CreativeSynth: Cross-Art-Attention for Artistic Image Synthesis with Multimodal Diffusion

1Tsinghua University, 2Pengcheng Laboratory, 3MAIS, Institute of Automation, Chinese Academy of Sciences, 4School of AI, UCAS, 5School of Computer Science and Technology, Chinese Academy of Sciences, 6ByteDance Inc., 7National ChengKung University

CreativeSynth Teaser

CreativeSynth is capable of generating personalized digital art when supplied with an art image, drawing on prompts from either unimodal or multimodal prompts. This methodology not only yields artwork with high-fidelity realism but also effectively upholds the foundational concepts, composition, stylistic elements, and visual symbolism intrinsic to genuine artworks. CreativeSynth supports a wide array of intriguing applications, including (a) image variation, (b) image editing, (c) style transfer, (d) image fusion, and (e) multimodal blending.

Abstract

Although remarkable progress has been made in image style transfer, style is just one of the components of artistic paintings. Directly transferring extracted style features to natural images often results in outputs with obvious synthetic traces. This is because key painting attributes including layout, perspective, shape, and semantics often cannot be conveyed and expressed through style transfer. Large-scale pretrained text-to-image generation models have demonstrated their capability to synthesize a vast amount of high-quality images. However, even with extensive textual descriptions, it is challenging to fully express the unique visual properties and details of paintings. Moreover, generic models often disrupt the overall artistic effect when modifying specific areas, making it more complicated to achieve a unified aesthetic in artworks. Our main novel idea is to integrate multimodal semantic information as a synthesis guide into artworks, rather than transferring style to the real world. We also aim to reduce the disruption to the harmony of artworks while simplifying the guidance conditions. Specifically, we propose an innovative multi-task unified framework called CreativeSynth, based on the diffusion model with the ability to coordinate multimodal inputs. CreativeSynth combines multimodal features with customized attention mechanisms to seamlessly integrate real-world semantic content into the art domain through Cross-Art-Attention for aesthetic maintenance and semantic fusion. We demonstrate the results of our method across a wide range of different art categories, proving that CreativeSynth bridges the gap between generative models and artistic expression.

CreativeSynth Framework

CreativeSynth incorporates information from text and image modalities to sample artwork based on guiding conditions. This approach begins with encodings of semantic prompts from images and textual prompts to lay the groundwork for condition guidance. Our framework then focuses on aesthetic maintenance by a dedicated processor that adjusts the semantic image style to be consistent with the target image through adaptive instance normalization. In the semantic fusion section, CreativeSynth employs a decoupled cross-attention mechanism that meticulously coordinates the interplay between visual and textual features, resulting in a cohesive synthesis rather than a sum of its parts. Finally, the sampling process is based on the principle of image inversion, which utilizes denoising techniques to reverse sample the image from the initial noise. Ultimately, CreativeSynth generates customized artworks that resonate with the given semantic prompts and chosen aesthetic style. The overall architecture of the method is shown in the following figure.

Interpolate start reference image.

The overall pipeline of CreativeSynth.

Overview

Different modalities naturally emerge in response to the same data source, and CreativeSynth connects all of these modalities into a common embedding space, enabling new emergent alignments and functionality.

Interpolate start reference image.

CreativeSynth overview.

Comparisons

Qualitative comparisons of our proposed CreativeSynth with other extant methods. The results offer a visualization of image fusion between artistic and real images.

Interpolate start reference image.

Visual comparison of our proposed CreativeSynth with state-of-the-art methods for text-guided editing of diverse types of art images.

Interpolate start reference image.

BibTeX

@article{huang2024creativesynth,
      title={CreativeSynth: Creative Blending and Synthesis of Visual Arts based on Multimodal Diffusion},
      author={Huang, Nisha and Dong, Weiming and Zhang, Yuxin and Tang, Fan and Li, Ronghui and Ma, Chongyang and Li, Xiu and
     Lee, Tong-Yee and Xu, Changsheng},
      journal={arXiv preprint arXiv:2401.14066},
      year={2024}
    }