CreativeSynth: Cross-Art-Attention for Artistic Image Synthesis with Multimodal Diffusion

¹Tsinghua University, ²Pengcheng Laboratory, ³MAIS, Institute of Automation, Chinese Academy of Sciences, ⁴School of AI, UCAS, ⁵School of Computer Science and Technology, Chinese Academy of Sciences, ⁶ByteDance Inc., ⁷National ChengKung University

Abstract

Although remarkable progress has been made in image style transfer, style is just one of the components of artistic paintings. Directly transferring extracted style features to natural images often results in outputs with obvious synthetic traces. This is because key painting attributes including layout, perspective, shape, and semantics often cannot be conveyed and expressed through style transfer. Large-scale pretrained text-to-image generation models have demonstrated their capability to synthesize a vast amount of high-quality images. However, even with extensive textual descriptions, it is challenging to fully express the unique visual properties and details of paintings. Moreover, generic models often disrupt the overall artistic effect when modifying specific areas, making it more complicated to achieve a unified aesthetic in artworks. Our main novel idea is to integrate multimodal semantic information as a synthesis guide into artworks, rather than transferring style to the real world. We also aim to reduce the disruption to the harmony of artworks while simplifying the guidance conditions. Specifically, we propose an innovative multi-task unified framework called CreativeSynth, based on the diffusion model with the ability to coordinate multimodal inputs. CreativeSynth combines multimodal features with customized attention mechanisms to seamlessly integrate real-world semantic content into the art domain through Cross-Art-Attention for aesthetic maintenance and semantic fusion. We demonstrate the results of our method across a wide range of different art categories, proving that CreativeSynth bridges the gap between generative models and artistic expression.

CreativeSynth Framework

CreativeSynth incorporates information from text and image modalities to sample artwork based on guiding conditions. This approach begins with encodings of semantic prompts from images and textual prompts to lay the groundwork for condition guidance. Our framework then focuses on aesthetic maintenance by a dedicated processor that adjusts the semantic image style to be consistent with the target image through adaptive instance normalization. In the semantic fusion section, CreativeSynth employs a decoupled cross-attention mechanism that meticulously coordinates the interplay between visual and textual features, resulting in a cohesive synthesis rather than a sum of its parts. Finally, the sampling process is based on the principle of image inversion, which utilizes denoising techniques to reverse sample the image from the initial noise. Ultimately, CreativeSynth generates customized artworks that resonate with the given semantic prompts and chosen aesthetic style. The overall architecture of the method is shown in the following figure.

Overview

Different modalities naturally emerge in response to the same data source, and CreativeSynth connects all of these modalities into a common embedding space, enabling new emergent alignments and functionality.

BibTeX

@article{huang2024creativesynth, title={CreativeSynth: Creative Blending and Synthesis of Visual Arts based on Multimodal Diffusion}, author={Huang, Nisha and Dong, Weiming and Zhang, Yuxin and Tang, Fan and Li, Ronghui and Ma, Chongyang and Li, Xiu and Lee, Tong-Yee and Xu, Changsheng}, journal={arXiv preprint arXiv:2401.14066}, year={2024} }

CreativeSynth: Cross-Art-Attention for Artistic Image Synthesis with Multimodal Diffusion

CreativeSynth Teaser

Abstract

CreativeSynth Framework

The overall pipeline of CreativeSynth.

Overview

CreativeSynth overview.

Comparisons

BibTeX