CO3: Contrasting Concepts Compose Better

Debottam Dutta Jianchong Chen Rajalaxmi Rajagopalan Yu-Lin Wei Romit Roy Choudhury

University of Illinois Urbana-Champaign

Under review

arXiv Code (coming soon)

(TL;DR) We propose a corrective sampling strategy to improve multi-concept prompt fidelity in text-to-image diffusion models by suppressing overlapping modes.

Abstract

We propose an algorithm to improve multi-concept prompt fidelity in text-to-image diffusion models. We start from a common failure: prompts like "a cat and a clock" sometimes yield images where one concept is missing, faint, or colliding awkwardly with another. We hypothesize this happens when the model drifts into mixed modes that over-emphasize a single concept pattern it learned strongly during training, while the others are weakened. Instead of retraining, we introduce a corrective sampling strategy that gently suppresses regions where the joint prompt behavior overlaps too strongly with any single concept's dominant pattern, steering generation toward "pure" joint modes where all concepts can co-exist with balanced visual presence. We further show that existing multi-concept guidance schemes can operate in unstable weight regimes that amplify imbalance; we characterize favorable regions and adapt sampling to remain within them. The approach is plug-and-play, requires no model tuning, and complements standard classifier-free guidance. Experiments on diverse multi-concept prompts show consistent gains in concept coverage, relative prominence balance, and robustness, reducing dropped or distorted concepts compared to standard baselines and prior compositional methods. Results suggest that lightweight corrective guidance can substantially mitigate brittle semantic behavior in modern diffusion systems.

Method

Problematic (Overlapping) Mode Hypothesis

Mode overlap illustration — Figure 1: The figure illustrates our hypothesis on mode overlap using a simple 2D toy example. Two modes of the distribution \(p_t(x \mid C = \text{'a cat and a dog'})\) (in green contour) has significant overlap with the modes of the individual concept distributions \(p_t(x \mid c_1 = \text{'a cat'})\) (in red contour) and \(p_t(x \mid c_2 = \text{'a dog'})\) (in orange contour). The proposed corrector distribution suppresses these overlaps, steering the generation away from problematic modes. The arrows indicate the denoising directions.

T2I models like StableDiffusion sample from the modes of the learned distribution \(p(x|C)\). While such models can produce high resolution images in general, the results are sometimes surprisingly misaligned even for simple prompts like \(C\)="a cat and a dog". We hypothesize that problematic modes in \(p(x|C)\) arise when they overlap with modes of individual concept distributions \(p(x|c_i)\). Such an overlap biases the generation toward a single concept, reducing the prominence of others.

Remedy: Concept-Contrasting Corrector (CO3)

To cure this, we propose a corrector that generates samples from a distribution that assigns low probability to regions where \(p(x|C)\) overlaps with individual \(p(x|c_i)\), steering generation toward pure multi-concept modes.

Corrected Distribution — Proposed Concept-Contrasting Corrector (CO3) Distribution

Algorithm

The CO3 algorithm consists of three key components. Below we show the algorithm boxes describing our approach.

Results

Quantitative

We evaluate the generated images using two metrics: BLIP-VQA and ImageReward. The table below shows quantitative comparison of different methods on compositional generation tasks.

Qualitative – Simple Prompts

Qualitative comparison on simple prompts — Figure 2: Qualitative comparison of different methods on simpler prompts.

Qualitative – Complex Prompts

Qualitative comparison on complex prompts — Figure 3: Qualitative comparison of CO3 with competing methods on prompts with multiple concepts and complex scenarios.

Qualitative – Rare concept scenarios

Qualitative comparison on rarebench prompts — Figure 3: Qualitative comparison of CO3 with competing methods on rare-concept scenarios.

Model Agnostic Behavior & Efficiency

CO3 is model-agnostic and can be applied to various diffusion model architectures.

Agnostic to architectures (Unets, ViTs etc.)
Agnostic to samplers (DDIM, DPM++ etc.)
Plug-and-play, no fine-tuning, gradient-free
Takes 3x less memory than gradient-based baselines
Complements CFG

PixART comparison — Figure 5: *Model Agnostic behavior*: Qualitative comparison of generation from PixART-\(\Sigma\) base diffusion model, PixART-\(\Sigma\) + CO3, and PixART-\(\Sigma\) + Composable Diffusion.

↑ Back to Top