HiPrompt: Tuning-free Higher-Resolution Generation with Hierarchical MLLM Prompts

HiPrompt : Tuning-free Higher-Resolution Generation with Hierarchical MLLM Prompts

Xinyu Liu¹, Yingqing He¹, Lanqing Guo², Xiang Li³, Bu Jin⁴, Peng Li¹, Yan Li¹, Chi-Min Chan¹, Qifeng Chen¹, Wei Xue¹, Wenhan Luo¹, Qifeng Liu¹, Yike Guo¹

¹Hong Kong University of Science and Technology, ²Nanyang Technological University, ³Tsinghua University, ⁴University of Chinese Academy of Sciences

Abstract

The potential for higher-resolution image generation using pretrained diffusion models is immense, yet these models often struggle with issues of object repetition and structural artifacts especially when scaling to 4K resolution and higher. We figure out that the problem is caused by that, a single prompt for the generation of multiple scales provides insufficient efficacy. In response, we propose HiPrompt, a new tuning-free solution that tackles the above problems by introducing hierarchical prompts. The hierarchical prompts offer both global and local guidance. Specifically, the global guidance comes from the user input that describes the overall content, while the local guidance utilizes patch-wise descriptions from MLLMs to elaborately guide the regional structure and texture generation. Furthermore, during the inverse denoising process, the generated noise is decomposed into low- and high-frequency spatial components. These components are conditioned on multiple prompt levels, including detailed patch-wise descriptions and broader image-level prompts, facilitating prompt-guided denoising under hierarchical semantic guidance. It further allows the generation to focus more on local spatial regions and ensures the generated images maintain coherent local and global semantics, structures, and textures with high definition. Extensive experiments demonstrate that HiPrompt outperforms state-of-the-art works in higher-resolution image generation, significantly reducing object repetition and enhancing structural quality.

HiPrompt

Given a low-resolution image, MLLMs are employed to generate dense local descriptions for each overlapping local patch. To enhance the quality of these detailed prompts, we utilize N-grams (n = 1) refinement to filter out irrelevant noise. Subsequently, HiPrompt decomposes the noisy image into low- and high-spatial frequency components using low-pass and high-pass Gaussian filters. These components are denoised in parallel, conditioned on the hierarchical prompts, and then summarized into final estimation during the inverse denoising process.

More Samples at Various Resolutions

2048×2048

"Ethereal fantasy concept art of an elf, magnificent, celestial, ethereal, painterly, epic, majestic, magical, fantasy art, cover art, dreamy."

4096×4096

"A loyal robotic dog with sleek, cyber-enhanced features standing guard beside its owner in a neon-lit cyberpunk city, bond depicted amidst a chaotic megacity at dusk, painted by Hajime Sorayama and Greg Rutkowski, sharp details and dynamic lighting."

2048×4096

"A pair of cuddly rabbits, one white with floppy ears and the other brown with a twitching nose, snuggling together in a cozy hutch filled with straw."

4096×4096

"Beautiful angle view of Species woman individual from another planet with multicolored hair, beautiful colorful eyes, run in autumn garden, breeze at dawn, alcohol ink painting, psychedelic art by Ross Tran, Antonio J. Manzanedo, Tom Bagshaw, mandy disher, cinematic, 32k, stills from Steven Spielberg epic film, clear focus, hyperrealistic repin artstation painting, detailed character design concept art, matte painting."

4096×4096

"Cute foxy magic harry potter style smiling funny comical set in hogwarts caricutre style close up, intricate, magical, volumetric lighting, beautiful masterpiece, rich in deep colors, sharp focus, ultra detailed, trending on artstation, studio photos, cinematic lighting, fantastic view, hyperrealistic, 4K 3D, high definition."

4096×4096

"A serene mountain landscape with towering snow-capped peaks, a crystal-clear blue lake reflecting the mountains, dense pine forests, and a vibrant orange sunrise illuminating the sky."

4096×4096

"A beautiful photo of an lion that got lost in the amazon rainforest, rain, mist, 8k, sharp intricate details, masterpiece, imaginative, raytracing, octane render, studio lighting, professionally shot nature photo, godrays, hyperrealistic, ultra high quality, realism, wet, dripping water, wandering through the undergrowth"

4096×4096

"Spectacular Tiny World in the Transparent Jar On the Table, interior of the Great Hall, Elaborate, Carved Architecture, Anatomy, Symetrical, Geometric and Parameteric Details, Precision Flat line Details, Pattern, Dark fantasy, Dark errie mood and ineffably mysterious mood, Technical design, Intricate Ultra Detail, Ornate Detail, Stylized and Futuristic and Biomorphic Details, Architectural Concept, Low contrast Details, Cinematic Lighting, 8k, by moebius, Fullshot, Epic, Fullshot, Octane render, Unreal ,Photorealistic, Hyperrealism. "

4096×4096

"Close up illustration of a cybernetic panda in cyberpunk room typing, technology, neon, futuristic, sci-fi, electro, neon, science fiction, maximum details, fine art, 4k, highres, unforgettable."

3072×3072

"Portrait of a young woman with a serene expression and delicate features. Her light brown hair is styled into a loose braid over one shoulder, and she wears a blue headband with orange floral patterns. She has clear, luminous skin and soft pale blue eyes that convey a gentle confidence. Her attire is casually elegant, with a relaxed blue denim garment. The lighting is soft and natural, enhancing the warmth and inviting quality of the portrait."

4096×4096

"Primitive forest, towering trees, sunlight falling, vivid colors."

HiPrompt : Tuning-free Higher-Resolution Generation with Hierarchical MLLM Prompts

Detailed Comparison

Abstract

HiPrompt

Qualitative Comparision Results

The red boxes highlight the repeated object problem, while the yellow boxes denote areas with blurred and unreasonable structures.

"A professional photograph of an astronaut riding a horse."

"A cute and adorable fluffy puppy wearing a witch hat in a halloween autumn evening forest, falling autumn leaves, brown acorns on the ground, halloween pumpkins spiderwebs, bats, a witch’s broom."

More Samples at Various Resolutions