🤗 HuggingFace | 💻 Official website(官网) Try our model!
This repo contains PyTorch model definitions, pretrained weights and inference/sampling code for our HunyuanImage-2.1. You can directly try our model on Official website(官网) and find more visualizations on our project page.
- September 18, 2025: ✨ Try the PromptEnhancer-32B model for higher-quality prompt enhancement!.
- September 18, 2025: ✨ ComfyUI workflow of HunyuanImage-2.1 is available now!
- September 16, 2025: 👑 We achieved the Top1 on Arena's leaderboard for text-to-image open-source models. Leaderboard
- September 12, 2025: 🚀 Released FP8 quantized models! Making it possible to generate 2K images with only 24GB GPU memory!
- September 8, 2025: 🚀 Released inference code and model weights for HunyuanImage-2.1.
We are excited to introduce HunyuanImage-2.1, a 17B text-to-image model that is capable of generating 2K (2048 × 2048) resolution images.
Our architecture consists of two stages:
- Base text-to-image Model: The first stage is a text-to-image model that utilizes two text encoders: a multimodal large language model (MLLM) to improve image-text alignment, and a multi-language, character-aware encoder to enhance text rendering across various languages.
- Refiner Model: The second stage introduces a refiner model that further enhances image quality and clarity, while minimizing artifacts.
👑 We achieved the Top1 on Arena's leaderboard for text-to-image open-source models.
- High-Quality Generation: Efficiently produces ultra-high-definition (2K) images with cinematic composition.
- Multilingual Support: Provides native support for both Chinese and English prompts.
- Advanced Architecture: Built on a multi-modal, single- and dual-stream combined DiT (Diffusion Transformer) backbone.
- Glyph-Aware Processing: Utilizes ByT5's text rendering capabilities for improved text generation accuracy.
- Flexible Aspect Ratios: Supports a variety of image aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4, 3:2, 2:3).
- Prompt Enhancement: Automatically rewrites prompts to improve descriptive accuracy and visual quality.
Hardware and OS Requirements:
-
NVIDIA GPU with CUDA support.
Minimum requrement for now: 24 GB GPU memory for 2048x2048 image generation.
Note: The memory requirements above are measured with model CPU offloading and FP8 quantization enabled. If your GPU has sufficient memory, you may disable offloading for improved inference speed.
-
Supported operating system: Linux.
- Clone the repository:
git clone https://github.com/Tencent-Hunyuan/HunyuanImage-2.1.git
cd HunyuanImage-2.1
- Install dependencies:
pip install -r requirements.txt
pip install flash-attn==2.7.3 --no-build-isolation
The details of download pretrained models are shown here.
Prompt enhancement plays a crucial role in enabling our model to generate high-quality images. By writing longer and more detailed prompts, the generated image will be significantly improved. We encourage you to craft comprehensive and descriptive prompts to achieve the best possible image quality.
We highly recommend you to try the PromptEnhancer-32B model for higher-quality prompt enhancement.
HunyuanImage-2.1 only supports 2K image generation (e.g. 2048x2048 for 1:1 images, 2560x1536 for 16:9 images, etc.). Generating images with 1K resolution will result in artifacts.
Additionally, we highly recommend using the full generation pipeline for better quality (i.e. enabling prompt enhancement and refinment).
model type | model name | description | num_inference_steps | guidance_scale | shift |
---|---|---|---|---|---|
Base text-to-image Model | hunyuanimage2.1 | Undistilled model for the best quality. | 50 | 3.5 | 5 |
Distilled text-to-image Model | hunyuanimage2.1-distilled | Distilled model for faster inference | 8 | 3.25 | 4 |
Refiner | hunyuanimage-refiner | The refiner model | N/A | N/A | N/A |
import os
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'
import torch
from hyimage.diffusion.pipelines.hunyuanimage_pipeline import HunyuanImagePipeline
# Supported model_name: hunyuanimage-v2.1, hunyuanimage-v2.1-distilled
model_name = "hunyuanimage-v2.1"
pipe = HunyuanImagePipeline.from_pretrained(model_name=model_name, use_fp8=True)
pipe = pipe.to("cuda")
# The input prompt
prompt = "A cute, cartoon-style anthropomorphic penguin plush toy with fluffy fur, standing in a painting studio, wearing a red knitted scarf and a red beret with the word \"Tencent\" on it, holding a paintbrush with a focused expression as it paints an oil painting of the Mona Lisa, rendered in a photorealistic photographic style."
# Generate with different aspect ratios
aspect_ratios = {
"16:9": (2560, 1536),
"4:3": (2304, 1792),
"1:1": (2048, 2048),
"3:4": (1792, 2304),
"9:16": (1536, 2560),
}
width, height = aspect_ratios["1:1"]
image = pipe(
prompt=prompt,
width=width,
height=height,
# disable the reprompt if you already use the prompt enhancement to enhance the prompt
use_reprompt=False, # Enable prompt enhancement (which may result in higher GPU memory usage)
use_refiner=True, # Enable refiner model
# For the distilled model, use 8 steps for faster inference.
# For the non-distilled model, use 50 steps for better quality.
num_inference_steps=8 if "distilled" in model_name else 50,
guidance_scale=3.25 if "distilled" in model_name else 3.5,
shift=4 if "distilled" in model_name else 5,
seed=649151,
)
image.save("generated_image.png")
Our model can follow complex instructions to generate high‑quality, creative images.
We recommend using longer, more detailed prompts. You can also try the prompts we provide.
To improve the quality and detail of generated images, we use a prompt rewriting model. This model automatically enhances user-provided text prompts by adding detailed and descriptive information.
SSAE (Structured Semantic Alignment Evaluation) is an intelligent evaluation metric for image-text alignment based on advanced multimodal large language models (MLLMs). We extracted 3500 key points across 12 categories, then used multimodal large language models to automatically evaluate and score by comparing the generated images with these key points based on the visual content of the images. Mean Image Accuracy represents the image-wise average score across all key points, while Global Accuracy directly calculates the average score across all key points.
Model | Open Source | Mean Image Accuracy | Global Accuracy | Primary Subject | Secondary Subject | Scene | Other | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Noun | Key Attributes | Other Attributes | Action | Noun | Attributes | Action | Noun | Attributes | Shot | Style | Composition | ||||
FLUX-dev | ✅ | 0.7122 | 0.6995 | 0.7965 | 0.7824 | 0.5993 | 0.5777 | 0.7950 | 0.6826 | 0.6923 | 0.8453 | 0.8094 | 0.6452 | 0.7096 | 0.6190 |
Seedream-3.0 | ❌ | 0.8827 | 0.8792 | 0.9490 | 0.9311 | 0.8242 | 0.8177 | 0.9747 | 0.9103 | 0.8400 | 0.9489 | 0.8848 | 0.7582 | 0.8726 | 0.7619 |
Qwen-Image | ✅ | 0.8854 | 0.8828 | 0.9502 | 0.9231 | 0.8351 | 0.8161 | 0.9938 | 0.9043 | 0.8846 | 0.9613 | 0.8978 | 0.7634 | 0.8548 | 0.8095 |
GPT-Image | ❌ | 0.8952 | 0.8929 | 0.9448 | 0.9289 | 0.8655 | 0.8445 | 0.9494 | 0.9283 | 0.8800 | 0.9432 | 0.9017 | 0.7253 | 0.8582 | 0.7143 |
HunyuanImage 2.1 | ✅ | 0.8888 | 0.8832 | 0.9339 | 0.9341 | 0.8363 | 0.8342 | 0.9627 | 0.8870 | 0.9615 | 0.9448 | 0.9254 | 0.7527 | 0.8689 | 0.7619 |
From the SSAE evaluation results, our model has currently achieved the optimal performance among open-source models in terms of semantic alignment, and is very close to the performance of closed-source commercial models (GPT-Image).
We adopted the GSB evaluation method commonly used to assess the relative performance between two models from an overall image perception perspective. In total, we utilized 1000 text prompts, generating an equal number of image samples for all compared models in a single run. For a fair comparison, we conducted inference only once for each prompt, avoiding any cherry-picking of results. When comparing with the baseline methods, we maintained the default settings for all selected models. The evaluation was performed by more than 100 professional evaluators. From the results, HunyuanImage 2.1 achieved a relative win rate of -1.36% against Seedream3.0 (closed-source) and 2.89% outperforming Qwen-Image (open-source). The GSB evaluation results demonstrate that HunyuanImage 2.1, as an open-source model, has reached a level of image generation quality comparable to closed-source commercial models (Seedream3.0), while showing certain advantages in comparison with similar open-source models (Qwen-Image). This fully validates the technical advancement and practical value of HunyuanImage 2.1 in text-to-image generation tasks.
Feel free to join our Discord server or join our WeChat groups—not only to exchange ideas and explore collaboration, but also to ask any questions you might have. You're welcome to open an issue or submit a pull request on GitHub. Your feedback is valuable to us and helps drive HunyuanImage forward. Thank you for being a part of our community!
If you find this project useful for your research and applications, please cite as:
@misc{HunyuanImage-2.1,
title={HunyuanImage 2.1: An Efficient Diffusion Model for High-Resolution (2K) Text-to-Image Generation},
author={Tencent Hunyuan Team},
year={2025},
howpublished={\url{https://github.com/Tencent-Hunyuan/HunyuanImage-2.1}},
}
We would like to thank the following open-source projects and communities for their contributions to open research and exploration: Qwen, FLUX, diffusers and HuggingFace.