Skip to content

init #2852

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open

Conversation

DerekLiu35
Copy link

Preparing the Article

  • Add an entry to _blog.yml.
  • Add a thumbnail. There are no requirements here, but there is a template if it's helpful.
  • Check you use a short title and blog path.
  • Upload any additional assets (such as images) to the Documentation Images repo. This is to reduce bloat in the GitHub base repo when cloning and pulling. Try to have small images to avoid a slow or expensive user experience.
  • Add metadata (such as authors) to your md file. You can also specify guest or org for the authors.
  • Ensure the publication date is correct.
  • Preview the content. A quick way is to paste the markdown content in https://huggingface.co/new-blog. Do not click publish, this is just a way to do an early check.

@DerekLiu35 DerekLiu35 marked this pull request as ready for review May 13, 2025 20:33
@DerekLiu35
Copy link
Author

@SunMarc

@SunMarc SunMarc requested a review from sayakpaul May 14, 2025 17:17
Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this.


**BF16:**

![Baroque, Futurist, and Noir style images generated with BF16 precision](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/quantization-backends-diffusers/combined_flux-dev_bf16_combined.png)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also provide an actual caption for the figure.

Comment on lines 37 to 42
**BnB 4-bit:**

![Baroque, Futurist, and Noir style images generated with BnB 4-bit](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/quantization-backends-diffusers/combined_flux-dev_bnb_4bit_combined.png)

**BnB 8-bit:**
![Baroque, Futurist, and Noir style images generated with BnB 8-bit](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/quantization-backends-diffusers/combined_flux-dev_bnb_8bit_combined.png)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we combine three images here?

  1. BF16
  2. 4bit
  3. 8bit

Along with the caption?

**BnB 8-bit:**
![Baroque, Futurist, and Noir style images generated with BnB 8-bit](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/quantization-backends-diffusers/combined_flux-dev_bnb_8bit_combined.png)

| BnB Precision | Memory after loading | Peak memory | Inference time |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's include the Bf16 numbers too.

| Q8_0 | 21.502 GB | 25.973 GB | 15 seconds |
| Q2_k | 13.264 GB | 17.752 GB | 26 seconds |

**Example (Flux-dev with GGUF Q4_1)**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have to be exhaustive about showing snippets for every configuration unless they vary significantly from one another.


For more information check out the [GGUF docs](https://huggingface.co/docs/diffusers/quantization/gguf).

### FP8 Layerwise Casting (`enable_layerwise_casting`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also write that can be combined with group_offloading:
https://huggingface.co/docs/diffusers/en/optimization/memory#group-offloading


We created a setup where you can provide a prompt, and we generate results using both the original, high-precision model (e.g., Flux-dev in BF16) and several quantized versions (BnB 4-bit, BnB 8-bit). The generated images are then presented to you and your challenge is to identify which ones came from the quantized models.

Try it out [here](https://huggingface.co/spaces/derekl35/flux-quant)!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's embed the space inside the blog post.

Here's a quick guide to choosing a quantization backend:

* **Easiest Memory Savings (NVIDIA):** Start with `bitsandbytes` 4/8-bit.
* **Prioritize Inference Speed:** `torchao` + `torch.compile` offers the best performance potential.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GGUF also supports torch.compile(). So does bitsandbytes. I think we should mention that.

* **Simplicity (Hopper/Ada):** Explore FP8 Layerwise Casting (`enable_layerwise_casting`).
* **For Using Existing GGUF Models:** Use GGUF loading (`from_single_file`).

Quantization significantly lowers the barrier to entry for using large diffusion models. Experiment with these backends to find the best balance of memory, speed, and quality for your needs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we hint the readers that they can expect a follow-up blog around training with quantization?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah would be great !

@ChunTeLee
Copy link

ChunTeLee commented May 15, 2025

Exploring Quantization Backends in Diffusers thumbnail Hey there, here is the thumbnail suggestion! cc @sayakpaul

@sayakpaul
Copy link
Member

@ChunTeLee possible to reduce the size of the middle object a bit so that "exploring" and "quantization" words are clear?

@ChunTeLee
Copy link

Exploring Quantization Backends in Diffusers thumbnail

Here you go!

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot ! This is really nice. I feel like it could be nice to add a bit more details but overall I think we can ship this very soon !

Comment on lines 317 to 319
## Combining with Memory Optimizations

Most of these quantization backends can be combined with the memory optimization techniques offered in Diffusers. For example, using `enable_model_cpu_offload()` with `bitsandbytes` cuts the memory further, giving a reasonable trade-off between memory and latency. You can learn more about these techniques in the [Diffusers documentation](https://huggingface.co/docs/diffusers/main/en/optimization/memory).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be nice would be to show some results here with enable_model_cpu_offload and torch.compile ! Maybe we can test these with bnb 4-bit as it is compatible with the main version.

Then we can also combine all benchmark results and put it into a dataset so that users can easily compare.

Copy link
Member

@sayakpaul sayakpaul May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a central dataset sounds good, and it enables systematic exploration.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. Should I include code too? Not sure how to do torch.compile with pipeline level quant config

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a cool feature to add @sayakpaul in pipeline level quant config actually. Maybe you can open a PR @DerekLiu35 in diffusers to support this scenario ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SunMarc which scenario you're talking about? To maybe better focus on the blog post, it'd be good to file that request on our repo or discuss on Slack.

@DerekLiu35

Sounds good to me. Should I include code too? Not sure how to do torch.compile with pipeline level quant config

We can individually compile the components of a pipeline. Like this:
https://github.com/sayakpaul/diffusers-torchao/blob/9b9f2383dccb8eb73a4c6a8ffe736dd9610c26d2/inference/benchmark_image.py#L40

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah let's do that later !

></script>

Building on our previous post, "[Memory-efficient Diffusion Transformers with Quanto and Diffusers](https://huggingface.co/blog/quanto-diffusers)", this post explores the diverse quantization backends integrated directly into Hugging Face Diffusers. We'll examine how bitsandbytes, GGUF, torchao, and native FP8 support make large and powerful models more accessible, demonstrating their use with Flux (a flow-based text-to-image generation model).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do a quick introduction of the pipeline that we will be quantizing, share about what each components do, the memory it takes and what might be interesting to quantize.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it could be done in bulleted points.

Comment on lines 70 to 73
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
from diffusers.quantizers import PipelineQuantizationConfig
from transformers import T5EncoderModel
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe explain somewhere that we need to be careful when they import the configs DiffusersBitsAndBytesConfig and TransformersBitsAndBytesConfig since the components comes from different libraries and that if they don't want to deal with that, they can use this instead


| torchao Precision | Memory after loading | Peak memory | Inference time |
|-------------------------------|----------------------|-------------|----------------|
| int4_weight_only | 10.635 GB | 14.654 GB | 109 seconds |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any idea why it takes so much time ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember I faced something while working on https://github.com/sayakpaul/diffusers-torchao. IIUC it was a combination of unavailability of a good kernel, shape constraints, etc.


| quanto Precision | Memory after loading | Peak memory | Inference time |
|------------------|----------------------|-------------|----------------|
| int4 | 12.254 GB | 16.139 GB | 109 seconds |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

)
```

> **Note:** At the time of writing, for float8 support with Quanto, you'll need `optimum-quanto<0.2.5` and use quanto directly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is why you didn't put the results for float8 ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, would need to add code to show how to quantize with float8:

import torch
from diffusers import AutoModel, FluxPipeline
from transformers import T5EncoderModel
from optimum.quanto import freeze, qfloat8, quantize

model_id = "black-forest-labs/FLUX.1-dev"

text_encoder_2 = T5EncoderModel.from_pretrained(
    model_id,
    subfolder="text_encoder_2",
    torch_dtype=torch.bfloat16,
)

quantize(text_encoder_2, weights=qfloat8)
freeze(text_encoder_2)

transformer = AutoModel.from_pretrained(
      model_id,
      subfolder="transformer",
      torch_dtype=torch.bfloat16,
)

quantize(transformer, weights=qfloat8)
freeze(transformer)

pipe = FluxPipeline.from_pretrained(
    model_id,
    transformer=transformer,
    text_encoder_2=text_encoder_2,
    torch_dtype=torch.bfloat16
).to("cuda")

pipe_kwargs = {
    "prompt": "Ghibli style, a fantasy landscape with a grey castle with multiple tall, blue-roofed towers, beside a clear, flowing river in a lush green valley.",
    "height": 1024,
    "width": 1024,
    "guidance_scale": 3.5,
    "num_inference_steps": 50,
    "max_sequence_length": 512,
}

print(f"Pipeline memory usage: {torch.cuda.max_memory_reserved() / 1024**3:.3f} GB") 

image = pipe(**pipe_kwargs, generator=torch.manual_seed(0),).images[0] 

print(f"Pipeline memory usage: {torch.cuda.max_memory_reserved() / 1024**3:.3f} GB") 

image.save("flux-dev_quanto_fp8.png")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah let's add the code but make sure to say that we will be working on fixing this.

Comment on lines 321 to 331
## Spot The Quantized Model

Quantization sounds great for saving memory, but how much does it *really* affect the final image? Can you even spot the difference? We invite you to test your perception!

We created a setup where you can provide a prompt, and we generate results using both the original, high-precision model (e.g., Flux-dev in BF16) and several quantized versions (BnB 4-bit, BnB 8-bit). The generated images are then presented to you and your challenge is to identify which ones came from the quantized models.

Try it out here!

<gradio-app theme_mode="light" space="derekl35/flux-quant"></gradio-app>

Often, especially with 8-bit quantization, the differences are subtle and may not be noticeable without close inspection. More aggressive quantization like 4-bit or lower might be more noticeable, but the results can still be good, especially considering the massive memory savings.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can put that at the top since users will be definitely super interested in reading the rest to the blogpost after playing with the space !

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NF4 often gives the best trade-off though. At least for image models and saves considerable amount of memory. Also see this tip here:

Uploading image.png…

* **Simplicity (Hopper/Ada):** Explore FP8 Layerwise Casting (`enable_layerwise_casting`).
* **For Using Existing GGUF Models:** Use GGUF loading (`from_single_file`).

Quantization significantly lowers the barrier to entry for using large diffusion models. Experiment with these backends to find the best balance of memory, speed, and quality for your needs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah would be great !

* **For Using Existing GGUF Models:** Use GGUF loading (`from_single_file`).
* **Curious about training with quantization?** Stay tuned for a follow-up blog post on that topic!

Quantization significantly lowers the barrier to entry for using large diffusion models. Experiment with these backends to find the best balance of memory, speed, and quality for your needs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure acknowledge @ChunTeLee for providing a nice thumbnail.

Comment on lines 38 to 50
Before diving into the quantization backends, let's introduce the FluxPipeline and its components, which we'll be quantizing:

* **Text Encoders (CLIP and T5):**
* **Function:** Process input text prompts. FLUX-dev uses CLIP for initial understanding and a larger T5 for nuanced comprehension and better text rendering.
* **Memory:** T5 - 9.52 GB; CLIP - 246 MB (in BF16)
* **Transformer (Main Model - MMDiT):**
* **Function:** Core generative part (Multimodal Diffusion Transformer). Generates images in latent space from text embeddings.
* **Memory:** 23.8 GB (in BF16)
* **Variational Auto-Encoder (VAE):**
* **Function:** Translates images between pixel and latent space. Decodes generated latent representation to a pixel-based image.
* **Memory:** 168 MB (in BF16)
* **Focus of Quantization:** Examples will primarily focus on the `transformer` and `text_encoder_2` (T5) for the most substantial memory savings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice ! Can you add a link to https://huggingface.co/black-forest-labs/FLUX.1-dev. Also you can precise the memory needed to load the model

Comment on lines 69 to 72
from diffusers import AutoModel, FluxPipeline
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
from diffusers.quantizers import PipelineQuantizationConfig
from transformers import T5EncoderModel
Copy link
Member

@SunMarc SunMarc May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clean some of the import since we don't use them, same for the other snippets

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants