init #2852

DerekLiu35 · 2025-05-13T20:33:02Z

Preparing the Article

Add an entry to _blog.yml.
Add a thumbnail. There are no requirements here, but there is a template if it's helpful.
Check you use a short title and blog path.
Upload any additional assets (such as images) to the Documentation Images repo. This is to reduce bloat in the GitHub base repo when cloning and pulling. Try to have small images to avoid a slow or expensive user experience.
Add metadata (such as authors) to your md file. You can also specify guest or org for the authors.
Ensure the publication date is correct.
Preview the content. A quick way is to paste the markdown content in https://huggingface.co/new-blog. Do not click publish, this is just a way to do an early check.

DerekLiu35 · 2025-05-13T20:34:05Z

@SunMarc

sayakpaul

Thanks for working on this.

diffusers-quantization.md

sayakpaul · 2025-05-15T04:44:05Z

diffusers-quantization.md

+
+**BF16:**
+
+![Baroque, Futurist, and Noir style images generated with BF16 precision](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/quantization-backends-diffusers/combined_flux-dev_bf16_combined.png)


Let's also provide an actual caption for the figure.

diffusers-quantization.md

sayakpaul · 2025-05-15T04:46:00Z

diffusers-quantization.md

+**BnB 4-bit:**
+
+![Baroque, Futurist, and Noir style images generated with BnB 4-bit](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/quantization-backends-diffusers/combined_flux-dev_bnb_4bit_combined.png)
+
+**BnB 8-bit:**
+![Baroque, Futurist, and Noir style images generated with BnB 8-bit](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/quantization-backends-diffusers/combined_flux-dev_bnb_8bit_combined.png)


Can we combine three images here?

BF16

4bit

8bit

Along with the caption?

sayakpaul · 2025-05-15T04:46:28Z

diffusers-quantization.md

+**BnB 8-bit:**
+![Baroque, Futurist, and Noir style images generated with BnB 8-bit](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/quantization-backends-diffusers/combined_flux-dev_bnb_8bit_combined.png)
+
+| BnB Precision | Memory after loading | Peak memory | Inference time |


Let's include the Bf16 numbers too.

sayakpaul · 2025-05-15T04:55:53Z

diffusers-quantization.md

+| Q8_0           | 21.502 GB            | 25.973 GB   | 15 seconds     |
+| Q2_k           | 13.264 GB            | 17.752 GB   | 26 seconds     |
+
+**Example (Flux-dev with GGUF Q4_1)**


I don't think we have to be exhaustive about showing snippets for every configuration unless they vary significantly from one another.

sayakpaul · 2025-05-15T04:57:14Z

diffusers-quantization.md

+
+For more information check out the [GGUF docs](https://huggingface.co/docs/diffusers/quantization/gguf).
+
+### FP8 Layerwise Casting (`enable_layerwise_casting`)


Could also write that can be combined with group_offloading:
https://huggingface.co/docs/diffusers/en/optimization/memory#group-offloading

sayakpaul · 2025-05-15T04:57:47Z

diffusers-quantization.md

+
+We created a setup where you can provide a prompt, and we generate results using both the original, high-precision model (e.g., Flux-dev in BF16) and several quantized versions (BnB 4-bit, BnB 8-bit). The generated images are then presented to you and your challenge is to identify which ones came from the quantized models.
+
+Try it out [here](https://huggingface.co/spaces/derekl35/flux-quant)!


Let's embed the space inside the blog post.

sayakpaul · 2025-05-15T04:58:42Z

diffusers-quantization.md

+Here's a quick guide to choosing a quantization backend:
+
+*   **Easiest Memory Savings (NVIDIA):** Start with `bitsandbytes` 4/8-bit.
+*   **Prioritize Inference Speed:** `torchao` + `torch.compile` offers the best performance potential.


GGUF also supports torch.compile(). So does bitsandbytes. I think we should mention that.

sayakpaul · 2025-05-15T04:59:12Z

diffusers-quantization.md

+*   **Simplicity (Hopper/Ada):** Explore FP8 Layerwise Casting (`enable_layerwise_casting`).
+*   **For Using Existing GGUF Models:** Use GGUF loading (`from_single_file`).
+
+Quantization significantly lowers the barrier to entry for using large diffusion models. Experiment with these backends to find the best balance of memory, speed, and quality for your needs.


Should we hint the readers that they can expect a follow-up blog around training with quantization?

Yeah would be great !

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

diffusers-quantization.md

ChunTeLee · 2025-05-15T19:56:19Z

Exploring Quantization Backends in Diffusers thumbnail

Hey there, here is the thumbnail suggestion! cc @sayakpaul

sayakpaul · 2025-05-16T03:15:25Z

@ChunTeLee possible to reduce the size of the middle object a bit so that "exploring" and "quantization" words are clear?

ChunTeLee · 2025-05-17T02:37:42Z

Here you go!

SunMarc

Thanks a lot ! This is really nice. I feel like it could be nice to add a bit more details but overall I think we can ship this very soon !

diffusers-quantization.md

SunMarc · 2025-05-19T09:09:48Z

diffusers-quantization.md

+## Combining with Memory Optimizations
+
+Most of these quantization backends can be combined with the memory optimization techniques offered in Diffusers. For example, using `enable_model_cpu_offload()` with `bitsandbytes` cuts the memory further, giving a reasonable trade-off between memory and latency. You can learn more about these techniques in the [Diffusers documentation](https://huggingface.co/docs/diffusers/main/en/optimization/memory).


What would be nice would be to show some results here with enable_model_cpu_offload and torch.compile ! Maybe we can test these with bnb 4-bit as it is compatible with the main version.

Then we can also combine all benchmark results and put it into a dataset so that users can easily compare.

Having a central dataset sounds good, and it enables systematic exploration.

Sounds good to me. Should I include code too? Not sure how to do torch.compile with pipeline level quant config

This would be a cool feature to add @sayakpaul in pipeline level quant config actually. Maybe you can open a PR @DerekLiu35 in diffusers to support this scenario ?

@SunMarc which scenario you're talking about? To maybe better focus on the blog post, it'd be good to file that request on our repo or discuss on Slack.

@DerekLiu35

Sounds good to me. Should I include code too? Not sure how to do torch.compile with pipeline level quant config

We can individually compile the components of a pipeline. Like this:
https://github.com/sayakpaul/diffusers-torchao/blob/9b9f2383dccb8eb73a4c6a8ffe736dd9610c26d2/inference/benchmark_image.py#L40

Yeah let's do that later !

SunMarc · 2025-05-19T09:11:27Z

diffusers-quantization.md

+></script>
+
+Building on our previous post, "[Memory-efficient Diffusion Transformers with Quanto and Diffusers](https://huggingface.co/blog/quanto-diffusers)", this post explores the diverse quantization backends integrated directly into Hugging Face Diffusers. We'll examine how bitsandbytes, GGUF, torchao, and native FP8 support make large and powerful models more accessible, demonstrating their use with Flux (a flow-based text-to-image generation model).
+


let's do a quick introduction of the pipeline that we will be quantizing, share about what each components do, the memory it takes and what might be interesting to quantize.

I believe it could be done in bulleted points.

SunMarc · 2025-05-19T09:17:29Z

diffusers-quantization.md

+from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
+from diffusers.quantizers import PipelineQuantizationConfig
+from transformers import T5EncoderModel
+from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig


Maybe explain somewhere that we need to be careful when they import the configs DiffusersBitsAndBytesConfig and TransformersBitsAndBytesConfig since the components comes from different libraries and that if they don't want to deal with that, they can use this instead

SunMarc · 2025-05-19T09:18:08Z

diffusers-quantization.md

+
+| torchao Precision             | Memory after loading | Peak memory | Inference time |
+|-------------------------------|----------------------|-------------|----------------|
+| int4_weight_only              | 10.635 GB            | 14.654 GB   | 109 seconds    |


any idea why it takes so much time ?

I remember I faced something while working on https://github.com/sayakpaul/diffusers-torchao. IIUC it was a combination of unavailability of a good kernel, shape constraints, etc.

SunMarc · 2025-05-19T09:18:16Z

diffusers-quantization.md

+
+| quanto Precision | Memory after loading | Peak memory | Inference time |
+|------------------|----------------------|-------------|----------------|
+| int4             | 12.254 GB            | 16.139 GB   | 109 seconds    |


SunMarc · 2025-05-19T09:18:48Z

diffusers-quantization.md

+)
+```
+
+> **Note:** At the time of writing, for float8 support with Quanto, you'll need `optimum-quanto<0.2.5` and use quanto directly.


this is why you didn't put the results for float8 ?

Yeah, would need to add code to show how to quantize with float8:

import torch from diffusers import AutoModel, FluxPipeline from transformers import T5EncoderModel from optimum.quanto import freeze, qfloat8, quantize model_id = "black-forest-labs/FLUX.1-dev" text_encoder_2 = T5EncoderModel.from_pretrained( model_id, subfolder="text_encoder_2", torch_dtype=torch.bfloat16, ) quantize(text_encoder_2, weights=qfloat8) freeze(text_encoder_2) transformer = AutoModel.from_pretrained( model_id, subfolder="transformer", torch_dtype=torch.bfloat16, ) quantize(transformer, weights=qfloat8) freeze(transformer) pipe = FluxPipeline.from_pretrained( model_id, transformer=transformer, text_encoder_2=text_encoder_2, torch_dtype=torch.bfloat16 ).to("cuda") pipe_kwargs = { "prompt": "Ghibli style, a fantasy landscape with a grey castle with multiple tall, blue-roofed towers, beside a clear, flowing river in a lush green valley.", "height": 1024, "width": 1024, "guidance_scale": 3.5, "num_inference_steps": 50, "max_sequence_length": 512, } print(f"Pipeline memory usage: {torch.cuda.max_memory_reserved() / 1024**3:.3f} GB") image = pipe(**pipe_kwargs, generator=torch.manual_seed(0),).images[0] print(f"Pipeline memory usage: {torch.cuda.max_memory_reserved() / 1024**3:.3f} GB") image.save("flux-dev_quanto_fp8.png")

yeah let's add the code but make sure to say that we will be working on fixing this.

SunMarc · 2025-05-19T09:21:28Z

diffusers-quantization.md

+## Spot The Quantized Model
+
+Quantization sounds great for saving memory, but how much does it *really* affect the final image? Can you even spot the difference? We invite you to test your perception!
+
+We created a setup where you can provide a prompt, and we generate results using both the original, high-precision model (e.g., Flux-dev in BF16) and several quantized versions (BnB 4-bit, BnB 8-bit). The generated images are then presented to you and your challenge is to identify which ones came from the quantized models.
+
+Try it out here!
+
+<gradio-app theme_mode="light" space="derekl35/flux-quant"></gradio-app>
+
+Often, especially with 8-bit quantization, the differences are subtle and may not be noticeable without close inspection. More aggressive quantization like 4-bit or lower might be more noticeable, but the results can still be good, especially considering the massive memory savings.


Maybe we can put that at the top since users will be definitely super interested in reading the rest to the blogpost after playing with the space !

NF4 often gives the best trade-off though. At least for image models and saves considerable amount of memory. Also see this tip here:

SunMarc · 2025-05-19T09:22:01Z

diffusers-quantization.md

+*   **Simplicity (Hopper/Ada):** Explore FP8 Layerwise Casting (`enable_layerwise_casting`).
+*   **For Using Existing GGUF Models:** Use GGUF loading (`from_single_file`).
+
+Quantization significantly lowers the barrier to entry for using large diffusion models. Experiment with these backends to find the best balance of memory, speed, and quality for your needs.


Yeah would be great !

sayakpaul · 2025-05-19T10:06:34Z

diffusers-quantization.md

+*   **For Using Existing GGUF Models:** Use GGUF loading (`from_single_file`).
+*   **Curious about training with quantization?** Stay tuned for a follow-up blog post on that topic!
+
+Quantization significantly lowers the barrier to entry for using large diffusion models. Experiment with these backends to find the best balance of memory, speed, and quality for your needs.


Let's make sure acknowledge @ChunTeLee for providing a nice thumbnail.

SunMarc · 2025-05-20T09:51:11Z

diffusers-quantization.md

+Before diving into the quantization backends, let's introduce the FluxPipeline and its components, which we'll be quantizing:
+
+*   **Text Encoders (CLIP and T5):**
+    *   **Function:** Process input text prompts. FLUX-dev uses CLIP for initial understanding and a larger T5 for nuanced comprehension and better text rendering.
+    *   **Memory:** T5 - 9.52 GB; CLIP - 246 MB (in BF16)
+*   **Transformer (Main Model - MMDiT):**
+    *   **Function:** Core generative part (Multimodal Diffusion Transformer). Generates images in latent space from text embeddings. 
+    *   **Memory:** 23.8 GB (in BF16)
+*   **Variational Auto-Encoder (VAE):**
+    *   **Function:** Translates images between pixel and latent space. Decodes generated latent representation to a pixel-based image.
+    *   **Memory:** 168 MB (in BF16)
+*   **Focus of Quantization:** Examples will primarily focus on the `transformer` and `text_encoder_2` (T5) for the most substantial memory savings.
+


Nice ! Can you add a link to https://huggingface.co/black-forest-labs/FLUX.1-dev. Also you can precise the memory needed to load the model

SunMarc · 2025-05-20T13:40:10Z

diffusers-quantization.md

+from diffusers import AutoModel, FluxPipeline
+from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
+from diffusers.quantizers import PipelineQuantizationConfig
+from transformers import T5EncoderModel


clean some of the import since we don't use them, same for the other snippets

init

c65f422

DerekLiu35 marked this pull request as ready for review May 13, 2025 20:33

SunMarc requested a review from sayakpaul May 14, 2025 17:17

sayakpaul reviewed May 15, 2025

View reviewed changes

DerekLiu35 and others added 3 commits May 15, 2025 09:55

Apply suggestions from code review

f094eab

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

turn images into tables

d825e78

apply more suggestion from code review

43c3ca6

DerekLiu35 commented May 15, 2025

View reviewed changes

diffusers-quantization.md Show resolved Hide resolved

update thumbnail

7ac1eda

DerekLiu35 added 2 commits May 18, 2025 20:54

update thumbnail

c3f590f

typo

b48bfbe

SunMarc reviewed May 19, 2025

View reviewed changes

sayakpaul reviewed May 19, 2025

View reviewed changes

DerekLiu35 added 2 commits May 19, 2025 15:56

apply suggestions from code review

591d524

proofread

dc58d40

SunMarc reviewed May 20, 2025

View reviewed changes

DerekLiu35 and others added 6 commits May 20, 2025 14:18

apply suggestions from code review

0668dde

add torchao INT4 example

b5a14e4

Merge branch 'huggingface:main' into derekl35/quantization-diffusers

8a66d5b

update _blog.yml

9cef78c

fix _blog.yml

fc9f809

explain other ways to reduce memory usage

f92912d


		BF16:

		![Baroque, Futurist, and Noir style images generated with BF16 precision](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/quantization-backends-diffusers/combined_flux-dev_bf16_combined.png)


		For more information check out the [GGUF docs](https://huggingface.co/docs/diffusers/quantization/gguf).

		### FP8 Layerwise Casting (`enable_layerwise_casting`)


		We created a setup where you can provide a prompt, and we generate results using both the original, high-precision model (e.g., Flux-dev in BF16) and several quantized versions (BnB 4-bit, BnB 8-bit). The generated images are then presented to you and your challenge is to identify which ones came from the quantized models.

		Try it out [here](https://huggingface.co/spaces/derekl35/flux-quant)!

		## Combining with Memory Optimizations

		Most of these quantization backends can be combined with the memory optimization techniques offered in Diffusers. For example, using `enable_model_cpu_offload()` with `bitsandbytes` cuts the memory further, giving a reasonable trade-off between memory and latency. You can learn more about these techniques in the [Diffusers documentation](https://huggingface.co/docs/diffusers/main/en/optimization/memory).

		></script>

		Building on our previous post, "[Memory-efficient Diffusion Transformers with Quanto and Diffusers](https://huggingface.co/blog/quanto-diffusers)", this post explores the diverse quantization backends integrated directly into Hugging Face Diffusers. We'll examine how bitsandbytes, GGUF, torchao, and native FP8 support make large and powerful models more accessible, demonstrating their use with Flux (a flow-based text-to-image generation model).

init #2852

Are you sure you want to change the base?

init #2852

Conversation

DerekLiu35 commented May 13, 2025

Preparing the Article

DerekLiu35 commented May 13, 2025

sayakpaul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChunTeLee commented May 15, 2025 • edited Loading

sayakpaul commented May 16, 2025

ChunTeLee commented May 17, 2025

SunMarc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sayakpaul May 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SunMarc May 20, 2025 • edited Loading

Choose a reason for hiding this comment

ChunTeLee commented May 15, 2025 •

edited

Loading

sayakpaul May 19, 2025 •

edited

Loading

SunMarc May 20, 2025 •

edited

Loading