Created FAQ page first draft #1896

cajeonrh · 2025-10-02T17:07:42Z

SUMMARY:
Created a FAQ page under the "Getting Started" section

TEST PLAN:

Created a new page called FAQ under "Getting Started" section
Included all FAQs from INFERENG-1867

Requesting review of content

github-actions · 2025-10-02T17:07:50Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist · 2025-10-02T17:07:53Z

Summary of Changes

Hello @cajeonrh, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new Frequently Asked Questions (FAQ) page within the "Getting Started" section of the documentation. The primary goal is to centralize answers to common user queries about the LLM Compressor, thereby enhancing user self-service and clarity on topics such as model performance post-compression, integration with other tools like sglang, and practical guidance on compression strategies and memory requirements.

Highlights

New FAQ Page: A new Frequently Asked Questions (FAQ) page has been added to the documentation.
Comprehensive Coverage: The FAQ page addresses common questions regarding LLM Compressor, including performance expectations after compression, sglang integration, compression strategy selection, memory usage, and specifics on which layers to quantize.
Resource Links: The page provides direct links to relevant guides and code examples for users seeking more in-depth information on various topics.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds a new FAQ page, which is a great addition to the documentation. The content is relevant and covers important user questions. I've identified a few areas for improvement, mainly related to Markdown link formatting, content clarity, and consistency. There are several instances of incorrect link syntax that need to be fixed across the document. I've also suggested consolidating a couple of redundant questions and using relative paths for internal links to improve maintainability.

docs/getting-started/faq.md

fynnsu

Thanks Cassie! Added a couple suggestions below.

Also some of your links are incorrectly formatted. It should be [link text](link url)

docs/getting-started/faq.md

cajeonrh · 2025-10-02T17:53:11Z

Thanks Fynn! I've incorporated your feedback.

brian-dellabetta

Looks good! one comment to add a note on multimodal models for question 5

docs/getting-started/faq.md

fynnsu · 2025-10-02T19:34:00Z

In the sidebar we're getting this title for the page. Can we simplify this to just "Frequently Asked Questions" or maybe even "FAQ". I believe this is being set by the # header at the top of the file

fynnsu · 2025-10-02T19:37:37Z

Also on the "Getting Started" page we have these boxes for "Installation", "Compress Your Model" and "Deploy on vLLM". Could we add a box for FAQ?

docs/getting-started/faq.md

dsikka

Could we add a quick question on installation: vLLM and llmcompressor should be used in separate environments as they may have dependency mismatches?

dsikka

The other common question we get asked is about multi-gpu support.

Can we add the following?

LLM Compresor handles all gpu movement for you.
For data-free pathways, we leverage all available gpus and offload anything that doesnt fit onto the allocated gpus. If using pathways that require data, we sequentially onload model layers onto a single gpu. This is the case for LLM Compressor 0.6-0.8.

cajeonrh · 2025-10-06T16:24:35Z

I've incorporated feedback, added more questions, and also added a FAQ box on the Getting Started page. Please let me know if I missed anything.

fynnsu

Looks great! Thanks for making those changes!

fynnsu · 2025-10-06T20:05:46Z

Looks like you need to fix DCO though. There are some instructions here: https://github.com/vllm-project/llm-compressor/pull/1896/checks?check_run_id=52066401360.

docs/getting-started/faq.md

kylesayrs

I'm really not a fan of using casual pronouns like "us", "we", "my". This may sound pedantic, but speaking from personal experience contributing to other OSS repos, words like "we" have the effect of alienating open source contributors. LLM Compressor is owned by everyone, the RedHat/ LLM Compressor team helps to maintain and shepherd it.

docs/getting-started/faq.md

kylesayrs

We should add a section titled "Where can I learn more about LLM Compressor?" which links to talks we've given.

https://www.youtube.com/watch?v=caLYSZMVQ1c
https://www.youtube.com/watch?v=GrhuqQDmBk8
https://www.youtube.com/watch?v=WVenRmF4dPY
https://www.youtube.com/watch?v=G1WNlLxPLSE

docs/getting-started/faq.md

kylesayrs · 2025-10-07T14:37:46Z

docs/getting-started/faq.md

+
+**7. Does LLM Compressor have multi-GPU support?**
+
+LLM Compressor handles all GPU movement for you. For data-free pathways, we leverage all available GPUs and offload anything that doesn't fit onto the allocated GPUs. If you are using pathways that require data, we sequentially onload model layers onto a single GPU. This is the case for LLM Compressor 0.6-0.8.


We essentially do not have mult-GPU support right now.

#1809

dispatch_for_generation is used for data-free atm?

Is this FAQ still relevant then? Should I remove it or does it need to be changed?

Still relevant

** 7. Does LLM Compressor support compressing large models? **

LLM Compressor enables the compression of large models via sequential onloading, whereby layers of the model are jointly onloaded to a single GPU, optimized, then offloaded back to the CPU. While this is the default data pipeline for most cases, you can override this setting for slightly better runtime by using oneshot(..., pipeline="basic") if you model can fit across all GPU resources. Multi-GPU parallel optimization is currently in development and being tracked here.

I think we need to add a point re: data free pathways for FP8 quantization @kylesayrs

Also, we need to keep the original question as most users ask about mutli-gpu support specifically / are confused when they only see one active gpu

Hm, maybe something like this

** 7. Does LLM Compressor have multi-GPU support? **

LLM Compressor enables the compression of large models via sequential onloading, whereby layers of the model are jointly onloaded to a single GPU, optimized, then offloaded back to the CPU. This is why, in most cases, only one GPU is used at a time.

In cases where no calibration data is needed, the model is dispatched to all GPUs, although only one GPU is used at a time for compression.

Multi-GPU parallel optimization is currently in development and being tracked here.

Signed-off-by: Cassie Jeon <cajeon@redhat.com>

cajeonrh · 2025-10-10T18:33:45Z

I'm really not a fan of using casual pronouns like "us", "we", "my". This may sound pedantic, but speaking from personal experience contributing to other OSS repos, words like "we" have the effect of alienating open source contributors. LLM Compressor is owned by everyone, the RedHat/ LLM Compressor team helps to maintain and shepherd it.

@kylesayrs
I think this would need consensus from the team before changing it. I went through the LLM Compressor docs and there are about 35 pages that use the pronouns "we"/"us." Unless the FAQ page should be the only one that doesn't use any personal pronouns. Otherwise, if the change needs to be applied to all the pages for LLM Compressor, that should be a new ticket for the work.

kylesayrs · 2025-10-14T13:27:54Z

@cajeonrh That’s fine, we can table the discussion for now

brian-dellabetta

LGTM, one question on links but we can revisit in a follow-up

brian-dellabetta · 2025-10-14T16:20:42Z

docs/getting-started/faq.md

+
+**5. What layers should be quantized?**
+
+Typically, all linear layers are quantized except the `lm_head` layer. This is because the `lm_head` layer is the last layer of the model and sensitive to quantization, which will impact the model's accuracy. For example, [this code snippet shows how to ignore the lm_head layer](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w8a8_fp8/llama3_example.py#L18).


This link and the one below might become stale if those files are ever changed, since they point to a file on main that can be changed. Should we use a tagged version instead?

gemini-code-assist bot reviewed Oct 2, 2025

View reviewed changes

cajeonrh force-pushed the INFERENG-1867 branch from 21cb4f0 to 34edf9d Compare October 2, 2025 17:16

fynnsu reviewed Oct 2, 2025

View reviewed changes

docs/getting-started/faq.md Outdated Show resolved Hide resolved

docs/getting-started/faq.md Outdated Show resolved Hide resolved

brian-dellabetta reviewed Oct 2, 2025

View reviewed changes

docs/getting-started/faq.md Show resolved Hide resolved

dsikka requested changes Oct 3, 2025

View reviewed changes

docs/getting-started/faq.md Outdated Show resolved Hide resolved

dsikka requested changes Oct 6, 2025

View reviewed changes

dsikka reviewed Oct 6, 2025

View reviewed changes

fynnsu previously approved these changes Oct 6, 2025

View reviewed changes

cajeonrh requested a review from dsikka October 6, 2025 20:44

kylesayrs reviewed Oct 7, 2025

View reviewed changes

docs/getting-started/faq.md Show resolved Hide resolved

kylesayrs reviewed Oct 7, 2025

View reviewed changes

docs/getting-started/faq.md Show resolved Hide resolved

kylesayrs reviewed Oct 7, 2025

View reviewed changes

docs/getting-started/faq.md Show resolved Hide resolved

kylesayrs reviewed Oct 7, 2025

View reviewed changes

Re-doing changes for FAQ page and main page

8108e51

Signed-off-by: Cassie Jeon <cajeon@redhat.com>

cajeonrh dismissed fynnsu’s stale review via 8108e51 October 8, 2025 16:07

cajeonrh force-pushed the INFERENG-1867 branch from 286d259 to 8108e51 Compare October 8, 2025 16:07

dsikka requested review from brian-dellabetta, fynnsu and kylesayrs October 14, 2025 13:23

Merge branch 'main' into INFERENG-1867

86e8b95

dsikka requested a review from kylesayrs October 14, 2025 14:07

brian-dellabetta approved these changes Oct 14, 2025

View reviewed changes

Merge branch 'main' into INFERENG-1867

5583b69

brian-dellabetta added the ready When a PR is ready for review label Oct 14, 2025

kylesayrs mentioned this pull request Oct 15, 2025

load and Inference bug with awq when using transformers and vllm #1550

Closed


		7. Does LLM Compressor have multi-GPU support?

		LLM Compressor handles all GPU movement for you. For data-free pathways, we leverage all available GPUs and offload anything that doesn't fit onto the allocated GPUs. If you are using pathways that require data, we sequentially onload model layers onto a single GPU. This is the case for LLM Compressor 0.6-0.8.


		5. What layers should be quantized?

		Typically, all linear layers are quantized except the `lm_head` layer. This is because the `lm_head` layer is the last layer of the model and sensitive to quantization, which will impact the model's accuracy. For example, [this code snippet shows how to ignore the lm_head layer](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w8a8_fp8/llama3_example.py#L18).

Created FAQ page first draft #1896

Are you sure you want to change the base?

Created FAQ page first draft #1896

Uh oh!

Conversation

cajeonrh commented Oct 2, 2025

Uh oh!

github-actions bot commented Oct 2, 2025

Uh oh!

gemini-code-assist bot commented Oct 2, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fynnsu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cajeonrh commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fynnsu commented Oct 2, 2025

Uh oh!

fynnsu commented Oct 2, 2025

Uh oh!

Uh oh!

dsikka left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

cajeonrh commented Oct 6, 2025

Uh oh!

fynnsu left a comment

Choose a reason for hiding this comment

Uh oh!

fynnsu commented Oct 6, 2025

Uh oh!

Uh oh!

kylesayrs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kylesayrs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kylesayrs Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

dsikka Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

cajeonrh Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

dsikka Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

kylesayrs Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsikka Oct 14, 2025

cajeonrh commented Oct 2, 2025 •

edited

Loading

dsikka left a comment •

edited

Loading

kylesayrs Oct 14, 2025 •

edited

Loading

dsikka Oct 14, 2025 •

edited

Loading

kylesayrs Oct 14, 2025 •

edited

Loading

cajeonrh commented Oct 10, 2025 •

edited

Loading