-
Notifications
You must be signed in to change notification settings - Fork 256
Created FAQ page first draft #1896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
Summary of ChangesHello @cajeonrh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new Frequently Asked Questions (FAQ) page within the "Getting Started" section of the documentation. The primary goal is to centralize answers to common user queries about the LLM Compressor, thereby enhancing user self-service and clarity on topics such as model performance post-compression, integration with other tools like sglang, and practical guidance on compression strategies and memory requirements. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a new FAQ page, which is a great addition to the documentation. The content is relevant and covers important user questions. I've identified a few areas for improvement, mainly related to Markdown link formatting, content clarity, and consistency. There are several instances of incorrect link syntax that need to be fixed across the document. I've also suggested consolidating a couple of redundant questions and using relative paths for internal links to improve maintainability.
21cb4f0
to
34edf9d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Cassie! Added a couple suggestions below.
Also some of your links are incorrectly formatted. It should be [link text](link url)
Thanks Fynn! I've incorporated your feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! one comment to add a note on multimodal models for question 5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a quick question on installation: vLLM and llmcompressor should be used in separate environments as they may have dependency mismatches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other common question we get asked is about multi-gpu support.
Can we add the following?
- LLM Compresor handles all gpu movement for you.
- For data-free pathways, we leverage all available gpus and offload anything that doesnt fit onto the allocated gpus. If using pathways that require data, we sequentially onload model layers onto a single gpu. This is the case for LLM Compressor 0.6-0.8.
I've incorporated feedback, added more questions, and also added a FAQ box on the Getting Started page. Please let me know if I missed anything. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks for making those changes!
Looks like you need to fix DCO though. There are some instructions here: https://github.com/vllm-project/llm-compressor/pull/1896/checks?check_run_id=52066401360. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really not a fan of using casual pronouns like "us", "we", "my". This may sound pedantic, but speaking from personal experience contributing to other OSS repos, words like "we" have the effect of alienating open source contributors. LLM Compressor is owned by everyone, the RedHat/ LLM Compressor team helps to maintain and shepherd it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a section titled "Where can I learn more about LLM Compressor?" which links to talks we've given.
https://www.youtube.com/watch?v=caLYSZMVQ1c
https://www.youtube.com/watch?v=GrhuqQDmBk8
https://www.youtube.com/watch?v=WVenRmF4dPY
https://www.youtube.com/watch?v=G1WNlLxPLSE
|
||
**7. Does LLM Compressor have multi-GPU support?** | ||
|
||
LLM Compressor handles all GPU movement for you. For data-free pathways, we leverage all available GPUs and offload anything that doesn't fit onto the allocated GPUs. If you are using pathways that require data, we sequentially onload model layers onto a single GPU. This is the case for LLM Compressor 0.6-0.8. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We essentially do not have mult-GPU support right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dispatch_for_generation is used for data-free atm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this FAQ still relevant then? Should I remove it or does it need to be changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still relevant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
** 7. Does LLM Compressor support compressing large models? **
LLM Compressor enables the compression of large models via sequential onloading, whereby layers of the model are jointly onloaded to a single GPU, optimized, then offloaded back to the CPU. While this is the default data pipeline for most cases, you can override this setting for slightly better runtime by using oneshot(..., pipeline="basic")
if you model can fit across all GPU resources. Multi-GPU parallel optimization is currently in development and being tracked here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to add a point re: data free pathways for FP8 quantization @kylesayrs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we need to keep the original question as most users ask about mutli-gpu support specifically / are confused when they only see one active gpu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, maybe something like this
** 7. Does LLM Compressor have multi-GPU support? **
LLM Compressor enables the compression of large models via sequential onloading, whereby layers of the model are jointly onloaded to a single GPU, optimized, then offloaded back to the CPU. This is why, in most cases, only one GPU is used at a time.
In cases where no calibration data is needed, the model is dispatched to all GPUs, although only one GPU is used at a time for compression.
Multi-GPU parallel optimization is currently in development and being tracked here.
Signed-off-by: Cassie Jeon <cajeon@redhat.com>
286d259
to
8108e51
Compare
@kylesayrs |
@cajeonrh That’s fine, we can table the discussion for now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one question on links but we can revisit in a follow-up
|
||
**5. What layers should be quantized?** | ||
|
||
Typically, all linear layers are quantized except the `lm_head` layer. This is because the `lm_head` layer is the last layer of the model and sensitive to quantization, which will impact the model's accuracy. For example, [this code snippet shows how to ignore the lm_head layer](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w8a8_fp8/llama3_example.py#L18). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This link and the one below might become stale if those files are ever changed, since they point to a file on main that can be changed. Should we use a tagged version instead?
SUMMARY:
Created a FAQ page under the "Getting Started" section
TEST PLAN:
Requesting review of content