-
Notifications
You must be signed in to change notification settings - Fork 30.9k
Add vision contribution guide #41456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, my top pain points when reviewing model PR are below. Feel free to re-use them if you think it's valid to be added in the doc 😄
- Do not add code path with if/else if the released weights have only one path supported. Code path used for experiments are not needed and bloat up in the end
- Try to use @autodocstring and @check_model_inputs
- The model structure has to follow standards of other VLMs with correct class names. Some users add a vision backbone with its original name, i.e. RiceModel used in llava-onevision-1.5. IMO we can either add a separate vision model and call it RiceModel, or we add it in llava model-file and call it LlavaOnevision1_5VisionModel
- Users add many assertions, some are basic shape checks and not needed, while some are valid but without actionable error message. Let's nudge to add only valid checks with good error messages
- Users call model inputs as
x, y, z
, we can nudge for better naming conventionspixel_values, hidden_states
. Especially in audio backbones, apparently copy from source code - Always go over the PR again , look for commented code or left-over TODO/FIXME comments
- Do not use _checkpoint_conversion_mapping, convert checkpoints and release in the hub if needed
- The vision language models should have
get_image_features()
andget_placeholder_mask()
methods - When copying tests from similar models, delete all skipped tests and check if that applies for your model as well. Blind test skipping is not good
CONTRIBUTING.md
Outdated
|
||
### Vision-Language Model Contribution Checklist | ||
|
||
If you're contributing a **vision-language model** (or any multimodal model that processes images), please follow this checklist. Maintainers will use this to review your PR, and completing these steps will significantly increase the likelihood of your PR being merged quickly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe "or any multimodal model that processes images/videos
" or simply vision objects?
All new models should use the modular architecture pattern. Create a `modular_<model_name>.py` file using the modular model converter: | ||
|
||
- Use `transformers-cli add-new-model-like` to generate a modular skeleton and get started | ||
- All code should be in the modular file if possible. Modeling must be in it, it's better if configuration is in it as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"it's better if configuration and processing is in it"
CONTRIBUTING.md
Outdated
☐ **4. Add integration tests with exact output matching** | ||
|
||
At minimum, add an `IntegrationTest` class that tests end-to-end generation with **exact** output matching: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we nudge to add tests for all components which are not 100%-copies? In other words, if model has its own processing, then we also need processor tests etc. I've seen several times ppl adding only model tests and then we find issues in processing...
CONTRIBUTING.md
Outdated
|
||
- For generative models: test that generated text matches expected output exactly | ||
- For non-generative models: test that output logits match expected values | ||
- Tests should use real checkpoints and inputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to clarify maybe: checkpoints should fit in our runners so if the model size is huge, they need to load in 4-bit or in half precision etc. Also using small image sizes helps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small comment, but very eager to have this merge in general, this should simplify first reviews quite a bit 🤗
- Search for similar models (e.g., other vision-language models) | ||
- Reuse attention mechanisms, layer implementations, and processing patterns | ||
- Check models like LLaVA, Idefics2, Fuyu for vision-language patterns | ||
- Don't reinvent the wheel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add auto_docstring
, can_return_tuple
, check_model_inputs
and _can_record_outputs
. I think the more precise we can be the better! I see a lot of confusion on the use of the decorators, and how it's not necessary to add the output_attention, output_hidden_states etc.
|
||
All checks must pass before your PR can be merged. | ||
|
||
**If this checklist is complete, your PR has a very high likelihood of being merged!** Following these steps makes the maintainers' work much easier and will reduce the number of review iterations, getting your important work out there faster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding links to scripts might make it more useful for the readers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the review from @zucchini-nlp and @yonigozlan as suggestions for the ease of @molbap.
Hope this helps!
Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>
Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>
Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>
Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>
Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>
Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>
What does this PR do?
This updates the vision contribution guide, to help external contributors merge their work faster and reduce the load on maintainers with a checklist. It is a very minimal checklist that will need to be completed with specific model implementation guides following the new API.