Skip to content

Conversation

molbap
Copy link
Contributor

@molbap molbap commented Oct 8, 2025

What does this PR do?

This updates the vision contribution guide, to help external contributors merge their work faster and reduce the load on maintainers with a checklist. It is a very minimal checklist that will need to be completed with specific model implementation guides following the new API.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, my top pain points when reviewing model PR are below. Feel free to re-use them if you think it's valid to be added in the doc 😄

  • Do not add code path with if/else if the released weights have only one path supported. Code path used for experiments are not needed and bloat up in the end
  • Try to use @autodocstring and @check_model_inputs
  • The model structure has to follow standards of other VLMs with correct class names. Some users add a vision backbone with its original name, i.e. RiceModel used in llava-onevision-1.5. IMO we can either add a separate vision model and call it RiceModel, or we add it in llava model-file and call it LlavaOnevision1_5VisionModel
  • Users add many assertions, some are basic shape checks and not needed, while some are valid but without actionable error message. Let's nudge to add only valid checks with good error messages
  • Users call model inputs as x, y, z, we can nudge for better naming conventions pixel_values, hidden_states. Especially in audio backbones, apparently copy from source code
  • Always go over the PR again , look for commented code or left-over TODO/FIXME comments
  • Do not use _checkpoint_conversion_mapping, convert checkpoints and release in the hub if needed
  • The vision language models should have get_image_features() and get_placeholder_mask() methods
  • When copying tests from similar models, delete all skipped tests and check if that applies for your model as well. Blind test skipping is not good

CONTRIBUTING.md Outdated

### Vision-Language Model Contribution Checklist

If you're contributing a **vision-language model** (or any multimodal model that processes images), please follow this checklist. Maintainers will use this to review your PR, and completing these steps will significantly increase the likelihood of your PR being merged quickly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "or any multimodal model that processes images/videos" or simply vision objects?

All new models should use the modular architecture pattern. Create a `modular_<model_name>.py` file using the modular model converter:

- Use `transformers-cli add-new-model-like` to generate a modular skeleton and get started
- All code should be in the modular file if possible. Modeling must be in it, it's better if configuration is in it as well.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"it's better if configuration and processing is in it"

CONTRIBUTING.md Outdated
Comment on lines 155 to 157
**4. Add integration tests with exact output matching**

At minimum, add an `IntegrationTest` class that tests end-to-end generation with **exact** output matching:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we nudge to add tests for all components which are not 100%-copies? In other words, if model has its own processing, then we also need processor tests etc. I've seen several times ppl adding only model tests and then we find issues in processing...

CONTRIBUTING.md Outdated

- For generative models: test that generated text matches expected output exactly
- For non-generative models: test that output logits match expected values
- Tests should use real checkpoints and inputs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to clarify maybe: checkpoints should fit in our runners so if the model size is huge, they need to load in 4-bit or in half precision etc. Also using small image sizes helps

Copy link
Member

@yonigozlan yonigozlan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small comment, but very eager to have this merge in general, this should simplify first reviews quite a bit 🤗

Comment on lines 192 to 195
- Search for similar models (e.g., other vision-language models)
- Reuse attention mechanisms, layer implementations, and processing patterns
- Check models like LLaVA, Idefics2, Fuyu for vision-language patterns
- Don't reinvent the wheel
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add auto_docstring, can_return_tuple, check_model_inputs and _can_record_outputs. I think the more precise we can be the better! I see a lot of confusion on the use of the decorators, and how it's not necessary to add the output_attention, output_hidden_states etc.


All checks must pass before your PR can be merged.

**If this checklist is complete, your PR has a very high likelihood of being merged!** Following these steps makes the maintainers' work much easier and will reduce the number of review iterations, getting your important work out there faster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤗

Copy link
Contributor

@ariG23498 ariG23498 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding links to scripts might make it more useful for the readers.

Copy link
Contributor

@ariG23498 ariG23498 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the review from @zucchini-nlp and @yonigozlan as suggestions for the ease of @molbap.

Hope this helps!

molbap and others added 7 commits October 20, 2025 16:17
Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>
Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>
Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>
Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>
Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>
Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>
@molbap molbap merged commit 9aab965 into main Oct 20, 2025
15 checks passed
@molbap molbap deleted the contributing_vision branch October 20, 2025 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants