Add vision contribution guide #41456

molbap · 2025-10-08T16:24:56Z

What does this PR do?

This updates the vision contribution guide, to help external contributors merge their work faster and reduce the load on maintainers with a checklist. It is a very minimal checklist that will need to be completed with specific model implementation guides following the new API.

HuggingFaceDocBuilderDev · 2025-10-08T16:34:00Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp

Nice, my top pain points when reviewing model PR are below. Feel free to re-use them if you think it's valid to be added in the doc 😄

Do not add code path with if/else if the released weights have only one path supported. Code path used for experiments are not needed and bloat up in the end
Try to use @autodocstring and @check_model_inputs
The model structure has to follow standards of other VLMs with correct class names. Some users add a vision backbone with its original name, i.e. RiceModel used in llava-onevision-1.5. IMO we can either add a separate vision model and call it RiceModel, or we add it in llava model-file and call it LlavaOnevision1_5VisionModel
Users add many assertions, some are basic shape checks and not needed, while some are valid but without actionable error message. Let's nudge to add only valid checks with good error messages
Users call model inputs as x, y, z, we can nudge for better naming conventions pixel_values, hidden_states. Especially in audio backbones, apparently copy from source code
Always go over the PR again , look for commented code or left-over TODO/FIXME comments
Do not use _checkpoint_conversion_mapping, convert checkpoints and release in the hub if needed
The vision language models should have get_image_features() and get_placeholder_mask() methods
When copying tests from similar models, delete all skipped tests and check if that applies for your model as well. Blind test skipping is not good

zucchini-nlp · 2025-10-09T08:51:13Z

CONTRIBUTING.md

+
+### Vision-Language Model Contribution Checklist
+
+If you're contributing a **vision-language model** (or any multimodal model that processes images), please follow this checklist. Maintainers will use this to review your PR, and completing these steps will significantly increase the likelihood of your PR being merged quickly.


maybe "or any multimodal model that processes images/videos" or simply vision objects?

zucchini-nlp · 2025-10-09T08:52:02Z

CONTRIBUTING.md

+All new models should use the modular architecture pattern. Create a `modular_<model_name>.py` file using the modular model converter:
+
+- Use `transformers-cli add-new-model-like` to generate a modular skeleton and get started
+- All code should be in the modular file if possible. Modeling must be in it, it's better if configuration is in it as well. 


"it's better if configuration and processing is in it"

zucchini-nlp · 2025-10-09T08:54:50Z

CONTRIBUTING.md

+☐ **4. Add integration tests with exact output matching**
+
+At minimum, add an `IntegrationTest` class that tests end-to-end generation with **exact** output matching:


can we nudge to add tests for all components which are not 100%-copies? In other words, if model has its own processing, then we also need processor tests etc. I've seen several times ppl adding only model tests and then we find issues in processing...

zucchini-nlp · 2025-10-09T08:55:51Z

CONTRIBUTING.md

+
+- For generative models: test that generated text matches expected output exactly
+- For non-generative models: test that output logits match expected values
+- Tests should use real checkpoints and inputs


to clarify maybe: checkpoints should fit in our runners so if the model size is huge, they need to load in 4-bit or in half precision etc. Also using small image sizes helps

yonigozlan

One small comment, but very eager to have this merge in general, this should simplify first reviews quite a bit 🤗

yonigozlan · 2025-10-13T12:00:15Z

CONTRIBUTING.md

+- Search for similar models (e.g., other vision-language models)
+- Reuse attention mechanisms, layer implementations, and processing patterns
+- Check models like LLaVA, Idefics2, Fuyu for vision-language patterns
+- Don't reinvent the wheel


I would add auto_docstring, can_return_tuple, check_model_inputs and _can_record_outputs. I think the more precise we can be the better! I see a lot of confusion on the use of the decorators, and how it's not necessary to add the output_attention, output_hidden_states etc.

yonigozlan · 2025-10-13T12:02:15Z

CONTRIBUTING.md

+
+All checks must pass before your PR can be merged.
+
+**If this checklist is complete, your PR has a very high likelihood of being merged!** Following these steps makes the maintainers' work much easier and will reduce the number of review iterations, getting your important work out there faster.


ariG23498

Adding links to scripts might make it more useful for the readers.

CONTRIBUTING.md

ariG23498

I have added the review from @zucchini-nlp and @yonigozlan as suggestions for the ease of @molbap.

Hope this helps!

CONTRIBUTING.md

Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>

vision contrib guide

6af362e

molbap requested review from yonigozlan and zucchini-nlp October 9, 2025 08:06

zucchini-nlp reviewed Oct 9, 2025

View reviewed changes

yonigozlan approved these changes Oct 13, 2025

View reviewed changes

ariG23498 reviewed Oct 17, 2025

View reviewed changes

CONTRIBUTING.md Outdated Show resolved Hide resolved

CONTRIBUTING.md Outdated Show resolved Hide resolved

ariG23498 reviewed Oct 20, 2025

View reviewed changes

CONTRIBUTING.md Outdated Show resolved Hide resolved

CONTRIBUTING.md Outdated Show resolved Hide resolved

CONTRIBUTING.md Outdated Show resolved Hide resolved

CONTRIBUTING.md Show resolved Hide resolved

molbap and others added 7 commits October 20, 2025 16:17

Update CONTRIBUTING.md

a8e3907

Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>

Update CONTRIBUTING.md

7fbd85b

Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>

Update CONTRIBUTING.md

a3b51f5

Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>

Update CONTRIBUTING.md

b782123

Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>

Update CONTRIBUTING.md

489cb9a

Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>

Update CONTRIBUTING.md

2ce4d4c

Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>

update tiny things

e3f1436

molbap merged commit 9aab965 into main Oct 20, 2025
15 checks passed

molbap deleted the contributing_vision branch October 20, 2025 16:56


		### Vision-Language Model Contribution Checklist

		If you're contributing a vision-language model (or any multimodal model that processes images), please follow this checklist. Maintainers will use this to review your PR, and completing these steps will significantly increase the likelihood of your PR being merged quickly.

		☐ 4. Add integration tests with exact output matching

		At minimum, add an `IntegrationTest` class that tests end-to-end generation with exact output matching:


		All checks must pass before your PR can be merged.

		If this checklist is complete, your PR has a very high likelihood of being merged! Following these steps makes the maintainers' work much easier and will reduce the number of review iterations, getting your important work out there faster.

Add vision contribution guide #41456

Add vision contribution guide #41456

Uh oh!

Conversation

molbap commented Oct 8, 2025

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Oct 8, 2025

Uh oh!

zucchini-nlp left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

yonigozlan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yonigozlan Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

yonigozlan Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

ariG23498 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ariG23498 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zucchini-nlp left a comment •

edited

Loading

yonigozlan left a comment •

edited

Loading