-
Notifications
You must be signed in to change notification settings - Fork 30.1k
Mistral: Add support for interleaved attention #39799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
run-slow: mistral |
This comment contains run-slow, running the specified jobs: models: ['models/mistral'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments to make it fit our standard way of doing 🤗
src/transformers/models/phi4_multimodal/modular_phi4_multimodal.py
Outdated
Show resolved
Hide resolved
[For maintainers] Suggested jobs to run (before merge) run-slow: minimax, mistral, mixtral, phi3, phi4_multimodal, starcoder2 |
"past_key_values": past_key_values, | ||
"position_ids": position_ids, | ||
} | ||
full_mask_already_prepared = isinstance(attention_mask, torch.Tensor) and len(attention_mask.shape) > 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is necessary due to test_modeling_mistral.py::Mask4DTestHard::test_stacked_causal_mask
passing a raw 4d mask.
Maybe should I change the test to pass a dict instead? It would break BC...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does not make sense, the model is sliding only
run-slow: mistral |
This comment contains run-slow, running the specified jobs: models: ['models/mistral'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mistral model was never interleaved, so my first review is to go towards no.
We can maybe define this in the config but hardcode it to be sliding. Our philosophy is to not add abstraction when it is not relevant. Same for mixtral: modeling code should be unchanged
"past_key_values": past_key_values, | ||
"position_ids": position_ids, | ||
} | ||
full_mask_already_prepared = isinstance(attention_mask, torch.Tensor) and len(attention_mask.shape) > 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does not make sense, the model is sliding only
mistralai/Ministral-8B-Instruct-2410 is interleaved, as reported by @hmellor, quoting model's readme "Trained with a 128k context window with interleaved sliding-window attention" I had understood that we wanted to support this 😅 But if Ministral isn’t relevant enough to justify the modeling change, sure just close the PR 🤗
The PR already does this: defaults are set to sliding or full attention unless the model's config says otherwise: transformers/src/transformers/models/mistral/configuration_mistral.py Lines 164 to 167 in 2166534
(several Mistral models use full attention, example here)
This PR doesn’t change Mixtral’s modeling, it only moves |
I don't mind the config, but the modeling code is not longer |
🤯 wow, you have every model on your head! I made a quick Ministral modular from Qwen2 and it matches exactly this PR's outputs with just the bias removal:
So I understand now! Whats the preferred approach?:
|
https://huggingface.co/mistralai/Ministral-8B-Instruct-2410/blob/main/params.json (Mistral format config) shows that Ministral was supposed to be interleaved. From what I understand about the history, when this model was originally contributed to Transformers, interleaved sliding attention was not supported, so the model was capped to use all sliding attention. |
|
I am interested in learning about model bring-up, so I’ll open the PR! Closing this |
Adds support for interleaved attention masks to the Mistral model.