Mistral: Add support for interleaved attention #39799

manueldeprada · 2025-07-30T18:40:28Z

Adds support for interleaved attention masks to the Mistral model.

HuggingFaceDocBuilderDev · 2025-07-30T18:53:43Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…al-fix

…ormers into mistral-fix

manueldeprada · 2025-07-31T19:58:03Z

run-slow: mistral

github-actions · 2025-07-31T19:59:22Z

This comment contains run-slow, running the specified jobs:

models: ['models/mistral']
quantizations: [] ...

Cyrilvallez

A few comments to make it fit our standard way of doing 🤗

src/transformers/models/minimax/modular_minimax.py

src/transformers/models/mistral/modeling_mistral.py

src/transformers/models/mixtral/modular_mixtral.py

src/transformers/models/phi4_multimodal/modular_phi4_multimodal.py

github-actions · 2025-08-06T08:46:37Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: minimax, mistral, mixtral, phi3, phi4_multimodal, starcoder2

manueldeprada · 2025-08-06T08:49:08Z

src/transformers/models/mistral/modeling_mistral.py

+                "past_key_values": past_key_values,
+                "position_ids": position_ids,
+            }
+            full_mask_already_prepared = isinstance(attention_mask, torch.Tensor) and len(attention_mask.shape) > 2


This check is necessary due to test_modeling_mistral.py::Mask4DTestHard::test_stacked_causal_mask passing a raw 4d mask.

Maybe should I change the test to pass a dict instead? It would break BC...

this does not make sense, the model is sliding only

manueldeprada · 2025-08-06T08:50:08Z

run-slow: mistral

github-actions · 2025-08-06T08:51:38Z

This comment contains run-slow, running the specified jobs:

models: ['models/mistral']
quantizations: [] ...

ArthurZucker

Mistral model was never interleaved, so my first review is to go towards no.
We can maybe define this in the config but hardcode it to be sliding. Our philosophy is to not add abstraction when it is not relevant. Same for mixtral: modeling code should be unchanged

ArthurZucker · 2025-08-08T09:36:39Z

src/transformers/models/mistral/modeling_mistral.py

+                "past_key_values": past_key_values,
+                "position_ids": position_ids,
+            }
+            full_mask_already_prepared = isinstance(attention_mask, torch.Tensor) and len(attention_mask.shape) > 2


this does not make sense, the model is sliding only

manueldeprada · 2025-08-08T10:12:03Z

Mistral model was never interleaved, so my first review is to go towards no.
We can maybe define this in the config but hardcode it to be sliding. Our philosophy is to not add abstraction when it is not relevant.

mistralai/Ministral-8B-Instruct-2410 is interleaved, as reported by @hmellor, quoting model's readme "Trained with a 128k context window with interleaved sliding-window attention"

I had understood that we wanted to support this 😅 But if Ministral isn’t relevant enough to justify the modeling change, sure just close the PR 🤗

We can maybe define this in the config but hardcode it to be sliding.

The PR already does this: defaults are set to sliding or full attention unless the model's config says otherwise:

transformers/src/transformers/models/mistral/configuration_mistral.py

Lines 164 to 167 in 2166534

    
           if self.layer_types is None: 
        
               self.layer_types = [ 
        
                   "sliding_attention" if self.sliding_window is not None else "full_attention" 
        
               ] * num_hidden_layers

(several Mistral models use full attention, example here)

Same for mixtral: modeling code should be unchanged.

This PR doesn’t change Mixtral’s modeling, it only moves Attention’s sliding_window to __init__ so inheriting models can modify it.

ArthurZucker · 2025-08-08T10:18:26Z

I don't mind the config, but the modeling code is not longer mistral, but qwen2. We need to check because that architecture already exists -> does not make sense for me to do an exception here ! So as much as possible let's check which model is thus the closest

manueldeprada · 2025-08-08T11:51:26Z

I don't mind the config, but the modeling code is not longer mistral, but qwen2. We need to check because that architecture already exists -> does not make sense for me to do an exception here ! So as much as possible let's check which model is thus the closest

🤯 wow, you have every model on your head! I made a quick Ministral modular from Qwen2 and it matches exactly this PR's outputs with just the bias removal:

class MinistralAttention(Qwen2Attention):
    def __init__(self, config, layer_idx: int):
        super().__init__(config, layer_idx)
        self.q_proj = nn.Linear(config.hidden_size, config.num_attention_heads * self.head_dim, bias=False)
        self.k_proj = nn.Linear(config.hidden_size, config.num_key_value_heads * self.head_dim, bias=False)
        self.v_proj = nn.Linear(config.hidden_size, config.num_key_value_heads * self.head_dim, bias=False)

So I understand now! Whats the preferred approach?:

Give up on Ministral and only do the config change for external libraries (and maybe warn that Ministral may be incorrect on long context windows?).
Add a new Ministral model with the slim modular from Qwen2.

hmellor · 2025-08-08T12:09:01Z

https://huggingface.co/mistralai/Ministral-8B-Instruct-2410/blob/main/params.json (Mistral format config) shows that Ministral was supposed to be interleaved.

From what I understand about the history, when this model was originally contributed to Transformers, interleaved sliding attention was not supported, so the model was capped to use all sliding attention.

ArthurZucker · 2025-08-12T12:27:35Z

Add a new Ministral model with the slim modular from Qwen2. would be the best IMO! We can open a PR to support or just do a hardcode case for mistral + interleaved -> load Ministral

manueldeprada · 2025-08-12T12:30:57Z

Add a new Ministral model with the slim modular from Qwen2. would be the best IMO! We can open a PR to support or just do a hardcode case for mistral + interleaved -> load Ministral

I am interested in learning about model bring-up, so I’ll open the PR! Closing this

add support for interleaving attention

40f3530

manueldeprada added 2 commits July 30, 2025 21:26

ops

64e20d7

better with config change

3b14d82

manueldeprada changed the title ~~Adds support for interleaved attention on Mistral models~~ Mistral: Add support for interleaved attention Jul 31, 2025

manueldeprada added 9 commits July 31, 2025 11:01

Merge branch 'main' of github.com:huggingface/transformers into mistr…

fe6d611

…al-fix

update

1ba9732

modular descendants of mistral

ebc863e

Merge branch 'main' of github.com:huggingface/transformers into mistr…

d17011f

…al-fix

update

52dc962

modular descendants fi

156fba0

update

95a4cac

update

7b77134

update

eb55d97

manueldeprada marked this pull request as ready for review July 31, 2025 13:16

manueldeprada requested a review from Cyrilvallez July 31, 2025 13:16

manueldeprada and others added 3 commits July 31, 2025 15:33

Merge branch 'main' into mistral-fix

8bc1694

fix test

d6033bc

Merge branch 'mistral-fix' of https://github.com/manueldeprada/transf…

6d245ee

…ormers into mistral-fix

huggingface deleted a comment from github-actions bot Jul 31, 2025

Cyrilvallez reviewed Aug 4, 2025

View reviewed changes

cyril review

2166534

manueldeprada commented Aug 6, 2025

View reviewed changes

manueldeprada requested a review from Cyrilvallez August 6, 2025 08:51

ArthurZucker reviewed Aug 8, 2025

View reviewed changes

manueldeprada closed this Aug 12, 2025

manueldeprada mentioned this pull request Aug 18, 2025

Add ministral model #40247

Draft

Mistral: Add support for interleaved attention #39799

Mistral: Add support for interleaved attention #39799

Uh oh!

Conversation

manueldeprada commented Jul 30, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jul 30, 2025

Uh oh!

manueldeprada commented Jul 31, 2025

Uh oh!

github-actions bot commented Jul 31, 2025

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Aug 6, 2025

Uh oh!

manueldeprada Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

manueldeprada commented Aug 6, 2025

Uh oh!

github-actions bot commented Aug 6, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

manueldeprada commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker commented Aug 8, 2025

Uh oh!

manueldeprada commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hmellor commented Aug 8, 2025

Uh oh!

ArthurZucker commented Aug 12, 2025

Uh oh!

manueldeprada commented Aug 12, 2025

Uh oh!

Uh oh!

manueldeprada Aug 6, 2025 •

edited

Loading

manueldeprada commented Aug 8, 2025 •

edited

Loading

manueldeprada commented Aug 8, 2025 •

edited

Loading