Skip to content
Open
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
5bee627
add ministral model
manueldeprada Aug 18, 2025
528fe18
docs, tests
manueldeprada Aug 18, 2025
e9d8bbc
nits
manueldeprada Aug 20, 2025
c8c2157
Merge branch 'main' into model-add-ministral
manueldeprada Aug 21, 2025
d251e88
fix tests
manueldeprada Aug 21, 2025
fc7de60
run modular after merge
manueldeprada Aug 21, 2025
18d2bf9
Merge branch 'main' into model-add-ministral
manueldeprada Aug 21, 2025
e515a5b
opsie
manueldeprada Aug 22, 2025
bdadc2c
Merge branch 'model-add-ministral' of github.com:manueldeprada/transf…
manueldeprada Aug 22, 2025
2eed170
integration tests
manueldeprada Aug 22, 2025
b2ca2ae
again
manueldeprada Aug 22, 2025
f47b51f
fff
manueldeprada Aug 22, 2025
44597b1
Merge branch 'main' into model-add-ministral
manueldeprada Aug 22, 2025
7f94b14
Merge branch 'main' into model-add-ministral
manueldeprada Aug 22, 2025
8476dec
dtype
manueldeprada Aug 22, 2025
16814c4
Merge branch 'main' of github.com:huggingface/transformers into model…
manueldeprada Aug 22, 2025
8b8433d
Merge branch 'model-add-ministral' of github.com:manueldeprada/transf…
manueldeprada Aug 22, 2025
d9aa776
Merge branch 'main' into model-add-ministral
manueldeprada Aug 25, 2025
e5ff6dc
Merge branch 'main' into model-add-ministral
manueldeprada Aug 25, 2025
9edd006
rerun modular
manueldeprada Aug 25, 2025
914cf2f
Merge branch 'main' of github.com:huggingface/transformers into model…
manueldeprada Aug 29, 2025
765e82a
arthur review
manueldeprada Aug 29, 2025
3ff9b6b
ops
manueldeprada Aug 29, 2025
40d35ab
Merge branch 'main' into model-add-ministral
manueldeprada Aug 29, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -581,6 +581,8 @@
title: MegatronGPT2
- local: model_doc/minimax
title: MiniMax
- local: model_doc/ministral
title: Ministral
- local: model_doc/mistral
title: Mistral
- local: model_doc/mixtral
Expand Down
86 changes: 86 additions & 0 deletions docs/source/en/model_doc/ministral.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
<!--Copyright 2024 Mistral AI and The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

<div style="float: right;">
<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
<img alt="Tensor parallelism" src="https://img.shields.io/badge/Tensor%20parallelism-06b6d4?style=flat&logoColor=white">
</div>
</div>

# Ministral

[Ministral](https://huggingface.co/mistralai/Ministral-8B-Instruct-2410) is a 8B parameter language model that extends the Mistral architecture with alternating attention pattern. Unlike Mistral, that uses either full attention or sliding window attention consistently, Ministral alternates between full attention and sliding window attention layers, in a pattern of 1 full attention layer followed by 3 sliding window attention layers. This allows for a 128K context length support.

This architecture turns out to coincide with Qwen2, with the main difference being the presence of biases in attention projections in Ministral.


You can find the Ministral checkpoints under the [Mistral AI](https://huggingface.co/mistralai) organization.

## Usage

The example below demonstrates how to use Ministral for text generation:

```python
>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer

>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Ministral-8B-Instruct-2410", torch_dtype=torch.bfloat16, attn_implementation="sdpa", device_map="auto")
>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Ministral-8B-Instruct-2410")

>>> messages = [
... {"role": "user", "content": "What is your favourite condiment?"},
... {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
... {"role": "user", "content": "Do you have mayonnaise recipes?"}
... ]

>>> model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

>>> generated_ids = model.generate(model_inputs, max_new_tokens=100, do_sample=True)
>>> tokenizer.batch_decode(generated_ids)[0]
"Mayonnaise can be made as follows: (...)"
```

## MinistralConfig

[[autodoc]] MinistralConfig

## MinistralModel

[[autodoc]] MinistralModel
- forward

## MinistralForCausalLM

[[autodoc]] MinistralForCausalLM
- forward

## MinistralForSequenceClassification

[[autodoc]] MinistralForSequenceClassification
- forward

## MinistralForTokenClassification

[[autodoc]] MinistralForTokenClassification
- forward

## MinistralForQuestionAnswering

[[autodoc]] MinistralForQuestionAnswering
- forward
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -208,6 +208,7 @@
from .mgp_str import *
from .mimi import *
from .minimax import *
from .ministral import *
from .mistral import *
from .mistral3 import *
from .mixtral import *
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -249,6 +249,7 @@
("mgp-str", "MgpstrConfig"),
("mimi", "MimiConfig"),
("minimax", "MiniMaxConfig"),
("ministral", "MinistralConfig"),
("mistral", "MistralConfig"),
("mistral3", "Mistral3Config"),
("mixtral", "MixtralConfig"),
Expand Down Expand Up @@ -679,6 +680,7 @@
("mgp-str", "MGP-STR"),
("mimi", "Mimi"),
("minimax", "MiniMax"),
("ministral", "Ministral"),
("mistral", "Mistral"),
("mistral3", "Mistral3"),
("mixtral", "Mixtral"),
Expand Down
5 changes: 5 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -249,6 +249,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
("mgp-str", "MgpstrForSceneTextRecognition"),
("mimi", "MimiModel"),
("minimax", "MiniMaxModel"),
("ministral", "MinistralModel"),
("mistral", "MistralModel"),
("mistral3", "Mistral3Model"),
("mixtral", "MixtralModel"),
Expand Down Expand Up @@ -682,6 +683,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
("mega", "MegaForCausalLM"),
("megatron-bert", "MegatronBertForCausalLM"),
("minimax", "MiniMaxForCausalLM"),
("ministral", "MinistralForCausalLM"),
("mistral", "MistralForCausalLM"),
("mixtral", "MixtralForCausalLM"),
("mllama", "MllamaForCausalLM"),
Expand Down Expand Up @@ -1233,6 +1235,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
("mega", "MegaForSequenceClassification"),
("megatron-bert", "MegatronBertForSequenceClassification"),
("minimax", "MiniMaxForSequenceClassification"),
("ministral", "MinistralForSequenceClassification"),
("mistral", "MistralForSequenceClassification"),
("mixtral", "MixtralForSequenceClassification"),
("mobilebert", "MobileBertForSequenceClassification"),
Expand Down Expand Up @@ -1331,6 +1334,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
("mega", "MegaForQuestionAnswering"),
("megatron-bert", "MegatronBertForQuestionAnswering"),
("minimax", "MiniMaxForQuestionAnswering"),
("ministral", "MinistralForQuestionAnswering"),
("mistral", "MistralForQuestionAnswering"),
("mixtral", "MixtralForQuestionAnswering"),
("mobilebert", "MobileBertForQuestionAnswering"),
Expand Down Expand Up @@ -1443,6 +1447,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
("mega", "MegaForTokenClassification"),
("megatron-bert", "MegatronBertForTokenClassification"),
("minimax", "MiniMaxForTokenClassification"),
("ministral", "MinistralForTokenClassification"),
("mistral", "MistralForTokenClassification"),
("mixtral", "MixtralForTokenClassification"),
("mobilebert", "MobileBertForTokenClassification"),
Expand Down
2 changes: 1 addition & 1 deletion src/transformers/models/auto/tokenization_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -780,7 +780,7 @@ def tokenizer_class_from_name(class_name: str) -> Union[type[Any], None]:
for module_name, tokenizers in TOKENIZER_MAPPING_NAMES.items():
if class_name in tokenizers:
module_name = model_type_to_module_name(module_name)
if module_name in ["mistral", "mixtral"] and class_name == "MistralCommonTokenizer":
if module_name in ["mistral", "mixtral", "ministral"] and class_name == "MistralCommonTokenizer":
module = importlib.import_module(".tokenization_mistral_common", "transformers")
else:
module = importlib.import_module(f".{module_name}", "transformers.models")
Expand Down
27 changes: 27 additions & 0 deletions src/transformers/models/ministral/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Copyright 2023 Mistral AI and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure


if TYPE_CHECKING:
from .configuration_ministral import *
from .modeling_ministral import *
else:
import sys

_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
164 changes: 164 additions & 0 deletions src/transformers/models/ministral/configuration_ministral.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
# This file was automatically generated from src/transformers/models/ministral/modular_ministral.py.
# Do NOT edit this file manually as any edits will be overwritten by the generation of
# the file from the modular. If any change should be done, please apply the change to the
# modular_ministral.py file directly. One of our CI enforces this.
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨

from ...configuration_utils import PretrainedConfig


class MinistralConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`MinistralModel`]. It is used to instantiate an
Ministral model according to the specified arguments, defining the model architecture. Instantiating a configuration
with the defaults will yield a similar configuration to that of the Ministral-8B-Instruct-2410.

[mistralai/Ministral-8B-Instruct-2410](https://huggingface.co/mistralai/Ministral-8B-Instruct-2410)
[mistralai/Ministral-8B-Instruct-2410](https://huggingface.co/mistralai/Ministral-8B-Instruct-2410)

Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.


Args:
vocab_size (`int`, *optional*, defaults to 32000):
Vocabulary size of the Ministral model. Defines the number of different tokens that can be represented by the
`inputs_ids` passed when calling [`MinistralModel`]
hidden_size (`int`, *optional*, defaults to 4096):
Dimension of the hidden representations.
intermediate_size (`int`, *optional*, defaults to 14336):
Dimension of the MLP representations.
num_hidden_layers (`int`, *optional*, defaults to 32):
Number of hidden layers in the Transformer encoder.
num_attention_heads (`int`, *optional*, defaults to 32):
Number of attention heads for each attention layer in the Transformer encoder.
num_key_value_heads (`int`, *optional*, defaults to 8):
This is the number of key_value heads that should be used to implement Grouped Query Attention. If
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
`num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
by meanpooling all the original heads within that group. For more details, check out [this
paper](https://huggingface.co/papers/2305.13245). If it is not specified, will default to `8`.
head_dim (`int`, *optional*, defaults to `hidden_size // num_attention_heads`):
The attention head dimension.
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
The non-linear activation function (function or string) in the decoder.
max_position_embeddings (`int`, *optional*, defaults to `4096*32`):
The maximum sequence length that this model might ever be used with. Ministral's sliding window attention
allows sequence of up to 4096*32 tokens.
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
rms_norm_eps (`float`, *optional*, defaults to 1e-06):
The epsilon used by the rms normalization layers.
use_cache (`bool`, *optional*, defaults to `True`):
Whether or not the model should return the last key/values attentions (not used by all models). Only
relevant if `config.is_decoder=True`.
pad_token_id (`int`, *optional*):
The id of the padding token.
bos_token_id (`int`, *optional*, defaults to 1):
The id of the "beginning-of-sequence" token.
eos_token_id (`int`, *optional*, defaults to 2):
The id of the "end-of-sequence" token.
tie_word_embeddings (`bool`, *optional*, defaults to `False`):
Whether the model's input and output word embeddings should be tied.
rope_theta (`float`, *optional*, defaults to 10000.0):
The base period of the RoPE embeddings.
sliding_window (`int`, *optional*, defaults to 4096):
Sliding window attention window size. If not specified, will default to `4096`.
attention_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for the attention probabilities.
layer_types (`list`, *optional*):
Attention pattern for each layer.

```python
>>> from transformers import MinistralModel, MinistralConfig

>>> # Initializing a Ministral 8B style configuration
>>> configuration = MinistralConfig()

>>> # Initializing a model from the Ministral 8B style configuration
>>> model = MinistralModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config
```"""

model_type = "ministral"
keys_to_ignore_at_inference = ["past_key_values"]
# Default tensor parallel plan for base model `MinistralModel`
base_model_tp_plan = {
"layers.*.self_attn.q_proj": "colwise",
"layers.*.self_attn.k_proj": "colwise",
"layers.*.self_attn.v_proj": "colwise",
"layers.*.self_attn.o_proj": "rowwise",
"layers.*.mlp.gate_proj": "colwise",
"layers.*.mlp.up_proj": "colwise",
"layers.*.mlp.down_proj": "rowwise",
}
base_model_pp_plan = {
"embed_tokens": (["input_ids"], ["inputs_embeds"]),
"layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
"norm": (["hidden_states"], ["hidden_states"]),
}

def __init__(
self,
vocab_size=32000,
hidden_size=4096,
intermediate_size=14336,
num_hidden_layers=32,
num_attention_heads=32,
num_key_value_heads=8,
head_dim=None,
hidden_act="silu",
max_position_embeddings=4096 * 32,
initializer_range=0.02,
rms_norm_eps=1e-6,
use_cache=True,
pad_token_id=None,
bos_token_id=1,
eos_token_id=2,
tie_word_embeddings=False,
rope_theta=10000.0,
sliding_window=4096,
attention_dropout=0.0,
layer_types=None,
**kwargs,
):
super().__init__(
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
eos_token_id=eos_token_id,
tie_word_embeddings=tie_word_embeddings,
**kwargs,
)
self.vocab_size = vocab_size
self.max_position_embeddings = max_position_embeddings
self.hidden_size = hidden_size
self.intermediate_size = intermediate_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.sliding_window = sliding_window
self.head_dim = head_dim

# for backward compatibility
if num_key_value_heads is None:
num_key_value_heads = num_attention_heads

self.num_key_value_heads = num_key_value_heads
self.hidden_act = hidden_act
self.initializer_range = initializer_range
self.rms_norm_eps = rms_norm_eps
self.use_cache = use_cache
self.rope_theta = rope_theta
self.attention_dropout = attention_dropout
self.layer_types = layer_types

if self.layer_types is None:
self.layer_types = [
"sliding_attention" if self.sliding_window is not None else "full_attention"
] * num_hidden_layers


__all__ = ["MinistralConfig"]
Loading