Skip to content

Conversation

juliendenize
Copy link
Contributor

@juliendenize juliendenize commented Oct 8, 2025

What does this PR do?

This PR patches the MistralCommonTokenizer:

  1. [BUG FIX] spm now is correctly supported as previous usage would result in an error for _piece_to_id and _is_control_token.
  2. [BUG FIX] previous implementation of _piece_to_id was incorrect for Tekkenizer.

a) the special tokens were not supported
b) the normal tokens were not shifted by adding the number of special tokens

  1. [MAYBE BUG FIX] Changed get_vocab to a function that should better mimic what happens in Transformers: now the mapping is based on the real ids but some ids are missing due to conversion loss of some tokens in Tekken.
  2. [FEATURE] add_generation_prompt has been added. This is to match signature of Transformers. In practice this value is ignored except if:

a) continue_final_message and add_generation_prompt are True an error is raised.
b) if add_generation_prompt is True and the last message is assistant then an error is raised as the user should have passed continue_final_message.

  1. [FEATURE] Now the tokenizer is set to fast because:

a) it is true(Edit: not so sure about that after discussion)
b) it removes annoying message when initializing the tokenizer.

Edit 2: this has been reverted as it doesn't bring value and is misleading.

  1. [OPTIMIZATION] now image tensors are converted without slowness warnings from torch.
  2. [DOCS] updated some docs to remove unused args.

Also added minimal tests to ensure SPM also works in the future.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@patrickvonplaten @ArthurZucker @itazap

@juliendenize juliendenize changed the title Fix token_to_id and add add_generation_prompt Patch MistralCommonTokenizer Oct 9, 2025
Copy link
Contributor

github-actions bot commented Oct 9, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating! 🤗

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@juliendenize juliendenize force-pushed the patch_mistral_tokenizer branch from bbe0c0f to 8938039 Compare October 10, 2025 17:53
@ArthurZucker ArthurZucker enabled auto-merge (squash) October 14, 2025 11:05
@ArthurZucker ArthurZucker merged commit 0566b6f into huggingface:main Oct 14, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants