Add enforce_utf8_boundaries option to BpeTrainer #1830

sanderland · 2025-07-22T12:21:55Z

This PR introduces a new boolean option, enforce_utf8_boundaries, to the BpeTrainer. In recent work we've shown that preventing BPE merges which cross UTF-8 character boundaries leads to higher-quality tokenizers.
Although we have provided our own implementation, at ICML TokShop several people suggested that the community would benefit a lot more if this was in 🤗 tokenizers.

The flag ensures that every token learned corresponds to either a full character sequence or a valid, contiguous byte sequence prefix within a single character, leading to:

Improved Token Quality: It prevents the creation of unintuitive and potentially harmful tokens that mix partial and complete characters (e.g., <0x95>\n\n as seen in GPT-4o).
Better Compression: Counter-intuitively, this constraint can also lead to a slight improvement in the final compression rate, as the model is guided toward more semantically meaningful merges.

To maintain backward compatibility, this option is disabled by default.

Unfortunately the option requires some interactions with the pretokenizer which does the encoding. I've tried to keep this clean, but open to suggestions, especially as my skills in the language are a bit rusty.

TODO

Node bindings. Will see if I can add this after initial review, or get help on it.

sftse · 2025-08-03T10:31:41Z

tokenizers/src/pre_tokenizers/byte_level.rs

 /// Converts bytes to unicode characters.
 /// See https://github.com/openai/gpt-2/blob/master/src/encoder.py#L9
-pub(crate) fn bytes_char() -> AHashMap<u8, char> {
+pub fn bytes_char() -> AHashMap<u8, char> {


Why is this made pub, this seems like a mistake. In fact, it seems even the previous pub(crate) is misplaced. Can you make this private and reuse the CHAR_BYTES static in this module by making it pub(crate)?

the function is used in normalizers/byte_level.rs, but will make it pub(crate) again and use CHAR_BYTES

Yes, but normalizers/byte_level.rs only needs the function to build an exact copy of the static BYTE_TO_CHAR. So as a drive-by cleanup, we could make just the statics pub(crate) and reuse them across both pre_tokenizers/byte_level.rs and normalizers/byte_level.rs. We can then make this function private.

sftse · 2025-08-03T10:32:40Z

tokenizers/src/models/bpe/trainer.rs

+    static BYTE_TO_CHAR: LazyLock<AHashMap<u8, char>> = LazyLock::new(bytes_char);
+    static CHAR_TO_BYTE: LazyLock<AHashMap<char, u8>> =
+        LazyLock::new(|| BYTE_TO_CHAR.iter().map(|(b, c)| (*c, *b)).collect());


Can you reuse the statics already in the codebase, see other comment?

will do, just missed them I think :)

sftse · 2025-08-03T10:40:47Z

tokenizers/src/models/bpe/trainer.rs

+        // Rule 3 (Implicit): Any mix of complete and incomplete is disallowed.
+        if is_a_complete || is_b_complete {
+            return false;
+        }


Can this be changed to use xor? I had a bit longer than necessary to grasp that the || works the same as ^ because the is_a_complete && is_b_complete check has already been done. With ^ the operator matches the comment more closely.

sftse · 2025-08-03T11:34:21Z

tokenizers/src/tokenizer/mod.rs

+    /// Validates compatibility between a trainer and the current tokenizer configuration.
+    /// Currently only checks:
+    //  For BpeTrainer with `enforce_utf8_boundaries=True` => pretokenizer must be ByteLevel.
+    fn _check_trainer_compat<T: Trainer<Model = M> + 'static>(&self, trainer: &T) -> Result<()> {
+        // Use `Any` to safely check for the BpeTrainer type at runtime
+        if let Some(bpe_trainer) = (trainer as &dyn Any).downcast_ref::<bpe::BpeTrainer>() {
+            if bpe_trainer.enforce_utf8_boundaries {
+                // Now check if the pre_tokenizer is ByteLevel
+                let is_byte_level = self.pre_tokenizer.as_ref().map_or(false, |pretok| {
+                    (pretok as &dyn Any).is::<pre_tokenizers::byte_level::ByteLevel>()
+                });
+
+                if !is_byte_level {
+                    return Err(
+                        "`enforce_utf8_boundaries=True` can only be used with a `ByteLevel` pre-tokenizer."
+                        .into()
+                    );
+                }
+            }
+        }
+        Ok(())
+    }
+


This code does not compile. The casting pretok as &dyn Any is the culprit, because there isn't a 'static bound on the pretokenizer.

Because of the ugly restrictions on Any, it might be better to introduce default trait methods to Trainer and PreTokenizer like this

trait Trainer { // ... previous methods fn enforce_utf8_boundaries(&self) -> Option<bool> { None } } impl Trainer for BpeTrainer { // .. previous methods fn enforce_utf8_boundaries(&self) -> Option<bool> { Some(self.enforce_utf8_boundaries) } } trait PreTokenizer { // ... previous methods fn is_byte_level(&self) -> true { false } } impl PreTokenizer for ByteLevel { // .. previous methods fn is_byte_level(&self) -> bool { true } }

This is a bit more type-safe, will catch more errors at compile time, and is a bit easier to understand imo, not like 'static bounds. But this probably should get the input of the maintainers.

have replaced this with a placeholder awaiting comments by maintainers.

ArthurZucker · 2025-08-29T09:10:51Z

Sorry for being super late here and thanks a lot for the PR, will have a look in a bit!

Sander Land added 2 commits July 22, 2025 13:59

implement enforce utf8 boundaries

661f6b5

default false

2ad4194

sftse reviewed Aug 3, 2025

View reviewed changes

sanderland and others added 2 commits August 3, 2025 14:22

Merge branch 'main' into merge_bound

bd05a87

review comments

b180705

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add enforce_utf8_boundaries option to BpeTrainer #1830

Add enforce_utf8_boundaries option to BpeTrainer #1830

Uh oh!

sanderland commented Jul 22, 2025

Uh oh!

sftse Aug 3, 2025

Uh oh!

sanderland Aug 3, 2025 •

edited

Loading

Uh oh!

sftse Aug 4, 2025

Uh oh!

sftse Aug 3, 2025

Uh oh!

sanderland Aug 3, 2025

Uh oh!

sftse Aug 3, 2025

Uh oh!

sanderland Aug 3, 2025

Uh oh!

sftse Aug 3, 2025 •

edited

Loading

Uh oh!

sanderland Aug 3, 2025

Uh oh!

ArthurZucker commented Aug 29, 2025

Uh oh!

Uh oh!

Add enforce_utf8_boundaries option to BpeTrainer #1830

Are you sure you want to change the base?

Add enforce_utf8_boundaries option to BpeTrainer #1830

Uh oh!

Conversation

sanderland commented Jul 22, 2025

TODO

Uh oh!

sftse Aug 3, 2025

Choose a reason for hiding this comment

Uh oh!

sanderland Aug 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sftse Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

sftse Aug 3, 2025

Choose a reason for hiding this comment

Uh oh!

sanderland Aug 3, 2025

Choose a reason for hiding this comment

Uh oh!

sftse Aug 3, 2025

Choose a reason for hiding this comment

Uh oh!

sanderland Aug 3, 2025

Choose a reason for hiding this comment

Uh oh!

sftse Aug 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanderland Aug 3, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker commented Aug 29, 2025

Uh oh!

Uh oh!

sanderland Aug 3, 2025 •

edited

Loading

sftse Aug 3, 2025 •

edited

Loading