Skip to content

Commit cad6344

Browse files
authored
Generalise re_replacement_seq to deal with special symbols
This PR is similar to #90, and generalises the regex to deal with all the previous, and hopefully all future cases as well. The new special case not covered by the previous approach are the `�?` and `�,` tokens, used by Salamandra models. Since all these special tokens (new and old) consist of one or more � symbols, with an optional single-character prefix and/or suffix, we can simplify and generalise the pattern to r"^.?�+.?$".
1 parent 78eb908 commit cad6344

File tree

2 files changed

+12
-10
lines changed

2 files changed

+12
-10
lines changed

python/outlines_core/fsm/regex.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -342,11 +342,11 @@ def make_deterministic_fsm(fsm: FSM) -> Tuple[BetterFSM, Dict[int, int]]:
342342

343343
re_llama_byte_token = re.compile(r"^<0x[0-9A-F]{2}>$")
344344

345-
# The "▁*" prefix is required to handle Gemma and GPT-SW3 tokenizers.
346-
# The "\.*" suffix is required to handle the NorwAI tokenizer.
347-
# The "\.*" prefix is required to handle the Salamandra tokenizer.
348-
# The "s*$" suffix is required to handle the OpenCoder tokenizer.
349-
re_replacement_seq = re.compile(r"^▁*\.*�+\.*s*$")
345+
# The ".?" prefix and suffix is to handle special cases in some model vocabularies. This
346+
# includes Gemma models (which use "▁�" as a token), NorwAI models (which use ".�" as a
347+
# token), Salamandra models (which use ".�" and "�?" as tokens) and OpenCoder models
348+
# (which use "�s" as a token).
349+
re_replacement_seq = re.compile(r"^.?�+.?$")
350350

351351

352352
# Copied from transformers.models.gpt2.tokenization_gpt2.bytes_to_unicode

tests/fsm/test_regex.py

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -542,12 +542,14 @@ def convert_token_to_string(self, token):
542542
"�",
543543
"��",
544544
"�.",
545-
"�..",
545+
".�",
546+
".�.",
546547
"▁�",
547-
"▁▁�",
548-
"▁�.",
549-
"▁�.",
550-
"▁▁�..",
548+
"�▁",
549+
"▁�▁",
550+
"?�",
551+
"�?",
552+
"?�?",
551553
],
552554
)
553555
def test_reduced_vocabulary_with_rare_tokens(rare_token):

0 commit comments

Comments
 (0)