-
Notifications
You must be signed in to change notification settings - Fork 372
Update rabbit_hole.py tokenizer encounters a special token (\n
)
#1054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The problem is that the tokenizer encounters a special token (`\n`) that is not allowed by default, causing an error. To fix it, we explicitly allow `\n` using `allowed_special={"\n"}` and disable checks for other special tokens with `disallowed_special=()` .
Never saw a problem with P.s. PR on |
error log resolved with the added code: |
|
is that I often import documentation on prompts and llm systems and often these documentaions have special characters, and from this type of error, I have also implemented a function to keep the flow of data from blocking the execution of processes # pulizia stringhe da caratteri nocivi #
########################################
def kre(text: str) -> str:
#def kre(text: str, cat) -> str:
"""
Resta il codice originale.
Args:
text (str): Il testo da modificare.
Returns:
str: Il testo modificato.
"""
# settings = cat.mad_hatter.get_plugin().load_settings()
old: str
new: str
sostituzioni = [
('<think>', '<Ragionamento>'),
('</think>', '</Ragionamento>'),
('\[', '['),
('\]', ']'),
('\|', '|'),
('<', '<'),
('>', '>'),
('@', '@'),
('{', '{'),
('}', '}')
]
for old, new in sostituzioni:
text = re.sub(old, new, text)
return text
# pulizia stringhe da caratteri nocivi #
########################################
def krec(text: str, cat) -> str:
#def kre(text: str, cat) -> str:
"""
Resta il codice originale.
Args:
text (str): Il testo da modificare.
Returns:
str: Il testo modificato.
"""
settings = cat.mad_hatter.get_plugin().load_settings()
old: str
new: str
sostituzioni = [
('- AI', '- KaguraAI'),
('<think>', '<Ragionamento>'),
('</think>', '</Ragionamento>'),
# ('- Human', '- Canapaio'),
('- Human', f" - {settings['user_name']}"),
('\[', '['),
('\]', ']'),
('\|', '|'),
('<', '<'),
('>', '>'),
('@', '@'),
('{', '{'),
('}', '}')
]
for old, new in sostituzioni:
text = re.sub(old, new, text)
return text |
The problem is that the tokenizer encounters a special token (
\n
) that is not allowed by default, causing an error. To fix it, we explicitly allow\n
usingallowed_special={"\n"}
and disable checks for other special tokens withdisallowed_special=()
.Description
Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.
Related to issue #(issue)
Type of change
Checklist: