Update rabbit_hole.py tokenizer encounters a special token (`\n`) #1054

canapaio · 2025-04-02T07:05:55Z

The problem is that the tokenizer encounters a special token (\n) that is not allowed by default, causing an error. To fix it, we explicitly allow \n using allowed_special={"\n"} and disable checks for other special tokens with disallowed_special=() .

Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Related to issue #(issue)

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas

The problem is that the tokenizer encounters a special token (`\n`) that is not allowed by default, causing an error. To fix it, we explicitly allow `\n` using `allowed_special={"\n"}` and disable checks for other special tokens with `disallowed_special=()` .

pieroit · 2025-04-03T09:15:55Z

Never saw a problem with \n, what does actually happen?

P.s. PR on develop and comments in english please

canapaio · 2025-04-03T13:18:58Z

error log resolved with the added code:
cheshire_cat_core_190 | ERROR: Exception in ASGI application
cheshire_cat_core_190 | Traceback (most recent call last):
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 419, in run_asgi
cheshire_cat_core_190 | result = await app( # type: ignore[func-returns-value]
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in call
cheshire_cat_core_190 | return await self.app(scope, receive, send)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in call
cheshire_cat_core_190 | await super().call(scope, receive, send)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/applications.py", line 112, in call
cheshire_cat_core_190 | await self.middleware_stack(scope, receive, send)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in call
cheshire_cat_core_190 | raise exc
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in call
cheshire_cat_core_190 | await self.app(scope, receive, _send)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/middleware/cors.py", line 93, in call
cheshire_cat_core_190 | await self.simple_response(scope, receive, send, request_headers=headers)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/middleware/cors.py", line 144, in simple_response
cheshire_cat_core_190 | await self.app(scope, receive, send)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in call
cheshire_cat_core_190 | await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
cheshire_cat_core_190 | raise exc
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
cheshire_cat_core_190 | await app(scope, receive, sender)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 714, in call
cheshire_cat_core_190 | await self.middleware_stack(scope, receive, send)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 734, in app
cheshire_cat_core_190 | await route.handle(scope, receive, send)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
cheshire_cat_core_190 | await self.app(scope, receive, send)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
cheshire_cat_core_190 | await wrap_app_handling_exceptions(app, request)(scope, receive, send)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
cheshire_cat_core_190 | raise exc
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
cheshire_cat_core_190 | await app(scope, receive, sender)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 74, in app
cheshire_cat_core_190 | await response(scope, receive, send)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/responses.py", line 160, in call
cheshire_cat_core_190 | await self.background()
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/background.py", line 41, in call
cheshire_cat_core_190 | await task()
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/background.py", line 28, in call
cheshire_cat_core_190 | await run_in_threadpool(self.func, *self.args, **self.kwargs)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/concurrency.py", line 37, in run_in_threadpool
cheshire_cat_core_190 | return await anyio.to_thread.run_sync(func)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync
cheshire_cat_core_190 | return await get_async_backend().run_sync_in_worker_thread(
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2461, in run_sync_in_worker_thread
cheshire_cat_core_190 | return await future
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 962, in run
cheshire_cat_core_190 | result = context.run(func, *args)
cheshire_cat_core_190 | File "/app/cat/rabbit_hole.py", line 163, in ingest_file
cheshire_cat_core_190 | docs = self.file_to_docs(
cheshire_cat_core_190 | File "/app/cat/rabbit_hole.py", line 252, in file_to_docs
cheshire_cat_core_190 | return self.string_to_docs(
cheshire_cat_core_190 | File "/app/cat/rabbit_hole.py", line 309, in string_to_docs
cheshire_cat_core_190 | docs = self.__split_text(
cheshire_cat_core_190 | File "/app/cat/rabbit_hole.py", line 456, in __split_text
cheshire_cat_core_190 | docs = text_splitter.split_documents(text)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/langchain_text_splitters/base.py", line 96, in split_documents
cheshire_cat_core_190 | return self.create_documents(texts, metadatas=metadatas)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/langchain_text_splitters/base.py", line 79, in create_documents
cheshire_cat_core_190 | for chunk in self.split_text(text):
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/langchain_text_splitters/character.py", line 126, in split_text
cheshire_cat_core_190 | return self._split_text(text, self._separators)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/langchain_text_splitters/character.py", line 100, in _split_text
cheshire_cat_core_190 | if self._length_function(s) < self._chunk_size:
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/langchain_text_splitters/base.py", line 196, in _tiktoken_encoder
cheshire_cat_core_190 | enc.encode(
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/tiktoken/core.py", line 117, in encode
cheshire_cat_core_190 | raise_disallowed_special_token(match.group())
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/tiktoken/core.py", line 400, in raise_disallowed_special_token
cheshire_cat_core_190 | raise ValueError(
cheshire_cat_core_190 | ValueError: Encountered text corresponding to disallowed special token '<|endoftext|>'.
cheshire_cat_core_190 | If you want this text to be encoded as a special token, pass it to allowed_special, e.g. allowed_special={'<|endoftext|>', ...}.
cheshire_cat_core_190 | If you want this text to be encoded as normal text, disable the check for this token by passing disallowed_special=(enc.special_tokens_set - {'<|endoftext|>'}).
cheshire_cat_core_190 | To disable this check for all special tokens, pass disallowed_special=().

pieroit · 2025-04-03T14:51:03Z

ValueError: Encountered text corresponding to disallowed special token '<|endoftext|>'.
Looks related to special tokens used by LLMs to indicate end of sequence.
Which model are you using? Can you check this happens only with that model?

canapaio · 2025-04-03T19:48:38Z

is that I often import documentation on prompts and llm systems and often these documentaions have special characters, and from this type of error, I have also implemented a function to keep the flow of data from blocking the execution of processes
also because the error given is when entering urls via the admin upload url bar

# pulizia stringhe da caratteri nocivi #
########################################
def kre(text: str) -> str:
#def kre(text: str, cat) -> str:
    """
    Resta il codice originale.
    
    Args:
        text (str): Il testo da modificare.
    
    Returns:
        str: Il testo modificato.
    """
#    settings = cat.mad_hatter.get_plugin().load_settings()
    old: str
    new: str
    sostituzioni = [
        ('<think>', '<Ragionamento>'),
        ('</think>', '</Ragionamento>'),
        ('\[', '&#91;'),
        ('\]', '&#93;'),
        ('\|', '&#124;'),
        ('<', '&lt;'),
        ('>', '&gt;'),
        ('@', '&commat;'),
        ('{', '&#123;'),
        ('}', '&#125;')
    ]
    
    for old, new in sostituzioni:
        text = re.sub(old, new, text)
        
    return text

    # pulizia stringhe da caratteri nocivi #
########################################
def krec(text: str, cat) -> str:
#def kre(text: str, cat) -> str:
    """
    Resta il codice originale.
    
    Args:
        text (str): Il testo da modificare.
    
    Returns:
        str: Il testo modificato.
    """
    settings = cat.mad_hatter.get_plugin().load_settings()
    old: str
    new: str
    sostituzioni = [
        ('- AI', '- KaguraAI'),
        ('<think>', '<Ragionamento>'),
        ('</think>', '</Ragionamento>'),
#        ('- Human', '- Canapaio'),
        ('- Human', f" - {settings['user_name']}"),
        ('\[', '&#91;'),
        ('\]', '&#93;'),
        ('\|', '&#124;'),
        ('<', '&lt;'),
        ('>', '&gt;'),
        ('@', '&commat;'),
        ('{', '&#123;'),
        ('}', '&#125;')
    ]
    
    for old, new in sostituzioni:
        text = re.sub(old, new, text)
        
    return text

Update rabbit_hole.py

c82ee12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update rabbit_hole.py tokenizer encounters a special token (`\n`) #1054

Update rabbit_hole.py tokenizer encounters a special token (`\n`) #1054

Uh oh!

canapaio commented Apr 2, 2025

Uh oh!

pieroit commented Apr 3, 2025

Uh oh!

canapaio commented Apr 3, 2025

Uh oh!

pieroit commented Apr 3, 2025

Uh oh!

canapaio commented Apr 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

Update rabbit_hole.py tokenizer encounters a special token (\n) #1054

Are you sure you want to change the base?

Update rabbit_hole.py tokenizer encounters a special token (\n) #1054

Uh oh!

Conversation

canapaio commented Apr 2, 2025

Description

Type of change

Checklist:

Uh oh!

pieroit commented Apr 3, 2025

Uh oh!

canapaio commented Apr 3, 2025

Uh oh!

pieroit commented Apr 3, 2025

Uh oh!

canapaio commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Update rabbit_hole.py tokenizer encounters a special token (`\n`) #1054

Update rabbit_hole.py tokenizer encounters a special token (`\n`) #1054

canapaio commented Apr 3, 2025 •

edited

Loading