Skip to content

Update rabbit_hole.py tokenizer encounters a special token (\n) #1054

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

canapaio
Copy link
Contributor

@canapaio canapaio commented Apr 2, 2025

The problem is that the tokenizer encounters a special token (\n) that is not allowed by default, causing an error. To fix it, we explicitly allow \n using allowed_special={"\n"} and disable checks for other special tokens with disallowed_special=() .

Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Related to issue #(issue)

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas

The problem is that the tokenizer encounters a special token (`\n`) that is not allowed by default, causing an error. To fix it, we explicitly allow `\n` using `allowed_special={"\n"}` and disable checks for other special tokens with `disallowed_special=()` .
@pieroit
Copy link
Member

pieroit commented Apr 3, 2025

Never saw a problem with \n, what does actually happen?

P.s. PR on develop and comments in english please

@canapaio
Copy link
Contributor Author

canapaio commented Apr 3, 2025

error log resolved with the added code:
cheshire_cat_core_190 | ERROR: Exception in ASGI application
cheshire_cat_core_190 | Traceback (most recent call last):
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 419, in run_asgi
cheshire_cat_core_190 | result = await app( # type: ignore[func-returns-value]
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in call
cheshire_cat_core_190 | return await self.app(scope, receive, send)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in call
cheshire_cat_core_190 | await super().call(scope, receive, send)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/applications.py", line 112, in call
cheshire_cat_core_190 | await self.middleware_stack(scope, receive, send)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in call
cheshire_cat_core_190 | raise exc
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in call
cheshire_cat_core_190 | await self.app(scope, receive, _send)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/middleware/cors.py", line 93, in call
cheshire_cat_core_190 | await self.simple_response(scope, receive, send, request_headers=headers)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/middleware/cors.py", line 144, in simple_response
cheshire_cat_core_190 | await self.app(scope, receive, send)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in call
cheshire_cat_core_190 | await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
cheshire_cat_core_190 | raise exc
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
cheshire_cat_core_190 | await app(scope, receive, sender)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 714, in call
cheshire_cat_core_190 | await self.middleware_stack(scope, receive, send)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 734, in app
cheshire_cat_core_190 | await route.handle(scope, receive, send)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
cheshire_cat_core_190 | await self.app(scope, receive, send)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
cheshire_cat_core_190 | await wrap_app_handling_exceptions(app, request)(scope, receive, send)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
cheshire_cat_core_190 | raise exc
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
cheshire_cat_core_190 | await app(scope, receive, sender)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 74, in app
cheshire_cat_core_190 | await response(scope, receive, send)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/responses.py", line 160, in call
cheshire_cat_core_190 | await self.background()
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/background.py", line 41, in call
cheshire_cat_core_190 | await task()
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/background.py", line 28, in call
cheshire_cat_core_190 | await run_in_threadpool(self.func, *self.args, **self.kwargs)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/starlette/concurrency.py", line 37, in run_in_threadpool
cheshire_cat_core_190 | return await anyio.to_thread.run_sync(func)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync
cheshire_cat_core_190 | return await get_async_backend().run_sync_in_worker_thread(
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2461, in run_sync_in_worker_thread
cheshire_cat_core_190 | return await future
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 962, in run
cheshire_cat_core_190 | result = context.run(func, *args)
cheshire_cat_core_190 | File "/app/cat/rabbit_hole.py", line 163, in ingest_file
cheshire_cat_core_190 | docs = self.file_to_docs(
cheshire_cat_core_190 | File "/app/cat/rabbit_hole.py", line 252, in file_to_docs
cheshire_cat_core_190 | return self.string_to_docs(
cheshire_cat_core_190 | File "/app/cat/rabbit_hole.py", line 309, in string_to_docs
cheshire_cat_core_190 | docs = self.__split_text(
cheshire_cat_core_190 | File "/app/cat/rabbit_hole.py", line 456, in __split_text
cheshire_cat_core_190 | docs = text_splitter.split_documents(text)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/langchain_text_splitters/base.py", line 96, in split_documents
cheshire_cat_core_190 | return self.create_documents(texts, metadatas=metadatas)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/langchain_text_splitters/base.py", line 79, in create_documents
cheshire_cat_core_190 | for chunk in self.split_text(text):
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/langchain_text_splitters/character.py", line 126, in split_text
cheshire_cat_core_190 | return self._split_text(text, self._separators)
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/langchain_text_splitters/character.py", line 100, in _split_text
cheshire_cat_core_190 | if self._length_function(s) < self._chunk_size:
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/langchain_text_splitters/base.py", line 196, in _tiktoken_encoder
cheshire_cat_core_190 | enc.encode(
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/tiktoken/core.py", line 117, in encode
cheshire_cat_core_190 | raise_disallowed_special_token(match.group())
cheshire_cat_core_190 | File "/usr/local/lib/python3.10/site-packages/tiktoken/core.py", line 400, in raise_disallowed_special_token
cheshire_cat_core_190 | raise ValueError(
cheshire_cat_core_190 | ValueError: Encountered text corresponding to disallowed special token '<|endoftext|>'.
cheshire_cat_core_190 | If you want this text to be encoded as a special token, pass it to allowed_special, e.g. allowed_special={'<|endoftext|>', ...}.
cheshire_cat_core_190 | If you want this text to be encoded as normal text, disable the check for this token by passing disallowed_special=(enc.special_tokens_set - {'<|endoftext|>'}).
cheshire_cat_core_190 | To disable this check for all special tokens, pass disallowed_special=().

@pieroit
Copy link
Member

pieroit commented Apr 3, 2025

ValueError: Encountered text corresponding to disallowed special token '<|endoftext|>'.
Looks related to special tokens used by LLMs to indicate end of sequence.
Which model are you using? Can you check this happens only with that model?

@canapaio
Copy link
Contributor Author

canapaio commented Apr 3, 2025

is that I often import documentation on prompts and llm systems and often these documentaions have special characters, and from this type of error, I have also implemented a function to keep the flow of data from blocking the execution of processes
also because the error given is when entering urls via the admin upload url bar

# pulizia stringhe da caratteri nocivi #
########################################
def kre(text: str) -> str:
#def kre(text: str, cat) -> str:
    """
    Resta il codice originale.
    
    Args:
        text (str): Il testo da modificare.
    
    Returns:
        str: Il testo modificato.
    """
#    settings = cat.mad_hatter.get_plugin().load_settings()
    old: str
    new: str
    sostituzioni = [
        ('<think>', '<Ragionamento>'),
        ('</think>', '</Ragionamento>'),
        ('\[', '&#91;'),
        ('\]', '&#93;'),
        ('\|', '&#124;'),
        ('<', '&lt;'),
        ('>', '&gt;'),
        ('@', '&commat;'),
        ('{', '&#123;'),
        ('}', '&#125;')
    ]
    
    for old, new in sostituzioni:
        text = re.sub(old, new, text)
        
    return text

    # pulizia stringhe da caratteri nocivi #
########################################
def krec(text: str, cat) -> str:
#def kre(text: str, cat) -> str:
    """
    Resta il codice originale.
    
    Args:
        text (str): Il testo da modificare.
    
    Returns:
        str: Il testo modificato.
    """
    settings = cat.mad_hatter.get_plugin().load_settings()
    old: str
    new: str
    sostituzioni = [
        ('- AI', '- KaguraAI'),
        ('<think>', '<Ragionamento>'),
        ('</think>', '</Ragionamento>'),
#        ('- Human', '- Canapaio'),
        ('- Human', f" - {settings['user_name']}"),
        ('\[', '&#91;'),
        ('\]', '&#93;'),
        ('\|', '&#124;'),
        ('<', '&lt;'),
        ('>', '&gt;'),
        ('@', '&commat;'),
        ('{', '&#123;'),
        ('}', '&#125;')
    ]
    
    for old, new in sostituzioni:
        text = re.sub(old, new, text)
        
    return text

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants