Skip to content

Metadata length longer than chunk size #108

Open
@shackmann

Description

@shackmann

Describe the bug
Repeated issue when using the sustainable living subset of the bright dataset.

Exception:  ValueError
--------------------------------------------------------------------------------
Message:  Metadata length (134) is longer than chunk size (128). Consider increasing the chunk size or decreasing the size of your metadata to avoid this.
--------------------------------------------------------------------------------
Traceback:  Traceback (most recent call last):
  File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/py_modules_files/_ray_pkg_9d9949a5341176cd/syftr/tuner/qa_tuner.py", line 349, in objective
    obj1, obj2, metrics, flow_json = evaluate(params, study_config)
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/py_modules_files/_ray_pkg_9d9949a5341176cd/syftr/tuner/qa_tuner.py", line 112, in evaluate
    obj1, obj2, results = _evaluate(params, study_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/py_modules_files/_ray_pkg_9d9949a5341176cd/syftr/tuner/qa_tuner.py", line 305, in _evaluate
    flow = build_flow(params, study_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/py_modules_files/_ray_pkg_9d9949a5341176cd/syftr/tuner/qa_tuner.py", line 186, in build_flow
    rag_retriever, rag_docstore = build_rag_retriever(study_config, params)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/py_modules_files/_ray_pkg_9d9949a5341176cd/syftr/retrievers/build.py", line 163, in build_rag_retriever
    dense_index, dense_docstore = get_or_build_dense_index(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/py_modules_files/_ray_pkg_9d9949a5341176cd/syftr/retrievers/build.py", line 47, in get_or_build_dense_index
    index, docstore = _build_dense_index(
                      ^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/py_modules_files/_ray_pkg_9d9949a5341176cd/syftr/retrievers/build.py", line 75, in _build_dense_index
    nodes = pipeline.run(
            ^^^^^^^^^^^^^
  File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/pip/dd5f956fcc327946303d03fcf07dea86900ea86c/virtualenv/lib/python3.12/site-packages/llama_index/core/instrumentation/dispatcher.py", line 324, in wrapper
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/pip/dd5f956fcc327946303d03fcf07dea86900ea86c/virtualenv/lib/python3.12/site-packages/llama_index/core/ingestion/pipeline.py", line 550, in run
    nodes = run_transformations(
            ^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/pip/dd5f956fcc327946303d03fcf07dea86900ea86c/virtualenv/lib/python3.12/site-packages/llama_index/core/ingestion/pipeline.py", line 98, in run_transformations
    nodes = transform(nodes, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/pip/dd5f956fcc327946303d03fcf07dea86900ea86c/virtualenv/lib/python3.12/site-packages/llama_index/core/instrumentation/dispatcher.py", line 324, in wrapper
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/pip/dd5f956fcc327946303d03fcf07dea86900ea86c/virtualenv/lib/python3.12/site-packages/llama_index/core/node_parser/interface.py", line 194, in __call__
    return self.get_nodes_from_documents(nodes, **kwargs)  # type: ignore
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/pip/dd5f956fcc327946303d03fcf07dea86900ea86c/virtualenv/lib/python3.12/site-packages/llama_index/core/node_parser/interface.py", line 166, in get_nodes_from_documents
    nodes = self._parse_nodes(documents, show_progress=show_progress, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/pip/dd5f956fcc327946303d03fcf07dea86900ea86c/virtualenv/lib/python3.12/site-packages/llama_index/core/instrumentation/dispatcher.py", line 324, in wrapper
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/pip/dd5f956fcc327946303d03fcf07dea86900ea86c/virtualenv/lib/python3.12/site-packages/llama_index/core/node_parser/interface.py", line 261, in _parse_nodes
    splits = self.split_text_metadata_aware(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/pip/dd5f956fcc327946303d03fcf07dea86900ea86c/virtualenv/lib/python3.12/site-packages/llama_index/core/instrumentation/dispatcher.py", line 324, in wrapper
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/pip/dd5f956fcc327946303d03fcf07dea86900ea86c/virtualenv/lib/python3.12/site-packages/llama_index/core/node_parser/text/token.py", line 122, in split_text_metadata_aware
    raise ValueError(
ValueError: Metadata length (134) is longer than chunk size (128). Consider increasing the chunk size or decreasing the size of your metadata to avoid this.

To Reproduce
Following configurations causes the issue:

{'additional_context_enabled': False,
 'few_shot_embedding_model': 'sentence-transformers/paraphrase-multilingual-mpnet-base-v2',
 'few_shot_enabled': True,
 'few_shot_top_k': 15,
 'hyde_enabled': False,
 'lats_max_rollouts': 2,
 'lats_num_expansions': 3,
 'rag_embedding_model': 'BAAI/bge-multilingual-gemma2',
 'rag_method': 'dense',
 'rag_mode': 'lats_rag_agent',
 'rag_query_decomposition_enabled': False,
 'rag_top_k': 9,
 'reranker_enabled': False,
 'response_synthesizer_llm': 'Qwen/Qwen2.5',
 'splitter_chunk_exp': 7,
 'splitter_chunk_overlap_frac': 0.0,
 'splitter_method': 'token',
 'template_name': 'concise'}

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions