Open
Description
Describe the bug
Repeated issue when using the sustainable living subset of the bright dataset.
Exception: ValueError
--------------------------------------------------------------------------------
Message: Metadata length (134) is longer than chunk size (128). Consider increasing the chunk size or decreasing the size of your metadata to avoid this.
--------------------------------------------------------------------------------
Traceback: Traceback (most recent call last):
File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/py_modules_files/_ray_pkg_9d9949a5341176cd/syftr/tuner/qa_tuner.py", line 349, in objective
obj1, obj2, metrics, flow_json = evaluate(params, study_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/py_modules_files/_ray_pkg_9d9949a5341176cd/syftr/tuner/qa_tuner.py", line 112, in evaluate
obj1, obj2, results = _evaluate(params, study_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/py_modules_files/_ray_pkg_9d9949a5341176cd/syftr/tuner/qa_tuner.py", line 305, in _evaluate
flow = build_flow(params, study_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/py_modules_files/_ray_pkg_9d9949a5341176cd/syftr/tuner/qa_tuner.py", line 186, in build_flow
rag_retriever, rag_docstore = build_rag_retriever(study_config, params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/py_modules_files/_ray_pkg_9d9949a5341176cd/syftr/retrievers/build.py", line 163, in build_rag_retriever
dense_index, dense_docstore = get_or_build_dense_index(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/py_modules_files/_ray_pkg_9d9949a5341176cd/syftr/retrievers/build.py", line 47, in get_or_build_dense_index
index, docstore = _build_dense_index(
^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/py_modules_files/_ray_pkg_9d9949a5341176cd/syftr/retrievers/build.py", line 75, in _build_dense_index
nodes = pipeline.run(
^^^^^^^^^^^^^
File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/pip/dd5f956fcc327946303d03fcf07dea86900ea86c/virtualenv/lib/python3.12/site-packages/llama_index/core/instrumentation/dispatcher.py", line 324, in wrapper
result = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/pip/dd5f956fcc327946303d03fcf07dea86900ea86c/virtualenv/lib/python3.12/site-packages/llama_index/core/ingestion/pipeline.py", line 550, in run
nodes = run_transformations(
^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/pip/dd5f956fcc327946303d03fcf07dea86900ea86c/virtualenv/lib/python3.12/site-packages/llama_index/core/ingestion/pipeline.py", line 98, in run_transformations
nodes = transform(nodes, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/pip/dd5f956fcc327946303d03fcf07dea86900ea86c/virtualenv/lib/python3.12/site-packages/llama_index/core/instrumentation/dispatcher.py", line 324, in wrapper
result = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/pip/dd5f956fcc327946303d03fcf07dea86900ea86c/virtualenv/lib/python3.12/site-packages/llama_index/core/node_parser/interface.py", line 194, in __call__
return self.get_nodes_from_documents(nodes, **kwargs) # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/pip/dd5f956fcc327946303d03fcf07dea86900ea86c/virtualenv/lib/python3.12/site-packages/llama_index/core/node_parser/interface.py", line 166, in get_nodes_from_documents
nodes = self._parse_nodes(documents, show_progress=show_progress, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/pip/dd5f956fcc327946303d03fcf07dea86900ea86c/virtualenv/lib/python3.12/site-packages/llama_index/core/instrumentation/dispatcher.py", line 324, in wrapper
result = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/pip/dd5f956fcc327946303d03fcf07dea86900ea86c/virtualenv/lib/python3.12/site-packages/llama_index/core/node_parser/interface.py", line 261, in _parse_nodes
splits = self.split_text_metadata_aware(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/pip/dd5f956fcc327946303d03fcf07dea86900ea86c/virtualenv/lib/python3.12/site-packages/llama_index/core/instrumentation/dispatcher.py", line 324, in wrapper
result = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-06-05_07-19-04_642991_1676/runtime_resources/pip/dd5f956fcc327946303d03fcf07dea86900ea86c/virtualenv/lib/python3.12/site-packages/llama_index/core/node_parser/text/token.py", line 122, in split_text_metadata_aware
raise ValueError(
ValueError: Metadata length (134) is longer than chunk size (128). Consider increasing the chunk size or decreasing the size of your metadata to avoid this.
To Reproduce
Following configurations causes the issue:
{'additional_context_enabled': False,
'few_shot_embedding_model': 'sentence-transformers/paraphrase-multilingual-mpnet-base-v2',
'few_shot_enabled': True,
'few_shot_top_k': 15,
'hyde_enabled': False,
'lats_max_rollouts': 2,
'lats_num_expansions': 3,
'rag_embedding_model': 'BAAI/bge-multilingual-gemma2',
'rag_method': 'dense',
'rag_mode': 'lats_rag_agent',
'rag_query_decomposition_enabled': False,
'rag_top_k': 9,
'reranker_enabled': False,
'response_synthesizer_llm': 'Qwen/Qwen2.5',
'splitter_chunk_exp': 7,
'splitter_chunk_overlap_frac': 0.0,
'splitter_method': 'token',
'template_name': 'concise'}