Skip to content

Commit 30e507b

Browse files
authored
feat(split): add line/column to output of SplitRecursively (#668)
* feat(split): add line/column to output of `SplitRecursively` * docs(split): add docs for `start`/`end` fields for `SplitReecursively`
1 parent 9a4c899 commit 30e507b

File tree

2 files changed

+209
-65
lines changed

2 files changed

+209
-65
lines changed

docs/docs/ops/functions.md

Lines changed: 24 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,22 @@ Return type: `Json`
2222
It tries to split at higher-level boundaries. If each chunk is still too large, it tries at the next level of boundaries.
2323
For example, for a Markdown file, it identifies boundaries in this order: level-1 sections, level-2 sections, level-3 sections, paragraphs, sentences, etc.
2424

25+
The spec takes the following fields:
26+
27+
* `custom_languages` (`list[CustomLanguageSpec]`, optional): This allows you to customize the way to chunking specific languages using regular expressions. Each `CustomLanguageSpec` is a dict with the following fields:
28+
* `language_name` (`str`): Name of the language.
29+
* `aliases` (`list[str]`, optional): A list of aliases for the language.
30+
It's an error if any language name or alias is duplicated.
31+
32+
* `separators_regex` (`list[str]`): A list of regex patterns to split the text.
33+
Higher-level boundaries should come first, and lower-level should be listed later. e.g. `[r"\n# ", r"\n## ", r"\n\n", r"\. "]`.
34+
See [regex Syntax](https://docs.rs/regex/latest/regex/#syntax) for supported regular expression syntax.
35+
2536
Input data:
2637

27-
* `text` (type: `str`, required): The text to split.
28-
* `chunk_size` (type: `int`, required): The maximum size of each chunk, in bytes.
29-
* `min_chunk_size` (type: `int`, optional): The minimum size of each chunk, in bytes. If not provided, default to `chunk_size / 2`.
38+
* `text` (*Str*): The text to split.
39+
* `chunk_size` (*Int64*): The maximum size of each chunk, in bytes.
40+
* `min_chunk_size` (*Int64*, optional): The minimum size of each chunk, in bytes. If not provided, default to `chunk_size / 2`.
3041

3142
:::note
3243

@@ -37,34 +48,30 @@ Input data:
3748

3849
:::
3950

40-
* `chunk_overlap` (type: `int`, optional): The maximum overlap size between adjacent chunks, in bytes.
41-
* `language` (type: `str`, optional): The language of the document.
51+
* `chunk_overlap` (*Int64*, optional): The maximum overlap size between adjacent chunks, in bytes.
52+
* `language` (*Str*, optional): The language of the document.
4253
Can be a language name (e.g. `Python`, `Javascript`, `Markdown`) or a file extension (e.g. `.py`, `.js`, `.md`).
4354

44-
* `custom_languages` (type: `list[CustomLanguageSpec]`, optional): This allows you to customize the way to chunking specific languages using regular expressions. Each `CustomLanguageSpec` is a dict with the following fields:
45-
* `language_name` (type: `str`, required): Name of the language.
46-
* `aliases` (type: `list[str]`, optional): A list of aliases for the language.
47-
It's an error if any language name or alias is duplicated.
48-
49-
* `separators_regex` (type: `list[str]`, required): A list of regex patterns to split the text.
50-
Higher-level boundaries should come first, and lower-level should be listed later. e.g. `[r"\n# ", r"\n## ", r"\n\n", r"\. "]`.
51-
See [regex Syntax](https://docs.rs/regex/latest/regex/#syntax) for supported regular expression syntax.
5255

5356
:::note
5457

5558
We use the `language` field to determine how to split the input text, following these rules:
5659

57-
* We'll match the input `language` field against the `language_name` or `aliases` of each custom language specification, and use the matched one. If value of `language` is null, it'll be treated as empty string when matching `language_name` or `aliases`.
60+
* We'll match the input `language` field against the `language_name` or `aliases` of each element of `custom_languages`, and use the matched one. If value of `language` is null, it'll be treated as empty string when matching `language_name` or `aliases`.
5861
* If no match is found, we'll match the `language` field against the builtin language configurations.
5962
For all supported builtin language names and aliases (extensions), see [the code](https://github.com/search?q=org%3Acocoindex-io+lang%3Arust++%22static+TREE_SITTER_LANGUAGE_BY_LANG%22&type=code).
6063
* If no match is found, the input will be treated as plain text.
6164

6265
:::
6366

64-
Return type: [KTable](/docs/core/data_types#ktable), each row represents a chunk, with the following sub fields:
67+
Return type: [*KTable*](/docs/core/data_types#ktable), each row represents a chunk, with the following sub fields:
6568

66-
* `location` (type: `range`): The location of the chunk.
67-
* `text` (type: `str`): The text of the chunk.
69+
* `location` (*Range*): The location of the chunk.
70+
* `text` (*Str*): The text of the chunk.
71+
* `start` / `end` (*Struct*): Details about the start position (inclusive) and end position (exclusive) of the chunk. They have the following sub fields:
72+
* `offset` (*Int64*): The byte offset of the position.
73+
* `line` (*Int64*): The line number of the position. Starting from 1.
74+
* `column` (*Int64*): The column number of the position. Starting from 1.
6875

6976
## SentenceTransformerEmbed
7077

0 commit comments

Comments
 (0)