You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/docs/ops/functions.md
+24-17Lines changed: 24 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,11 +22,22 @@ Return type: `Json`
22
22
It tries to split at higher-level boundaries. If each chunk is still too large, it tries at the next level of boundaries.
23
23
For example, for a Markdown file, it identifies boundaries in this order: level-1 sections, level-2 sections, level-3 sections, paragraphs, sentences, etc.
24
24
25
+
The spec takes the following fields:
26
+
27
+
*`custom_languages` (`list[CustomLanguageSpec]`, optional): This allows you to customize the way to chunking specific languages using regular expressions. Each `CustomLanguageSpec` is a dict with the following fields:
28
+
*`language_name` (`str`): Name of the language.
29
+
*`aliases` (`list[str]`, optional): A list of aliases for the language.
30
+
It's an error if any language name or alias is duplicated.
31
+
32
+
*`separators_regex` (`list[str]`): A list of regex patterns to split the text.
33
+
Higher-level boundaries should come first, and lower-level should be listed later. e.g. `[r"\n# ", r"\n## ", r"\n\n", r"\. "]`.
34
+
See [regex Syntax](https://docs.rs/regex/latest/regex/#syntax) for supported regular expression syntax.
35
+
25
36
Input data:
26
37
27
-
*`text` (type: `str`, required): The text to split.
28
-
*`chunk_size` (type: `int`, required): The maximum size of each chunk, in bytes.
29
-
*`min_chunk_size` (type: `int`, optional): The minimum size of each chunk, in bytes. If not provided, default to `chunk_size / 2`.
38
+
*`text` (*Str*): The text to split.
39
+
*`chunk_size` (*Int64*): The maximum size of each chunk, in bytes.
40
+
*`min_chunk_size` (*Int64*, optional): The minimum size of each chunk, in bytes. If not provided, default to `chunk_size / 2`.
30
41
31
42
:::note
32
43
@@ -37,34 +48,30 @@ Input data:
37
48
38
49
:::
39
50
40
-
*`chunk_overlap` (type: `int`, optional): The maximum overlap size between adjacent chunks, in bytes.
41
-
*`language` (type: `str`, optional): The language of the document.
51
+
*`chunk_overlap` (*Int64*, optional): The maximum overlap size between adjacent chunks, in bytes.
52
+
*`language` (*Str*, optional): The language of the document.
42
53
Can be a language name (e.g. `Python`, `Javascript`, `Markdown`) or a file extension (e.g. `.py`, `.js`, `.md`).
43
54
44
-
*`custom_languages` (type: `list[CustomLanguageSpec]`, optional): This allows you to customize the way to chunking specific languages using regular expressions. Each `CustomLanguageSpec` is a dict with the following fields:
45
-
*`language_name` (type: `str`, required): Name of the language.
46
-
*`aliases` (type: `list[str]`, optional): A list of aliases for the language.
47
-
It's an error if any language name or alias is duplicated.
48
-
49
-
*`separators_regex` (type: `list[str]`, required): A list of regex patterns to split the text.
50
-
Higher-level boundaries should come first, and lower-level should be listed later. e.g. `[r"\n# ", r"\n## ", r"\n\n", r"\. "]`.
51
-
See [regex Syntax](https://docs.rs/regex/latest/regex/#syntax) for supported regular expression syntax.
52
55
53
56
:::note
54
57
55
58
We use the `language` field to determine how to split the input text, following these rules:
56
59
57
-
* We'll match the input `language` field against the `language_name` or `aliases` of each custom language specification, and use the matched one. If value of `language` is null, it'll be treated as empty string when matching `language_name` or `aliases`.
60
+
* We'll match the input `language` field against the `language_name` or `aliases` of each element of `custom_languages`, and use the matched one. If value of `language` is null, it'll be treated as empty string when matching `language_name` or `aliases`.
58
61
* If no match is found, we'll match the `language` field against the builtin language configurations.
59
62
For all supported builtin language names and aliases (extensions), see [the code](https://github.com/search?q=org%3Acocoindex-io+lang%3Arust++%22static+TREE_SITTER_LANGUAGE_BY_LANG%22&type=code).
60
63
* If no match is found, the input will be treated as plain text.
61
64
62
65
:::
63
66
64
-
Return type: [KTable](/docs/core/data_types#ktable), each row represents a chunk, with the following sub fields:
67
+
Return type: [*KTable*](/docs/core/data_types#ktable), each row represents a chunk, with the following sub fields:
65
68
66
-
*`location` (type: `range`): The location of the chunk.
67
-
*`text` (type: `str`): The text of the chunk.
69
+
*`location` (*Range*): The location of the chunk.
70
+
*`text` (*Str*): The text of the chunk.
71
+
*`start` / `end` (*Struct*): Details about the start position (inclusive) and end position (exclusive) of the chunk. They have the following sub fields:
72
+
*`offset` (*Int64*): The byte offset of the position.
73
+
*`line` (*Int64*): The line number of the position. Starting from 1.
74
+
*`column` (*Int64*): The column number of the position. Starting from 1.
0 commit comments