You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
*`text` (type: `str`, required): The source text to parse.
15
-
*`language` (type: `str`, optional): The language of the source text. Only `json` is supported now. Default to `json`.
14
+
*`text` (`str`): The source text to parse.
15
+
*`language` (`str`, optional): The language of the source text. Only `json` is supported now. Default to `json`.
16
16
17
-
Return type: `Json`
17
+
Return: *Json*
18
18
19
19
## SplitRecursively
20
20
@@ -64,7 +64,7 @@ Input data:
64
64
65
65
:::
66
66
67
-
Return type: [*KTable*](/docs/core/data_types#ktable), each row represents a chunk, with the following sub fields:
67
+
Return: [*KTable*](/docs/core/data_types#ktable), each row represents a chunk, with the following sub fields:
68
68
69
69
*`location` (*Range*): The location of the chunk.
70
70
*`text` (*Str*): The text of the chunk.
@@ -79,22 +79,22 @@ Return type: [*KTable*](/docs/core/data_types#ktable), each row represents a chu
79
79
80
80
The spec takes the following fields:
81
81
82
-
*`model` (type: `str`, required): The name of the SentenceTransformer model to use.
83
-
*`args` (type: `dict[str, Any]`, optional): Additional arguments to pass to the SentenceTransformer constructor. e.g. `{"trust_remote_code": True}`
82
+
*`model` (`str`): The name of the SentenceTransformer model to use.
83
+
*`args` (`dict[str, Any]`, optional): Additional arguments to pass to the SentenceTransformer constructor. e.g. `{"trust_remote_code": True}`
84
84
85
85
Input data:
86
86
87
-
*`text` (type: `str`, required): The text to embed.
87
+
*`text` (*Str*): The text to embed.
88
88
89
-
Return type: `vector[float32; N]`, where `N` is determined by the model
89
+
Return: *Vector[Float32, N]*, where *N* is determined by the model
90
90
91
91
## ExtractByLlm
92
92
93
93
`ExtractByLlm` extracts structured information from a text using specified LLM. The spec takes the following fields:
94
94
95
-
*`llm_spec` (type: `cocoindex.LlmSpec`, required): The specification of the LLM to use. See [LLM Spec](/docs/ai/llm#llm-spec) for more details.
96
-
*`output_type` (type: `type`, required): The type of the output. e.g. a dataclass type name. See [Data Types](/docs/core/data_types) for all supported data types. The LLM will output values that match the schema of the type.
97
-
*`instruction` (type: `str`, optional): Additional instruction for the LLM.
95
+
*`llm_spec` (`cocoindex.LlmSpec`): The specification of the LLM to use. See [LLM Spec](/docs/ai/llm#llm-spec) for more details.
96
+
*`output_type` (`type`): The type of the output. e.g. a dataclass type name. See [Data Types](/docs/core/data_types) for all supported data types. The LLM will output values that match the schema of the type.
97
+
*`instruction` (`str`, optional): Additional instruction for the LLM.
98
98
99
99
:::tip Clear type definitions
100
100
@@ -109,25 +109,25 @@ To improve the quality of the extracted information, giving clear definitions fo
109
109
110
110
Input data:
111
111
112
-
*`text` (type: `str`, required): The text to extract information from.
112
+
*`text` (*Str*): The text to extract information from.
113
113
114
-
Return type: As specified by the `output_type` field in the spec. The extracted information from the input text.
114
+
Return: As specified by the `output_type` field in the spec. The extracted information from the input text.
115
115
116
116
## EmbedText
117
117
118
118
`EmbedText` embeds a text into a vector space using various LLM APIs that support text embedding.
119
119
120
120
The spec takes the following fields:
121
121
122
-
*`api_type` (type: [`cocoindex.LlmApiType`](/docs/ai/llm#llm-api-types), required): The type of LLM API to use for embedding.
123
-
*`model` (type: `str`, required): The name of the embedding model to use.
124
-
*`address` (type: `str`, optional): The address of the LLM API. If not specified, uses the default address for the API type.
125
-
*`output_dimension` (type: `int`, optional): The expected dimension of the output embedding vector. If not specified, use the default dimension of the model.
122
+
*`api_type` ([`cocoindex.LlmApiType`](/docs/ai/llm#llm-api-types)): The type of LLM API to use for embedding.
123
+
*`model` (`str`): The name of the embedding model to use.
124
+
*`address` (`str`, optional): The address of the LLM API. If not specified, uses the default address for the API type.
125
+
*`output_dimension` (`int`, optional): The expected dimension of the output embedding vector. If not specified, use the default dimension of the model.
126
126
127
127
For most API types, the function internally keeps a registry for the default output dimension of known model.
128
128
You need to explicitly specify the `output_dimension` if you want to use a new model that is not in the registry yet.
129
129
130
-
*`task_type` (type: `str`, optional): The task type for embedding, used by some embedding models to optimize the embedding for specific use cases.
130
+
*`task_type` (`str`, optional): The task type for embedding, used by some embedding models to optimize the embedding for specific use cases.
131
131
132
132
:::note Supported APIs for Text Embedding
133
133
@@ -137,6 +137,6 @@ Not all LLM APIs support text embedding. See the [LLM API Types table](/docs/ai/
137
137
138
138
Input data:
139
139
140
-
*`text` (type: `str`, required): The text to embed.
140
+
*`text` (*Str*, required): The text to embed.
141
141
142
-
Return type: `vector[float32; N]`, where `N` is the dimension of the embedding vector determined by the model.
142
+
Return: *Vector[Float32, N]*, where *N* is the dimension of the embedding vector determined by the model.
* `prefix` (type: `str`, optional): if provided, only files with path starting with this prefix will be imported.
126
-
* `binary` (type: `bool`, optional): whether reading files as binary (instead of text).
127
-
* `included_patterns` (type: `list[str]`, optional): a list of glob patterns to include files, e.g. `["*.txt", "docs/**/*.md"]`.
124
+
* `bucket_name` (`str`): Amazon S3 bucket name.
125
+
* `prefix` (`str`, optional): if provided, only files with path starting with this prefix will be imported.
126
+
* `binary` (`bool`, optional): whether reading files as binary (instead of text).
127
+
* `included_patterns` (`list[str]`, optional): a list of glob patterns to include files, e.g. `["*.txt", "docs/**/*.md"]`.
128
128
If not specified, all files will be included.
129
-
* `excluded_patterns` (type: `list[str]`, optional): a list of glob patterns to exclude files, e.g. `["*.tmp", "**/*.log"]`.
129
+
* `excluded_patterns` (`list[str]`, optional): a list of glob patterns to exclude files, e.g. `["*.tmp", "**/*.log"]`.
130
130
Any file or directory matching these patterns will be excluded even if they match `included_patterns`.
131
131
If not specified, no files will be excluded.
132
132
@@ -136,7 +136,7 @@ The spec takes the following fields:
136
136
137
137
:::
138
138
139
-
* `sqs_queue_url` (type: `str`, optional): if provided, the source will receive change event notifications from Amazon S3 via this SQS queue.
139
+
* `sqs_queue_url` (`str`, optional): if provided, the source will receive change event notifications from Amazon S3 via this SQS queue.
140
140
141
141
:::info
142
142
@@ -147,9 +147,9 @@ The spec takes the following fields:
147
147
148
148
### Schema
149
149
150
-
The output is a [KTable](/docs/core/data_types#ktable) with the following sub fields:
151
-
* `filename` (key, type: `str`): the filename of the file, including the path, relative to the root directory, e.g. `"dir1/file1.md"`.
152
-
* `content` (type: `str` if `binary` is `False`, otherwise `bytes`): the content of the file.
150
+
The output is a [*KTable*](/docs/core/data_types#ktable) with the following sub fields:
151
+
* `filename` (*Str*, key): the filename of the file, including the path, relative to the root directory, e.g. `"dir1/file1.md"`.
152
+
* `content` (*Str* if `binary` is `False`, otherwise *Bytes*): the content of the file.
153
153
154
154
155
155
## GoogleDrive
@@ -176,10 +176,10 @@ To access files in Google Drive, the `GoogleDrive` source will need to authentic
176
176
177
177
The spec takes the following fields:
178
178
179
-
* `service_account_credential_path` (type: `str`, required): full path to the service account credential file in JSON format.
180
-
* `root_folder_ids` (type: `list[str]`, required): a list of Google Drive folder IDs to import files from.
181
-
* `binary` (type: `bool`, optional): whether reading files as binary (instead of text).
182
-
* `recent_changes_poll_interval` (type: `datetime.timedelta`, optional): when set, this source provides a change capture mechanism by polling Google Drive for recent modified files periodically.
179
+
* `service_account_credential_path` (`str`): full path to the service account credential file in JSON format.
180
+
* `root_folder_ids` (`list[str]`): a list of Google Drive folder IDs to import files from.
181
+
* `binary` (`bool`, optional): whether reading files as binary (instead of text).
182
+
* `recent_changes_poll_interval` (`datetime.timedelta`, optional): when set, this source provides a change capture mechanism by polling Google Drive for recent modified files periodically.
183
183
184
184
:::info
185
185
@@ -198,9 +198,9 @@ The spec takes the following fields:
198
198
199
199
### Schema
200
200
201
-
The output is a [KTable](/docs/core/data_types#ktable) with the following sub fields:
201
+
The output is a [*KTable*](/docs/core/data_types#ktable) with the following sub fields:
202
202
203
-
* `file_id` (key, type: `str`): the ID of the file in Google Drive.
204
-
* `filename` (type: `str`): the filename of the file, without the path, e.g. `"file1.md"`
205
-
* `mime_type` (type: `str`): the MIME type of the file.
206
-
* `content` (type: `str` if `binary` is `False`, otherwise `bytes`): the content of the file.
203
+
* `file_id` (*Str*, key): the ID of the file in Google Drive.
204
+
* `filename` (*Str*): the filename of the file, without the path, e.g. `"file1.md"`
205
+
* `mime_type` (*Str*): the MIME type of the file.
206
+
* `content` (*Str* if `binary` is `False`, otherwise *Bytes*): the content of the file.
0 commit comments