diff --git a/bin/find-notebooks-to-test.sh b/bin/find-notebooks-to-test.sh
index 97b428dd..3beb0ec5 100755
--- a/bin/find-notebooks-to-test.sh
+++ b/bin/find-notebooks-to-test.sh
@@ -31,6 +31,7 @@ EXEMPT_NOTEBOOKS=(
"notebooks/enterprise-search/app-search-engine-exporter.ipynb",
"notebooks/playground-examples/bedrock-anthropic-elasticsearch-client.ipynb",
"notebooks/playground-examples/openai-elasticsearch-client.ipynb",
+ "notebooks/integrations/hugging-face/huggingface-integration-millions-of-documents-with-cohere-reranking.ipynb",
"notebooks/integrations/cohere/updated-cohere-elasticsearch-inference-api.ipynb",
)
diff --git a/notebooks/integrations/hugging-face/huggingface-integration-millions-of-documents-with-cohere-reranking.ipynb b/notebooks/integrations/hugging-face/huggingface-integration-millions-of-documents-with-cohere-reranking.ipynb
new file mode 100644
index 00000000..cfc5b18d
--- /dev/null
+++ b/notebooks/integrations/hugging-face/huggingface-integration-millions-of-documents-with-cohere-reranking.ipynb
@@ -0,0 +1,810 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "7a765629",
+ "metadata": {},
+ "source": [
+ "# Semantic Search using the Inference API with the Hugging Face Inference Endpoints Service\n",
+ "\n",
+ "[](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/integrations/hugging-face/huggingface-integration-millions-of-documents-with-cohere-reranking.ipynb)\n",
+ "\n",
+ "\n",
+ "Learn how to use the [Inference API](https://www.elastic.co/guide/en/elasticsearch/reference/current/inference-apis.html) with the Hugging Face Inference Endpoint service for semantic search."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f9101eb9",
+ "metadata": {},
+ "source": [
+ "# 🧰 Requirements\n",
+ "\n",
+ "For this example, you will need:\n",
+ "\n",
+ "- An Elastic deployment:\n",
+ " - We'll be using [Elastic serverless](https://www.elastic.co/docs/current/serverless) for this example (available with a [free trial](https://cloud.elastic.co/registration?utm_source=github&utm_content=elasticsearch-labs-notebook))\n",
+ "\n",
+ "- Elasticsearch 8.14 or above.\n",
+ " \n",
+ "- A paid [Hugging Face Inference Endpoint](https://huggingface.co/docs/inference-endpoints/guides/create_endpoint) is required to use the Inference API with \n",
+ "the Hugging Face Inference Endpoint service."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4cd69cc0",
+ "metadata": {},
+ "source": [
+ "# Create Elastic Cloud deployment or serverless project\n",
+ "\n",
+ "If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f27dffbf",
+ "metadata": {},
+ "source": [
+ "# Install packages and connect with Elasticsearch Client\n",
+ "\n",
+ "To get started, we'll need to connect to our Elastic deployment using the Python client (version 8.12.0 or above).\n",
+ "Because we're using an Elastic Cloud deployment, we'll use the **Cloud ID** to identify our deployment.\n",
+ "\n",
+ "First we need to `pip` install the following packages:\n",
+ "\n",
+ "- `elasticsearch`"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "8c4b16bc",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "!pip install elasticsearch\n",
+ "%pip install datasets"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "41ef96b3",
+ "metadata": {},
+ "source": [
+ "Next, we need to import the modules we need. 🔐 NOTE: getpass enables us to securely prompt the user for credentials without echoing them to the terminal, or storing it in memory."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "690ff9af",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from elasticsearch import Elasticsearch, helpers\n",
+ "from getpass import getpass\n",
+ "import datasets"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "23fa2b6c",
+ "metadata": {},
+ "source": [
+ "Now we can instantiate the Python Elasticsearch client.\n",
+ "\n",
+ "First we prompt the user for their password and Cloud ID.\n",
+ "Then we create a `client` object that instantiates an instance of the `Elasticsearch` class."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "195cc597",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id\n",
+ "ELASTIC_CLOUD_ID = getpass(\"Elastic Cloud ID: \")\n",
+ "\n",
+ "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key\n",
+ "ELASTIC_API_KEY = getpass(\"Elastic Api Key: \")\n",
+ "\n",
+ "# Create the client instance\n",
+ "client = Elasticsearch(\n",
+ " # For local development\n",
+ " # hosts=[\"http://localhost:9200\"]\n",
+ " cloud_id=ELASTIC_CLOUD_ID,\n",
+ " api_key=ELASTIC_API_KEY,\n",
+ " request_timeout=120,\n",
+ " max_retries=10,\n",
+ " retry_on_timeout=True,\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b1115ffb",
+ "metadata": {},
+ "source": [
+ "### Test the Client\n",
+ "Before you continue, confirm that the client has connected with this test."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "cc0de5ea",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "{'name': 'serverless', 'cluster_name': 'd3ae40d244564c39961aa942d9d47f84', 'cluster_uuid': 'poKWeRbiS--nyD43R_NROw', 'version': {'number': '8.11.0', 'build_flavor': 'serverless', 'build_type': 'docker', 'build_hash': '00000000', 'build_date': '2023-10-31', 'build_snapshot': False, 'lucene_version': '9.7.0', 'minimum_wire_compatibility_version': '8.11.0', 'minimum_index_compatibility_version': '8.11.0'}, 'tagline': 'You Know, for Search'}\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(client.info())\n",
+ "\n",
+ "\n",
+ "# define this now so we can use it later\n",
+ "def pretty_search_response(response):\n",
+ " if len(response[\"hits\"][\"hits\"]) == 0:\n",
+ " print(\"Your search returned no results.\")\n",
+ " else:\n",
+ " for hit in response[\"hits\"][\"hits\"]:\n",
+ " id = hit[\"_id\"]\n",
+ " score = hit[\"_score\"]\n",
+ " text = hit[\"_source\"][\"text_field\"]\n",
+ "\n",
+ " pretty_output = f\"\\nID: {id}\\nScore: {score}\\nText: {text}\"\n",
+ "\n",
+ " print(pretty_output)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "659c5890",
+ "metadata": {},
+ "source": [
+ "Refer to [the documentation](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new) to learn how to connect to a self-managed deployment.\n",
+ "\n",
+ "Read [this page](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new) to learn how to connect using API keys."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "840d92f0",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Create the inference endpoint object\n",
+ "\n",
+ "Let's create the inference endpoint by using the [Create inference API](https://www.elastic.co/guide/en/elasticsearch/reference/current/put-inference-api.html).\n",
+ "\n",
+ "You'll need an Hugging Face API key (access token) for this that you can find in your Hugging Face account under the [Access Tokens](https://huggingface.co/settings/tokens).\n",
+ "\n",
+ "You will also need to have created a [Hugging Face Inference Endpoint service instance](https://huggingface.co/docs/inference-endpoints/guides/create_endpoint) and noted the `url` of your instance. For this notebook, we deployed the `multilingual-e5-small` model."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "0d007737",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "ObjectApiResponse({'inference_id': 'my_hf_endpoint_object', 'task_type': 'text_embedding', 'service': 'hugging_face', 'service_settings': {'url': 'https://yb0j0ol2xzvro0oc.us-east-1.aws.endpoints.huggingface.cloud', 'similarity': 'dot_product', 'dimensions': 384, 'rate_limit': {'requests_per_minute': 3000}}, 'task_settings': {}})"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "API_KEY = getpass(\"Huggingface API key: \")\n",
+ "client.inference.put(\n",
+ " inference_id=\"my_hf_endpoint_object\",\n",
+ " body={\n",
+ " \"service\": \"hugging_face\",\n",
+ " \"service_settings\": {\n",
+ " \"api_key\": API_KEY,\n",
+ " \"url\": \"\",\n",
+ " \"similarity\": \"dot_product\",\n",
+ " },\n",
+ " },\n",
+ " task_type=\"text_embedding\",\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "id": "67f4201d",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "ObjectApiResponse({'text_embedding': [{'embedding': [0.026027203, -0.011120652, -0.048804738, -0.108695105, 0.06134937, -0.003066093, 0.053232085, 0.103629395, 0.046043355, 0.0055427994, 0.036174323, 0.022110537, 0.084891565, -0.008215214, -0.017915571, 0.041923355, 0.048264034, -0.0404355, -0.02609504, -0.023076748, 0.0077286777, 0.023034474, 0.010379155, 0.06257496, 0.025658935, 0.040398516, -0.059809092, 0.032451782, 0.020798752, -0.053219322, -0.0447653, -0.033474423, 0.085040554, -0.051343303, 0.081006914, 0.026895791, -0.031822708, -0.06217641, 0.069435075, -0.055062667, -0.014967285, -0.0040517864, 0.03874908, 0.07854211, 0.017526977, 0.040629108, -0.023190023, 0.056913305, -0.06422566, -0.009403182, -0.06666503, 0.035270344, 0.004515737, 0.07347306, 0.011125566, -0.07184689, -0.08095445, -0.04214626, -0.108447045, -0.019494658, 0.06303337, 0.019757038, -0.014584281, 0.060923614, 0.06465893, 0.108431116, 0.04072316, 0.03705652, -0.06975359, -0.050562095, -0.058487326, 0.05989619, 0.008454561, -0.02706363, -0.017974045, 0.030698266, 0.046484154, -0.06212431, 0.009513307, -0.056369964, -0.052940592, -0.05834985, -0.02096531, 0.03910419, -0.054484386, 0.06231919, 0.044607673, -0.064030685, 0.067746714, -0.0291515, 0.06992093, 0.06300958, -0.07530936, -0.06167211, -0.0681666, -0.042375665, -0.05200085, 0.058336657, 0.039630838, -0.03444309, 0.030615594, -0.042388055, 0.03127304, -0.059075136, -0.05925558, 0.019864058, 0.0311022, -0.11285156, 0.02264027, -0.0676216, 0.011842404, -0.0157365, 0.06580391, 0.023665493, -0.05072435, -0.039492164, -0.06390325, -0.067074455, 0.032680944, -0.05243909, 0.06721114, -0.005195616, -0.0458316, -0.046202496, -0.07942237, -0.011754681, 0.026515028, 0.04761297, 0.08130492, 0.0118014645, 0.025956452, 0.039976373, 0.050196614, 0.052609406, 0.063223615, 0.06121741, -0.028745022, 0.0008677591, 0.038760003, -0.021240402, -0.073974326, 0.0548761, -0.047403768, 0.025582938, 0.0585596, 0.056284837, 0.08381001, -0.02149303, 0.09447917, -0.04940235, 0.018470071, -0.044996567, 0.08062048, 0.05162519, 0.053831138, -0.052980945, -0.08226773, -0.068137355, 0.028439872, 0.049932946, -0.07633764, -0.08649836, -0.07108301, 0.017650153, -0.065348, -0.038191773, 0.040068675, 0.05870959, -0.04707911, -0.04340612, -0.044621766, 0.030800574, -0.042227603, 0.0604754, 0.010891958, 0.057460006, -0.046362966, 0.046009373, 0.07293652, 0.09398854, -0.017035728, -0.010618687, -0.09326647, -0.03877647, -0.026517635, -0.047411792, -0.073266074, 0.033911563, 0.0642687, -0.02208107, 0.0040624263, -0.003194478, -0.082016475, -0.088730805, -0.084694624, -0.03364641, -0.05026475, 0.051665384, 0.058177516, 0.02759865, -0.034461632, 0.0027396793, 0.013807217, 0.040009033, 0.06346369, 0.05832441, -0.07451158, 0.028601868, -0.022494016, 0.04229324, 0.027883757, -0.0673137, -0.07119014, 0.047188714, -0.033077974, -0.028302893, -0.028704679, 0.043902606, -0.05147592, 0.045782477, 0.08077521, -0.01782404, 0.0242885, -0.0711172, -0.023565968, 0.041291755, 0.084907316, -0.101972945, -0.038989857, 0.025122978, -0.014144972, -0.010975231, -0.0357049, -0.09243826, -0.023552464, -0.08525497, -0.018912667, 0.049455214, 0.06532829, -0.031223357, -0.013451132, -0.00037671064, 0.04600707, -0.057603396, 0.08035837, -0.026429964, -0.0962299, 0.022606302, -0.0116137, 0.062264528, 0.033446472, -0.06123555, -0.09909991, -0.07459225, -0.018707436, 0.028753517, 0.06808565, 0.023965191, -0.04717076, 0.026551146, 0.019655682, -0.009233348, 0.10465723, 0.046420176, 0.03295103, 0.053024694, -0.03854051, -0.0058735567, -0.061238136, -0.048678573, -0.05362055, 0.048028357, 0.003013557, -0.06505121, -0.020536456, -0.020093206, 0.014102229, 0.10254222, -0.027084326, -0.061477777, 0.03478813, -0.00029115603, 0.053552967, 0.056773122, 0.048566766, 0.027371235, -0.015398839, 0.0511229, -0.03932426, -0.043879736, -0.03872225, -0.08171432, 0.01703992, -0.04535995, 0.03194781, 0.011413799, 0.036786903, 0.021306055, -0.06722324, 0.034231987, -0.027529748, -0.059552487, 0.050244797, 0.08905617, -0.071323626, 0.05047076, 0.003429174, 0.034673557, 0.009984501, 0.056842286, 0.0683513, 0.023990847, -0.04053898, -0.022724004, 0.026175855, 0.027319307, -0.055451974, -0.053907238, -0.05359307, -0.035025068, -0.03776361, -0.02973751, -0.037610233, -0.051089168, 0.04428633, 0.06276192, -0.03754498, -0.060270913, 0.043127347, 0.016669549, 0.024885416, -0.027190097, -0.011614101, 0.077848606, -0.007924398, -0.061833344, -0.015071012, 0.023127502, -0.07634841, -0.015780756, 0.031652045, 0.0031123296, -0.032643825, 0.05640234, -0.02685534, -0.04942714, 0.048498664, 0.00043902535, -0.043975227, 0.017389799, 0.07734344, -0.090009265, 0.019997133, 0.10055134, -0.05671741, 0.048755262, -0.02514076, -0.011394784, 0.049053214, 0.04264309, -0.06451125, -0.029034287, 0.07762039, 0.06809162, 0.059983794, 0.035379365, -0.007960272, 0.019705113, -0.02518122, -0.05767321, 0.038523413, 0.081652805, -0.032829504, -0.0023197657, -0.018218426, -0.0885769, -0.094963886, 0.057851806, -0.041729856, -0.045802936, 0.0570079, 0.047811687, 0.017810043, 0.09373594]}]})"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "client.inference.inference(\n",
+ " inference_id=\"my_hf_endpoint_object\", input=\"this is the raw text of my document!\"\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1f2e48b7",
+ "metadata": {},
+ "source": [
+ "**IMPORTANT:** If you use Elasticsearch 8.12, you must change `inference_id` in the snippet above to `model_id`! "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0346151d",
+ "metadata": {},
+ "source": [
+ "#"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1024d070",
+ "metadata": {},
+ "source": [
+ "## Create an ingest pipeline with an inference processor\n",
+ "\n",
+ "Create an ingest pipeline with an inference processor by using the [`put_pipeline`](https://www.elastic.co/guide/en/elasticsearch/reference/master/put-pipeline-api.html) method. Reference the `inference_id` created above as `model_id` to infer on the data that is being ingested by the pipeline."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "id": "6ace9e2e",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "ObjectApiResponse({'acknowledged': True})"
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "client.ingest.put_pipeline(\n",
+ " id=\"hf_pipeline\",\n",
+ " processors=[\n",
+ " {\n",
+ " \"inference\": {\n",
+ " \"model_id\": \"my_hf_endpoint_object\",\n",
+ " \"input_output\": {\n",
+ " \"input_field\": \"text_field\",\n",
+ " \"output_field\": \"text_embedding\",\n",
+ " },\n",
+ " }\n",
+ " }\n",
+ " ],\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "76d07567",
+ "metadata": {},
+ "source": [
+ "Let's note a few important parameters from that API call:\n",
+ "\n",
+ "- `inference`: A processor that performs inference using a machine learning model.\n",
+ "- `model_id`: Specifies the ID of the inference endpoint to be used. In this example, the inference ID is set to `my_hf_endpoint_object`. Use the inference ID you defined when created the inference task.\n",
+ "- `input_output`: Specifies input and output fields.\n",
+ "- `input_field`: Field name from which the `dense_vector` representation is created.\n",
+ "- `output_field`: Field name which contains inference results. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "28e12d7a",
+ "metadata": {},
+ "source": [
+ "## Create index\n",
+ "\n",
+ "The mapping of the destination index - the index that contains the embeddings that the model will create based on your input text - must be created. The destination index must have a field with the [dense_vector](https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html) field type to index the output of the model we deployed in Hugging Face (`multilingual-e5-small`).\n",
+ "\n",
+ "Let's create an index named `hf-endpoint-index` with the mappings we need."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "id": "6ddcbca3",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'hf-endpoint-index'})"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "client.indices.create(\n",
+ " index=\"hf-endpoint-index\",\n",
+ " settings={\n",
+ " \"index\": {\n",
+ " \"default_pipeline\": \"hf_pipeline\",\n",
+ " }\n",
+ " },\n",
+ " mappings={\n",
+ " \"properties\": {\n",
+ " \"text\": {\"type\": \"text\"},\n",
+ " \"text_embedding\": {\n",
+ " \"type\": \"dense_vector\",\n",
+ " \"dims\": 384,\n",
+ " \"similarity\": \"dot_product\",\n",
+ " },\n",
+ " }\n",
+ " },\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "12aa3452",
+ "metadata": {},
+ "source": [
+ "## If you are using Elasticsearch serverless or v8.15+ then you will have access to the new `semantic_text` field\n",
+ "`semantic_text` has significantly faster ingest times and is recommended.\n",
+ "\n",
+ "https://github.com/elastic/elasticsearch/blob/main/docs/reference/mapping/types/semantic-text.asciidoc"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "id": "5eeb3755",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'hf-semantic-text-index'})"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "client.indices.create(\n",
+ " index=\"hf-semantic-text-index\",\n",
+ " mappings={\n",
+ " \"properties\": {\n",
+ " \"infer_field\": {\n",
+ " \"type\": \"semantic_text\",\n",
+ " \"inference_id\": \"my_hf_endpoint_object\",\n",
+ " },\n",
+ " \"text_field\": {\"type\": \"text\", \"copy_to\": \"infer_field\"},\n",
+ " }\n",
+ " },\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "07c187a9",
+ "metadata": {},
+ "source": [
+ "## Insert Documents\n",
+ "\n",
+ "In this example, we want to show the power of using GPUs in Hugging Face's Inference Endpoint service by indexing millions of multilingual documents from the miracl corpus. The speed at which these documents ingest will depend on whether you use a semantic text field (faster) or an ingest pipeline (slower) and will also depend on how much hardware your rent for your Hugging Face inference endpoint. Using a semantic_text field with a single T4 GPU, it may take about 3 hours to index 1 million documents. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "id": "d68737cb",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "ca38aa5c44824945a744c6c3a75756ce",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "miracl-corpus.py: 0%| | 0.00/3.15k [00:00, ?B/s]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "2e8b8c5ce485492ab80cbd34df4b1e39",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "README.md: 0%| | 0.00/6.85k [00:00, ?B/s]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "90a4a2866aca4834ac3561be8f2815ac",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "Loading dataset shards: 0%| | 0/28 [00:00, ?it/s]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "langs = [\n",
+ " \"ar\",\n",
+ " \"bn\",\n",
+ " \"en\",\n",
+ " \"es\",\n",
+ " \"fa\",\n",
+ " \"fi\",\n",
+ " \"fr\",\n",
+ " \"hi\",\n",
+ " \"id\",\n",
+ " \"ja\",\n",
+ " \"ko\",\n",
+ " \"ru\",\n",
+ " \"sw\",\n",
+ " \"te\",\n",
+ " \"th\",\n",
+ " \"zh\",\n",
+ "]\n",
+ "\n",
+ "\n",
+ "all_langs_datasets = [\n",
+ " iter(datasets.load_dataset(\"miracl/miracl-corpus\", lang)[\"train\"]) for lang in langs\n",
+ "]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "id": "7ccf9835",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Docs uplaoded: 1000\n",
+ "Docs uplaoded: 2000\n"
+ ]
+ },
+ {
+ "ename": "KeyboardInterrupt",
+ "evalue": "",
+ "output_type": "error",
+ "traceback": [
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+ "\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)",
+ "Cell \u001b[0;32mIn[11], line 27\u001b[0m\n\u001b[1;32m 17\u001b[0m \u001b[38;5;66;03m# if you are using an ingest pipeline instead of a\u001b[39;00m\n\u001b[1;32m 18\u001b[0m \u001b[38;5;66;03m# semantic text field, use this instead:\u001b[39;00m\n\u001b[1;32m 19\u001b[0m \u001b[38;5;66;03m# documents.append(\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 23\u001b[0m \u001b[38;5;66;03m# }\u001b[39;00m\n\u001b[1;32m 24\u001b[0m \u001b[38;5;66;03m# )\u001b[39;00m\n\u001b[1;32m 26\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m---> 27\u001b[0m response \u001b[38;5;241m=\u001b[39m \u001b[43mhelpers\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mbulk\u001b[49m\u001b[43m(\u001b[49m\u001b[43mclient\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdocuments\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mraise_on_error\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtimeout\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43m60s\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[1;32m 28\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mDocs uplaoded:\u001b[39m\u001b[38;5;124m\"\u001b[39m, (j \u001b[38;5;241m+\u001b[39m \u001b[38;5;241m1\u001b[39m) \u001b[38;5;241m*\u001b[39m MAX_BULK_SIZE)\n\u001b[1;32m 30\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mException\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m e:\n",
+ "File \u001b[0;32m~/workplace/elasticsearch-labs/.venv/lib/python3.11/site-packages/elasticsearch/helpers/actions.py:531\u001b[0m, in \u001b[0;36mbulk\u001b[0;34m(client, actions, stats_only, ignore_status, *args, **kwargs)\u001b[0m\n\u001b[1;32m 529\u001b[0m \u001b[38;5;66;03m# make streaming_bulk yield successful results so we can count them\u001b[39;00m\n\u001b[1;32m 530\u001b[0m kwargs[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124myield_ok\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mTrue\u001b[39;00m\n\u001b[0;32m--> 531\u001b[0m \u001b[43m\u001b[49m\u001b[38;5;28;43;01mfor\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43mok\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mitem\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01min\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43mstreaming_bulk\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 532\u001b[0m \u001b[43m \u001b[49m\u001b[43mclient\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mactions\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mignore_status\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mignore_status\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mspan_name\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mhelpers.bulk\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;66;43;03m# type: ignore[misc]\u001b[39;49;00m\n\u001b[1;32m 533\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\u001b[43m:\u001b[49m\n\u001b[1;32m 534\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;66;43;03m# go through request-response pairs and detect failures\u001b[39;49;00m\n\u001b[1;32m 535\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43;01mif\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[38;5;129;43;01mnot\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43mok\u001b[49m\u001b[43m:\u001b[49m\n\u001b[1;32m 536\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43;01mif\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[38;5;129;43;01mnot\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43mstats_only\u001b[49m\u001b[43m:\u001b[49m\n",
+ "File \u001b[0;32m~/workplace/elasticsearch-labs/.venv/lib/python3.11/site-packages/elasticsearch/helpers/actions.py:445\u001b[0m, in \u001b[0;36mstreaming_bulk\u001b[0;34m(client, actions, chunk_size, max_chunk_bytes, raise_on_error, expand_action_callback, raise_on_exception, max_retries, initial_backoff, max_backoff, yield_ok, ignore_status, span_name, *args, **kwargs)\u001b[0m\n\u001b[1;32m 442\u001b[0m time\u001b[38;5;241m.\u001b[39msleep(\u001b[38;5;28mmin\u001b[39m(max_backoff, initial_backoff \u001b[38;5;241m*\u001b[39m \u001b[38;5;241m2\u001b[39m \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39m (attempt \u001b[38;5;241m-\u001b[39m \u001b[38;5;241m1\u001b[39m)))\n\u001b[1;32m 444\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m--> 445\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43;01mfor\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m(\u001b[49m\u001b[43mok\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43minfo\u001b[49m\u001b[43m)\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01min\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[38;5;28;43mzip\u001b[39;49m\u001b[43m(\u001b[49m\n\u001b[1;32m 446\u001b[0m \u001b[43m \u001b[49m\u001b[43mbulk_data\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 447\u001b[0m \u001b[43m \u001b[49m\u001b[43m_process_bulk_chunk\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 448\u001b[0m \u001b[43m \u001b[49m\u001b[43mclient\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 449\u001b[0m \u001b[43m \u001b[49m\u001b[43mbulk_actions\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 450\u001b[0m \u001b[43m \u001b[49m\u001b[43mbulk_data\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 451\u001b[0m \u001b[43m \u001b[49m\u001b[43motel_span\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 452\u001b[0m \u001b[43m \u001b[49m\u001b[43mraise_on_exception\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 453\u001b[0m \u001b[43m \u001b[49m\u001b[43mraise_on_error\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 454\u001b[0m \u001b[43m \u001b[49m\u001b[43mignore_status\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 455\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 456\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 457\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 458\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\u001b[43m:\u001b[49m\n\u001b[1;32m 459\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43;01mif\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[38;5;129;43;01mnot\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43mok\u001b[49m\u001b[43m:\u001b[49m\n\u001b[1;32m 460\u001b[0m \u001b[43m \u001b[49m\u001b[43maction\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43minfo\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43m \u001b[49m\u001b[43minfo\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mpopitem\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n",
+ "File \u001b[0;32m~/workplace/elasticsearch-labs/.venv/lib/python3.11/site-packages/elasticsearch/helpers/actions.py:343\u001b[0m, in \u001b[0;36m_process_bulk_chunk\u001b[0;34m(client, bulk_actions, bulk_data, otel_span, raise_on_exception, raise_on_error, ignore_status, *args, **kwargs)\u001b[0m\n\u001b[1;32m 339\u001b[0m ignore_status \u001b[38;5;241m=\u001b[39m (ignore_status,)\n\u001b[1;32m 341\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m 342\u001b[0m \u001b[38;5;66;03m# send the actual request\u001b[39;00m\n\u001b[0;32m--> 343\u001b[0m resp \u001b[38;5;241m=\u001b[39m \u001b[43mclient\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mbulk\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43moperations\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mbulk_actions\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m \u001b[38;5;66;03m# type: ignore[arg-type]\u001b[39;00m\n\u001b[1;32m 344\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m ApiError \u001b[38;5;28;01mas\u001b[39;00m e:\n\u001b[1;32m 345\u001b[0m gen \u001b[38;5;241m=\u001b[39m _process_bulk_chunk_error(\n\u001b[1;32m 346\u001b[0m error\u001b[38;5;241m=\u001b[39me,\n\u001b[1;32m 347\u001b[0m bulk_data\u001b[38;5;241m=\u001b[39mbulk_data,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 350\u001b[0m raise_on_error\u001b[38;5;241m=\u001b[39mraise_on_error,\n\u001b[1;32m 351\u001b[0m )\n",
+ "File \u001b[0;32m~/workplace/elasticsearch-labs/.venv/lib/python3.11/site-packages/elasticsearch/_sync/client/utils.py:446\u001b[0m, in \u001b[0;36m_rewrite_parameters..wrapper..wrapped\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m 443\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m:\n\u001b[1;32m 444\u001b[0m \u001b[38;5;28;01mpass\u001b[39;00m\n\u001b[0;32m--> 446\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mapi\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
+ "File \u001b[0;32m~/workplace/elasticsearch-labs/.venv/lib/python3.11/site-packages/elasticsearch/_sync/client/__init__.py:717\u001b[0m, in \u001b[0;36mElasticsearch.bulk\u001b[0;34m(self, operations, body, index, error_trace, filter_path, human, pipeline, pretty, refresh, require_alias, routing, source, source_excludes, source_includes, timeout, wait_for_active_shards)\u001b[0m\n\u001b[1;32m 712\u001b[0m __body \u001b[38;5;241m=\u001b[39m operations \u001b[38;5;28;01mif\u001b[39;00m operations \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;28;01melse\u001b[39;00m body\n\u001b[1;32m 713\u001b[0m __headers \u001b[38;5;241m=\u001b[39m {\n\u001b[1;32m 714\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124maccept\u001b[39m\u001b[38;5;124m\"\u001b[39m: \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mapplication/json\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[1;32m 715\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mcontent-type\u001b[39m\u001b[38;5;124m\"\u001b[39m: \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mapplication/x-ndjson\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[1;32m 716\u001b[0m }\n\u001b[0;32m--> 717\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mperform_request\u001b[49m\u001b[43m(\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;66;43;03m# type: ignore[return-value]\u001b[39;49;00m\n\u001b[1;32m 718\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mPUT\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 719\u001b[0m \u001b[43m \u001b[49m\u001b[43m__path\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 720\u001b[0m \u001b[43m \u001b[49m\u001b[43mparams\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43m__query\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 721\u001b[0m \u001b[43m \u001b[49m\u001b[43mheaders\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43m__headers\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 722\u001b[0m \u001b[43m \u001b[49m\u001b[43mbody\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43m__body\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 723\u001b[0m \u001b[43m \u001b[49m\u001b[43mendpoint_id\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mbulk\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 724\u001b[0m \u001b[43m \u001b[49m\u001b[43mpath_parts\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43m__path_parts\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 725\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n",
+ "File \u001b[0;32m~/workplace/elasticsearch-labs/.venv/lib/python3.11/site-packages/elasticsearch/_sync/client/_base.py:271\u001b[0m, in \u001b[0;36mBaseClient.perform_request\u001b[0;34m(self, method, path, params, headers, body, endpoint_id, path_parts)\u001b[0m\n\u001b[1;32m 255\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mperform_request\u001b[39m(\n\u001b[1;32m 256\u001b[0m \u001b[38;5;28mself\u001b[39m,\n\u001b[1;32m 257\u001b[0m method: \u001b[38;5;28mstr\u001b[39m,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 264\u001b[0m path_parts: Optional[Mapping[\u001b[38;5;28mstr\u001b[39m, Any]] \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m,\n\u001b[1;32m 265\u001b[0m ) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m ApiResponse[Any]:\n\u001b[1;32m 266\u001b[0m \u001b[38;5;28;01mwith\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_otel\u001b[38;5;241m.\u001b[39mspan(\n\u001b[1;32m 267\u001b[0m method,\n\u001b[1;32m 268\u001b[0m endpoint_id\u001b[38;5;241m=\u001b[39mendpoint_id,\n\u001b[1;32m 269\u001b[0m path_parts\u001b[38;5;241m=\u001b[39mpath_parts \u001b[38;5;129;01mor\u001b[39;00m {},\n\u001b[1;32m 270\u001b[0m ) \u001b[38;5;28;01mas\u001b[39;00m otel_span:\n\u001b[0;32m--> 271\u001b[0m response \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_perform_request\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 272\u001b[0m \u001b[43m \u001b[49m\u001b[43mmethod\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 273\u001b[0m \u001b[43m \u001b[49m\u001b[43mpath\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 274\u001b[0m \u001b[43m \u001b[49m\u001b[43mparams\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mparams\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 275\u001b[0m \u001b[43m \u001b[49m\u001b[43mheaders\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mheaders\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 276\u001b[0m \u001b[43m \u001b[49m\u001b[43mbody\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mbody\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 277\u001b[0m \u001b[43m \u001b[49m\u001b[43motel_span\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43motel_span\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 278\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 279\u001b[0m otel_span\u001b[38;5;241m.\u001b[39mset_elastic_cloud_metadata(response\u001b[38;5;241m.\u001b[39mmeta\u001b[38;5;241m.\u001b[39mheaders)\n\u001b[1;32m 280\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m response\n",
+ "File \u001b[0;32m~/workplace/elasticsearch-labs/.venv/lib/python3.11/site-packages/elasticsearch/_sync/client/_base.py:316\u001b[0m, in \u001b[0;36mBaseClient._perform_request\u001b[0;34m(self, method, path, params, headers, body, otel_span)\u001b[0m\n\u001b[1;32m 313\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 314\u001b[0m target \u001b[38;5;241m=\u001b[39m path\n\u001b[0;32m--> 316\u001b[0m meta, resp_body \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtransport\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mperform_request\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 317\u001b[0m \u001b[43m \u001b[49m\u001b[43mmethod\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 318\u001b[0m \u001b[43m \u001b[49m\u001b[43mtarget\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 319\u001b[0m \u001b[43m \u001b[49m\u001b[43mheaders\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mrequest_headers\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 320\u001b[0m \u001b[43m \u001b[49m\u001b[43mbody\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mbody\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 321\u001b[0m \u001b[43m \u001b[49m\u001b[43mrequest_timeout\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_request_timeout\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 322\u001b[0m \u001b[43m \u001b[49m\u001b[43mmax_retries\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_max_retries\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 323\u001b[0m \u001b[43m \u001b[49m\u001b[43mretry_on_status\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_retry_on_status\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 324\u001b[0m \u001b[43m \u001b[49m\u001b[43mretry_on_timeout\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_retry_on_timeout\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 325\u001b[0m \u001b[43m \u001b[49m\u001b[43mclient_meta\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_client_meta\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 326\u001b[0m \u001b[43m \u001b[49m\u001b[43motel_span\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43motel_span\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 327\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 329\u001b[0m \u001b[38;5;66;03m# HEAD with a 404 is returned as a normal response\u001b[39;00m\n\u001b[1;32m 330\u001b[0m \u001b[38;5;66;03m# since this is used as an 'exists' functionality.\u001b[39;00m\n\u001b[1;32m 331\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (method \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mHEAD\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01mand\u001b[39;00m meta\u001b[38;5;241m.\u001b[39mstatus \u001b[38;5;241m==\u001b[39m \u001b[38;5;241m404\u001b[39m) \u001b[38;5;129;01mand\u001b[39;00m (\n\u001b[1;32m 332\u001b[0m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;241m200\u001b[39m \u001b[38;5;241m<\u001b[39m\u001b[38;5;241m=\u001b[39m meta\u001b[38;5;241m.\u001b[39mstatus \u001b[38;5;241m<\u001b[39m \u001b[38;5;241m299\u001b[39m\n\u001b[1;32m 333\u001b[0m \u001b[38;5;129;01mand\u001b[39;00m (\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 337\u001b[0m )\n\u001b[1;32m 338\u001b[0m ):\n",
+ "File \u001b[0;32m~/workplace/elasticsearch-labs/.venv/lib/python3.11/site-packages/elastic_transport/_transport.py:342\u001b[0m, in \u001b[0;36mTransport.perform_request\u001b[0;34m(self, method, target, body, headers, max_retries, retry_on_status, retry_on_timeout, request_timeout, client_meta, otel_span)\u001b[0m\n\u001b[1;32m 340\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m 341\u001b[0m otel_span\u001b[38;5;241m.\u001b[39mset_node_metadata(node\u001b[38;5;241m.\u001b[39mhost, node\u001b[38;5;241m.\u001b[39mport, node\u001b[38;5;241m.\u001b[39mbase_url, target)\n\u001b[0;32m--> 342\u001b[0m resp \u001b[38;5;241m=\u001b[39m \u001b[43mnode\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mperform_request\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 343\u001b[0m \u001b[43m \u001b[49m\u001b[43mmethod\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 344\u001b[0m \u001b[43m \u001b[49m\u001b[43mtarget\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 345\u001b[0m \u001b[43m \u001b[49m\u001b[43mbody\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mrequest_body\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 346\u001b[0m \u001b[43m \u001b[49m\u001b[43mheaders\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mrequest_headers\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 347\u001b[0m \u001b[43m \u001b[49m\u001b[43mrequest_timeout\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mrequest_timeout\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 348\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 349\u001b[0m _logger\u001b[38;5;241m.\u001b[39minfo(\n\u001b[1;32m 350\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;132;01m%s\u001b[39;00m\u001b[38;5;124m \u001b[39m\u001b[38;5;132;01m%s\u001b[39;00m\u001b[38;5;132;01m%s\u001b[39;00m\u001b[38;5;124m [status:\u001b[39m\u001b[38;5;132;01m%s\u001b[39;00m\u001b[38;5;124m duration:\u001b[39m\u001b[38;5;132;01m%.3f\u001b[39;00m\u001b[38;5;124ms]\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 351\u001b[0m \u001b[38;5;241m%\u001b[39m (\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 357\u001b[0m )\n\u001b[1;32m 358\u001b[0m )\n\u001b[1;32m 360\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m method \u001b[38;5;241m!=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mHEAD\u001b[39m\u001b[38;5;124m\"\u001b[39m:\n",
+ "File \u001b[0;32m~/workplace/elasticsearch-labs/.venv/lib/python3.11/site-packages/elastic_transport/_node/_http_urllib3.py:167\u001b[0m, in \u001b[0;36mUrllib3HttpNode.perform_request\u001b[0;34m(self, method, target, body, headers, request_timeout)\u001b[0m\n\u001b[1;32m 164\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 165\u001b[0m body_to_send \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[0;32m--> 167\u001b[0m response \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mpool\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43murlopen\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 168\u001b[0m \u001b[43m \u001b[49m\u001b[43mmethod\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 169\u001b[0m \u001b[43m \u001b[49m\u001b[43mtarget\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 170\u001b[0m \u001b[43m \u001b[49m\u001b[43mbody\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mbody_to_send\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 171\u001b[0m \u001b[43m \u001b[49m\u001b[43mretries\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mRetry\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 172\u001b[0m \u001b[43m \u001b[49m\u001b[43mheaders\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mrequest_headers\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 173\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkw\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;66;43;03m# type: ignore[arg-type]\u001b[39;49;00m\n\u001b[1;32m 174\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 175\u001b[0m response_headers \u001b[38;5;241m=\u001b[39m HttpHeaders(response\u001b[38;5;241m.\u001b[39mheaders)\n\u001b[1;32m 176\u001b[0m data \u001b[38;5;241m=\u001b[39m response\u001b[38;5;241m.\u001b[39mdata\n",
+ "File \u001b[0;32m~/workplace/elasticsearch-labs/.venv/lib/python3.11/site-packages/urllib3/connectionpool.py:789\u001b[0m, in \u001b[0;36mHTTPConnectionPool.urlopen\u001b[0;34m(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)\u001b[0m\n\u001b[1;32m 786\u001b[0m response_conn \u001b[38;5;241m=\u001b[39m conn \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m release_conn \u001b[38;5;28;01melse\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m 788\u001b[0m \u001b[38;5;66;03m# Make the request on the HTTPConnection object\u001b[39;00m\n\u001b[0;32m--> 789\u001b[0m response \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_make_request\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 790\u001b[0m \u001b[43m \u001b[49m\u001b[43mconn\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 791\u001b[0m \u001b[43m \u001b[49m\u001b[43mmethod\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 792\u001b[0m \u001b[43m \u001b[49m\u001b[43murl\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 793\u001b[0m \u001b[43m \u001b[49m\u001b[43mtimeout\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mtimeout_obj\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 794\u001b[0m \u001b[43m \u001b[49m\u001b[43mbody\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mbody\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 795\u001b[0m \u001b[43m \u001b[49m\u001b[43mheaders\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mheaders\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 796\u001b[0m \u001b[43m \u001b[49m\u001b[43mchunked\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mchunked\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 797\u001b[0m \u001b[43m \u001b[49m\u001b[43mretries\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mretries\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 798\u001b[0m \u001b[43m \u001b[49m\u001b[43mresponse_conn\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mresponse_conn\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 799\u001b[0m \u001b[43m \u001b[49m\u001b[43mpreload_content\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mpreload_content\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 800\u001b[0m \u001b[43m \u001b[49m\u001b[43mdecode_content\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdecode_content\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 801\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mresponse_kw\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 802\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 804\u001b[0m \u001b[38;5;66;03m# Everything went great!\u001b[39;00m\n\u001b[1;32m 805\u001b[0m clean_exit \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mTrue\u001b[39;00m\n",
+ "File \u001b[0;32m~/workplace/elasticsearch-labs/.venv/lib/python3.11/site-packages/urllib3/connectionpool.py:536\u001b[0m, in \u001b[0;36mHTTPConnectionPool._make_request\u001b[0;34m(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length)\u001b[0m\n\u001b[1;32m 534\u001b[0m \u001b[38;5;66;03m# Receive the response from the server\u001b[39;00m\n\u001b[1;32m 535\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m--> 536\u001b[0m response \u001b[38;5;241m=\u001b[39m \u001b[43mconn\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mgetresponse\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 537\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m (BaseSSLError, \u001b[38;5;167;01mOSError\u001b[39;00m) \u001b[38;5;28;01mas\u001b[39;00m e:\n\u001b[1;32m 538\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_raise_timeout(err\u001b[38;5;241m=\u001b[39me, url\u001b[38;5;241m=\u001b[39murl, timeout_value\u001b[38;5;241m=\u001b[39mread_timeout)\n",
+ "File \u001b[0;32m~/workplace/elasticsearch-labs/.venv/lib/python3.11/site-packages/urllib3/connection.py:464\u001b[0m, in \u001b[0;36mHTTPConnection.getresponse\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 461\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mresponse\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m HTTPResponse\n\u001b[1;32m 463\u001b[0m \u001b[38;5;66;03m# Get the response from http.client.HTTPConnection\u001b[39;00m\n\u001b[0;32m--> 464\u001b[0m httplib_response \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43msuper\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mgetresponse\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 466\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m 467\u001b[0m assert_header_parsing(httplib_response\u001b[38;5;241m.\u001b[39mmsg)\n",
+ "File \u001b[0;32m~/.pyenv/versions/3.11.4/lib/python3.11/http/client.py:1378\u001b[0m, in \u001b[0;36mHTTPConnection.getresponse\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 1376\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m 1377\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m-> 1378\u001b[0m \u001b[43mresponse\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mbegin\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1379\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mConnectionError\u001b[39;00m:\n\u001b[1;32m 1380\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mclose()\n",
+ "File \u001b[0;32m~/.pyenv/versions/3.11.4/lib/python3.11/http/client.py:318\u001b[0m, in \u001b[0;36mHTTPResponse.begin\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 316\u001b[0m \u001b[38;5;66;03m# read until we get a non-100 response\u001b[39;00m\n\u001b[1;32m 317\u001b[0m \u001b[38;5;28;01mwhile\u001b[39;00m \u001b[38;5;28;01mTrue\u001b[39;00m:\n\u001b[0;32m--> 318\u001b[0m version, status, reason \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_read_status\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 319\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m status \u001b[38;5;241m!=\u001b[39m CONTINUE:\n\u001b[1;32m 320\u001b[0m \u001b[38;5;28;01mbreak\u001b[39;00m\n",
+ "File \u001b[0;32m~/.pyenv/versions/3.11.4/lib/python3.11/http/client.py:279\u001b[0m, in \u001b[0;36mHTTPResponse._read_status\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 278\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21m_read_status\u001b[39m(\u001b[38;5;28mself\u001b[39m):\n\u001b[0;32m--> 279\u001b[0m line \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mstr\u001b[39m(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mfp\u001b[38;5;241m.\u001b[39mreadline(_MAXLINE \u001b[38;5;241m+\u001b[39m \u001b[38;5;241m1\u001b[39m), \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124miso-8859-1\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 280\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(line) \u001b[38;5;241m>\u001b[39m _MAXLINE:\n\u001b[1;32m 281\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m LineTooLong(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mstatus line\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
+ "File \u001b[0;32m~/.pyenv/versions/3.11.4/lib/python3.11/socket.py:706\u001b[0m, in \u001b[0;36mSocketIO.readinto\u001b[0;34m(self, b)\u001b[0m\n\u001b[1;32m 704\u001b[0m \u001b[38;5;28;01mwhile\u001b[39;00m \u001b[38;5;28;01mTrue\u001b[39;00m:\n\u001b[1;32m 705\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m--> 706\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_sock\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mrecv_into\u001b[49m\u001b[43m(\u001b[49m\u001b[43mb\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 707\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m timeout:\n\u001b[1;32m 708\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_timeout_occurred \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mTrue\u001b[39;00m\n",
+ "File \u001b[0;32m~/.pyenv/versions/3.11.4/lib/python3.11/ssl.py:1278\u001b[0m, in \u001b[0;36mSSLSocket.recv_into\u001b[0;34m(self, buffer, nbytes, flags)\u001b[0m\n\u001b[1;32m 1274\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m flags \u001b[38;5;241m!=\u001b[39m \u001b[38;5;241m0\u001b[39m:\n\u001b[1;32m 1275\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[1;32m 1276\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mnon-zero flags not allowed in calls to recv_into() on \u001b[39m\u001b[38;5;132;01m%s\u001b[39;00m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;241m%\u001b[39m\n\u001b[1;32m 1277\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m\u001b[38;5;18m__class__\u001b[39m)\n\u001b[0;32m-> 1278\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mread\u001b[49m\u001b[43m(\u001b[49m\u001b[43mnbytes\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mbuffer\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1279\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 1280\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28msuper\u001b[39m()\u001b[38;5;241m.\u001b[39mrecv_into(buffer, nbytes, flags)\n",
+ "File \u001b[0;32m~/.pyenv/versions/3.11.4/lib/python3.11/ssl.py:1134\u001b[0m, in \u001b[0;36mSSLSocket.read\u001b[0;34m(self, len, buffer)\u001b[0m\n\u001b[1;32m 1132\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m 1133\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m buffer \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[0;32m-> 1134\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_sslobj\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mread\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mlen\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mbuffer\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1135\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 1136\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_sslobj\u001b[38;5;241m.\u001b[39mread(\u001b[38;5;28mlen\u001b[39m)\n",
+ "\u001b[0;31mKeyboardInterrupt\u001b[0m: "
+ ]
+ }
+ ],
+ "source": [
+ "MAX_BULK_SIZE = 1000\n",
+ "MAX_BULK_UPLOADS = 1000\n",
+ "\n",
+ "sentinel = object()\n",
+ "for j in range(MAX_BULK_UPLOADS):\n",
+ " documents = []\n",
+ " while len(documents) < MAX_BULK_SIZE - len(all_langs_datasets):\n",
+ " for ds in all_langs_datasets:\n",
+ " text = next(ds, sentinel)\n",
+ " if text is not sentinel:\n",
+ " documents.append(\n",
+ " {\n",
+ " \"_index\": \"hf-semantic-text-index\",\n",
+ " \"_source\": {\"text_field\": text[\"text\"]},\n",
+ " }\n",
+ " )\n",
+ " # if you are using an ingest pipeline instead of a\n",
+ " # semantic text field, use this instead:\n",
+ " # documents.append(\n",
+ " # {\n",
+ " # \"_index\": \"hf-endpoint-index\",\n",
+ " # \"_source\": {\"text\": text['text']},\n",
+ " # }\n",
+ " # )\n",
+ "\n",
+ " try:\n",
+ " response = helpers.bulk(client, documents, raise_on_error=False, timeout=\"60s\")\n",
+ " print(\"Docs uplaoded:\", (j + 1) * MAX_BULK_SIZE)\n",
+ "\n",
+ " except Exception as e:\n",
+ " print(\"exception:\", str(e))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cf0f6df7",
+ "metadata": {},
+ "source": [
+ "## Semantic search\n",
+ "\n",
+ "After the dataset has been enriched with the embeddings, you can query the data using [semantic search](https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html#knn-semantic-search). Pass a `query_vector_builder` to the k-nearest neighbor (kNN) vector search API, and provide the query text and the model you have used to create the embeddings."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "id": "d9b21b71",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "query = \"English speaking countries\"\n",
+ "semantic_search_results = client.search(\n",
+ " index=\"hf-semantic-text-index\",\n",
+ " query={\"semantic\": {\"field\": \"infer_field\", \"query\": query}},\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "id": "8ef79d16",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "ID: DDbC4pEBhYre9Ocn7zIr\n",
+ "Score: 0.92574656\n",
+ "Text: Orodha ya nchi kufuatana na wakazi\n",
+ "\n",
+ "ID: bjbC4pEBhYre9OcnzC3U\n",
+ "Score: 0.9159906\n",
+ "Text: Intercontinental Cup\n",
+ "\n",
+ "ID: njbC4pEBhYre9OcnzC3U\n",
+ "Score: 0.91523564\n",
+ "Text: รายการจัดเรียงตามทวีปและประเทศ\n",
+ "\n",
+ "ID: bDbC4pEBhYre9Ocn3jBM\n",
+ "Score: 0.9142189\n",
+ "Text: a b c ĉ d e f g ĝ h ĥ i j ĵ k l m n o p r s ŝ t u ŭ v z\n",
+ "\n",
+ "ID: 8jbD4pEBhYre9OcnDTSL\n",
+ "Score: 0.9127883\n",
+ "Text: With Australia:\n",
+ "With Adelaide United:\n",
+ "\n",
+ "ID: MzbC4pEBhYre9Ocn_TQ1\n",
+ "Score: 0.9116771\n",
+ "Text: Más información en .\n",
+ "\n",
+ "ID: _DbC4pEBhYre9Ocn7zEr\n",
+ "Score: 0.9106927\n",
+ "Text: (AS)= Asia (AF)= Afrika (NA)= Amerika ya kaskazini (SA)= Amerika ya kusini (A)= Antaktika (EU)= Ulaya na (AU)= Australia na nchi za Pasifiki.\n",
+ "\n",
+ "ID: fDbC4pEBhYre9Ocn7zEr\n",
+ "Score: 0.9096315\n",
+ "Text: Stadi za lugha ya mazungumzo ni kuzungumza na kusikiliza.\n",
+ "\n",
+ "ID: DDbC4pEBhYre9Ocn3jBL\n",
+ "Score: 0.90771043\n",
+ "Text: \"*(Meksiko mara nyingi huhesabiwa katika Amerika ya Kati kwa sababu za kiutamaduni)\"\n",
+ "\n",
+ "ID: IjbC4pEBhYre9Ocn3i9L\n",
+ "Score: 0.9070151\n",
+ "Text: Englan is a small village in the district of Wokha, in the Nagaland state of India. Its name literally means \"The Path of the Sun\". It is one of the main centers of the district and is an active center of the Lotha language and culture.\n"
+ ]
+ }
+ ],
+ "source": [
+ "pretty_search_response(semantic_search_results)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "id": "a7053b1e",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "ObjectApiResponse({'inference_id': 'my_cohere_rerank_endpoint', 'task_type': 'rerank', 'service': 'cohere', 'service_settings': {'model_id': 'rerank-english-v3.0', 'rate_limit': {'requests_per_minute': 10000}}, 'task_settings': {'top_n': 100, 'return_documents': True}})"
+ ]
+ },
+ "execution_count": 17,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "try:\n",
+ " client.inference.delete(inference_id=\"my_cohere_rerank_endpoint\")\n",
+ "except Exception:\n",
+ " pass\n",
+ "client.inference.put(\n",
+ " task_type=\"rerank\",\n",
+ " inference_id=\"my_cohere_rerank_endpoint\",\n",
+ " body={\n",
+ " \"service\": \"cohere\",\n",
+ " \"service_settings\": {\n",
+ " \"api_key\": \"\",\n",
+ " \"model_id\": \"rerank-english-v3.0\",\n",
+ " },\n",
+ " \"task_settings\": {\"top_n\": 100, \"return_documents\": True},\n",
+ " },\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "id": "6fba4f22",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "reranked_search_results = client.search(\n",
+ " index=\"hf-semantic-text-index\",\n",
+ " retriever={\n",
+ " \"text_similarity_reranker\": {\n",
+ " \"retriever\": {\n",
+ " \"standard\": {\n",
+ " \"query\": {\"semantic\": {\"field\": \"infer_field\", \"query\": query}}\n",
+ " }\n",
+ " },\n",
+ " \"field\": \"text_field\",\n",
+ " \"inference_id\": \"my_cohere_rerank_endpoint\",\n",
+ " \"inference_text\": query,\n",
+ " \"rank_window_size\": 100,\n",
+ " }\n",
+ " },\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "id": "fd4ef932",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "ID: _DbC4pEBhYre9Ocn7zEr\n",
+ "Score: 0.1766716\n",
+ "Text: (AS)= Asia (AF)= Afrika (NA)= Amerika ya kaskazini (SA)= Amerika ya kusini (A)= Antaktika (EU)= Ulaya na (AU)= Australia na nchi za Pasifiki.\n",
+ "\n",
+ "ID: zDbC4pEBhYre9OcnzC7V\n",
+ "Score: 0.06394842\n",
+ "Text: Waingereza nao wakatawala Afrika Mashariki na Kusini, na kuwa sehemu ya Sudan na Somalia, Uganda, Kenya, Tanzania (chini ya jina la Tanganyika), Zanzibar, Nyasaland, Rhodesia, Bechuanaland, Basutoland na Swaziland chini ya utawala wao na baada ya kushinda katika vita huko Afrika ya Kusini walitawala Transvaal, Orange Free State, Cape Colony na Natal, na huko Afrika ya Magharibi walitawala Gambia, Sierra Leone, the Gold Coast na Nigeria.\n",
+ "\n",
+ "ID: bDbC4pEBhYre9Ocn3jBM\n",
+ "Score: 0.013532149\n",
+ "Text: a b c ĉ d e f g ĝ h ĥ i j ĵ k l m n o p r s ŝ t u ŭ v z\n",
+ "\n",
+ "ID: LDbD4pEBhYre9OcnHje5\n",
+ "Score: 0.010130412\n",
+ "Text: Mifano maarufu ya bunge ni Majumba ya Bunge mjini London, Kongresi mjini Washingtin D.C., Bundestag mjini Berlin na Duma nchini Moscow, Parlamento Italiano mjini Roma na \"Assemblée nationale\" mjini Paris. Kwa kanuni ya serikali wakilishi watu hupigia kura wanasiasa ili watimize \"matakwa\" yao. Ingawa nchi kama Israeli, Ugiriki, Uswidi na Uchina zina nyumba moja ya bunge, nchi nyingi zina nyumba mbili za bunge, kumaanisha kuwa zina nyumba mbili za kibunge zinazochaguliwa tofauti. Katika 'nyumba ya chini' wanasiasa wanachaguliwa kuwakilisha maeneo wakilishi bungeni. 'Nymba ya juu' kawaida huchaguliwa kuwakilisha majimbo katika mfumo wa majimbo (kama vile nchii Australia, Ujerumani au Marekani) au upigaji kura tofauti katika katika mfumo wa umoja (kama vile nchini Ufaransa). Nchini Uingereza nyumba ya juu inachaguliwa na na serikali kama nyumba ya marudio. Ukosoaji mmoja wa mifumo yenye nyumba mbili yenye nyumba mbili zilizochaguliwa ni kuwa nyumba ya juu na ya chini huenda zikafanana. Utetezi wa tangu jadi wa mifumo ya nyumba mbili nni kuwa chumba cha juu huwa kama nyumba ya marekebisho. Hili linaweza kupunguza uonevu na dhuluma katika hatua ya kiserikali\", 101\n",
+ "\n",
+ "ID: lzbC4pEBhYre9Ocn7zIr\n",
+ "Score: 0.0033897832\n",
+ "Text: इसके अलावा हिन्दी और संस्कृत में\n",
+ "\n",
+ "ID: wDbC4pEBhYre9Ocn7zIr\n",
+ "Score: 0.0025311112\n",
+ "Text: 2. التزام بريطانيا وفرنسا وفيما بعد إيطاليا بإدارة دولية لفلسطين.\n",
+ "\n",
+ "ID: IjbC4pEBhYre9Ocn3i9L\n",
+ "Score: 0.0023596606\n",
+ "Text: Englan is a small village in the district of Wokha, in the Nagaland state of India. Its name literally means \"The Path of the Sun\". It is one of the main centers of the district and is an active center of the Lotha language and culture.\n",
+ "\n",
+ "ID: jTbD4pEBhYre9OcnDTWL\n",
+ "Score: 0.0022694687\n",
+ "Text: ఇండియా గేటు\n",
+ "\n",
+ "ID: 4zbC4pEBhYre9Ocn_TM0\n",
+ "Score: 0.0018458483\n",
+ "Text: Más información en la web de la Generalidad Valenciana o en la web de la FEDME\n",
+ "\n",
+ "ID: 8jbD4pEBhYre9OcnDTSL\n",
+ "Score: 0.0016875096\n",
+ "Text: With Australia:\n",
+ "With Adelaide United:\n"
+ ]
+ }
+ ],
+ "source": [
+ "pretty_search_response(reranked_search_results)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7e4055ba",
+ "metadata": {},
+ "source": [
+ "**NOTE:** The value of `model_id` in the `query_vector_builder` must match the value of `inference_id` you created in the [first step](#create-the-inference-endpoint)."
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.4"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}