-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[ENH] add chroma cloud embedding function to python client #5246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
Chroma Cloud Embedding Function Integration for Python Client This PR introduces a new embedding function for the Python client that enables integration with the Chroma Cloud embedding service, currently supporting only the BAAI/bge-m3 model. The implementation provides a formal EmbeddingFunction interface, supports loading configuration from environment variables, enforces key validation and error handling, and adds thorough tests to cover initialization, input handling, config serialization, and config validation. The new embedding function is also registered and exposed in the embedding function registry for programmatic access. Key Changes• Created Affected Areas• chromadb/utils/ This summary was automatically generated by @propel-code-bot |
self._api_url = "https://api.trychroma.com/api/v2/embed" | ||
self._session = httpx.Client() | ||
self._session.headers.update( | ||
{"x-chroma-token": self.api_key} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
The API key is being passed directly to the header without considering potential timeout scenarios or connection pooling. Consider adding timeout configuration to prevent hanging requests:
self._api_url = "https://api.trychroma.com/api/v2/embed" | |
self._session = httpx.Client() | |
self._session.headers.update( | |
{"x-chroma-token": self.api_key} | |
) | |
self._api_url = "https://api.trychroma.com/api/v2/embed" | |
self._session = httpx.Client(timeout=30.0) | |
self._session.headers.update( | |
{"x-chroma-token": self.api_key} | |
) |
⚡ Committable suggestion
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
# Get embeddings from /embed endpoint | ||
response = self._session.post( | ||
self._api_url, | ||
json={"model": str(self.model.value), "texts": input, "tenant_uuid": self.tenant_uuid}, | ||
).json() | ||
|
||
# Extract embeddings from response | ||
return [np.array(data.embedding, dtype=np.float32) for data in response.data] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
The HTTP request lacks proper error handling. Network failures, API errors, or malformed responses could cause the application to crash. Add comprehensive error handling:
# Get embeddings from /embed endpoint | |
response = self._session.post( | |
self._api_url, | |
json={"model": str(self.model.value), "texts": input, "tenant_uuid": self.tenant_uuid}, | |
).json() | |
# Extract embeddings from response | |
return [np.array(data.embedding, dtype=np.float32) for data in response.data] | |
# Get embeddings from /embed endpoint | |
try: | |
response = self._session.post( | |
self._api_url, | |
json={"model": str(self.model.value), "texts": input, "tenant_uuid": self.tenant_uuid}, | |
) | |
response.raise_for_status() | |
response_data = response.json() | |
except httpx.RequestError as e: | |
raise ValueError(f"Failed to connect to Chroma Cloud API: {e}") | |
except httpx.HTTPStatusError as e: | |
raise ValueError(f"Chroma Cloud API returned error {e.response.status_code}: {e.response.text}") | |
except Exception as e: | |
raise ValueError(f"Failed to parse API response: {e}") | |
# Extract embeddings from response | |
if not hasattr(response_data, 'data') or not response_data.data: | |
raise ValueError("Invalid response format from Chroma Cloud API") | |
return [np.array(data.embedding, dtype=np.float32) for data in response_data.data] |
⚡ Committable suggestion
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
""" | ||
Initialize the ChromaCloudEmbeddingFunction. | ||
|
||
Args: | ||
api_key (str, optional): The API key for the Chroma API. If not provided, | ||
it will be read from the environment variable specified by api_key_env_var. | ||
api_key_env_var (str, optional): Environment variable name that contains your API key. | ||
Defaults to "CHROMA_API_KEY". | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Documentation]
The docstring parameters don't match the actual function parameters. The docstring mentions api_key
and api_key_env_var
but omits model
and tenant_uuid
which are required parameters:
""" | |
Initialize the ChromaCloudEmbeddingFunction. | |
Args: | |
api_key (str, optional): The API key for the Chroma API. If not provided, | |
it will be read from the environment variable specified by api_key_env_var. | |
api_key_env_var (str, optional): Environment variable name that contains your API key. | |
Defaults to "CHROMA_API_KEY". | |
""" | |
""" | |
Initialize the ChromaCloudEmbeddingFunction. | |
Args: | |
model (ChromaCloudEmbeddingModel): The embedding model to use. | |
tenant_uuid (str): The UUID of the tenant for the Chroma Cloud service. | |
api_key (str, optional): The API key for the Chroma API. If not provided, | |
it will be read from the environment variable specified by api_key_env_var. | |
api_key_env_var (str, optional): Environment variable name that contains your API key. | |
Defaults to "CHROMA_API_KEY". | |
""" |
⚡ Committable suggestion
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
|
||
@staticmethod | ||
def name() -> str: | ||
return "chroma_hosted" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[CriticalError]
There's a naming inconsistency between the function name returned by name()
method and the key used in the known_embedding_functions
dictionary. The function returns "chroma_hosted"
but it's registered as "chroma_cloud"
. This could cause configuration issues:
return "chroma_hosted" | |
return "chroma_cloud" |
⚡ Committable suggestion
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
def get_config(self) -> Dict[str, Any]: | ||
return { | ||
"api_key_env_var": self.api_key_env_var, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[CriticalError]
The get_config()
method is missing the model
and tenant_uuid
parameters which are essential for reconstructing the embedding function. This could lead to incomplete configuration serialization:
def get_config(self) -> Dict[str, Any]: | |
return { | |
"api_key_env_var": self.api_key_env_var, | |
} | |
def get_config(self) -> Dict[str, Any]: | |
return { | |
"model": self.model.value, | |
"tenant_uuid": self.tenant_uuid, | |
"api_key_env_var": self.api_key_env_var, | |
} |
⚡ Committable suggestion
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
3 Jobs Failed: PR checks / Python tests / test-rust-bindings (3.9, --ignore-glob 'chromadb/test/property/*' --ignore-glob 'chromadb/test/st...
PR checks / Python tests / test-rust-single-node-integration (3.9, --ignore-glob 'chromadb/test/property/*' --ignore-glob 'c...
PR checks / all-required-pr-checks-passed
Summary: 1 successful workflow, 1 failed workflow
Last updated: 2025-08-12 03:26:27 UTC |
try: | ||
model = ChromaCloudEmbeddingModel(model) | ||
except: | ||
raise ValueError("The valid ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
The error message on line 42 is incomplete and provides no useful information. A proper error message should explain what valid models are supported.
raise ValueError("The valid ") | |
raise ValueError("The model is not supported. Valid models are: " + ", ".join([m.value for m in ChromaCloudEmbeddingModel])) |
⚡ Committable suggestion
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
|
||
try: | ||
model = ChromaCloudEmbeddingModel(model) | ||
except: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
The bare except:
clause on line 41 is too broad and can catch unexpected exceptions like KeyboardInterrupt or SystemExit. Use a specific exception type.
except: | |
except ValueError: |
⚡ Committable suggestion
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
Raises: | ||
ValidationError: If the configuration does not match the schema | ||
""" | ||
validate_config_schema(config, "chroma_cloud") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[CriticalError]
The validate_config_schema()
call uses "chroma_cloud" but this should match the name returned by the name()
method. Given the test expects "chroma_hosted", this needs to be consistent across the codebase.
api_key_env_var=API_KEY_ENV_VAR, | ||
) | ||
|
||
assert ef.name() == "chroma_hosted" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[CriticalError]
The test expects the name to be "chroma_hosted" but the implementation returns "chroma_cloud". This inconsistency needs to be resolved.
"api_key_env_var": API_KEY_ENV_VAR, | ||
} | ||
ChromaCloudEmbeddingFunction.validate_config(valid_config) | ||
mock_validate.assert_called_once_with(valid_config, "chroma_hosted") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[CriticalError]
The test expects "chroma_hosted" but the implementation's validate_config_schema()
call uses "chroma_cloud". This mismatch will cause the test to fail.
if not self.api_key: | ||
raise ValueError(f"The {api_key_env_var} environment variable is not set.") | ||
|
||
self._api_url = "https://chroma-core-staging--chroma-trieve-ingest-immediateembed-10d1fc.modal.run" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
The API URL is hardcoded to a staging environment. While this is acceptable for a demo as mentioned in the PR description, for future production use, consider making this configurable, for example via an environment variable or a parameter to the constructor.
Description of changes
Adds the Chroma Cloud embedding function to the Python client. Currently only supports BAAI/bge-m3 as the embedding model.
Test plan
pytest
for python,yarn test
for js,cargo test
for rustMigration plan
N/A
Observability plan
N/A
Documentation Changes
We should add docs when this is GA but since this is just for a demo, no new docs changes