Skip to content

Conversation

c-gamble
Copy link
Contributor

Description of changes

Adds the Chroma Cloud embedding function to the Python client. Currently only supports BAAI/bge-m3 as the embedding model.

Test plan

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Migration plan

N/A

Observability plan

N/A

Documentation Changes

We should add docs when this is GA but since this is just for a demo, no new docs changes

Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

Copy link
Contributor

propel-code-bot bot commented Aug 11, 2025

Chroma Cloud Embedding Function Integration for Python Client

This PR introduces a new embedding function for the Python client that enables integration with the Chroma Cloud embedding service, currently supporting only the BAAI/bge-m3 model. The implementation provides a formal EmbeddingFunction interface, supports loading configuration from environment variables, enforces key validation and error handling, and adds thorough tests to cover initialization, input handling, config serialization, and config validation. The new embedding function is also registered and exposed in the embedding function registry for programmatic access.

Key Changes

• Created chroma_cloud_embedding_function.py to implement the ChromaCloudEmbeddingFunction class, which calls Chroma Cloud endpoints to obtain embeddings.
• Defined a single supported embedding model (BAAI/bge-m3) using an Enum for validation.
• Implemented error handling for missing packages, API connection failures, and response validation.
• Wired up environment variable-based API key retrieval, with deprecation warning for direct API key usage.
• Integrated config round-trip and build_from_config logic for reproducible instantiation.
• Hooked the embedding function into known_embedding_functions, all, and module-level imports in init.py.
• Created a suite of pytest tests (mocked) for instantiation, call behavior, config methods, and config validation.
• Updated the module's public API and registration dictionary to recognize the new embedding function.

Affected Areas

• chromadb/utils/embedding_functions/chroma_cloud_embedding_function.py
• chromadb/utils/embedding_functions/init.py
• chromadb/test/ef/test_chroma_cloud_ef.py

This summary was automatically generated by @propel-code-bot

Comment on lines 50 to 54
self._api_url = "https://api.trychroma.com/api/v2/embed"
self._session = httpx.Client()
self._session.headers.update(
{"x-chroma-token": self.api_key}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

The API key is being passed directly to the header without considering potential timeout scenarios or connection pooling. Consider adding timeout configuration to prevent hanging requests:

Suggested change
self._api_url = "https://api.trychroma.com/api/v2/embed"
self._session = httpx.Client()
self._session.headers.update(
{"x-chroma-token": self.api_key}
)
self._api_url = "https://api.trychroma.com/api/v2/embed"
self._session = httpx.Client(timeout=30.0)
self._session.headers.update(
{"x-chroma-token": self.api_key}
)

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Comment on lines 70 to 77
# Get embeddings from /embed endpoint
response = self._session.post(
self._api_url,
json={"model": str(self.model.value), "texts": input, "tenant_uuid": self.tenant_uuid},
).json()

# Extract embeddings from response
return [np.array(data.embedding, dtype=np.float32) for data in response.data]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

The HTTP request lacks proper error handling. Network failures, API errors, or malformed responses could cause the application to crash. Add comprehensive error handling:

Suggested change
# Get embeddings from /embed endpoint
response = self._session.post(
self._api_url,
json={"model": str(self.model.value), "texts": input, "tenant_uuid": self.tenant_uuid},
).json()
# Extract embeddings from response
return [np.array(data.embedding, dtype=np.float32) for data in response.data]
# Get embeddings from /embed endpoint
try:
response = self._session.post(
self._api_url,
json={"model": str(self.model.value), "texts": input, "tenant_uuid": self.tenant_uuid},
)
response.raise_for_status()
response_data = response.json()
except httpx.RequestError as e:
raise ValueError(f"Failed to connect to Chroma Cloud API: {e}")
except httpx.HTTPStatusError as e:
raise ValueError(f"Chroma Cloud API returned error {e.response.status_code}: {e.response.text}")
except Exception as e:
raise ValueError(f"Failed to parse API response: {e}")
# Extract embeddings from response
if not hasattr(response_data, 'data') or not response_data.data:
raise ValueError("Invalid response format from Chroma Cloud API")
return [np.array(data.embedding, dtype=np.float32) for data in response_data.data]

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Comment on lines 20 to 28
"""
Initialize the ChromaCloudEmbeddingFunction.

Args:
api_key (str, optional): The API key for the Chroma API. If not provided,
it will be read from the environment variable specified by api_key_env_var.
api_key_env_var (str, optional): Environment variable name that contains your API key.
Defaults to "CHROMA_API_KEY".
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Documentation]

The docstring parameters don't match the actual function parameters. The docstring mentions api_key and api_key_env_var but omits model and tenant_uuid which are required parameters:

Suggested change
"""
Initialize the ChromaCloudEmbeddingFunction.
Args:
api_key (str, optional): The API key for the Chroma API. If not provided,
it will be read from the environment variable specified by api_key_env_var.
api_key_env_var (str, optional): Environment variable name that contains your API key.
Defaults to "CHROMA_API_KEY".
"""
"""
Initialize the ChromaCloudEmbeddingFunction.
Args:
model (ChromaCloudEmbeddingModel): The embedding model to use.
tenant_uuid (str): The UUID of the tenant for the Chroma Cloud service.
api_key (str, optional): The API key for the Chroma API. If not provided,
it will be read from the environment variable specified by api_key_env_var.
api_key_env_var (str, optional): Environment variable name that contains your API key.
Defaults to "CHROMA_API_KEY".
"""

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.


@staticmethod
def name() -> str:
return "chroma_hosted"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CriticalError]

There's a naming inconsistency between the function name returned by name() method and the key used in the known_embedding_functions dictionary. The function returns "chroma_hosted" but it's registered as "chroma_cloud". This could cause configuration issues:

Suggested change
return "chroma_hosted"
return "chroma_cloud"

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Comment on lines +106 to +109
def get_config(self) -> Dict[str, Any]:
return {
"api_key_env_var": self.api_key_env_var,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CriticalError]

The get_config() method is missing the model and tenant_uuid parameters which are essential for reconstructing the embedding function. This could lead to incomplete configuration serialization:

Suggested change
def get_config(self) -> Dict[str, Any]:
return {
"api_key_env_var": self.api_key_env_var,
}
def get_config(self) -> Dict[str, Any]:
return {
"model": self.model.value,
"tenant_uuid": self.tenant_uuid,
"api_key_env_var": self.api_key_env_var,
}

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Copy link
Contributor Author

c-gamble commented Aug 11, 2025

3 Jobs Failed:

PR checks / Python tests / test-rust-bindings (3.9, --ignore-glob 'chromadb/test/property/*' --ignore-glob 'chromadb/test/st...
Step "Test" from job "Python tests / test-rust-bindings (3.9, --ignore-glob 'chromadb/test/property/*' --ignore-glob 'chromadb/test/st..." is failing. The last 20 log lines are:

[...]
        'GoogleVertexEmbeddingFunction',
        'HuggingFaceEmbeddingFunction',
        'HuggingFaceEmbeddingServer',
        'InstructorEmbeddingFunction',
        'JinaEmbeddingFunction',
        'MistralEmbeddingFunction',
        'MorphEmbeddingFunction',
        'ONNXMiniLM_L6_V2',
        'OllamaEmbeddingFunction',
        'OpenAIEmbeddingFunction',
        'OpenCLIPEmbeddingFunction',
        'RoboflowEmbeddingFunction',
        'SentenceTransformerEmbeddingFunction',
        'Text2VecEmbeddingFunction',
        'TogetherAIEmbeddingFunction',
        'VoyageAIEmbeddingFunction',
    }
ERROR chromadb/test/ef/test_chroma_cloud_ef.py::test_chroma_cloud_embedding_function_init - KeyError: 'CHROMA_API_KEY'
= 5 failed, 375 passed, 30 skipped, 1 xfailed, 243 warnings, 1 error in 153.93s (0:02:33) =
Error: Process completed with exit code 1.
PR checks / Python tests / test-rust-single-node-integration (3.9, --ignore-glob 'chromadb/test/property/*' --ignore-glob 'c...
Step "Rust Integration Test" from job "Python tests / test-rust-single-node-integration (3.9, --ignore-glob 'chromadb/test/property/*' --ignore-glob 'c..." is failing. The last 20 log lines are:

[...]
        'InstructorEmbeddingFunction',
        'JinaEmbeddingFunction',
        'MistralEmbeddingFunction',
        'MorphEmbeddingFunction',
        'ONNXMiniLM_L6_V2',
        'OllamaEmbeddingFunction',
        'OpenAIEmbeddingFunction',
        'OpenCLIPEmbeddingFunction',
        'RoboflowEmbeddingFunction',
        'SentenceTransformerEmbeddingFunction',
        'Text2VecEmbeddingFunction',
        'TogetherAIEmbeddingFunction',
        'VoyageAIEmbeddingFunction',
    }
ERROR chromadb/test/ef/test_chroma_cloud_ef.py::test_chroma_cloud_embedding_function_init - KeyError: 'CHROMA_API_KEY'
= 5 failed, 370 passed, 33 skipped, 1 xfailed, 350 warnings, 1 error in 189.56s (0:03:09) =
  2025-08-12T03:10:51.656729Z ERROR  , name: "BatchSpanProcessor.Flush.ExportError", reason: "ExportFailed(Status { code: Unavailable, message: \", detailed error message: dns error: failed to lookup address information: Temporary failure in name resolution\" })", Failed during the export process
    at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/opentelemetry_sdk-0.27.0/src/trace/span_processor.rs:334

Error: Process completed with exit code 1.
PR checks / all-required-pr-checks-passed
Step "Decide whether the needed jobs succeeded or failed" from job "all-required-pr-checks-passed" is failing. The last 20 log lines are:

[...]
}
EOM
)"
shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
env:
  GITHUB_REPO_NAME: chroma-core/chroma
  PYTHONPATH: /home/runner/_work/_actions/re-actors/alls-green/release/v1/src
# ❌ Some of the required to succeed jobs failed 😢😢😢

📝 Job statuses:
📝 python-tests → ❌ failure [required to succeed or be skipped]
📝 python-vulnerability-scan → ✓ success [required to succeed or be skipped]
📝 javascript-client-tests → ✓ success [required to succeed or be skipped]
📝 rust-tests → ✓ success [required to succeed or be skipped]
📝 go-tests → ✓ success [required to succeed or be skipped]
📝 lint → ✓ success [required to succeed]
📝 check-helm-version-bump → ⬜ skipped [required to succeed or be skipped]
📝 delete-helm-comment → ✓ success [required to succeed or be skipped]
Error: Process completed with exit code 1.

Summary: 1 successful workflow, 1 failed workflow

Last updated: 2025-08-12 03:26:27 UTC

try:
model = ChromaCloudEmbeddingModel(model)
except:
raise ValueError("The valid ")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

The error message on line 42 is incomplete and provides no useful information. A proper error message should explain what valid models are supported.

Suggested change
raise ValueError("The valid ")
raise ValueError("The model is not supported. Valid models are: " + ", ".join([m.value for m in ChromaCloudEmbeddingModel]))

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.


try:
model = ChromaCloudEmbeddingModel(model)
except:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

The bare except: clause on line 41 is too broad and can catch unexpected exceptions like KeyboardInterrupt or SystemExit. Use a specific exception type.

Suggested change
except:
except ValueError:

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Raises:
ValidationError: If the configuration does not match the schema
"""
validate_config_schema(config, "chroma_cloud")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CriticalError]

The validate_config_schema() call uses "chroma_cloud" but this should match the name returned by the name() method. Given the test expects "chroma_hosted", this needs to be consistent across the codebase.

api_key_env_var=API_KEY_ENV_VAR,
)

assert ef.name() == "chroma_hosted"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CriticalError]

The test expects the name to be "chroma_hosted" but the implementation returns "chroma_cloud". This inconsistency needs to be resolved.

"api_key_env_var": API_KEY_ENV_VAR,
}
ChromaCloudEmbeddingFunction.validate_config(valid_config)
mock_validate.assert_called_once_with(valid_config, "chroma_hosted")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CriticalError]

The test expects "chroma_hosted" but the implementation's validate_config_schema() call uses "chroma_cloud". This mismatch will cause the test to fail.

if not self.api_key:
raise ValueError(f"The {api_key_env_var} environment variable is not set.")

self._api_url = "https://chroma-core-staging--chroma-trieve-ingest-immediateembed-10d1fc.modal.run"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

The API URL is hardcoded to a staging environment. While this is acceptable for a demo as mentioned in the PR description, for future production use, consider making this configurable, for example via an environment variable or a parameter to the constructor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant