[MOD-9576] Fix memory safety in cosine preprocessing #670

meiravgri · 2025-05-08T04:56:32Z

This PR fixes a memory safety issue in our vector preprocessing pipeline, particularly affecting int8_t and uint8_t vectors with cosine similarity.

The Bug

Previously, CosinePreprocessor copied processed_bytes_count bytes from the source blob.
For int8_t / uint8_t + cosine similarity that value is input bytes + 4 (extra norm), so

memcpy(blob, original_blob, processed_bytes_count);

could touch uninitialised memory—undefined behaviour.

Key Changes

Preprocessors API

Blob‑size propagation – each preprocess* function now receives input_blob_size by reference and can write back the new size after it reallocates or extends the blob.
Fixed destination size – processed_bytes_count is stored once in the cosine‑preprocessor constructor.
It is not taken from the preprocess* arguments, so we no longer rely on processed_bytes_count == input_blob_size
Added assertions to ensure any pre‑allocated storage_blob or query_blob already equals processed_bytes_count, enabling safe in‑place processing; dynamic resizing wasn’t implemented now to avoid added complexity and performance costs—if needed in the future, these checks can be removed and proper allocation logic added.

Factory

Added processed_bytes_count to PreprocessorsContainerParams.
Converted CreatePreprocessorsContainerParams from a regular function to a template function to properly handle type-specific size calculations

Impact

Eliminates out‑of‑bounds reads for int8_t / uint8_t cosine vectors.
API change: downstream code must propagate the (possibly updated) input_blob_size.
No functional change for other datatype‑metric pairs.
Negligible perf hit: at most 4 bytes of extra memset.

…er file add logic to determine the cosine processed_bytes_count add processed_bytes_count to the cosine component change blob_size param to the original blob size instead of processed_bytes_count in all preprocessing functions.

codecov · 2025-05-08T05:37:13Z

Codecov Report

Attention: Patch coverage is 89.36170% with 5 lines in your changes missing coverage. Please review.

Project coverage is 96.29%. Comparing base (fbff84f) to head (8967d40).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
.../VecSim/spaces/computer/preprocessor_container.cpp	75.00%	2 Missing ⚠️
src/VecSim/spaces/computer/preprocessors.h	89.47%	2 Missing ⚠️
...rc/VecSim/spaces/computer/preprocessor_container.h	90.90%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #670      +/-   ##
==========================================
+ Coverage   96.24%   96.29%   +0.05%     
==========================================
  Files         112      111       -1     
  Lines        6278     6288      +10     
==========================================
+ Hits         6042     6055      +13     
+ Misses        236      233       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…_size

add DummyChangeAllocSizePreprocessor class to test pp that changes the size add tests one will fail because the input blob size is not updated

meiravgri · 2025-05-12T15:22:11Z

src/VecSim/spaces/computer/preprocessor_container.h


    for (auto pp : preprocessors) {
        if (!pp)
            break;
        // modifies the memory in place
-        pp->preprocessStorageInPlace(blob, processed_bytes_count);
+        pp->preprocessStorageInPlace(blob, input_blob_size);


will be covered by svs code

…tes_count add valgrind macro value_to_add of DummyQueryPreprocessor and DummyMixedPreprocessor is now DataType instead of int

Copilot

Pull Request Overview

This PR refactors the preprocessor pipeline API to propagate and fix the blob size across all stages, and it ensures safe in-place processing for cosine similarity by storing the expected processed_bytes_count. It also updates the factory to calculate and pass the new size, and adds a Valgrind build option.

Switch from a processed_bytes_count value to a size_t &input_blob_size reference throughout all preprocessors.
Store processed_bytes_count in CosinePreprocessor and add assertions to guard in-place processing.
Update factories to compute and forward the correct blob size; add USE_VALGRIND option in tests.

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/unit/test_hnsw_tiered.cpp	Adapted test preprocessor implementations to new `input_blob_size` API
tests/unit/CMakeLists.txt	Added `USE_VALGRIND` CMake option
src/VecSim/spaces/computer/preprocessors.h	Changed preprocess interfaces to use size reference; updated `CosinePreprocessor`
src/VecSim/spaces/computer/preprocessor_container.h	Propagate size reference through container API
src/VecSim/spaces/computer/preprocessor_container.cpp	Updated container implementation to use new size
src/VecSim/index_factories/components/preprocessors_factory.h	Added `processed_bytes_count` to params
src/VecSim/index_factories/components/components_factory.h	Created templated `CreatePreprocessorsContainerParams` to compute new size
src/VecSim/CMakeLists.txt	Removed deprecated `components_factory.cpp` from build
Makefile	Enable `USE_VALGRIND` flag when requested

Comments suppressed due to low confidence (1)

src/VecSim/spaces/computer/preprocessors.h:61

The both-null case (when storage_blob and query_blob are both initially nullptr) is never entered because it's inside the if (storage_blob != query_blob) block. You need an else branch matching the original implementation to handle allocating and splitting a single shared nullptr blob.

if (storage_blob != query_blob) {

Copilot · 2025-05-18T12:26:24Z

src/VecSim/spaces/computer/preprocessors.h

+                memcpy(storage_blob, original_blob, input_blob_size);
            } else if (query_blob == nullptr) {
                query_blob = this->allocator->allocate_aligned(processed_bytes_count, alignment);
-                // TODO: handle original_blob_size != processed_bytes_count
-                memcpy(query_blob, original_blob, processed_bytes_count);
+                memcpy(query_blob, original_blob, input_blob_size);


You allocate processed_bytes_count bytes but only copy input_blob_size bytes, leaving the extra norm space uninitialized. Consider initializing or writing the norm value explicitly before use to avoid undefined data in the trailing bytes.

plan for the tests

cc4281a

meiravgri changed the title ~~make CreatePreprocessorsContainerParams templated and move it to header file~~ [MOD-9576] make CreatePreprocessorsContainerParams templated and move it to header file May 11, 2025

meiravgri added 5 commits May 11, 2025 16:35

Merge remote-tracking branch 'origin/main' into meiravg_fix_blob_copy…

74885a3

…_size

rename original_blob_size-> input_blob_size

86a44a9

add DummyChangeAllocSizePreprocessor class to test pp that changes the size add tests one will fail because the input blob size is not updated

preprocessors now change the blob size

3e15e76

fix test

1863722

fix tiered test

55837ba

meiravgri commented May 12, 2025

View reviewed changes

meiravgri added 3 commits May 17, 2025 04:50

add assert storage_blob == nullptr || input_blob_size == processed_by…

b1699ad

…tes_count add valgrind macro value_to_add of DummyQueryPreprocessor and DummyMixedPreprocessor is now DataType instead of int

enable assert only in debug

6dc543d

use constexpr for blob size

3e673b7

meiravgri added the benchmarks-all label May 18, 2025

meiravgri changed the title ~~[MOD-9576] make CreatePreprocessorsContainerParams templated and move it to header file~~ [MOD-9576] Fix memory safety in cosine preprocessing May 18, 2025

meiravgri added the backport 8.0 label May 18, 2025

small docs changes

8967d40

meiravgri requested review from alonre24, GuyAv46, Copilot and dor-forer and removed request for GuyAv46 May 18, 2025 12:23

Copilot AI reviewed May 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MOD-9576] Fix memory safety in cosine preprocessing #670

[MOD-9576] Fix memory safety in cosine preprocessing #670

meiravgri commented May 8, 2025 •

edited

Loading

codecov bot commented May 8, 2025 •

edited

Loading

meiravgri May 12, 2025

Copilot AI left a comment

Copilot AI May 18, 2025

[MOD-9576] Fix memory safety in cosine preprocessing #670

Are you sure you want to change the base?

[MOD-9576] Fix memory safety in cosine preprocessing #670

Conversation

meiravgri commented May 8, 2025 • edited Loading

The Bug

Key Changes

Preprocessors API

Factory

Impact

codecov bot commented May 8, 2025 • edited Loading

Codecov Report

meiravgri May 12, 2025

Choose a reason for hiding this comment

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Copilot AI May 18, 2025

Choose a reason for hiding this comment

meiravgri commented May 8, 2025 •

edited

Loading

codecov bot commented May 8, 2025 •

edited

Loading