Add mmap based storage of vectors, while keeping buffered storage as default #57

rchitale7 · 2025-04-18T16:38:00Z

Description

This issue enables the vector and doc id binary downloaded from Remote Store to be saved to disk, and read using mmap. The caller of run_tasks can download the vectors to disk if they set storage_mode = "disk", but the default storage mode is still memory. I refactored the code so that we can support other storage modes if necessary. Key to the refactoring is the BinarySource object that wraps the storage mechanism; FileSource wraps the file object used for numpy.mmap, while BufferSource wraps the buffer object used for numpy.frombuffer.

I verified manually that both storage modes work, and create the expected faiss graph.

Issues Resolved

Partially resolves for #49. I need to still update the USER_GUIDE.md to specify that memory is the default approach, but with CUDA versions >= 12, we see memory spike. So to avoid memory spike, user can use disk (at the expense of slower index builds)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

remote_vector_index_builder/core/binary_source/file_source.py

rchitale7 · 2025-04-29T17:43:11Z

PR is pending resolution of the discussion here: facebookresearch/faiss#4274

rchitale7 · 2025-05-12T18:43:22Z

Based on discussion in facebookresearch/faiss#4274, we determined that the reason the memory spikes when storing vectors in CPU memory is because of the CUDA version. This issue is observed with CUDA version >= 12. I've updated this PR to make memory based storage the default, while still giving the caller the option to use disk/mmap based storage as another option. I'll update the USER_GUIDE.md to specify when to use disk v.s. memory storage after this PR is merged.

navneet1v · 2025-05-13T18:57:43Z

@rchitale7 fix the commit message and also the PR description to correctly point out what the PR is doing

…default Signed-off-by: Rohan Chitale <rchital@amazon.com>

rchitale7 · 2025-05-13T21:08:43Z

@rchitale7 fix the commit message and also the PR description to correctly point out what the PR is doing

missed this, fixed it now.

navneet1v · 2025-05-15T16:19:49Z

@rchitale7 Based on the response here: facebookresearch/faiss#4274 (comment) I think we might not even need this PR. So, I will wait for the bug to be fixed.

rchitale7 marked this pull request as ready for review April 18, 2025 16:39

rchitale7 requested review from Rajrahane, navneet1v, yigithub, vamshin, jed326, owenhalpert and neetikasinghal as code owners April 18, 2025 16:39

jed326 reviewed Apr 21, 2025

View reviewed changes

remote_vector_index_builder/core/binary_source/file_source.py Show resolved Hide resolved

jed326 previously approved these changes Apr 23, 2025

View reviewed changes

rchitale7 dismissed jed326’s stale review via 36484c7 May 12, 2025 17:31

rchitale7 force-pushed the mmap branch from 1f88f07 to 36484c7 Compare May 12, 2025 17:31

jed326 approved these changes May 12, 2025

View reviewed changes

rchitale7 changed the title ~~Add mmap based storage of vectors as a default~~ Add mmap based storage of vectors May 13, 2025

Add mmap based storage of vectors, while keeping buffered storage as …

100305d

…default Signed-off-by: Rohan Chitale <rchital@amazon.com>

rchitale7 force-pushed the mmap branch from 414fe19 to 100305d Compare May 13, 2025 21:00

rchitale7 changed the title ~~Add mmap based storage of vectors~~ Add mmap based storage of vectors, while keeping buffered storage as default May 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add mmap based storage of vectors, while keeping buffered storage as default #57

Add mmap based storage of vectors, while keeping buffered storage as default #57

rchitale7 commented Apr 18, 2025 •

edited

Loading

rchitale7 commented Apr 29, 2025

rchitale7 commented May 12, 2025

navneet1v commented May 13, 2025

rchitale7 commented May 13, 2025

navneet1v commented May 15, 2025

Add mmap based storage of vectors, while keeping buffered storage as default #57

Are you sure you want to change the base?

Add mmap based storage of vectors, while keeping buffered storage as default #57

Conversation

rchitale7 commented Apr 18, 2025 • edited Loading

Description

Issues Resolved

rchitale7 commented Apr 29, 2025

rchitale7 commented May 12, 2025

navneet1v commented May 13, 2025

rchitale7 commented May 13, 2025

navneet1v commented May 15, 2025

rchitale7 commented Apr 18, 2025 •

edited

Loading