Skip to content

Add mmap based storage of vectors, while keeping buffered storage as default #57

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rchitale7
Copy link
Member

@rchitale7 rchitale7 commented Apr 18, 2025

Description

This issue enables the vector and doc id binary downloaded from Remote Store to be saved to disk, and read using mmap. The caller of run_tasks can download the vectors to disk if they set storage_mode = "disk", but the default storage mode is still memory. I refactored the code so that we can support other storage modes if necessary. Key to the refactoring is the BinarySource object that wraps the storage mechanism; FileSource wraps the file object used for numpy.mmap, while BufferSource wraps the buffer object used for numpy.frombuffer.

I verified manually that both storage modes work, and create the expected faiss graph.

Issues Resolved

Partially resolves for #49. I need to still update the USER_GUIDE.md to specify that memory is the default approach, but with CUDA versions >= 12, we see memory spike. So to avoid memory spike, user can use disk (at the expense of slower index builds)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@rchitale7 rchitale7 marked this pull request as ready for review April 18, 2025 16:39
jed326
jed326 previously approved these changes Apr 23, 2025
@rchitale7
Copy link
Member Author

PR is pending resolution of the discussion here: facebookresearch/faiss#4274

@rchitale7
Copy link
Member Author

Based on discussion in facebookresearch/faiss#4274, we determined that the reason the memory spikes when storing vectors in CPU memory is because of the CUDA version. This issue is observed with CUDA version >= 12. I've updated this PR to make memory based storage the default, while still giving the caller the option to use disk/mmap based storage as another option. I'll update the USER_GUIDE.md to specify when to use disk v.s. memory storage after this PR is merged.

@navneet1v
Copy link
Collaborator

@rchitale7 fix the commit message and also the PR description to correctly point out what the PR is doing

@rchitale7 rchitale7 changed the title Add mmap based storage of vectors as a default Add mmap based storage of vectors May 13, 2025
…default

Signed-off-by: Rohan Chitale <rchital@amazon.com>
@rchitale7 rchitale7 changed the title Add mmap based storage of vectors Add mmap based storage of vectors, while keeping buffered storage as default May 13, 2025
@rchitale7
Copy link
Member Author

@rchitale7 fix the commit message and also the PR description to correctly point out what the PR is doing

missed this, fixed it now.

@navneet1v
Copy link
Collaborator

@rchitale7 Based on the response here: facebookresearch/faiss#4274 (comment) I think we might not even need this PR. So, I will wait for the bug to be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants