Skip to content

Add a flat vector format for bfloat16 vector storage #132533

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: lucene_snapshot
Choose a base branch
from

Conversation

thecoop
Copy link
Member

@thecoop thecoop commented Aug 7, 2025

Add a bfloat16 flat vector storage format for bbq indices. This needs to be selected by specifying element_type: bfloat16 index option, currently enabled for bbq_flat and bbq_hnsw. This only changes the bytes stored on disk, vectors are still processed in-memory as float[].

@thecoop thecoop requested a review from a team as a code owner August 7, 2025 11:05
@thecoop thecoop requested a review from a team as a code owner August 7, 2025 11:05
@thecoop thecoop changed the base branch from main to lucene_snapshot August 7, 2025 11:05
@thecoop thecoop removed request for a team August 7, 2025 12:04
@thecoop
Copy link
Member Author

thecoop commented Aug 7, 2025

bbq_flat float32:

index_name            index_type  num_docs  index_time(ms)  force_merge_time(ms)  num_segments
--------------------  ----------  --------  --------------  --------------------  ------------  
testwiki1024en.train        flat    499001            6536                  4732             0

index_name            index_type  n_probe  latency(ms)  net_cpu_time(ms)  avg_cpu_count    QPS  recall    visited  filter_selectivity
--------------------  ----------  -------  -----------  ----------------  -------------  -----  ------  ---------  ------------------  
testwiki1024en.train        flat        0        29.92              0.00           0.00  33.42    0.68  499000.00                1.00

bbq_flat bfloat16:

index_name            index_type  num_docs  index_time(ms)  force_merge_time(ms)  num_segments
--------------------  ----------  --------  --------------  --------------------  ------------  
testwiki1024en.train        flat    499001            9055                  6355             0

index_name            index_type  n_probe  latency(ms)  net_cpu_time(ms)  avg_cpu_count    QPS  recall    visited  filter_selectivity
--------------------  ----------  -------  -----------  ----------------  -------------  -----  ------  ---------  ------------------  
testwiki1024en.train        flat        0        31.90              0.00           0.00  31.35    0.69  499000.00                1.00

bbq_hnsw float32:

index_name            index_type  num_docs  index_time(ms)  force_merge_time(ms)  num_segments
--------------------  ----------  --------  --------------  --------------------  ------------  
testwiki1024en.train        hnsw    499001          731577                130158             0

index_name            index_type  n_probe  latency(ms)  net_cpu_time(ms)  avg_cpu_count     QPS  recall   visited  filter_selectivity
--------------------  ----------  -------  -----------  ----------------  -------------  ------  ------  --------  ------------------  
testwiki1024en.train        hnsw        0         2.83              0.00           0.00  352.98    0.68  11987.75                1.00

bbq_hnsw bfloat16:

index_name            index_type  num_docs  index_time(ms)  force_merge_time(ms)  num_segments
--------------------  ----------  --------  --------------  --------------------  ------------  
testwiki1024en.train        hnsw    499001          763079                222260             0

index_name            index_type  n_probe  latency(ms)  net_cpu_time(ms)  avg_cpu_count     QPS  recall   visited  filter_selectivity
--------------------  ----------  -------  -----------  ----------------  -------------  ------  ------  --------  ------------------  
testwiki1024en.train        hnsw        0         2.93              0.00           0.00  340.72    0.68  12227.39                1.00

@thecoop
Copy link
Member Author

thecoop commented Aug 7, 2025

From esbench runs, around a 5% performance drop, for a 40-50% reduction in disk usage

Comment on lines +32 to +34
public static float bFloat16ToFloat(short bf) {
return Float.intBitsToFloat(bf << 16);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to worry about the short being signed? Since we are just storing the bits, do we have to do a Short.toUnsignedInt call?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, as its a left shift, so there's no sign-extension to worry about

import java.nio.ByteOrder;
import java.nio.ShortBuffer;

class BFloat16 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a silly question mostly for my own knowledge. Why not use jdk.incubator.vector.Float16? Just because it's in incubator and we can pretty easily write / maintain our own?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the IEEE 16-bit float, which is different to bfloat16. IEEE 16-bit has less range than 32-bit floats. BFloat16 has the same range as 32-bit, but less precision - https://en.wikipedia.org/wiki/Bfloat16_floating-point_format

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@john-wagster utilizing Float16 pollutes our type space with something that actually isn't supported. It also does boxing.

I think there will be space for panamavector when it comes to scoring between two bfloat16 arrays as right now this PR decodes to float, then compares them, which is not only an extra step, but also loses any potential speed improvements.

Copy link
Member Author

@thecoop thecoop Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this PR is here for the disk space and memory usage reductions. Passing round the bfloat16 values directly as short[] is a much more involved piece of work, and you would still need to convert to 32-bit floats at some point as panama doesn't support working with 16-bit bfloat16s.

@benwtrent
Copy link
Member

I wonder @thecoop if it would be better to add a element_type: bfloat16 to the vector mapping?

Since you are adding a new flat vector format, that means we could add it to ALL our formats almost for "free".

Then for diskbbq, hnsw_bbq, etc. when choosing the element_type, we can adjust the higher level format, which just calls the same readers/writers, but with a different flat vector format.

@thecoop
Copy link
Member Author

thecoop commented Aug 7, 2025

I'm not sure how that would work with the lucene VectorEncoding enum, which doesn't have a value for 2-byte BFLOAT16, and all the formats only support 1-byte bytes or 4-byte floats.

I have a separate PR that adds VectorEncoding.BFLOAT16 to lucene, but that would need a lucene major version upgrade to use.

@john-wagster
Copy link
Contributor

Around a 5% performance drop, for a 40-50% reduction in disk usage

looks like merge times went way up too? around 50% more time in merge if I'm reading this right.

@benwtrent
Copy link
Member

I'm not sure how that would work with the lucene VectorEncoding enum, which doesn't have a value for 2-byte BFLOAT16, and all the formats only support 1-byte bytes or 4-byte floats.

We don't need to do anything with Lucene. Our element types don't directly translate. We can just tell Lucene whatever we want (maybe just float). our format will just know that float actually means bfloat16.

@benwtrent
Copy link
Member

@thecoop see what I did for bit element type for inspiration into what I mean.

@thecoop
Copy link
Member Author

thecoop commented Aug 7, 2025

looks like merge times went way up too? around 50% more time in merge if I'm reading this right.

yup, but it actually showed a decrease in merge time in esbench runs. Needs a closer check.

@benwtrent
Copy link
Member

yup, but it actually showed a decrease in merge time in esbench runs. Needs a closer check.

I wonder if its a difference of incremental merge times vs forcemerge time.

We may just be copying fewer bytes and the reduction in overall off-heap memory requirements might off-set the impact of requiring to decode the floats for regular merges.

@thecoop
Copy link
Member Author

thecoop commented Aug 7, 2025

I've defined a new ElementType, rather than creating a special index option

@thecoop thecoop force-pushed the bfloat16-vector-format branch 4 times, most recently from 0fa437b to 175cd80 Compare August 8, 2025 09:22
@thecoop thecoop force-pushed the bfloat16-vector-format branch from 175cd80 to f979e89 Compare August 8, 2025 10:36
// write vector values
long vectorDataOffset = vectorData.alignFilePointer(Float.BYTES);
switch (fieldData.fieldInfo.getVectorEncoding()) {
case BYTE -> writeByteVectors(fieldData);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since now we are creating this with element_type, we should just throw if byte is used when creating a field and remove all the byte support here on the writer & on the reader.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, quite right - done. This code obviously needs a proper pass-through after prototyping.

@jeramysoucy jeramysoucy self-requested a review August 11, 2025 15:14
@jeramysoucy
Copy link

@thecoop I am planning to take a look at this on behalf of the Kibana Security team. Could you point me to changes that are relevant for our team, or anywhere specific that you'd like us to focus on?

@thecoop
Copy link
Member Author

thecoop commented Aug 11, 2025

This was done as part of spacetime, and it's still WIP. I'm not sure why you were automatically pinged by the bots for this PR.

@jeramysoucy
Copy link

@thecoop Ah, ok. I am going to to remove us for now. If you want us to take a look if/when it is ready for review/merge let us know.

@jeramysoucy jeramysoucy removed request for a team and jeramysoucy August 11, 2025 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants