When generating embeddings, I get multiple embedding vectors for a single string. #14957

blargg · 2025-07-30T04:48:47Z

blargg
Jul 30, 2025

I start a server with

llama-server -hf Qwen/Qwen3-Embedding-4B-GGUF:Q4_K_M --embeddings -ngl 1000

Then when I get the embeddings, it seems like I get multiple vectors in return.

curl -X POST "http://localhost:8080/embedding" --data '{"content":"some text to embed"}' | jq ".[0].embedding | keys"

# output
[
  0,
  1,
  2,
  3,
  4
]

I think these are 4 distinct vectors. Each of them have as many elements as I expect in a single embedding vector.

curl -X POST "http://localhost:8080/embedding" --data '{"content":"some text to embed"}' | jq ".[0].embedding.[0] | length"

# output
2560

The number of vectors I get changes with the length of the text. My guess is that it's tokenizing and then embedding each of the tokens, or something similar. I'm not sure why it would do that though.

Any ideas how I can get the embedding vector for the whole string?

Answered by blargg

Jul 30, 2025

Alright, I found a bit more. This is affected by the --pooling flag. Like this, it will just generate a vector for every token.

Using --pooling cls generates a single vector for classification. If I understand correctly, it picks a representative token, which doesn't sound right to me. I expect that the embedding model creates an embedding over the whole string, not just looking at a specific token to generate the embedding.

Maybe I misunderstand what is going on though.

View full answer

blargg · 2025-07-30T05:33:46Z

blargg
Jul 30, 2025
Author

Alright, I found a bit more. This is affected by the --pooling flag. Like this, it will just generate a vector for every token.

Using --pooling cls generates a single vector for classification. If I understand correctly, it picks a representative token, which doesn't sound right to me. I expect that the embedding model creates an embedding over the whole string, not just looking at a specific token to generate the embedding.

Maybe I misunderstand what is going on though.

3 replies

ggerganov Jul 30, 2025
Maintainer

AFAIK, pooling depends on the model and how it is trained. Some models are trained to pool the embedding from just one token. Other models are trained to compute the mean embeddings of all tokens. etc.

Note that even if you pool one token, it still contains information from all tokens since they attend to each other in the attention.

blargg Jul 31, 2025
Author

Ah, I see. Thanks for the help.

It looks like Qwen 3 uses the last token (based on some of the code from this example), but so far in my testing cls and last still give the same results.

iamlemec Aug 1, 2025
Collaborator

The issue with CLS and LAST giving the same result was fixed very recently in #14927. Should work properly with the latest code. The official GGUFs provided by Qwen don't specify a default embedding type, even though it should be LAST, so you'll need to specify LAST manually.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

When generating embeddings, I get multiple embedding vectors for a single string. #14957

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

When generating embeddings, I get multiple embedding vectors for a single string. #14957

Uh oh!

blargg Jul 30, 2025

Replies: 1 comment · 3 replies

Uh oh!

blargg Jul 30, 2025 Author

Uh oh!

Uh oh!

ggerganov Jul 30, 2025 Maintainer

Uh oh!

blargg Jul 31, 2025 Author

Uh oh!

iamlemec Aug 1, 2025 Collaborator

blargg
Jul 30, 2025

Replies: 1 comment 3 replies

blargg
Jul 30, 2025
Author

ggerganov Jul 30, 2025
Maintainer

blargg Jul 31, 2025
Author

iamlemec Aug 1, 2025
Collaborator