Spark NLP gte-small model not generating the same vector embeddings as from transformers.js or sentence-transformers #14539

lsli8888 · 2025-03-25T18:20:55Z

lsli8888
Mar 25, 2025

This is a followup from:

Here's my PySpark ML pipeline for generating vector embeddings for multiple sentences using Spark NLP and its official gte-small embedding model:

        document_assembler = DocumentAssembler() \
            .setInputCol(self.embedding_column) \
            .setOutputCol("document")

        sentence = SentenceDetector() \
            .setInputCols(["document"]) \
            .setOutputCol("sentence")

        tokenizer = Tokenizer() \
            .setInputCols(["document"]) \
            .setOutputCol("token")

        bert_embeddings = BertEmbeddings().load(gte_small_model_file_location) \
            .setInputCols(["sentence", 'token']) \
            .setOutputCol("embeddings") \
            .setCaseSensitive(True)

        sentence_embedding = SentenceEmbeddings() \
            .setInputCols(["sentence", "embeddings"]) \
            .setOutputCol("sentence_embedding") \
            .setPoolingStrategy("AVERAGE")

        embeddings_finisher = EmbeddingsFinisher() \
            .setInputCols("sentence_embedding") \
            .setOutputCols([self.vector_column]) \
            .setOutputAsVector(False)

        return Pipeline(
            stages=[
                document_assembler,
                sentence,
                tokenizer,
                bert_embeddings,
                sentence_embedding,
                embeddings_finisher
            ]
        )

My goal is to use Supabase Edge Functions built-in functionality/API to generate gte-small vector embeddings from user queries. These embeddings are then compared in similarity to embeddings generated on my server side (hopefully) by Spark NLP from data in a Delta Lake. Obviously, I want the embeddings values to be similarly generated when using Spark NLP using the same input data as if it were on Supabase Edge Functions.

I tried various inputs for a single sentence - from multiple words for a sentence to a single word sentence - and the results are different between what is generated on Supabase Edge functions and Spark NLP using the code above. BTW, I also tried with case sensitive set to either value - true or false. Just to confirm I'm not doing anything unusual - or that Supabase isn't doing something wrong, I tried generating the vector embeddings from the same input using SentenceTransformers. I got the same vector embeddings in from SentenceTransformers as I did using Supabase.

Yes, I could create a UDF that uses SentenceTransformers to generate embeddings from data in my Delta lake, but I really want to stick to Spark NLP as I'm sure the latter will perform faster and is more efficient. Is there anything in my code above that I have wrong or need to change? Are there any Spark NLP APIs I can use to match the transformations/embedding implementation from Supabase/transformers.js or SentenceTransformers so that Spark NLP will generate the same vector embeddings?

maziyarpanahi · 2025-03-25T20:32:50Z

maziyarpanahi
Mar 25, 2025
Maintainer

Hi @lsli8888

The BERT and SentenceEmbeddings is not a good choice for your use case. I would use something like BGEEmbeddings that has the whole process all in one place. The output of these annotators which are meant for text embeddings are identical to the original models from sentence-transformers (huggingface). Annotators like E5, MPNet, Nomic, etc.

The only issue is that you might want to import the GTE models through BGEEmbeddings annotator since they have the same architecture (BERT) it should be fine.

PS: we have fixed some issues in text embeddings annotators to make sure they output the exact same as the ones from outside: https://github.com/JohnSnowLabs/spark-nlp/releases/tag/5.5.3

6 replies

maziyarpanahi Mar 26, 2025
Maintainer

As I mentioned, that model doesn't exist for this annotator. You need to import it yourself: this is how you can import onnx models to these annotators (you can use BGE): https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples/python/transformers/onnx

lsli8888 Mar 26, 2025
Author

Thanks for the clarification and help @maziyarpanahi. I was able to import onnx models using BGE Embeddings using the link above to generate a bert_onnx file. BTW, I also used setting case sensitive to True and then to False when saving the models:

BGEEmbeddings.loadSavedModel(f"{EXPORT_PATH}", spark_session) \
    .setInputCols(["document"]) \
    .setOutputCol("bge") \
    .setCaseSensitive(False) # Also tried true

I then generated the vector embeddings using the following code. I tested on a simple single token of "Hello" to start with:

        document_assembler = DocumentAssembler() \
            .setInputCol(self.embedding_column) \
            .setOutputCol("document")

        bge_embeddings = BGEEmbeddings().load(f"/tmp/{self.embedding_algorithm}") \ # This is where I saved the model
            .setInputCols(["document"]) \
            .setOutputCol("embeddings")

        embeddings_finisher = EmbeddingsFinisher() \
            .setInputCols("embeddings") \
            .setOutputCols(self.vector_column) \
            .setOutputAsVector(False)

Again, I tried setting both case sensitive to True and then to False. I didn't get my previous error. However, the vector embedding generated just didn't match the vector embedding generate using Supabase Edge Functions or using sentence-transformers (again, both Supabase and sentence-transformers yielded the same vector embeddings).

Am I still doing something wrong or forgetting to do something?

BTW, I'm using PySpark 3.5.4, Spark NLP 5.5.3.

maziyarpanahi Mar 26, 2025
Maintainer

thank you, do you have a notebook that shows this difference? If you can share a colab that shows what you are loading in ONNX and then spark-nlp, how you calculate the cosine similarity that shows difference would be very helpful.

lsli8888 Mar 27, 2025
Author

Hi @maziyarpanahi,

Unfortunately, I couldn't get Colab to work for me in short order, but I was able to create a local Jupyter notebook and get it to run on VSCode. I've attached the Jupyter notebook file with some of the output. I'll try again tomorrow and see if I can get it to work on Colab.

The cosine similarity I got is 0.9460 which is similar, though not the same.

FYI, I had to rename the file from comparison.ipynb to comparison.txt as GitHub wasn't allowing me to upload comparison.ipynb.

comparison.txt

lsli8888 Mar 27, 2025
Author

Hi @maziyarpanahi, I sent a link of the Google Colab notebook to your email address that you've listed in your GitHub profile.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark NLP gte-small model not generating the same vector embeddings as from transformers.js or sentence-transformers #14539

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Spark NLP gte-small model not generating the same vector embeddings as from transformers.js or sentence-transformers #14539

lsli8888 Mar 25, 2025

Replies: 1 comment · 6 replies

maziyarpanahi Mar 25, 2025 Maintainer

maziyarpanahi Mar 26, 2025 Maintainer

lsli8888 Mar 26, 2025 Author

maziyarpanahi Mar 26, 2025 Maintainer

lsli8888 Mar 27, 2025 Author

lsli8888 Mar 27, 2025 Author

lsli8888
Mar 25, 2025

Replies: 1 comment 6 replies

maziyarpanahi
Mar 25, 2025
Maintainer

maziyarpanahi Mar 26, 2025
Maintainer

lsli8888 Mar 26, 2025
Author

maziyarpanahi Mar 26, 2025
Maintainer

lsli8888 Mar 27, 2025
Author

lsli8888 Mar 27, 2025
Author