Replies: 1 comment 6 replies
-
Hi @lsli8888 The BERT and SentenceEmbeddings is not a good choice for your use case. I would use something like The only issue is that you might want to import the GTE models through PS: we have fixed some issues in text embeddings annotators to make sure they output the exact same as the ones from outside: https://github.com/JohnSnowLabs/spark-nlp/releases/tag/5.5.3 |
Beta Was this translation helpful? Give feedback.
-
This is a followup from:
#14535
Here's my PySpark ML pipeline for generating vector embeddings for multiple sentences using Spark NLP and its official gte-small embedding model:
My goal is to use Supabase Edge Functions built-in functionality/API to generate gte-small vector embeddings from user queries. These embeddings are then compared in similarity to embeddings generated on my server side (hopefully) by Spark NLP from data in a Delta Lake. Obviously, I want the embeddings values to be similarly generated when using Spark NLP using the same input data as if it were on Supabase Edge Functions.
I tried various inputs for a single sentence - from multiple words for a sentence to a single word sentence - and the results are different between what is generated on Supabase Edge functions and Spark NLP using the code above. BTW, I also tried with case sensitive set to either value - true or false. Just to confirm I'm not doing anything unusual - or that Supabase isn't doing something wrong, I tried generating the vector embeddings from the same input using SentenceTransformers. I got the same vector embeddings in from SentenceTransformers as I did using Supabase.
Yes, I could create a UDF that uses SentenceTransformers to generate embeddings from data in my Delta lake, but I really want to stick to Spark NLP as I'm sure the latter will perform faster and is more efficient. Is there anything in my code above that I have wrong or need to change? Are there any Spark NLP APIs I can use to match the transformations/embedding implementation from Supabase/transformers.js or SentenceTransformers so that Spark NLP will generate the same vector embeddings?
Beta Was this translation helpful? Give feedback.
All reactions