-
Notifications
You must be signed in to change notification settings - Fork 1
initial commit for embedding linker #65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this sounds great!
I've just left "a few" comments on things I'm getting confused about.
for name in self._name_keys | ||
] | ||
|
||
def embed_cui_names(self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this would need to be explicitly be called to init the class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure myself about the best way to expose and use these methods that is consistent within the medcat library.
Currently on my testing I'm doing this with an existing medcat, and well defined pipeline:
cat.config.components.linking = EmbeddingLinking()
cat.config.components.linking.comp_name = EmbeddingLinker.name
embedding_linker = EmbeddingLinker(
cdb=cat.cdb,
config=cat.config
)
cat._pipeline._components[-1] = embedding_linker
cat._recreate_pipe()
embedding_linker.embed_cui_names(embedding_model_name="NeuML/pubmedbert-base-embeddings")
embedding_linker.embed_names(embedding_model_name="NeuML/pubmedbert-base-embeddings")
But obviously this isn't standard at all.
I would like to know is there a "standard" way of implementing this when initialising a linker in general, or if you have a recommendation of what I should do instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intuition would be that if you need something initialised for inference, you'd initialise it at init time (i.e in __init__
).
With that said, since you write this data onto disk (as part of cdb.addl_info
), you would need to check for its existance first. But even then, if the CDB changes, (some of) these would need to be recalculated. The CDB
keeps track of the is_dirty
attribute. If that's False, there's been no (API-based) changes to the CDB since the CDB load, and if it's True, then there's been changes. But I don't think I've set up anything to mark it undirty automatically, really. The only thing that's automatically dealt with is the subnames, but there's a separate flag for that.
TODO or discuss currently that I know of:
Otherwise the changes I've submitted have done a few things:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small comments on the code
You could also init at As for
I don't think there's really a way to do that gracefully. This would be a generative process - you're wanting to generate information that isn't there. Overall, you migth also want to add some tests to this. Probably with a very low-performant but small model. But something that shows the entire thing works in the technical sense. |
I've just this in the call method: if self.cdb.is_dirty: I think there are tons are permutations I can do around being smart with embedding names and cuis. But I think it might just not be worth it. |
I think that's fair. Though maybe also tell them the method to use for the re-embedding. I.e: linker = self.cat._pipeline.get_component(CoreComponentType.linking)
linker.embed_cui_names(linker.embedding_model_name) PS: I'd probably need to improve getting the pipe from the CAT object instead of using the protected attribute. |
Hihi, Additional changes:
Regarding:
I was curious what's the best way to access the embed linker, as it's needed to be called when embedding names and cuis? Currently I access the component via: Which isn't ideal I suppose. I think the only TODO in terms of development is deciding on what to do with similarity thresholds. I'm thinking that; if you generate your own link candidates a higher threshold is appropriate, but when you get to larger contexts then that should be a smaller threshold. |
The API to access the component would be: from medcat.components.types import CoreComponentType
comp = cat.pipe.get_component(CoreComponentType.linking) EDIT: You may want to resync with master - otherwise you're stuck with using |
Yep. That works nicely. |
The test failure should be fixed by #125 |
This is WIP just showing the inital commit of the embedding linker, and how I've tested / instantiated it.
WHAT IT DOES:
This is a linker that requires no training. However it does have higher computational overheads.
Given an embedding model name (such as sentence-transformers/all-MiniLM-L6-v2 or BAAI/bge-m3) we embed all names in cdb.name2info. This will result in a N dimensional vector for each name.
When provided with a bunch of entities from the ner step we then embed the detected names, or the raw texts and compare them to all embedded potential names via cosine similarity.
With the most similar name we then obtain all cuis from that name via
cdb.name2info[predicted_name]["per_cui_status"].keys()
.HOW IT'S BEEN TESTED:
I am currently changing the linker in an already existing model:
Just a note embedding with sentence-transformers/all-MiniLM-L6-v2 takes 10 minutes for 3 million names. BAAI/bge-m3 takes 1 hour 30. Although this is a one time cost.
I then run this to test performance:
I've tested this on the original model, and twice with the new embedding linker using "sentence-transformers/all-MiniLM-L6-v2" as the embedding model. Firstly, just chosing the first cui in self.cdb.name2info[predicted_name]["per_cui_status"].keys() if there are multiple. The second tries to do disambiguation with cui's prefered names and the detected name.
Using context_based_linker:
Epoch: 0, Prec: 0.09023910064860668, Rec: 0.3306035738148675, F1: 0.14177920150470955
Picking first cui, with an embedding linker:
Epoch: 0, Prec: 0.07762498227887255, Rec: 0.3095756163585588, F1: 0.12412585010647798
With embedding disambiguation:
Epoch: 0, Prec: 0.0650762760285772, Rec: 0.2595346605900696, F1: 0.10406025987586687
The only real import stat here is recall. While these look low, I'm led to believe that these usually come with filters, which I've avoided.
Issues / Discussion:
embed_names
, besides addressing the component itself via protected variables:TODO:
Let me know any suggestions / whatever else I've missed. Thanks!