initial commit for embedding linker #65

adam-sutton-1992 · 2025-08-04T15:04:33Z

This is WIP just showing the inital commit of the embedding linker, and how I've tested / instantiated it.

WHAT IT DOES:

This is a linker that requires no training. However it does have higher computational overheads.

Given an embedding model name (such as sentence-transformers/all-MiniLM-L6-v2 or BAAI/bge-m3) we embed all names in cdb.name2info. This will result in a N dimensional vector for each name.

When provided with a bunch of entities from the ner step we then embed the detected names, or the raw texts and compare them to all embedded potential names via cosine similarity.

With the most similar name we then obtain all cuis from that name via cdb.name2info[predicted_name]["per_cui_status"].keys().

HOW IT'S BEEN TESTED:

I am currently changing the linker in an already existing model:

from medcat.components.types import CoreComponentType
from medcat.components.linking.embedding_linker import Linker as EmbeddingLinker
from medcat.components.types import register_core_component

cat = CAT.load_model_pack(...)
register_core_component(CoreComponentType.linking, EmbeddingLinker.name, EmbeddingLinker.create_new_component)

embedding_linker = EmbeddingLinker(
    cdb=cat.cdb,
    config=cat.config,
    embedding_model_name="sentence-transformers/all-MiniLM-L6-v2",
    max_length=64
)
cat.config.components.linking.comp_name = EmbeddingLinker.name
cat._recreate_pipe()
embedding_linker = cat._pipeline._components[-1]
embedding_linker.embed_names(embedding_model_name="sentence-transformers/all-MiniLM-L6-v2", batch_size=4096)

cat.save_model_pack(...)

Just a note embedding with sentence-transformers/all-MiniLM-L6-v2 takes 10 minutes for 3 million names. BAAI/bge-m3 takes 1 hour 30. Although this is a one time cost.

I then run this to test performance:

from medcat.cat import CAT
from medcat.components.types import register_core_component
from medcat.components.types import CoreComponentType
from medcat.components.linking.embedding_linker import Linker as EmbeddingLinker
from medcat.stats.stats import get_stats
from medcat.data.mctexport import MedCATTrainerExport

register_core_component(CoreComponentType.linking, EmbeddingLinker.name, EmbeddingLinker.create_new_component)
cat = CAT.load_model_pack("/data/adam/models/kch_gstt_v2_embedding_linker")

**loading distemist and snomed_ct test sets**

all_projects = MedCATTrainerExport(projects=[])
all_projects["projects"].append(distemist)
all_projects["projects"].append(snomed_ct)
get_stats(cat2, all_projects)

I've tested this on the original model, and twice with the new embedding linker using "sentence-transformers/all-MiniLM-L6-v2" as the embedding model. Firstly, just chosing the first cui in self.cdb.name2info[predicted_name]["per_cui_status"].keys() if there are multiple. The second tries to do disambiguation with cui's prefered names and the detected name.

Using context_based_linker:
Epoch: 0, Prec: 0.09023910064860668, Rec: 0.3306035738148675, F1: 0.14177920150470955

Picking first cui, with an embedding linker:

Epoch: 0, Prec: 0.07762498227887255, Rec: 0.3095756163585588, F1: 0.12412585010647798

With embedding disambiguation:

Epoch: 0, Prec: 0.0650762760285772, Rec: 0.2595346605900696, F1: 0.10406025987586687

The only real import stat here is recall. While these look low, I'm led to believe that these usually come with filters, which I've avoided.

Issues / Discussion:

I am currently storing the embeddings in cdb.name2info[name]["context_vectors"] for each name. This results in a 4.5gb larger model pack, and a loading time from 40 seconds -> 240 seconds. If possible it would be nice to be able to save a "context_matrix" where each row would correspond to the name (as long as name positioning persists). Maybe via the torch.save function although I'm not sure how to slot that in here.
If the context based linker takes about 15 minutes on a project. The embedding linker takes 1hr30. Doing it entity by entity took about 6 hours. But doing the cosine similarity calculation for all detected names cut that time down. There might be more efficiencies to shave off here which I don't see... yet
The performances are a little telling that I don't have disambiguation down yet. I'm open to suggestions. I am planning to try only embed according to cdb.cui2info[cui]['preferred_name']. So I don't have to use cdb.name2info[predicted_name]["per_cui_status"].keys(). I'm open to other suggestions maybe with name status itself.
I'm not sure of a nice way to open up embed_names, besides addressing the component itself via protected variables:

embedding_linker = cat._pipeline._components[-1]
embedding_linker.embed_names(embedding_model_name="sentence-transformers/all-MiniLM-L6-v2", batch_size=4096)

I don't save the models themselves locally to the model pack. We're not training the model pack, so just let them live in the huggingface hub repo with their buddies.

TODO:

Figure out the best linking / disambiguation method
Ensure lightweight saving/loading of models / vectors
Create a config for this type of embedding. I think I only need to handle the embedding model, and the batch size used to generate embeddings. As they aren't exposed at the moment gracefully.
Add whatever is required to make this a default component
Measure performance with other embedding methods (BAAI-M3)

Let me know any suggestions / whatever else I've missed. Thanks!

mart-r

Overall, this sounds great!

I've just left "a few" comments on things I'm getting confused about.

medcat-v2/medcat/components/linking/embedding_linker.py

mart-r · 2025-08-18T16:34:25Z

medcat-v2/medcat/components/linking/embedding_linker.py

+            for name in self._name_keys
+        ]
+
+    def embed_cui_names(self, 


Looks like this would need to be explicitly be called to init the class?

I'm not sure myself about the best way to expose and use these methods that is consistent within the medcat library.

Currently on my testing I'm doing this with an existing medcat, and well defined pipeline:

cat.config.components.linking = EmbeddingLinking() cat.config.components.linking.comp_name = EmbeddingLinker.name embedding_linker = EmbeddingLinker( cdb=cat.cdb, config=cat.config ) cat._pipeline._components[-1] = embedding_linker cat._recreate_pipe() embedding_linker.embed_cui_names(embedding_model_name="NeuML/pubmedbert-base-embeddings") embedding_linker.embed_names(embedding_model_name="NeuML/pubmedbert-base-embeddings")

But obviously this isn't standard at all.

I would like to know is there a "standard" way of implementing this when initialising a linker in general, or if you have a recommendation of what I should do instead.

My intuition would be that if you need something initialised for inference, you'd initialise it at init time (i.e in __init__).

With that said, since you write this data onto disk (as part of cdb.addl_info), you would need to check for its existance first. But even then, if the CDB changes, (some of) these would need to be recalculated. The CDB keeps track of the is_dirty attribute. If that's False, there's been no (API-based) changes to the CDB since the CDB load, and if it's True, then there's been changes. But I don't think I've set up anything to mark it undirty automatically, really. The only thing that's automatically dealt with is the subnames, but there's a separate flag for that.

medcat-v2/medcat/components/linking/embedding_linker.py

adam-sutton-1992 · 2025-08-21T22:51:15Z

TODO or discuss currently that I know of:

Initialising for Inference I think having it embed upon init might not be the best way, as it can be time consuming. Especially if you're planning to change the embedding model. Altough handling is_dirty gracefuly is also required.
Testing with removal of stop words / punctuation.
Best way to embed names with seperators gracefully via reverse engineering the method or a suitable compromise.
Optimal settings in the config as a base.
Optimal similarity_threshold for precision / recall / f1.

Otherwise the changes I've submitted have done a few things:

Added generating link candidates via names. If you choose to ignore candidates provided by the ner step, or have a detected entity with no candidates some are generated. This slightly improves performance in both cases, and is even better when combined with filtering.
Removed infer by cui. With the generation of link candidates via names it feels quite obsolete. We only infer via cui names when disambiguating based on the longest name OR the preferred name (which can do with a bit more testing).
Various bug fixes and improvements as discussed here.

mart-r

Some small comments on the code

medcat-v2/medcat/components/linking/embedding_linker.py

mart-r · 2025-08-22T09:47:06Z

Initialising for Inference I think having it embed upon init might not be the best way, as it can be time consuming. Especially if you're planning to change the embedding model. Altough handling is_dirty gracefuly is also required.

You could also init at __call__ time when it's first needed. But that's really quite unintuitive since it would mean the first inference call would take "forever" - even if it was for one small piece of text. Which could cause the user to think it's broken.
Like I said before though - as long as you provide a tutorial alongside that shows what the initial step(s) are to get this saved on disk, it'll probably be fine. So if the initial model creation involves an additional step (after setting config and stuff) to do the embedding, and save the model after that, I'd say that would be good enoguh.

As for is_dirty, I think the best you can do is to check for the flag and log a warning. Because you could recalculate at __call__ time if you find the CDB to be dirty, but this may - again - take so long that the user thinks the whole thing is broken. Though if the change is addative (in terms of CUIs or or names) then you should be able to only embed the bits that were added. But things may not be as easy if there's stuff removed (though probably still technically doable).
The other option would be to overwrite some of the CDB's methods (to add / remove cuis / names) and either log a warning at that time or even raise an exception (this could also be a configuration option).

Best way to embed names with seperators gracefully via reverse engineering the method or a suitable compromise.

I don't think there's really a way to do that gracefully. This would be a generative process - you're wanting to generate information that isn't there.
The only real way I could see this working is if you had the SNOMED release (and the UMLS bits that were used for enrichment) and then just mapped all the processed names to their originals. You could do this as a one time thing on a per model basis, but it'd be tedious to do in a general manner.

Overall, you migth also want to add some tests to this. Probably with a very low-performant but small model. But something that shows the entire thing works in the technical sense.

adam-sutton-1992 · 2025-08-22T15:45:34Z

I've just this in the call method:

if self.cdb.is_dirty:
logging.warning("CDB has been modified since last save/load. This might significantly affect linking performance.")
logging.warning("If you have added new concepts or changes, please re-embed the CDB names and cuis before linking.")

I think there are tons are permutations I can do around being smart with embedding names and cuis. But I think it might just not be worth it.

mart-r · 2025-08-22T15:49:47Z

I think that's fair.

Though maybe also tell them the method to use for the re-embedding. I.e:

linker = self.cat._pipeline.get_component(CoreComponentType.linking)
linker.embed_cui_names(linker.embedding_model_name)

PS: I'd probably need to improve getting the pipe from the CAT object instead of using the protected attribute.

adam-sutton-1992 · 2025-08-28T18:30:43Z

Hihi,

Additional changes:

first bits of unit testing.
Fixes based on feedback from said unit testing :)
Optimal hyper-params for the linker including (including stop words, context window length, etc.).

Regarding:

PS: I'd probably need to improve getting the pipe from the CAT object instead of using the protected attribute.

I was curious what's the best way to access the embed linker, as it's needed to be called when embedding names and cuis?

Currently I access the component via: cat._pipeline._components[-1], then call embedding_linker.create_embeddings(embedding_model_name="abhinand/MedEmbed-small-v0.1").

Which isn't ideal I suppose.

I think the only TODO in terms of development is deciding on what to do with similarity thresholds. I'm thinking that; if you generate your own link candidates a higher threshold is appropriate, but when you get to larger contexts then that should be a smaller threshold.

mart-r · 2025-08-29T10:55:56Z

The API to access the component would be:

from medcat.components.types import CoreComponentType
comp = cat.pipe.get_component(CoreComponentType.linking)

EDIT: You may want to resync with master - otherwise you're stuck with using cat._pipeline to access the pipe.

adam-sutton-1992 · 2025-09-01T13:11:33Z

Yep. That works nicely.

mart-r · 2025-09-02T09:57:39Z

The test failure should be fixed by #125

initial commit for embedding linker

c08a6d8

mart-r marked this pull request as draft August 4, 2025 15:22

update to embedding logic and additional configutations

41ba8c0

mart-r reviewed Aug 18, 2025

View reviewed changes

handling no link candidates along with fixes

3b08276

mart-r reviewed Aug 22, 2025

View reviewed changes

medcat-v2/medcat/components/linking/embedding_linker.py Outdated Show resolved Hide resolved

medcat-v2/medcat/components/linking/embedding_linker.py Outdated Show resolved Hide resolved

added testing and fixes from testing

5a57b9a

Merge remote-tracking branch 'origin/main' into embedding_linker

8200592

adam-sutton-1992 added 2 commits September 1, 2025 14:47

mypy fixes

5ae1604

fixed linting (hopefully)

8bf0c3b

mart-r mentioned this pull request Sep 2, 2025

Medcat v2 components test hotfix #125

Open

initial commit for embedding linker #65

Are you sure you want to change the base?

initial commit for embedding linker #65

Uh oh!

Conversation

adam-sutton-1992 commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mart-r left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mart-r Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

adam-sutton-1992 Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

mart-r Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adam-sutton-1992 commented Aug 21, 2025

Uh oh!

mart-r left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mart-r commented Aug 22, 2025

Uh oh!

adam-sutton-1992 commented Aug 22, 2025

Uh oh!

mart-r commented Aug 22, 2025

Uh oh!

adam-sutton-1992 commented Aug 28, 2025

Uh oh!

mart-r commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adam-sutton-1992 commented Sep 1, 2025

Uh oh!

mart-r commented Sep 2, 2025

Uh oh!

Uh oh!

adam-sutton-1992 commented Aug 4, 2025 •

edited

Loading

mart-r commented Aug 29, 2025 •

edited

Loading