Skip to content

Conversation

adam-sutton-1992
Copy link
Contributor

@adam-sutton-1992 adam-sutton-1992 commented Aug 4, 2025

This is WIP just showing the inital commit of the embedding linker, and how I've tested / instantiated it.

WHAT IT DOES:

This is a linker that requires no training. However it does have higher computational overheads.

Given an embedding model name (such as sentence-transformers/all-MiniLM-L6-v2 or BAAI/bge-m3) we embed all names in cdb.name2info. This will result in a N dimensional vector for each name.

When provided with a bunch of entities from the ner step we then embed the detected names, or the raw texts and compare them to all embedded potential names via cosine similarity.

With the most similar name we then obtain all cuis from that name via cdb.name2info[predicted_name]["per_cui_status"].keys().

HOW IT'S BEEN TESTED:

I am currently changing the linker in an already existing model:

from medcat.components.types import CoreComponentType
from medcat.components.linking.embedding_linker import Linker as EmbeddingLinker
from medcat.components.types import register_core_component

cat = CAT.load_model_pack(...)
register_core_component(CoreComponentType.linking, EmbeddingLinker.name, EmbeddingLinker.create_new_component)

embedding_linker = EmbeddingLinker(
    cdb=cat.cdb,
    config=cat.config,
    embedding_model_name="sentence-transformers/all-MiniLM-L6-v2",
    max_length=64
)
cat.config.components.linking.comp_name = EmbeddingLinker.name
cat._recreate_pipe()
embedding_linker = cat._pipeline._components[-1]
embedding_linker.embed_names(embedding_model_name="sentence-transformers/all-MiniLM-L6-v2", batch_size=4096)

cat.save_model_pack(...)

Just a note embedding with sentence-transformers/all-MiniLM-L6-v2 takes 10 minutes for 3 million names. BAAI/bge-m3 takes 1 hour 30. Although this is a one time cost.

I then run this to test performance:

from medcat.cat import CAT
from medcat.components.types import register_core_component
from medcat.components.types import CoreComponentType
from medcat.components.linking.embedding_linker import Linker as EmbeddingLinker
from medcat.stats.stats import get_stats
from medcat.data.mctexport import MedCATTrainerExport

register_core_component(CoreComponentType.linking, EmbeddingLinker.name, EmbeddingLinker.create_new_component)
cat = CAT.load_model_pack("/data/adam/models/kch_gstt_v2_embedding_linker")

**loading distemist and snomed_ct test sets**

all_projects = MedCATTrainerExport(projects=[])
all_projects["projects"].append(distemist)
all_projects["projects"].append(snomed_ct)
get_stats(cat2, all_projects)

I've tested this on the original model, and twice with the new embedding linker using "sentence-transformers/all-MiniLM-L6-v2" as the embedding model. Firstly, just chosing the first cui in self.cdb.name2info[predicted_name]["per_cui_status"].keys() if there are multiple. The second tries to do disambiguation with cui's prefered names and the detected name.

Using context_based_linker:
Epoch: 0, Prec: 0.09023910064860668, Rec: 0.3306035738148675, F1: 0.14177920150470955

Picking first cui, with an embedding linker:

Epoch: 0, Prec: 0.07762498227887255, Rec: 0.3095756163585588, F1: 0.12412585010647798

With embedding disambiguation:

Epoch: 0, Prec: 0.0650762760285772, Rec: 0.2595346605900696, F1: 0.10406025987586687

The only real import stat here is recall. While these look low, I'm led to believe that these usually come with filters, which I've avoided.

Issues / Discussion:

  • I am currently storing the embeddings in cdb.name2info[name]["context_vectors"] for each name. This results in a 4.5gb larger model pack, and a loading time from 40 seconds -> 240 seconds. If possible it would be nice to be able to save a "context_matrix" where each row would correspond to the name (as long as name positioning persists). Maybe via the torch.save function although I'm not sure how to slot that in here.
  • If the context based linker takes about 15 minutes on a project. The embedding linker takes 1hr30. Doing it entity by entity took about 6 hours. But doing the cosine similarity calculation for all detected names cut that time down. There might be more efficiencies to shave off here which I don't see... yet
  • The performances are a little telling that I don't have disambiguation down yet. I'm open to suggestions. I am planning to try only embed according to cdb.cui2info[cui]['preferred_name']. So I don't have to use cdb.name2info[predicted_name]["per_cui_status"].keys(). I'm open to other suggestions maybe with name status itself.
  • I'm not sure of a nice way to open up embed_names, besides addressing the component itself via protected variables:
embedding_linker = cat._pipeline._components[-1]
embedding_linker.embed_names(embedding_model_name="sentence-transformers/all-MiniLM-L6-v2", batch_size=4096)
  • I don't save the models themselves locally to the model pack. We're not training the model pack, so just let them live in the huggingface hub repo with their buddies.

TODO:

  1. Figure out the best linking / disambiguation method
  2. Ensure lightweight saving/loading of models / vectors
  3. Create a config for this type of embedding. I think I only need to handle the embedding model, and the batch size used to generate embeddings. As they aren't exposed at the moment gracefully.
  4. Add whatever is required to make this a default component
  5. Measure performance with other embedding methods (BAAI-M3)

Let me know any suggestions / whatever else I've missed. Thanks!

@mart-r mart-r marked this pull request as draft August 4, 2025 15:22
Copy link
Contributor

@mart-r mart-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this sounds great!

I've just left "a few" comments on things I'm getting confused about.

for name in self._name_keys
]

def embed_cui_names(self,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this would need to be explicitly be called to init the class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure myself about the best way to expose and use these methods that is consistent within the medcat library.

Currently on my testing I'm doing this with an existing medcat, and well defined pipeline:

cat.config.components.linking = EmbeddingLinking()
cat.config.components.linking.comp_name = EmbeddingLinker.name

embedding_linker = EmbeddingLinker(
    cdb=cat.cdb,
    config=cat.config
)

cat._pipeline._components[-1] = embedding_linker

cat._recreate_pipe()

embedding_linker.embed_cui_names(embedding_model_name="NeuML/pubmedbert-base-embeddings")
embedding_linker.embed_names(embedding_model_name="NeuML/pubmedbert-base-embeddings")

But obviously this isn't standard at all.

I would like to know is there a "standard" way of implementing this when initialising a linker in general, or if you have a recommendation of what I should do instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intuition would be that if you need something initialised for inference, you'd initialise it at init time (i.e in __init__).

With that said, since you write this data onto disk (as part of cdb.addl_info), you would need to check for its existance first. But even then, if the CDB changes, (some of) these would need to be recalculated. The CDB keeps track of the is_dirty attribute. If that's False, there's been no (API-based) changes to the CDB since the CDB load, and if it's True, then there's been changes. But I don't think I've set up anything to mark it undirty automatically, really. The only thing that's automatically dealt with is the subnames, but there's a separate flag for that.

@adam-sutton-1992
Copy link
Contributor Author

TODO or discuss currently that I know of:

  • Initialising for Inference I think having it embed upon init might not be the best way, as it can be time consuming. Especially if you're planning to change the embedding model. Altough handling is_dirty gracefuly is also required.
  • Testing with removal of stop words / punctuation.
  • Best way to embed names with seperators gracefully via reverse engineering the method or a suitable compromise.
  • Optimal settings in the config as a base.
  • Optimal similarity_threshold for precision / recall / f1.

Otherwise the changes I've submitted have done a few things:

  • Added generating link candidates via names. If you choose to ignore candidates provided by the ner step, or have a detected entity with no candidates some are generated. This slightly improves performance in both cases, and is even better when combined with filtering.
  • Removed infer by cui. With the generation of link candidates via names it feels quite obsolete. We only infer via cui names when disambiguating based on the longest name OR the preferred name (which can do with a bit more testing).
  • Various bug fixes and improvements as discussed here.

Copy link
Contributor

@mart-r mart-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments on the code

@mart-r
Copy link
Contributor

mart-r commented Aug 22, 2025

Initialising for Inference I think having it embed upon init might not be the best way, as it can be time consuming. Especially if you're planning to change the embedding model. Altough handling is_dirty gracefuly is also required.

You could also init at __call__ time when it's first needed. But that's really quite unintuitive since it would mean the first inference call would take "forever" - even if it was for one small piece of text. Which could cause the user to think it's broken.
Like I said before though - as long as you provide a tutorial alongside that shows what the initial step(s) are to get this saved on disk, it'll probably be fine. So if the initial model creation involves an additional step (after setting config and stuff) to do the embedding, and save the model after that, I'd say that would be good enoguh.

As for is_dirty, I think the best you can do is to check for the flag and log a warning. Because you could recalculate at __call__ time if you find the CDB to be dirty, but this may - again - take so long that the user thinks the whole thing is broken. Though if the change is addative (in terms of CUIs or or names) then you should be able to only embed the bits that were added. But things may not be as easy if there's stuff removed (though probably still technically doable).
The other option would be to overwrite some of the CDB's methods (to add / remove cuis / names) and either log a warning at that time or even raise an exception (this could also be a configuration option).

Best way to embed names with seperators gracefully via reverse engineering the method or a suitable compromise.

I don't think there's really a way to do that gracefully. This would be a generative process - you're wanting to generate information that isn't there.
The only real way I could see this working is if you had the SNOMED release (and the UMLS bits that were used for enrichment) and then just mapped all the processed names to their originals. You could do this as a one time thing on a per model basis, but it'd be tedious to do in a general manner.

Overall, you migth also want to add some tests to this. Probably with a very low-performant but small model. But something that shows the entire thing works in the technical sense.

@adam-sutton-1992
Copy link
Contributor Author

I've just this in the call method:

if self.cdb.is_dirty:
logging.warning("CDB has been modified since last save/load. This might significantly affect linking performance.")
logging.warning("If you have added new concepts or changes, please re-embed the CDB names and cuis before linking.")

I think there are tons are permutations I can do around being smart with embedding names and cuis. But I think it might just not be worth it.

@mart-r
Copy link
Contributor

mart-r commented Aug 22, 2025

I think that's fair.

Though maybe also tell them the method to use for the re-embedding. I.e:

linker = self.cat._pipeline.get_component(CoreComponentType.linking)
linker.embed_cui_names(linker.embedding_model_name)

PS: I'd probably need to improve getting the pipe from the CAT object instead of using the protected attribute.

@adam-sutton-1992
Copy link
Contributor Author

Hihi,

Additional changes:

  • first bits of unit testing.
  • Fixes based on feedback from said unit testing :)
  • Optimal hyper-params for the linker including (including stop words, context window length, etc.).

Regarding:

  • PS: I'd probably need to improve getting the pipe from the CAT object instead of using the protected attribute.

I was curious what's the best way to access the embed linker, as it's needed to be called when embedding names and cuis?

Currently I access the component via: cat._pipeline._components[-1], then call embedding_linker.create_embeddings(embedding_model_name="abhinand/MedEmbed-small-v0.1").

Which isn't ideal I suppose.

I think the only TODO in terms of development is deciding on what to do with similarity thresholds. I'm thinking that; if you generate your own link candidates a higher threshold is appropriate, but when you get to larger contexts then that should be a smaller threshold.

@mart-r
Copy link
Contributor

mart-r commented Aug 29, 2025

The API to access the component would be:

from medcat.components.types import CoreComponentType
comp = cat.pipe.get_component(CoreComponentType.linking)

EDIT: You may want to resync with master - otherwise you're stuck with using cat._pipeline to access the pipe.

@adam-sutton-1992
Copy link
Contributor Author

Yep. That works nicely.

@mart-r
Copy link
Contributor

mart-r commented Sep 2, 2025

The test failure should be fixed by #125

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants