Code: Open-source framework for detecting bias and overfitting for large pathology images

this readme contains code used in the paper "Open-source framework for detecting bias and overfitting for large pathology images".

The code here covers:

preprocessing slides into tiles using Vahadane normalization
creating a SSL models based on MoCo v1
- this model can also be configured to do conditional sampling
exporting embeddings to zarr arrays
generating a tile-level annotation file
fine-tuning phikon/MoCo v1
Creating figures for the paper

The UMAPs and linear probing from the paper are here. It's kept separate to make it easier to use as standalone tools.

Dataset

The dataset is TCGA-LUSC. It can be downloaded from official portals (I'm not giving a link since it keeps changing). The annotations I used are in the annotations folder. For clinical annotations, you only need to use the filename, but if you want extended clinical information I recommend download TCGA annotations here from liu et al. (2018) The annotations are downloaded from the same datasets, look for "clinical" and "slide" which should give you two separate .tsv files.

Installation

# install miniconda. Then:
conda create -y -c conda-forge --name overfit-detection python=3.12.1 --file requirements.txt
# OR, with python >=3.11:
# python -m venv venv && ./venv/bin/activate && pip install -r requirements.txt

# torch is now removed from conda, have to install post-hoc
pip install monai torch torchvision opencv-python

Recreating the paper

Assuming you have the raw dataset from the TCGA portal in "/data/TCGA-LUSC". We use ipython since regular python may give a "module not found":

# create tiles and annotations
# the current default color normalization is Vahadane, but you can change this in the script
# according to many other papers, normalization has little impact on TSS bias, so you could consider changing color normalization to speed it up
ipython preprocessing/process_tcga.py -- --wsi-path /data/TCGA-LUSC --out-dir /data/TCGA-LUSC-tiles
ipython preprocessing/gen_tcga_tile_labels.py -- --data-dir /data/TCGA-LUSC-tiles --out-dir out

# train the model. For our computer this took about 3 days per model
# you can also skip this and just use PhikonV2 (next steps) to avoid training
ipython train_model/train_ssl.py -- --condition --batch-slide-num 4 --src-dir /data/TCGA-LUSC-tiles --epochs 300 --moco-k 128
ipython train_model/train_ssl.py -- --no-condition --src-dir /data/TCGA-LUSC-tiles --epochs 300 --moco-k 128
ipython train_model/train_ssl.py -- --no-condition --src-dir /data/TCGA-LUSC-tiles --epochs 300 --moco-k 65536

ipython feature_extraction/extract_features_phikon2.py -- --src-dir /data/TCGA-LUSC-tiles --out-dir out/
ipython feature_extraction/extract_features_inceptionv4.py -- --src-dir /data/TCGA-LUSC-tiles \
  --out-dir out --model-pth out/ --model-pth 'out/models/MoCo/TCGA_LUSC/model/checkpoint_MoCo_TCGA_LUSC_0200_False_m128_n0_o0_K128.pth.tar'
# ..repeat for other models..

After running the above, you'll have embeddings saved in ./out/*.zarr. These can then be used by our feature_inspect package. To view model stats for InceptionV4, you can use tensorboard: tensorboard --logdir=out/models/MoCo/TCGA_LUSC/model/

License

This code is under the Apache 2.0 license. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 371 Commits
annotations		annotations
feature_extraction		feature_extraction
figures		figures
misc		misc
network		network
paper_data		paper_data
preprocessing		preprocessing
train_model		train_model
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ablation_checkpoint.sh		ablation_checkpoint.sh
anova_chi_test.py		anova_chi_test.py
linear_probe.dat		linear_probe.dat
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Code: Open-source framework for detecting bias and overfitting for large pathology images

Dataset

Installation

Recreating the paper

License

About

Uh oh!

Releases

Packages

Languages

License

uit-hdl/code-overfit-detection-framework

Folders and files

Latest commit

History

Repository files navigation

Code: Open-source framework for detecting bias and overfitting for large pathology images

Dataset

Installation

Recreating the paper

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages