Source code of 70th place solution for Bristol-Myers Squibb – Molecular Translation.
You can read about the solution in detail here.
The main goal is to translate chemical structure images into InChI transcription - machine-readable format. This is image captioning task and solved as CNN+RNN architecture.
- EfficientNet as encoder, and LSTM+Attention as decoder.
- Adaptive batchsampler (the higher sample loss, the higher probability to add the sample in a batch).
- Predict Smile notation and than convert it to InChI (Smiles is much more simpler to predict - shorter notation and less token-classes).
- Freeze the encoder and train the lstm-decoder separately.
- More experiments with synthetic images generation.
- Split InChI up to 8 indepenpent string layers (separated by the "/" notation: "/b", "/t", "/m", "/s", etc.) and train the separate models for each layer. Each layer in the InChI describes different information about the molecule, and several models trained on separate inchi-layers can get good results.
- Add rotation transoform to different angles.
- Add beam search.
- Try to replace LSTM to Transformer.
- Nvidia drivers >= 465, CUDA >= 11.3
- Docker, nvidia-docker
The provided Dockerfile is supplied to build an image with CUDA support and cuDNN.
-
Clone the repo.
git clone git@github.com:skalinin/BMS-Molecular-Translation.git cd BMS-Molecular-Translation
-
Download and extract dataset to the
data/bms-molecular-translation
folder. -
sudo make all
to build a docker image and create a container. -
Preprocess dataset csv, generate additional synth images and tokenize data.
python src/scripts/data_preprocess/prepare_csv/BMS_preprocess.py python src/scripts/data_preprocess/prepare_csv/external_synth_data.py python src/scripts/data_preprocess/tokenize_csv.py
-
Train model
python src/scripts/train.py
-
Make submission csv
python src/scripts/submission.py \ --encoder_pretrain /path/to/encoder-weigths \ --decoder_pretrain /path/to/decoder-weigths