Skip to content

skalinin/BMS-Molecular-Translation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

85 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Solution for Bristol-Myers Squibb – Molecular Translation

header

Source code of 70th place solution for Bristol-Myers Squibb – Molecular Translation.

You can read about the solution in detail here.

Solution

The main goal is to translate chemical structure images into InChI transcription - machine-readable format. This is image captioning task and solved as CNN+RNN architecture.

Key points:

  • EfficientNet as encoder, and LSTM+Attention as decoder.
  • Adaptive batchsampler (the higher sample loss, the higher probability to add the sample in a batch).
  • Predict Smile notation and than convert it to InChI (Smiles is much more simpler to predict - shorter notation and less token-classes).
  • Freeze the encoder and train the lstm-decoder separately.

What other experiments could be done:

  • More experiments with synthetic images generation.
  • Split InChI up to 8 indepenpent string layers (separated by the "/" notation: "/b", "/t", "/m", "/s", etc.) and train the separate models for each layer. Each layer in the InChI describes different information about the molecule, and several models trained on separate inchi-layers can get good results.
  • Add rotation transoform to different angles.
  • Add beam search.
  • Try to replace LSTM to Transformer.

Quick setup and start

The provided Dockerfile is supplied to build an image with CUDA support and cuDNN.

Preparations

  • Clone the repo.

    git clone git@github.com:skalinin/BMS-Molecular-Translation.git
    cd BMS-Molecular-Translation
  • Download and extract dataset to the data/bms-molecular-translation folder.

  • sudo make all to build a docker image and create a container.

  • Preprocess dataset csv, generate additional synth images and tokenize data.

    python src/scripts/data_preprocess/prepare_csv/BMS_preprocess.py
    python src/scripts/data_preprocess/prepare_csv/external_synth_data.py
    python src/scripts/data_preprocess/tokenize_csv.py

Run

  • Train model

    python src/scripts/train.py
  • Make submission csv

    python src/scripts/submission.py \
      --encoder_pretrain /path/to/encoder-weigths \
      --decoder_pretrain /path/to/decoder-weigths

About

Kaggle | 70th place solution for Bristol-Myers Squibb – Molecular Translation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published