Solution for Bristol-Myers Squibb – Molecular Translation

Source code of 70th place solution for Bristol-Myers Squibb – Molecular Translation.

You can read about the solution in detail here.

Solution

The main goal is to translate chemical structure images into InChI transcription - machine-readable format. This is image captioning task and solved as CNN+RNN architecture.

Key points:

EfficientNet as encoder, and LSTM+Attention as decoder.
Adaptive batchsampler (the higher sample loss, the higher probability to add the sample in a batch).
Predict Smile notation and than convert it to InChI (Smiles is much more simpler to predict - shorter notation and less token-classes).
Freeze the encoder and train the lstm-decoder separately.

What other experiments could be done:

More experiments with synthetic images generation.
Split InChI up to 8 indepenpent string layers (separated by the "/" notation: "/b", "/t", "/m", "/s", etc.) and train the separate models for each layer. Each layer in the InChI describes different information about the molecule, and several models trained on separate inchi-layers can get good results.
Add rotation transoform to different angles.
Add beam search.
Try to replace LSTM to Transformer.

Quick setup and start

Nvidia drivers >= 465, CUDA >= 11.3
Docker, nvidia-docker

The provided Dockerfile is supplied to build an image with CUDA support and cuDNN.

Preparations

Clone the repo.

git clone git@github.com:skalinin/BMS-Molecular-Translation.git
cd BMS-Molecular-Translation

Download and extract dataset to the data/bms-molecular-translation folder.
sudo make all to build a docker image and create a container.

Preprocess dataset csv, generate additional synth images and tokenize data.

python src/scripts/data_preprocess/prepare_csv/BMS_preprocess.py
python src/scripts/data_preprocess/prepare_csv/external_synth_data.py
python src/scripts/data_preprocess/tokenize_csv.py

Run

Train model
```
python src/scripts/train.py
```

Make submission csv

python src/scripts/submission.py \
  --encoder_pretrain /path/to/encoder-weigths \
  --decoder_pretrain /path/to/decoder-weigths

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
data		data
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Solution for Bristol-Myers Squibb – Molecular Translation

Solution

Key points:

What other experiments could be done:

Quick setup and start

Preparations

Run

About

Uh oh!

Releases

Packages

Languages

License

skalinin/BMS-Molecular-Translation

Folders and files

Latest commit

History

Repository files navigation

Solution for Bristol-Myers Squibb – Molecular Translation

Solution

Key points:

What other experiments could be done:

Quick setup and start

Preparations

Run

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages