Conditional Unigram Tokenization with Parallel Data

This is the repository for the paper Conditional Unigram Tokenization with Parallel Data.

Usage

Training the tokeniser

Example:

python -m paired_sp train -m cond-exact --src tokenised-source-data --trg target-data -p -v 32000 -o tokeniser.tar.gz -i 2 --sub-iter 2

Parameters:

usage: Paired SentencePiece train [-h] [--no-preload] [--tqdm] [--debug] [--info] --model {max,sum,max-unpaired,sum-unpaired,cond-exact,pmi,cpmi} [--src SRC] --trg TRG [--maxlen MAXLEN] [--pretokenize]
                                  [--vocab-size VOCAB_SIZE] --output OUTPUT [--iter ITER] [--sub-iter SUB_ITER] [--reduce REDUCE] [--seed SEED] [--char-coverage CHAR_COVERAGE] [--temp TEMP]
                                  [--threshold THRESHOLD] [--spans SPANS] [--trainer {mc,em}]

options:
  -h, --help            show this help message and exit
  --no-preload          Preload in memory all the datasets.
  --tqdm                Show tqdm progress bars.
  --debug               Logger debug
  --info                Logger info
  --model {max,sum,max-unpaired,sum-unpaired,cond-exact,pmi,cpmi}, -m {max,sum,max-unpaired,sum-unpaired,cond-exact,pmi,cpmi}
                        Type of model.
  --src SRC             TOKENIZED sentences in the source language. For Unpaired models: use the same file for --trg.
  --trg TRG             PLAIN TEXT sentences in the target language.
  --maxlen MAXLEN, -l MAXLEN
                        Maximum length of the spans in the target language. -1 for unlimited length.
  --pretokenize, -p     Split the target sentences on white spaces.
  --vocab-size VOCAB_SIZE, -v VOCAB_SIZE
                        Vocabulary size, default 8000.
  --output OUTPUT, -o OUTPUT
                        Where to save the model
  --iter ITER, -i ITER  Maximum number of EM iterations. Stop early if the desired vocabulary size is reached.
  --sub-iter SUB_ITER   Reduce the vocabulary after this many iterations of the EM algorithm.
  --reduce REDUCE, -r REDUCE
                        Reduce this portion of vocabulary at each iteration.
  --seed SEED, -s SEED  RNG seed.
  --char-coverage CHAR_COVERAGE
                        Keep the fraction of most frequent character
  --temp TEMP           Temp. file for intermediate tokenization.
  --threshold THRESHOLD
                        Remove spans that occour less than the threshold.
  --spans SPANS         Use a fixed number of spans.
  --trainer {mc,em}     Trainer to use.

Tokenising a text file:

Example:

python -m paired_sp tokenize -m cond-exact --model-file tokeniser.tar.gz --src tokenised-source-data --trg target-data -o tokenised-target-data

Parameters:

usage: Paired SentencePiece tokenize [-h] [--no-preload] [--tqdm] [--debug] [--info] --model {max,sum,max-unpaired,sum-unpaired,cond-exact,pmi,cpmi} --model-file MODEL_FILE [--src SRC] --trg TRG
                                     [--alignment ALIGNMENT] [--output OUTPUT]

options:
  -h, --help            show this help message and exit
  --no-preload          Preload in memory all the datasets.
  --tqdm                Show tqdm progress bars.
  --debug               Logger debug
  --info                Logger info
  --model {max,sum,max-unpaired,sum-unpaired,cond-exact,pmi,cpmi}, -m {max,sum,max-unpaired,sum-unpaired,cond-exact,pmi,cpmi}
                        Type of model.
  --model-file MODEL_FILE
                        tar.gz file that contains the model.
  --src SRC             TOKENIZED sentences in the source language. For Unpaired models: use the same file for --trg.
  --trg TRG             PLAIN TEXT sentences in the target language.
  --alignment ALIGNMENT
                        Eflomal alignment.
  --output OUTPUT, -o OUTPUT
                        Where to save the tokenized data. Default STDOUT.

Exporing the vocabulary

Example:

python -m paired_sp vocab -m cond-exact --model-file tokeniser.tar.gz -o vocab.yaml

Parameters:

usage: Paired SentencePiece vocab [-h] [--no-preload] [--tqdm] [--debug] [--info] --model {max,sum,max-unpaired,sum-unpaired,cond-exact,pmi,cpmi} --model-file MODEL_FILE -o OUTPUT

options:
  -h, --help            show this help message and exit
  --no-preload          Preload in memory all the datasets.
  --tqdm                Show tqdm progress bars.
  --debug               Logger debug
  --info                Logger info
  --model {max,sum,max-unpaired,sum-unpaired,cond-exact,pmi,cpmi}, -m {max,sum,max-unpaired,sum-unpaired,cond-exact,pmi,cpmi}
                        Type of model.
  --model-file MODEL_FILE
                        tar.gz file that contains the model.
  -o OUTPUT, --output OUTPUT
                        Output file (yaml).

Project Structure

Main folder

Snakefile: main snakemake file to train and score most of the models
Snakefile-run.sh: example on how to run Snakefile on a Slurm cluster
do_lm_train.sh, lm_psp_train.sh, lm_sp_train.sh: scripts to train some of the lm without snakemeke

paired_sp

Main python module to training and using the PairedSP tokeniser.

data

This folder contains the training data for the tokeniser and the tokenised data for the other tasks. It has a subforlder for each language pair (e.g., ces-ukr, and the link ukr-ces).

The training data is in the following files (ces-ukr as an example):

flores.dev.ces
flores.dev.ukr
flores.devtest.ces
flores.devtest.ukr
train.100000.ces (similar patter for Ukrainian and the other training sizes)
train.ces
train.ukr

For each language pair there is the folder align with the word aligned training data (e.g., alig/train.500000.ces) and intermediate eflomal files.

The tokenised data uses the following patterns:

psp/8000/100000/align/flores.dev.ces.tok: tokenised with PairedSP trained with 100k sentences and 8k as vocabulary size
unigram/16000/500000/flores.devtest.ukr.tok: tokenised with SentencePiece trained with 500k sentences and 16k as vocabulary size
unigram/16000/500000/align/train.500000.ces.tok: tokenised word aligned training data

Snakemake should be able to download and tokenise the data except for deu-hsb with has to be downloaded manually.

models

This folder contains the tokenisers, MT models, and LMs.

For the tokenisers, there is a subfolder for each language pair. The patter is as follows:

8000/100000/align_unigram.ukr.scores.eflomal: alignment scores for Ukrainian SentencePiece (100k training sentences, 8k vocabs)
8000/100000/unigram.ukr.scores: intrinsic score for Czech SentencePiece
8000/100000/unigram.ces.model: Czech SentencePiece model
8000/100000/unigram.ces.vocab: Czech SentencePiece vocabulary (generated by SentencePiece)
8000/100000/unigram.ukr.yml: Ukrainian SentencePiece vocabulary (converted to paired_sp)
8000/100000/align/align_unigram.ukr.scores.eflomal: PairedSP alignment scores for Ukrainian
8000/100000/align/ukr.scores: PairedSP intrinsic scores
8000/100000/align/ukr.tar.gz: PairedSP model
8000/100000/align/ukr.yml: PairedSP vocabulary

A PairedSP model is saved in a tar.gz file with the following structure:

self_data.json: the settings needed to run the tokeniser. Example:

 {
   "vocab_size": 8000,
   "maxlen": 23,
   "pretokenize": true,
   "seed": 0,
   "type": "PairedSPExactCondModel"
 }

table.tar.gz: tar.gz file with the co-occurences table and the indexes (json files).
- _count: dictionary of dictionary with the (expected) counts. The first key is the target language token id, the second key is the source language token id
- _span_index: dictionary with target tokens (key) and ids (value)
- _token_index: dictionary with source tokens (key) and ids (value) In the repository there is an example of the tokeniser.

The LMs and MT models are in the folders lm and mt, which contains sub-folders for each language pair. For the MT task the structure is the following:

8000/1000000/align/seed_3/model.npz: ces->ukr model (tokeniser with 8k vocabulary size, 1M samples) with seed 3 and trained with PairedSP
8000/1000000/align/seed_3/model.npz.yml: model settings (generated by MarianMT, used during training)
8000/1000000/baseline/seed_3/model.npz.decoder.yml: decoder settings of the baseline model (generate by MarianMT, used for inference)
8000/1000000/baseline/seed_3/model.npz.comet: baseline COMET scores
8000/1000000/align/seed_3/model.npz.scores: SacreBLEU scores
8000/1000000/align/seed_3/model.npz.tok: tokenised preditions on the test set
8000/1000000/align/seed_3/model.npz.detok: detokenised predicitons on the test set

The structure is the same for LMs, the main differences are the following files:

all_results.json: metrics (validation and testing)
model.safetensors: model weights
test_examples.txt: prediction examples

downloader

Script to generate the datasets (except for deu-hsb).

intrinsic

Python module to compute the intrinsic (see python -m intrinsic --help) and alignemt scores (see python -m intrinsic.eflomal --help).

mt

Script for training the MT models.

lm

Training script for the LM task. It uses some scripts from the mt folder.

pretokenize

Script to compute the word aligned (and phrase aligned) datasets.

results

Plot the results, partially generate the paper's tables.

Snakefiles

Snakemake rules to generate intermediate files.

sp_tools

Scripts for training and using SencencePiece.

Cite

TODO

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Conditional Unigram Tokenization with Parallel Data

Usage

Training the tokeniser

Tokenising a text file:

Exporing the vocabulary

Project Structure

Main folder

paired_sp

data

models

downloader

intrinsic

mt

lm

pretokenize

results

Snakefiles

sp_tools

Cite

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Snakefiles		Snakefiles
downloader		downloader
lm		lm
mt		mt
paired_sp		paired_sp
pretokenize		pretokenize
results		results
sp_tools		sp_tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
Snakefile-run.sh		Snakefile-run.sh
do_lm_train.sh		do_lm_train.sh
lm_psp_train.sh		lm_psp_train.sh
lm_sp_train.sh		lm_sp_train.sh
requirements.txt		requirements.txt

License

GianlucaVico/Conditional-Unigram-Tokenization

Folders and files

Latest commit

History

Repository files navigation

Conditional Unigram Tokenization with Parallel Data

Usage

Training the tokeniser

Tokenising a text file:

Exporing the vocabulary

Project Structure

Main folder

paired_sp

data

models

downloader

intrinsic

mt

lm

pretokenize

results

Snakefiles

sp_tools

Cite

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages