mtb-mixed-infection-pipeline is a reproducible bioinformatics workflow for detecting mixed Mycobacterium tuberculosis infections from whole-genome sequencing (WGS) data. It integrates Snippy, FreeBayes (pooled-discrete mode) & a patched version of MixInfect2.R to process raw sequencing reads through alignment, joint variant calling, & statistical analysis. This pipeline outputs multi-sample VCF files with key genotype information suitable for identifying mixed-strain infections.
- Organism: Mycobacterium tuberculosis H37Rv
- GenBank accession:
AL123456.3
conda env create -f snippy_env.yaml
conda activate snippy-envFormat: SampleID<TAB>Forward_Read<TAB>Reverse_Read
TN106985 ~/TN106985_1.fastq.gz ~/TN106985_2.fastq.gz
TN106727 ~/TN106727_1.fastq.gz ~/TN106727_2.fastq.gz
TN106925 ~/TN106925_1.fastq.gz ~/TN106925_2.fastq.gz
TN106439 ~/TN106439_1.fastq.gz ~/TN106439_2.fastq.gz
./run_snippy_mtb.sh -i ~/sample.list.txt -r ~/AL123456_MTB_H37Rv.fasta -t 8
conda install -c bioconda freebayes
samtools indexed sample bams '**snps.bam**' are available from Snippy run on multiple samples
freebayes -f AL123456_MTB_H37Rv.fasta \
-b TN106985.snps.bam -b TN106727.snps.bam -b TN106925.snps.bam -b TN106439.snps.bam\
--pooled-discrete \
--use-best-n-alleles 4 \
--genotype-qualities \
--min-alternate-fraction 0.01 \
--min-alternate-count 2 \
> multisample.vcfsamtools faidx AL123456.fasta
conda env create -f r-mixinfect.yaml
conda activate r-mixinfect
conda install -c conda-forge r-data.table r-optparse r-ggplot2 r-mclust "r-base>=4.0" "icu=73.2"
This script analyzes multi-strain Mycobacterium tuberculosis infections from VCF files using variant clustering & SNP-based heuristics.
(r-mixinfect):$ Rscript ~/MixInfect2/MixInfect2.R \
--VCFfile ~/multisample.vcf \
--prefix output \
--maskFile ~/MixInfect2/MaskedRegions.csv \
--minQual 10 \
--useFilter FALSEoutput_MixSampleSummary.csv: Summary of sample classifications.output_BICvalues.csv: BIC scores & inferred strain counts.
Loading required package: mclust
Package 'mclust' version 6.1
Type 'citation("mclust")' for citing this R package in publications.
Loading required package: stringr
Loading required package: optparse
Loading required package: foreach
Loading required package: doParallel
Loading required package: iterators
Loading required package: parallel
Total variants read from VCF: 590,051
After removing indels: 576,927
After applying QUAL > 10: 1,121
After removing spanning variants (*): 1,121
After masking regions: 884
| SampleName | Mix.Non.mix | hSNPs | Total.SNPs | % hSNPs_totalSNPs | No.strains | Major.strain.prop |
|---|---|---|---|---|---|---|
| TN106985 | Mix | 459 | 517 | 88.78% | 2 | 0.78 |
| TN106727 | Mix | 477 | 505 | 94.46% | 2 | 0.71 |
| TN106925 | Non-mix | 6 | 149 | 4.03% | 1 | NA |
| TN106439 | Non-mix | 9 | 252 | 3.57% | 1 | NA |
This project is licensed under the MIT License.