The SNP and gene datasets of M. Tuberculosis for drug resistance prediction. Here is a brief description of each file:
AllLabels.csvcontains the susceptibility/resistance status (susceptibility:0 and resistance:1) for each sample isolate to 12 different drugs.SNPList.csvcontains the list of all loci on the MTB genome where a mutation was detected using the variant calling tools, based on the reference genome provided here.SNP_data_part*.zipcontains csv files with the binary SNPs. The csv files are concatenated using loading_data package (refer to this repo).gene_data.csv.zipcontians a csv file that summarizes the SNPs based on the gene that they fall into to form a matrix that contains a single feature for each gene of each sample isolate.iso_list.csva list of all isolates IDs used in the training data.sparsetableFeb27.npzThe binary SNP file in npz format for ease of use.
For understanding how to load and use this data please visit the LRCN-drug-resistance repository, especially the loading_data section.
If you found the content of this repository useful, please cite us:
https://dl.acm.org/doi/abs/10.1145/3459930.3469534