Skip to content

Graphtyper copying the same files multiple times if using "--region_file"  #159

@sroener

Description

@sroener

Hi,

thank you for writing and maintaining graphtyper.

I notices that graphtyper copies the same cram files multiple files if I use the "--region_file" option. From my log, I get messages similar to the following reports:

[2024-11-22 00:29:36.836] SV genotyping region chr2:1010000-1221700
[2024-11-22 00:29:36.836] Path to genome is 'GRCh38_full_analysis_set_plus_decoy_hla.fa'
[2024-11-22 00:29:36.836] Running with up to 72 threads.
[2024-11-22 00:29:36.836] Copying data from 288 input SAM/BAM/CRAMs to local disk.
[2024-11-22 00:29:36.836] Temporary folder is /tmp/graphtyper_241122_002936_chr2_001010000.iWGl68
[2024-11-22 00:29:36.836] Copying reference genome FASTA and its index to temporary folder.
[2024-11-22 00:29:39.496] Genotype calling step starting.
[2024-11-22 00:29:39.497] Padded region is: chr2:1009000-1422700
[2024-11-22 00:29:39.497] Constructing graph.
[2024-11-22 00:29:39.520] Calculating contig offsets.
[2024-11-22 00:30:47.770] Finished calling. Thread work: 5/2/4/3/3/3/4/2/3/4/3/3/3/3/2/3/4/4/4/3/3/3/2/3/3/3/2/3/4/4/2/2/3/3/4/4/3/4/3/2/3/2/4/4/3/3/3/2/3/4/4/2/2/3/3/4/4/3/2/2/3/2/3/2/2/3/3/2/3/3/3/2
[2024-11-22 00:30:47.770] Merging output VCFs.
[2024-11-22 00:30:49.878] Cleaning up temporary files.
[2024-11-22 00:30:50.219] Finished! Output written at: batch1/chr2/001010000-001221700.vcf.gz

[2024-11-22 00:30:50.219] SV genotyping region chr2:1223900-1594700
[2024-11-22 00:30:50.219] Path to genome is 'GRCh38_full_analysis_set_plus_decoy_hla.fa'
[2024-11-22 00:30:50.219] Running with up to 72 threads.
[2024-11-22 00:30:50.219] Copying data from 288 input SAM/BAM/CRAMs to local disk.
[2024-11-22 00:30:50.219] Temporary folder is /tmp/graphtyper_241122_003050_chr2_001223900.wcbtZp
[2024-11-22 00:30:50.219] Copying reference genome FASTA and its index to temporary folder.
[2024-11-22 00:30:52.815] Genotype calling step starting.
[2024-11-22 00:30:52.815] Padded region is: chr2:1222900-1795700
[2024-11-22 00:30:52.815] Constructing graph.
[2024-11-22 00:30:52.853] Calculating contig offsets.
[2024-11-22 00:32:04.971] Finished calling. Thread work: 4/3/3/3/3/4/3/3/3/3/4/3/4/3/3/4/3/3/3/3/2/3/3/4/3/4/3/3/3/3/3/3/3/2/3/3/4/4/3/3/3/3/3/3/4/3/3/3/4/4/3/2/3/3/3/3/2/3/3/3/3/3/3/2/2/2/2/2/2/3/2/2
[2024-11-22 00:32:04.972] Merging output VCFs.
[2024-11-22 00:32:10.079] Cleaning up temporary files.

I interpret the logs as the software is iterating over the regions in the region_file and repeating the same steps over and over again. These steps include copying the input crams and the reference genome to a distinct temporary directory and cleaning it after calling. This creates a lot of IO overhead that doesn't seem necessary from my perspective. The cram files and the reference genome should not change between the different regions.

The main question here would be: does graphtyper internally copy the whole data/cram file or only parts of it. In the case of the latter case, would it be possible to first load the data for all regions and then start processing?

My suggestions would be moving the copying/cleaning of the temporary directory outside of the "loop". This could save a lot of IO and probably make the calling of multiple regions faster.

I assume the changes would have to be done in genotype_sv.cpp.

Please let me know if my suggestions are feasible.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions