Skip to content

Conversation

@Bowen0715
Copy link

@Bowen0715 Bowen0715 commented May 9, 2024

Commit 1:

fix: Handle NaN values in Spearman correlation by replacing with 0

This commit addresses the issue where NaN values were generated during Spearman correlation computation, leading to errors in the code. By replacing NaN values with 0, the code now handles constant input arrays gracefully and computes the correlation correctly.

Changes:

  • Modified _spearman_dissim function to replace NaN values with 0 in Spearman correlation computation.

This change resolves issues 10 and ensures smoother execution of the program.

Commit 2:

fix: Exclude gene position plot when no taxonomy file or gene position file is provided

This commit resolves an issue where the gene position plot failed when no taxonomy file was provided. It ensures that the plotting functionality works correctly.

Changes:

  • Set p2 to None when geneposs is None

Commit 3:

fix: Fix Bokeh error when generating the plot

This commit addresses a Bokeh error encountered during plot generation.

Changes:

  • Updated plot generation to use lists for range objects, addressing the Bokeh error.

Commit 4:

fix: Introduce sam2pmp.py script for parallel execution of the sam2pmp function

This commit introduces sam2pmp.py, a script dedicated to independently executing the sam2pmp function outside the main project context. The sam2pmp function consumes significantly less memory compared to the icra iterations, making it suitable for parallel execution on large datasets. This approach aims to optimize processing time by preparing data before icra execution.

Changes:

  • Added sam2pmp.py script for standalone execution of the sam2pmp function.

This enhancement enhances efficiency in data processing workflows, particularly for large-scale datasets.

@Bowen0715 Bowen0715 changed the title Handle NaN values in Spearman correlation by replacing with 0 Fixes for NaN Handling and Plot Generation Issues Jun 24, 2024
@ym2877
Copy link
Contributor

ym2877 commented Jun 24, 2024

Commit 1-3 look good to me. Regarding Commit 4, couple of comments:

  1. Why not just expose the sam2pmp function in init.py?
  2. Looks like the script also exposes a CLI, so can we add this to the cli/ folder instead? And add a corresponding entry to entry_points within setup.py

@Bowen0715
Copy link
Author

Bowen0715 commented Jun 27, 2024

@ym2877 Thank you for your comments!

The sam2pmp.py has been set up to modularize the steps of icra. The general idea is to leverage parallel computing to accelerate processing, particularly with large datasets.

The Sam2pmp function does not support multithreading but consumes minimal memory (less than 1 GB) and has a long runtime (1~6 hours for a 16 GB fastq file). Therefore, it is suitable for processing multiple samples concurrently.

Regarding Commit 4(c721631), I acknowledge the challenges in understanding and usability. This script depends on pre-existing BAM files generated by Bowtie2, making it 'step 2' in the icra pipeline.

To enhance clarity and functionality, I have introduced two new arguments to the icra command: generate_bam and bam_to_pmp, each corresponding to distinct steps of the process. Further explanations have been documented in LargeDataTips.md.
Additionally, I've added bamfol and pmpfol arguments to facilitate separate distribution of BAM and PMP files.
These new arguments do not affect the original pipeline unless explicitly specified.

Thank you for your consideration. I apologize for any confusion caused. Please feel free to adjust the code and documentation I committed as needed.

@Bowen0715 Bowen0715 changed the title Fixes for NaN Handling and Plot Generation Issues Fixes for NaN Handling & Plot Generation Issues; Tips for Large Datasets Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

I got an error in createdb.py

2 participants