Fixes for NaN Handling & Plot Generation Issues; Tips for Large Datasets #19

Bowen0715 · 2024-05-09T14:13:14Z

Commit 1:

fix: Handle NaN values in Spearman correlation by replacing with 0

This commit addresses the issue where NaN values were generated during Spearman correlation computation, leading to errors in the code. By replacing NaN values with 0, the code now handles constant input arrays gracefully and computes the correlation correctly.

Changes:

Modified _spearman_dissim function to replace NaN values with 0 in Spearman correlation computation.

This change resolves issues 10 and ensures smoother execution of the program.

Commit 2:

fix: Exclude gene position plot when no taxonomy file or gene position file is provided

This commit resolves an issue where the gene position plot failed when no taxonomy file was provided. It ensures that the plotting functionality works correctly.

Changes:

Set p2 to None when geneposs is None

Commit 3:

fix: Fix Bokeh error when generating the plot

This commit addresses a Bokeh error encountered during plot generation.

Changes:

Updated plot generation to use lists for range objects, addressing the Bokeh error.

Commit 4:

fix: Introduce sam2pmp.py script for parallel execution of the sam2pmp function

This commit introduces sam2pmp.py, a script dedicated to independently executing the sam2pmp function outside the main project context. The sam2pmp function consumes significantly less memory compared to the icra iterations, making it suitable for parallel execution on large datasets. This approach aims to optimize processing time by preparing data before icra execution.

Changes:

Added sam2pmp.py script for standalone execution of the sam2pmp function.

This enhancement enhances efficiency in data processing workflows, particularly for large-scale datasets.

ym2877 · 2024-06-24T19:31:15Z

Commit 1-3 look good to me. Regarding Commit 4, couple of comments:

Why not just expose the sam2pmp function in init.py?
Looks like the script also exposes a CLI, so can we add this to the cli/ folder instead? And add a corresponding entry to entry_points within setup.py

Test

Bowen0715 · 2024-06-27T15:08:27Z

@ym2877 Thank you for your comments!

The sam2pmp.py has been set up to modularize the steps of icra. The general idea is to leverage parallel computing to accelerate processing, particularly with large datasets.

The Sam2pmp function does not support multithreading but consumes minimal memory (less than 1 GB) and has a long runtime (1~6 hours for a 16 GB fastq file). Therefore, it is suitable for processing multiple samples concurrently.

Regarding Commit 4(c721631), I acknowledge the challenges in understanding and usability. This script depends on pre-existing BAM files generated by Bowtie2, making it 'step 2' in the icra pipeline.

To enhance clarity and functionality, I have introduced two new arguments to the icra command: generate_bam and bam_to_pmp, each corresponding to distinct steps of the process. Further explanations have been documented in LargeDataTips.md.
Additionally, I've added bamfol and pmpfol arguments to facilitate separate distribution of BAM and PMP files.
These new arguments do not affect the original pipeline unless explicitly specified.

Thank you for your consideration. I apologize for any confusion caused. Please feel free to adjust the code and documentation I committed as needed.

Fixed bugs

Bowen0715 added 3 commits May 9, 2024 20:01

Handle NaN values in Spearman correlation by replacing with 0

6e85fa0

Fix gene positions plot error when no taxonomy file is provided

8513abc

Fix bokeh error by making range object a list

69f11d9

Bowen0715 changed the title ~~Handle NaN values in Spearman correlation by replacing with 0~~ Fixes for NaN Handling and Plot Generation Issues Jun 24, 2024

Add sam2pmp.py script for processing large datasets

c721631

Bowen0715 and others added 8 commits June 27, 2024 17:24

Seperate steps of icra

406ae02

Bug fixed

cdd8d67

Bug fixed

acc4c54

Bug fixed

3a2d0ef

Add example codes

e5b9f11

Add example codes

c42b06e

Merge pull request #1 from Bowen0715/test

1da2376

Test

Fix typo

b54464c

Bowen0715 changed the title ~~Fixes for NaN Handling and Plot Generation Issues~~ Fixes for NaN Handling & Plot Generation Issues; Tips for Large Datasets Jun 28, 2024

Bowen0715 and others added 2 commits August 8, 2024 20:39

Fixed bugs

daca4a5

Merge pull request #2 from Bowen0715/test

054278c

Fixed bugs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixes for NaN Handling & Plot Generation Issues; Tips for Large Datasets #19

Fixes for NaN Handling & Plot Generation Issues; Tips for Large Datasets #19

Uh oh!

Bowen0715 commented May 9, 2024 •

edited

Loading

Uh oh!

ym2877 commented Jun 24, 2024 •

edited

Loading

Uh oh!

Bowen0715 commented Jun 27, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fixes for NaN Handling & Plot Generation Issues; Tips for Large Datasets #19

Are you sure you want to change the base?

Fixes for NaN Handling & Plot Generation Issues; Tips for Large Datasets #19

Uh oh!

Conversation

Bowen0715 commented May 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Commit 1:

Commit 2:

Commit 3:

Commit 4:

Uh oh!

ym2877 commented Jun 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Bowen0715 commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bowen0715 commented May 9, 2024 •

edited

Loading

ym2877 commented Jun 24, 2024 •

edited

Loading

Bowen0715 commented Jun 27, 2024 •

edited

Loading