Skip to content

Conversation

@nbedanova
Copy link
Contributor

Added two functions:
downsample_counts_multinomial which will produce a dataset that has a specified percentage of reduced counts for each cells. Use multinomial sampling to construct the downsampled dataset.
fms_percent_drop_counts_multinomial will downsample the data, normalize and then factor. The factorization is compared to the baseline pf2 factorization with full counts. The function assumes that any low expressing genes are already filtered out and will set the geneThreshold to 0 when calling prepare_dataset to ensure that the same number of genes are contained between full count vs downsampled.
The figure CountFMS will plot this using our cytokine dataset.

@nbedanova nbedanova requested a review from Copilot August 6, 2025 21:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds functionality to analyze factorization performance under reduced count conditions by implementing multinomial downsampling and FMS comparison metrics.

  • Implements multinomial count downsampling to simulate sequencing depth variations
  • Creates FMS comparison framework between full and downsampled datasets
  • Adds visualization capabilities for count drop analysis

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
pf2rnaseq/factorization.py Adds core functions for multinomial downsampling and FMS analysis with count reduction
pf2rnaseq/figures/commonFuncs/plotGeneral.py Implements plotting function for FMS vs count drop percentage visualization
pf2rnaseq/figures/figureCountFMS.py Creates figure demonstrating FMS analysis on cytokine dataset with various count drop percentages

@nbedanova nbedanova requested a review from aarmey August 6, 2025 21:27
Copy link
Member

@aarmey aarmey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff. Just a couple of minor comments.

return sampled_data


def fms_percent_drop_counts_multinomial(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not generally a fan of having functions like this that combine functionality and loops. Make a function that does the processing for one situation, then you can put the loops elsewhere.

results = np.zeros((runs, len(percentList)))

# Main loop
for j in range(runs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that you need to sweep across lots of runs. The result should be obvious at 0.5 downsampling, or there isn't a meaningful difference.

data[start_idx:end_idx] = new_counts.astype(cell_data.dtype)

# Create new sparse matrix
sampled_csr = sp.csr_matrix((data, indices, indptr), shape=original_csr.shape)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh preserving the sparsity is clever.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants