cRowflow

Overview

cRowflow is an R package designed for assessing clustering stability through repeated stochastic clustering. It is compatible with any clustering algorithm that outputs labels or returns results containing cluster assignments. By running clustering multiple times with different seeds, cRowflow quantifies clustering consistency using Element-Centric Similarity (ECS) and Element-Centric Consistency (ECC), offering insights into the robustness and reproducibility of cluster assignments. The package enables users to optimize feature subsets, fine-tune clustering parameters, and evaluate clustering robustness against perturbations.

cRowflow generalizes the ClustAssess package, which focuses on parameter selection for community-detection clustering in single-cell analysis. It extends this approach to any clustering task, enabling a data-driven identification of robust and reproducible clustering solutions across diverse applications.

Features

1. Stochastic Clustering Runner

Runs a stochastic clustering algorithm multiple times with different random seeds and evaluates the stability of results using ECC. It identifies, with element-wise precision, the stability of clustering results and provides majority voting labels.

Important Note: In R, the kmeans algorithm stores cluster labels in the "cluster" field, so we must set labels_name = "cluster" to extract them correctly. If the clustering algorithm directly returns the labels, labels_name can be left as NULL. Additionally, we must specify the number of clusters (e.g., centers = 5), as it does not have a default k value.

Function: `stochastic_clustering_runner()`

set.seed(42)
my_data <- matrix(rnorm(200 * 10), ncol = 10)

# Run stochastic clustering
result <- stochastic_clustering_runner(
  data = my_data,
  clustering_algo = kmeans,
  labels_name = "cluster",
  n_runs = 30,
  centers = 5
)

Returns:

partitions: List of clustering assignments for each run
majority_voting_labels: Consensus labels from majority voting
ecc: ECC scores indicating clustering stability

2. Genetic Algorithm Feature Selector

Uses a genetic algorithm to iteratively optimize feature selection for clustering stability. It repeatedly applies stochastic clustering with different feature subsets and evaluates stability using ECC. The algorithm evolves through selection, crossover, and mutation, converging on the feature set that maximizes clustering robustness.

Function: `genetic_algorithm_feature_selector()`

set.seed(42)
my_data <- matrix(rnorm(200 * 10), ncol = 10)

# Run genetic algorithm feature selection
result <- genetic_algorithm_feature_selector(
  data = my_data,
  clustering_algo = kmeans,
  labels_name = "cluster",
  n_runs = 30,
  population_size = 20,
  generations = 50,
  centers = 5
)

Returns:

best_features: Optimal feature subset
best_ecc: Highest median ECC achieved
history: Evolution of fitness across generations

3. Parameter Optimizer

Systematically tunes each hyperparameter separately by performing repeated clustering and evaluating stability using ECC.

Function: `parameter_optimizer()`

set.seed(42)
my_data <- matrix(rnorm(200 * 10), ncol = 10)

# Run parameter optimization
# find number of clusters (between 2 to 5) resultin in most stable partitions.
result <- parameter_optimizer(
  data = my_data,
  clustering_algo = kmeans,
  labels_name = "cluster",
  parameters_optimise_list = list(centers = 2:5),
  n_runs = 30
)

Returns:

results_df: ECC scores for each hyperparameter setting
stochastic_clustering_results: Full clustering results for each setting

4. Parameter Searcher

Evaluates all possible combinations (exhaustive grid search) of specified parameters, running repeated clustering and computing ECC for each combination. The purpose is to find the configuration (set of hyperparameter values) that provides the most stable clustering results.

Function: `parameter_searcher()`

set.seed(42)
my_data <- matrix(rnorm(200 * 10), ncol = 10)

# Run exhaustive parameter search
# Find combination of centers and nstart that results in the most stable partitions.
result <- parameter_searcher(
  data = my_data,
  clustering_algo = kmeans,
  labels_name = "cluster",
  param_grid = list(centers = 2:4, nstart = c(1, 5)),
  n_runs = 30
)

Returns:

results_df: A dataframe of ECC values for different parameter combinations
stochastic_clustering_results: Detailed clustering results

5. K-Fold Clustering Validator

Evaluates how stable clustering assignments remain across different data partitions by comparing clustering results on k-fold subsets with those from the full dataset. ECS is used to quantify similarity/stability between fold-level clustering and the baseline (full dataset).

Function: `kfold_clustering_validator()`

set.seed(42)
my_data <- matrix(rnorm(200 * 10), ncol = 10)

# Run k-fold clustering validation
result <- kfold_clustering_validator(
  data = my_data,
  clustering_algo = kmeans,
  labels_name = "cluster",
  k_folds = 5,
  n_runs = 30,
  centers = 5
)

Returns:

baseline_results: Clustering results on full dataset
kfolds_robustness_results: Clustering robustness metrics for each fold

6. Perturbation Robustness Tester

Tests how stable clustering results are when features are altered/perturbed. The user must provide a perturbation function, which modifies the dataset before clustering is re-run. Stability is assessed using ECS between the baseline clustering and perturbation-induced clusterings.

Function: `perturbation_robustness_tester()`

set.seed(42)
my_data <- matrix(rnorm(200 * 10), ncol = 10)

# Define a feature-shuffling perturbation function
shuffle_features <- function(data) {
  data[sample(nrow(data)), ]
}

# Run perturbation robustness test
result <- perturbation_robustness_tester(
  data = my_data,
  clustering_algo = kmeans,
  labels_name = "cluster",
  perturbation_func = shuffle_features,
  n_perturbations = 10,
  n_runs = 30,
  centers = 5
)

Returns:

perturbation_el_sim_scores: Similarity scores after perturbation
mean_score: Mean robustness score

Installation

To install the latest version from GitHub, use:

# install.packages("devtools")
devtools::install_github("Core-Bioinformatics/cRowflow")

Load the package and necessary dependencies:

library(cRowflow)

Dependencies

caret
ggplot2
dplyr
tidyr
viridis
purrr
rlang
ClustAssess

Tutorials

The package can be applied to any clustering task (as long as the clustering algorithm used is stochastic).

In the wine dataset vignette, we show how to use cRowflow with kmeans to assess clustering stability, optimize hyperparameters, and refine feature selection. We first evaluate clustering consistency on typical unoptimised parameters, then identify the most stable configuration, and finally improve stability by selecting an optimal feature subset.

License

This package is released under the MIT License.

Developed by Rafael Kollyfas (rk720@cam.ac.uk), Core Bioinformatics (Mohorianu Lab) group, University of Cambridge. February 2025.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
R		R
inst		inst
man		man
tests/testthat		tests/testthat
vignettes		vignettes
.DS_Store		.DS_Store
.Rbuildignore		.Rbuildignore
.gitattributes		.gitattributes
.gitignore		.gitignore
CRAN-SUBMISSION		CRAN-SUBMISSION
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md
cRowflow.Rproj		cRowflow.Rproj
cran-comments.md		cran-comments.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

cRowflow

Overview

Features

1. Stochastic Clustering Runner

Function: `stochastic_clustering_runner()`

Returns:

2. Genetic Algorithm Feature Selector

Function: `genetic_algorithm_feature_selector()`

Returns:

3. Parameter Optimizer

Function: `parameter_optimizer()`

Returns:

4. Parameter Searcher

Function: `parameter_searcher()`

Returns:

5. K-Fold Clustering Validator

Function: `kfold_clustering_validator()`

Returns:

6. Perturbation Robustness Tester

Function: `perturbation_robustness_tester()`

Returns:

Installation

Dependencies

Tutorials

License

About

Uh oh!

Releases

Packages

Languages

Core-Bioinformatics/cRowflow

Folders and files

Latest commit

History

Repository files navigation

cRowflow

Overview

Features

1. Stochastic Clustering Runner

Function: stochastic_clustering_runner()

Returns:

2. Genetic Algorithm Feature Selector

Function: genetic_algorithm_feature_selector()

Returns:

3. Parameter Optimizer

Function: parameter_optimizer()

Returns:

4. Parameter Searcher

Function: parameter_searcher()

Returns:

5. K-Fold Clustering Validator

Function: kfold_clustering_validator()

Returns:

6. Perturbation Robustness Tester

Function: perturbation_robustness_tester()

Returns:

Installation

Dependencies

Tutorials

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Function: `stochastic_clustering_runner()`

Function: `genetic_algorithm_feature_selector()`

Function: `parameter_optimizer()`

Function: `parameter_searcher()`

Function: `kfold_clustering_validator()`

Function: `perturbation_robustness_tester()`

Packages