Contraction Clustering (RASTER): A very fast and parallelizable clustering algorithm

**Copied from: https://github.com/scikit-learn/scikit-learn/issues/27848
In the comments, it was suggested that scikit-learn-extras may be a better fit for this algorithm.**

### Describe the workflow you want to enable

RASTER is a very fast clustering algorithm that runs in linear time, uses constant memory, and only requires a single pass. The relevant package is `cluster`.

### Describe your proposed solution

RASTER has been shown to be faster than all other clustering algorithms that are part of the `cluster` package (see comparative results in the "alternatives" field). A detailed description of the algorithm is in [this paper](https://arxiv.org/pdf/1907.03620.pdf). The key idea is that data points are projected onto a grid. This helper data structure that allows us to cluster data points at the desired level of precision and at a speed much faster than any other clustering algorithm we encountered in the literature. The closest comparison we were made aware of was CLIQUE, but RASTER is more efficient and, in fact, many orders of magnitude faster, which we have also shown experimentally, see Appendix B in the paper above.

Plots with comparisons:
<img width="549" alt="Screen Shot 2023-11-26 at 16 10 16" src="https://github.com/scikit-learn/scikit-learn/assets/3864047/6e8fc819-3e60-44ed-9591-75d19a2a5e6d">

Example of adjusting the precision parameter:
<img width="219" alt="Screen Shot 2023-11-26 at 16 12 47" src="https://github.com/scikit-learn/scikit-learn/assets/3864047/01a482e1-3b0e-4456-9dc5-59a92fa8bed1">

Pseudo-code:
<img width="296" alt="Screen Shot 2023-11-26 at 16 06 36" src="https://github.com/scikit-learn/scikit-learn/assets/3864047/383a419d-4342-4203-910b-2b4be83c2c44">

Implementation:
https://github.com/FraunhoferChalmersCentre/raster/tree/master

The algorithm is furthermore parallelizable.

### Describe alternatives you've considered, if relevant

We compare RASTER to 10 other clustering algorithms, and have found that it outperforms them. RASTER is not only faster, it is also able to process greater amounts of data, ceteris paribus. Here is a summary of the results of our research:
<img width="494" alt="Screen Shot 2023-11-26 at 16 04 57" src="https://github.com/scikit-learn/scikit-learn/assets/3864047/9b8e625b-8f87-4156-be69-f43fab6d7ced">

### Additional context

RASTER was discovered in the context of an industrial research project "FUMA - Fleet telematics big data analytics for vehicle Usage Modeling and Analysis" ([description](https://www.vinnova.se/en/p/fuma---fleet-telematics-big-data-analytics-for-vehicle-usage-modeling-and-analysis/?_t_id=t-Q45eA1octj8p7pTqureg%3d%3d&_t_uuid=iUB_TZq6SlS81DNItabcVg&_t_q=bada&_t_tags=language%3aen%2csiteid%3a6a0eda26-a5be-4f47-a778-b9393a63f812%2candquerymatch&_t_hit.id=Vinnova_Models_Pages_ProjectPage/_a03ac4c1-16a3-4f78-989b-3a1920df8f9d_en&_t_hit.pos=2), [final report](https://www.vinnova.se/globalassets/mikrosajter/ffi/dokument/slutrapporter-ffi/effektiva-och-uppkopplade-transporter-rapporter/2016-02207sv.pdf?cb=20200115151638)), which was administered by the Swedish research agency VINNOVA. It was a collaboration between Scania, a Volkswagen subsidiary, and the Fraunhofer-Chalmers Center for Industrial Mathematics. This research project ran from 2016 to 2019.

A practical result of RASTER was that it made it possible to process TBs of real-world geo-spatial data on a local workstation instead of a data center, leading to significant cost and time-savings. This also implies that we could eliminate security risks as we could keep this highly confidential dataset in-house. In fact, due to the single-pass nature of RASTER, we could generate results locally much faster than we could have gotten them via a data center.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Contraction Clustering (RASTER): A very fast and parallelizable clustering algorithm #170

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Contraction Clustering (RASTER): A very fast and parallelizable clustering algorithm #170

Description

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions