Skip to content

Contraction Clustering (RASTER): A very fast and parallelizable clustering algorithm #170

Open
@gregorulm

Description

@gregorulm

Copied from: scikit-learn/scikit-learn#27848
In the comments, it was suggested that scikit-learn-extras may be a better fit for this algorithm.

Describe the workflow you want to enable

RASTER is a very fast clustering algorithm that runs in linear time, uses constant memory, and only requires a single pass. The relevant package is cluster.

Describe your proposed solution

RASTER has been shown to be faster than all other clustering algorithms that are part of the cluster package (see comparative results in the "alternatives" field). A detailed description of the algorithm is in this paper. The key idea is that data points are projected onto a grid. This helper data structure that allows us to cluster data points at the desired level of precision and at a speed much faster than any other clustering algorithm we encountered in the literature. The closest comparison we were made aware of was CLIQUE, but RASTER is more efficient and, in fact, many orders of magnitude faster, which we have also shown experimentally, see Appendix B in the paper above.

Plots with comparisons:
Screen Shot 2023-11-26 at 16 10 16

Example of adjusting the precision parameter:
Screen Shot 2023-11-26 at 16 12 47

Pseudo-code:
Screen Shot 2023-11-26 at 16 06 36

Implementation:
https://github.com/FraunhoferChalmersCentre/raster/tree/master

The algorithm is furthermore parallelizable.

Describe alternatives you've considered, if relevant

We compare RASTER to 10 other clustering algorithms, and have found that it outperforms them. RASTER is not only faster, it is also able to process greater amounts of data, ceteris paribus. Here is a summary of the results of our research:
Screen Shot 2023-11-26 at 16 04 57

Additional context

RASTER was discovered in the context of an industrial research project "FUMA - Fleet telematics big data analytics for vehicle Usage Modeling and Analysis" (description, final report), which was administered by the Swedish research agency VINNOVA. It was a collaboration between Scania, a Volkswagen subsidiary, and the Fraunhofer-Chalmers Center for Industrial Mathematics. This research project ran from 2016 to 2019.

A practical result of RASTER was that it made it possible to process TBs of real-world geo-spatial data on a local workstation instead of a data center, leading to significant cost and time-savings. This also implies that we could eliminate security risks as we could keep this highly confidential dataset in-house. In fact, due to the single-pass nature of RASTER, we could generate results locally much faster than we could have gotten them via a data center.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions