Description
Hi,
If one wants to implement the sklearn's k-medoid algorithm on a dataset with many duplicate datapoints without removing those points (I'll give an example below) it runs inefficiently, as a n**2 np multiplication is done to evaluate the distances between all points for every point being evaluated as a medoid (Code reference).
A way around that would be to eliminate duplicates beforehand, but retain a frequency matrix of their occurrences for altering in_cluster_distances before summing.
For example, if one had many customers with non-differentiated profiles, they might want that to influence the decision of the center and the spread of the cluster containing them. But there should be no reason to cycle through all identical customers to calculate the distance matrix at every step.
I could not find a way to do this in sklearn. If there is one, then please let me know and add it to the docs, and I can post it on StackExchange and other places where I've seen people ask about it. If there isn't a way, then please add it to the codebase so people don't have to code up a bug-prone wrapper.
On StackExchange, a responder pointed to an R package that had such an option: https://rdrr.io/cran/WeightedCluster/man/kmedoids.html (I'm not an R coder, so I can't verify)
(I would love to help, but as I started a new, intense job, it will probably be a while until I can contribute to open-source again.)