Skip to content

Commit c667669

Browse files
committed
mnist datasets
1 parent eb60a83 commit c667669

13 files changed

+293
-87
lines changed

README.md

Lines changed: 94 additions & 85 deletions
Original file line numberDiff line numberDiff line change
@@ -141,7 +141,13 @@ We have tried to resolve any conflicts in the *best* possible manner.
141141
Each dataset consists of 200-1050 observations in 2 dimensions.
142142

143143

144-
3. [`other`](catalogue/other.md) includes:
144+
3. [`mnist`](catalogue/mnist.md) -
145+
LeCun's MNIST database of handwritten digits
146+
and Zalando's Fashion-MNIST dataset.
147+
148+
149+
150+
4. [`other`](catalogue/other.md) includes:
145151

146152
* `hdbscan` - a dataset used for demonstrating the outputs of the
147153
[Python implementation](https://github.com/scikit-learn-contrib/hdbscan)
@@ -172,7 +178,7 @@ We have tried to resolve any conflicts in the *best* possible manner.
172178
(TODO: help needed).
173179

174180

175-
4. [`sipu`](catalogue/sipu.md) -
181+
5. [`sipu`](catalogue/sipu.md) -
176182
datasets available at the SIPU (Speech and Image Processing Unit,
177183
School of Computing, University of Eastern Finland) website
178184

@@ -190,7 +196,7 @@ We have tried to resolve any conflicts in the *best* possible manner.
190196
We excluded the `DIM`-sets as they turn out to be too easy
191197
for most algorithms.
192198

193-
5. [`uci`](catalogue/uci.md) -
199+
6. [`uci`](catalogue/uci.md) -
194200
a selection of datasets available at the University of California, Irvine,
195201
[Machine Learning Repository](http://archive.ics.uci.edu/ml/)
196202
(Dua and Graff, 2019)
@@ -201,23 +207,23 @@ We have tried to resolve any conflicts in the *best* possible manner.
201207
also listed in the SIPU repository.
202208
Note that "the" Iris dataset is available elsewhere (see `other`).
203209

204-
6. [`wut`](catalogue/wut.md) -
210+
7. [`wut`](catalogue/wut.md) -
205211
authored by the fantastic students
206212
of Marek Gagolewski's Python for Data Analysis course at
207213
Warsaw University of Technology:
208214
Przemysław Kosewski, Jędrzej Krauze, Eliza Kaczorek, Anna Gierlak,
209215
Adam Wawrzyniak, Aleksander Truszczyński, Mateusz Kobyłka and Michał Maciąg.
210216

211217

212-
7. [`g2mg`](catalogue/g2mg.md) -
218+
8. [`g2mg`](catalogue/g2mg.md) -
213219
a modified version of `G2`-sets from SIPU with variances
214220
dependent on datasets' dimensionalities, i.e., s*np.sqrt(d/2),
215221
which makes these problems more difficult.
216222

217223
Each dataset consists of 2048 observations belonging
218224
to either of two Gaussian clusters in 1, 2, ..., 128 dimensions.
219225

220-
8. [`h2mg`](catalogue/h2mg.md) -
226+
9. [`h2mg`](catalogue/h2mg.md) -
221227
two Gaussian-like hubs with spread dependent on datasets' dimensionalities
222228

223229
Each dataset consists of 2048 observations in 1, 2, ..., 128 dimensions.
@@ -231,85 +237,88 @@ We have tried to resolve any conflicts in the *best* possible manner.
231237
## List of Datasets
232238

233239

234-
| |dataset | n| d|
235-
|:--|:----------------------|------:|--:|
236-
|1 |fcps/atom | 800| 3|
237-
|2 |fcps/chainlink | 1000| 3|
238-
|3 |fcps/engytime | 4096| 2|
239-
|4 |fcps/hepta | 212| 3|
240-
|5 |fcps/lsun | 400| 2|
241-
|6 |fcps/target | 770| 2|
242-
|7 |fcps/tetra | 400| 3|
243-
|8 |fcps/twodiamonds | 800| 2|
244-
|9 |fcps/wingnut | 1016| 2|
245-
|10 |graves/dense | 200| 2|
246-
|11 |graves/fuzzyx | 1000| 2|
247-
|12 |graves/line | 250| 2|
248-
|13 |graves/parabolic | 1000| 2|
249-
|14 |graves/ring | 1000| 2|
250-
|15 |graves/ring_noisy | 1050| 2|
251-
|16 |graves/ring_outliers | 1030| 2|
252-
|17 |graves/zigzag | 250| 2|
253-
|18 |graves/zigzag_noisy | 300| 2|
254-
|19 |graves/zigzag_outliers | 280| 2|
255-
|20 |other/chameleon_t4_8k | 8000| 2|
256-
|21 |other/chameleon_t5_8k | 8000| 2|
257-
|22 |other/chameleon_t7_10k | 10000| 2|
258-
|23 |other/chameleon_t8_8k | 8000| 2|
259-
|24 |other/hdbscan | 2309| 2|
260-
|25 |other/iris | 150| 4|
261-
|26 |other/iris5 | 105| 4|
262-
|27 |other/square | 1000| 2|
263-
|28 |sipu/a1 | 3000| 2|
264-
|29 |sipu/a2 | 5250| 2|
265-
|30 |sipu/a3 | 7500| 2|
266-
|31 |sipu/aggregation | 788| 2|
267-
|32 |sipu/birch1 | 100000| 2|
268-
|33 |sipu/birch2 | 100000| 2|
269-
|34 |sipu/compound | 399| 2|
270-
|35 |sipu/d31 | 3100| 2|
271-
|36 |sipu/flame | 240| 2|
272-
|37 |sipu/jain | 373| 2|
273-
|38 |sipu/pathbased | 300| 2|
274-
|39 |sipu/r15 | 600| 2|
275-
|40 |sipu/s1 | 5000| 2|
276-
|41 |sipu/s2 | 5000| 2|
277-
|42 |sipu/s3 | 5000| 2|
278-
|43 |sipu/s4 | 5000| 2|
279-
|44 |sipu/spiral | 312| 2|
280-
|45 |sipu/unbalance | 6500| 2|
281-
|46 |sipu/worms_2 | 105600| 2|
282-
|47 |sipu/worms_64 | 105000| 64|
283-
|48 |uci/ecoli | 336| 7|
284-
|49 |uci/glass | 214| 9|
285-
|50 |uci/ionosphere | 351| 34|
286-
|51 |uci/sonar | 208| 60|
287-
|52 |uci/statlog | 2310| 19|
288-
|53 |uci/wdbc | 569| 30|
289-
|54 |uci/wine | 178| 13|
290-
|55 |uci/yeast | 1484| 8|
291-
|56 |wut/circles | 4000| 2|
292-
|57 |wut/cross | 2000| 2|
293-
|58 |wut/graph | 2500| 2|
294-
|59 |wut/isolation | 9000| 2|
295-
|60 |wut/labirynth | 3546| 2|
296-
|61 |wut/mk1 | 300| 2|
297-
|62 |wut/mk2 | 1000| 2|
298-
|63 |wut/mk3 | 600| 3|
299-
|64 |wut/mk4 | 1500| 3|
300-
|65 |wut/olympic | 5000| 2|
301-
|66 |wut/smile | 1000| 2|
302-
|67 |wut/stripes | 5000| 2|
303-
|68 |wut/trajectories | 10000| 2|
304-
|69 |wut/trapped_lovers | 5000| 3|
305-
|70 |wut/twosplashes | 400| 2|
306-
|71 |wut/windows | 2977| 2|
307-
|72 |wut/x1 | 120| 2|
308-
|73 |wut/x2 | 120| 2|
309-
|74 |wut/x3 | 185| 2|
310-
|75 |wut/z1 | 192| 2|
311-
|76 |wut/z2 | 900| 2|
312-
|77 |wut/z3 | 1000| 2|
240+
| |dataset | n| d|
241+
|:--|:----------------------|------:|---:|
242+
|1 |fcps/atom | 800| 3|
243+
|2 |fcps/chainlink | 1000| 3|
244+
|3 |fcps/engytime | 4096| 2|
245+
|4 |fcps/hepta | 212| 3|
246+
|5 |fcps/lsun | 400| 2|
247+
|6 |fcps/target | 770| 2|
248+
|7 |fcps/tetra | 400| 3|
249+
|8 |fcps/twodiamonds | 800| 2|
250+
|9 |fcps/wingnut | 1016| 2|
251+
|10 |graves/dense | 200| 2|
252+
|11 |graves/fuzzyx | 1000| 2|
253+
|12 |graves/line | 250| 2|
254+
|13 |graves/parabolic | 1000| 2|
255+
|14 |graves/ring | 1000| 2|
256+
|15 |graves/ring_noisy | 1050| 2|
257+
|16 |graves/ring_outliers | 1030| 2|
258+
|17 |graves/zigzag | 250| 2|
259+
|18 |graves/zigzag_noisy | 300| 2|
260+
|19 |graves/zigzag_outliers | 280| 2|
261+
|20 |mnist/digits | 70000| 784|
262+
|21 |mnist/fashion | 70000| 784|
263+
|22 |other/chameleon_t4_8k | 8000| 2|
264+
|23 |other/chameleon_t5_8k | 8000| 2|
265+
|24 |other/chameleon_t7_10k | 10000| 2|
266+
|25 |other/chameleon_t8_8k | 8000| 2|
267+
|26 |other/hdbscan | 2309| 2|
268+
|27 |other/iris | 150| 4|
269+
|28 |other/iris5 | 105| 4|
270+
|29 |other/square | 1000| 2|
271+
|30 |sipu/a1 | 3000| 2|
272+
|31 |sipu/a2 | 5250| 2|
273+
|32 |sipu/a3 | 7500| 2|
274+
|33 |sipu/aggregation | 788| 2|
275+
|34 |sipu/birch1 | 100000| 2|
276+
|35 |sipu/birch2 | 100000| 2|
277+
|36 |sipu/compound | 399| 2|
278+
|37 |sipu/d31 | 3100| 2|
279+
|38 |sipu/flame | 240| 2|
280+
|39 |sipu/jain | 373| 2|
281+
|40 |sipu/pathbased | 300| 2|
282+
|41 |sipu/r15 | 600| 2|
283+
|42 |sipu/s1 | 5000| 2|
284+
|43 |sipu/s2 | 5000| 2|
285+
|44 |sipu/s3 | 5000| 2|
286+
|45 |sipu/s4 | 5000| 2|
287+
|46 |sipu/spiral | 312| 2|
288+
|47 |sipu/unbalance | 6500| 2|
289+
|48 |sipu/worms_2 | 105600| 2|
290+
|49 |sipu/worms_64 | 105000| 64|
291+
|50 |uci/ecoli | 336| 7|
292+
|51 |uci/glass | 214| 9|
293+
|52 |uci/ionosphere | 351| 34|
294+
|53 |uci/sonar | 208| 60|
295+
|54 |uci/statlog | 2310| 19|
296+
|55 |uci/wdbc | 569| 30|
297+
|56 |uci/wine | 178| 13|
298+
|57 |uci/yeast | 1484| 8|
299+
|58 |wut/circles | 4000| 2|
300+
|59 |wut/cross | 2000| 2|
301+
|60 |wut/graph | 2500| 2|
302+
|61 |wut/isolation | 9000| 2|
303+
|62 |wut/labirynth | 3546| 2|
304+
|63 |wut/mk1 | 300| 2|
305+
|64 |wut/mk2 | 1000| 2|
306+
|65 |wut/mk3 | 600| 3|
307+
|66 |wut/mk4 | 1500| 3|
308+
|67 |wut/olympic | 5000| 2|
309+
|68 |wut/smile | 1000| 2|
310+
|69 |wut/stripes | 5000| 2|
311+
|70 |wut/trajectories | 10000| 2|
312+
|71 |wut/trapped_lovers | 5000| 3|
313+
|72 |wut/twosplashes | 400| 2|
314+
|73 |wut/windows | 2977| 2|
315+
|74 |wut/x1 | 120| 2|
316+
|75 |wut/x2 | 120| 2|
317+
|76 |wut/x3 | 185| 2|
318+
|77 |wut/z1 | 192| 2|
319+
|78 |wut/z2 | 900| 2|
320+
|79 |wut/z3 | 1000| 2|
321+
313322

314323

315324

catalogue/mnist.csv

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
dataset,n,d,labels,k,noise,g
2+
mnist/digits,70000,784,labels0,10,0,0.03069206349206349
3+
mnist/fashion,70000,784,labels0,10,0,0.0

catalogue/mnist.md

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
**[Benchmark Suite for Clustering Algorithms -- Version 1](https://github.com/gagolews/clustering_benchmarks_v1)
2+
is maintained by [Marek Gagolewski](http://www.gagolewski.com)**
3+
4+
5+
--------------------------------------------------------------------------------
6+
7+
**Datasets**
8+
9+
* [mnist/digits](#mnist_digits)
10+
* [mnist/fashion](#mnist_fashion)
11+
12+
--------------------------------------------------------------------------------
13+
14+
## mnist/digits (n=70000, d=784) <a name="mnist_digits"></a>
15+
16+
THE MNIST DATABASE of handwritten digits
17+
-- train and test sample combined
18+
19+
Authors: Yann LeCun, Corinna Cortes and Christopher J.C. Burges
20+
21+
Source: http://yann.lecun.com/exdb/mnist/
22+
23+
24+
`labels0` come from the authors.
25+
26+
27+
28+
#### `labels0`
29+
30+
true_k=10, noise= 0, true_g=0.031
31+
32+
label_counts=[7877, 6990, 7141, 6824, 6313, 6876, 7293, 6825, 6958, 6903]
33+
34+
> **(preview generation suppressed)**
35+
36+
37+
38+
39+
40+
## mnist/fashion (n=70000, d=784) <a name="mnist_fashion"></a>
41+
42+
Fashion-MNIST is a dataset of Zalando’s article images
43+
-- train and test sample combined
44+
45+
Authors: Han Xiao, Kashif Rasul and Roland Vollgraf
46+
47+
Source: https://github.com/zalandoresearch/fashion-mnist
48+
49+
Citation: Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine
50+
Learning Algorithms. Han Xiao, Kashif Rasul, Roland Vollgraf. arXiv:1708.07747
51+
52+
53+
`labels0` come from the authors.
54+
55+
License: MIT
56+
57+
Copyright © [2017] Zalando SE, https://tech.zalando.com
58+
59+
Permission is hereby granted, free of charge, to any person obtaining a copy
60+
of this software and associated documentation files (the “Software”), to deal
61+
in the Software without restriction, including without limitation the rights
62+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
63+
copies of the Software, and to permit persons to whom the Software is
64+
furnished to do so, subject to the following conditions:
65+
66+
The above copyright notice and this permission notice shall be included
67+
in all copies or substantial portions of the Software.
68+
69+
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
70+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
71+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
72+
THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
73+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
74+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
75+
SOFTWARE.
76+
77+
78+
79+
#### `labels0`
80+
81+
true_k=10, noise= 0, true_g=0.000
82+
83+
label_counts=[7000, 7000, 7000, 7000, 7000, 7000, 7000, 7000, 7000, 7000]
84+
85+
> **(preview generation suppressed)**
86+
87+
88+
89+
90+

catalogue_generate_all.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
./catalogue_generate.py other
66
./catalogue_generate.py sipu
77
./catalogue_generate.py uci
8+
./catalogue_generate.py mnist
89
./catalogue_generate.py wut
910
./catalogue_generate.py h2mg
1011
./catalogue_generate.py g2mg

generate_gKmg.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
55
Copyright (C) 2018-2020 Marek.Gagolewski.com
66
7-
A generalized version of SIPU's G2 sets
7+
A generalised version of SIPU's G2 sets
88
(see https://cs.joensuu.fi/sipu/datasets/G2.txt),
99
but with a correction for dimensionality.
1010

generate_hKmg.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ def generate_hKmg(d, n, mu, s, random_state=None):
2828
corresponding labels.
2929
3030
The i-th group, i=1,...,K, consists of n[i-1] points
31-
that are sampled from a sphere centered at mu[i-1,:], of radius that follows
31+
that are sampled from a sphere centred at mu[i-1,:], of radius that follows
3232
the Gaussian distribution with mean 0 and standard deviation of s[i-1].
3333
"""
3434
assert mu.shape[0] == n.shape[0] == s.shape[0]

0 commit comments

Comments
 (0)