It is not rare for clustering papers/graduate theses to consider only a small number of datasets, say 5-10 UCI-sourced ones, which obviously is too few to make any evaluations rigorous enough. Other authors propose their own datasets, but forget to test their methods against other benchmarks batteries, risking their evaluations be biased.

Authors who share their data (kudos to them!) might not necessarily make the use of their suites particularly smooth (different file formats, different ways to access, etc., even across a single repository). On the other hand, other machine learning domains (but also: optimisation) have had some standardised, well agreed-upon approaches for testing the quality of the algorithms for quite a long time.

Due to these, I started a project that aims to aggregate, polish and standardise the existing clustering benchmark suites referred to across the machine learning and data mining literature. Moreover, it adds a few new datasets of different dimensionalities, sizes, and cluster types.

Also checkout my recent paper: M. Gagolewski, M. Bartoszuk, A. Cena, Are cluster validity measures (in)valid?, Information Sciences, 2021, in press, doi:10.1016/j.ins.2021.10.004.


Remember that most datasets are boring and there’s rarely anything of significance therein (the null hypothesis). It is usually not your fault if you fail to find a dataset fails to provide sufficient evidence for featuring anything “mind boggling”.

If your manager/thesis supervisor/client forces you to squeeze them too hard or to start cherry picking, it is your ethical duty to say no to this. The reproducibility/ replication crisis in science is real. This rat race is simply unsustainable.

Here are some datasets I use for Teaching/Training data wrangling/modelling skills. It’s worth to have them in case you come across some hidden gem after all.

Also, you should study maths in order to understand the limitations of the methods/models you use.

Ordinal Regression