Marek Gagolewski

It is not rare for clustering papers/graduate theses to consider only a small number of datasets, say 5-10 UCI-sourced ones, which obviously is too few to make any evaluations rigorous enough. Other authors propose their own datasets, but forget to test their methods against other benchmarks batteries, risking their evaluations be biased.

Authors who share their data (kudos to them!) might not necessarily make the use of their suites particularly smooth (different file formats, different ways to access, etc., even across a single repository). On the other hand, other machine learning domains (but also: optimisation) have had some standardised, well agreed-upon approaches for testing the quality of the algorithms for quite a long time.

Due to these, I started a project that aims to aggregate, polish and standardise the existing clustering benchmark suites referred to across the machine learning and data mining literature. Moreover, it adds a few new datasets of different dimensionalities, sizes, and cluster types.

Also checkout my recent paper: M. Gagolewski, M. Bartoszuk, A. Cena, Are cluster validity measures (in)valid?, Information Sciences, 2021, in press, doi:10.1016/j.ins.2021.10.004.


Marek Gagolewski

Here are some datasets I use for Teaching/Training data wrangling/modelling skills.

Remember that, in real life, most datasets are boring and there’s rarely anything of significance therein. It is usually not your fault if you fail to find a dataset fails to provide sufficient evidence for featuring anything “mind boggling”.

If your manager/thesis supervisor/client forces you to squeeze them too hard or to start cherry picking, it is your ethical duty to say no to this. The reproducibility/ replication crisis in science is real. This rat race is simply unsustainable.

Also, you should study maths in order to understand the limitations of the methods/models you use. Check out my open-access textbook Minimalist Data Wrangling with Python to learn more.

Ordinal Regression