2019-01-16 new paper

Supervised Learning to Aggregate Data with the Sugeno Integral

Supervised Learning to Aggregate Data with the Sugeno Integral, co-authored by Simon James and Gleb Beliakov, shall appear in IEEE Trans. Fuzzy Systems.

Abstract. The problem of learning symmetric capacities (or fuzzy measures) from data is investigated toward applications in data analysis and prediction as well as decision making. Theoretical results regarding the solution minimizing the mean absolute error are exploited to develop an exact branch-refine-and-bound-type algorithm for fitting Sugeno integrals (weighted lattice polynomial functions, max-min operators) with respect to symmetric capacities. The proposed method turns out to be particularly suitable for acting on ordinal data. In addition to providing a model that can be used for the general data regression task, the results can be used, among others, to calibrate generalized h-indices to bibliometric data.

2018-12-11 new Ph.D.

Anna Cena's Ph.D. defense

My Ph.D. student, Anna Cena has defended her doctoral thesis, Adaptive hierarchical clustering algorithms based on data aggregation methods. Yay!
2018-10-26 new Ph.D.

Maciej Bartoszuk's Ph.D. defense

My Ph.D. student, Maciej Bartoszuk has defended his doctoral thesis (cum laude!), A source code similarity assessment system for functional programming languages based on machine learning and data aggregation methods. Congratulations!
2018-07-02 new paper

The efficacy of league formats in ranking teams

The efficacy of league formats in ranking teams has been accepted for publication in Statistical Modelling. Joint work with Jan Lasek.

Abstract. The efficacy of different league formats in ranking teams according to their true latent strength is analysed. To this end, a new approach for estimating attacking and defensive strengths based on the Poisson regression for modelling match outcomes is proposed. Various performance metrics are estimated reflecting the agreement between latent teams' strength parameters and their final rank in the league table. The tournament designs studied here are used in the majority of European top-tier association football competitions. Based on numerical experiments, it turns out that a two-stage league format comprising of the three round-robin tournament together with an extra single round-robin is the most efficacious setting. In particular, it is the most accurate in selecting the best team as the winner of the league. Its efficacy can be enhanced by setting the number of points allocated for a win to two (instead of three that is currently in effect in association football).

2017-05-23 software

Python package genieclust 0.1a2 released

An alpha release of the Python package implementing our fast and robust (Genie clustering algorithm ) is now available on PyPI. Check out the github repository for more information and tutorials.
2018-05-11 invited talk

Invited Plenary Lecture @ ISCAMI 2018

Today, at the International Student Conference on Applied Mathematics and Informatics – ISCAMI 2018 held in Malenovice, Czechia, I gave a lecture entitled Clustering on MSTs.

Abstract. Cluster analysis is one of the most commonly applied unsupervised machine learning techniques. Its aim is to automatically discover an underlying structure of a data set represented by a partition of its elements: mutually disjoint and nonempty subsets are determined in such a way that observations within each group are ``similar'' and entities in distinct clusters ``differ'' as much as possible from each other.

It turns out that two state-of-the-art clustering algorithms -- namely the Genie and HDBSCAN* methods -- can be computed based on the minimum spanning tree (MST) of the pairwise dissimilarity graph. Both of them are not only resistant to outliers and produce high-quality partitions, but also are relatively fast to compute.

The aim of this tutorial is to discuss some key issues of hierarchical clustering and explore their relations with graph and data aggregation theory.

2017-05-03 software

R package stringi 1.2.2 released

A new major release of the R package stringi is out. Check out the changelog for more information.


* [GENERAL] #193: `stringi` is now bundled with ICU4C 61.1,
which is used on most Windows and OS X builds as well as on *nix systems
not equipped with ICU. However, if the C++11 support is disabled,
stringi will be built against ICU4C 55.1. The update to ICU brings
Unicode 10.0 support, including new emoji characters.

* [BUGFIX] #288: stri_match did not return the correct number of columns
when input was empty.

* [NEW FEATURE] #188: `stri_enc_detect` now returns a list of data frames.

* [NEW FEATURE] #289: `stri_flatten` gained `na_empty` `omit_empty` arguments.

* [NEW FEATURE] New functions: `stri_remove_empty`, `stri_na2empty`

* [NEW FEATURE] #285: Coercion from a non-trivial list (one that consists
of atomic vectors, each of length 1) to an atomic vector now issues a warning.

* [WARN] Removed `-Wparentheses` warnings in `icu55/common/cstring.h:38:63`
and `icu55/i18n/windtfmt.cpp` in the ICU4C 55.1 bundle.
2018-04-20 invited workshop

Text Analysis Developers' Workshop 2018 @ NYC

Greetings from the Text Analysis Developers' Workshop 2018 @ New York University! This is a follow-up of the great event held a year ago at the London School of Economics, but with a stronger out-of-R focus (Python included).