2020-07-31 software

Python and R package genieclust 0.9.4

A reimplementation of my robust hierarchical clustering algorithm Genie is now available on PyPI and CRAN. Now even faster and equipped with many more features, including noise point detection. See for more details, documentation, benchmarks, and tutorials.
2020-07-08 new paper

Paper on SimilaR in R Journal

SimilaR: R Code Clone and Plagiarism Detection by Maciej Bartoszuk and me has been accepted for publication in the R Journal.

Abstract. Third-party software for assuring source code quality is becoming increasingly popular. Tools that evaluate the coverage of unit tests, perform static code analysis, or inspect run-time memory use are crucial in the software development life cycle. More sophisticated methods allow for performing meta-analyses of large software repositories, e.g., to discover abstract topics they relate to or common design patterns applied by their developers. They may be useful in gaining a better understanding of the component interdependencies, avoiding cloned code as well as detecting plagiarism in programming classes.

A meaningful measure of similarity of computer programs often forms the basis of such tools. While there are a few noteworthy instruments for similarity assessment, none of them turns out particularly suitable for analysing R code chunks. Existing solutions rely on rather simple techniques and heuristics and fail to provide a user with the kind of sensitivity and specificity required for working with R scripts. In order to fill this gap, we propose a new algorithm based on a Program Dependence Graph, implemented in the SimilaR package. It can serve as a tool not only for improving R code quality but also for detecting plagiarism, even when it has been masked by applying some obfuscation techniques or imputing dead code. We demonstrate its accuracy and efficiency in a real-world case study.

2020-06-08 new paper

Paper in PNAS: Three Dimensions of Scientific Impact

In a paper recently published in the Proceedings of the National Academy of Sciences of the United States of America (PNAS) (doi:10.1073/pnas.2001064117; joint work with Grzesiek Siudem, Basia Żogała-Siudem and Ania Cena), we consider the mechanisms behind one’s research success as measured by one’s papers’ citability. By acknowledging the perceived esteem might be a consequence not only of how valuable one’s works are but also of pure luck, we arrived at a model that can accurately recreate a citation record based on just three parameters: the number of publications, the total number of citations, and the degree of randomness in the citation patterns. As a by-product, we show that a single index will never be able to embrace the complex reality of the scientific impact. However, three of them can already provide us with a reliable summary.

Abstract. The growing popularity of bibliometric indexes (whose most famous example is the h index by J. E. Hirsch [J. E. Hirsch, Proc. Natl. Acad. Sci. U.S.A. 102, 16569–16572 (2005)]) is opposed by those claiming that one's scientific impact cannot be reduced to a single number. Some even believe that our complex reality fails to submit to any quantitative description. We argue that neither of the two controversial extremes is true. By assuming that some citations are distributed according to the rich get richer rule (success breeds success, preferential attachment) while some others are assigned totally at random (all in all, a paper needs a bibliography), we have crafted a model that accurately summarizes citation records with merely three easily interpretable parameters: productivity, total impact, and how lucky an author has been so far.


Benchmark Suite for Clustering Algorithms - Version 1

Let's aggregate, polish and standardise the existing clustering benchmark suites referred to across the machine learning and data mining literature! See our new Benchmark Suite for Clustering Algorithms.
2020-02-23 book draft

Lightweight Machine Learning Classics with R

A first draft of my new textbook Lightweight Machine Learning Classics with R is now available.

About. Explore some of the most fundamental algorithms which have stood the test of time and provide the basis for innovative solutions in data-driven AI. Learn how to use the R language for implementing various stages of data processing and modelling activities. Appreciate mathematics as the universal language for formalising data-intense problems and communicating their solutions. The book is for you if you're yet to be fluent with university-level linear algebra, calculus and probability theory or you've forgotten all the maths you've ever learned, and are seeking a gentle, yet thorough, introduction to the topic.

2020-02-10 new paper

Genie+OWA: Robustifying Hierarchical Clustering with OWA-based Linkages

Check out our (by Anna Cena and me) most recent paper on the best hierarchical clustering algorithm in the world – Genie. It is going to appear in Information Sciences; doi:10.1016/j.ins.2020.02.025.

Abstract. We investigate the application of the Ordered Weighted Averaging (OWA) data fusion operator in agglomerative hierarchical clustering. The examined setting generalises the well-known single, complete and average linkage schemes. It allows to embody expert knowledge in the cluster merge process and to provide a much wider range of possible linkages. We analyse various families of weighting functions on numerous benchmark data sets in order to assess their influence on the resulting cluster structure. Moreover, we inspect the correction for the inequality of cluster size distribution -- similar to the one in the Genie algorithm. Our results demonstrate that by robustifying the procedure with the Genie correction, we can obtain a significant performance boost in terms of clustering quality. This is particularly beneficial in the case of the linkages based on the closest distances between clusters, including the single linkage and its "smoothed" counterparts. To explain this behaviour, we propose a new linkage process called three-stage OWA which yields further improvements. This way we confirm the intuition that hierarchical cluster analysis should rather take into account a few nearest neighbours of each point, instead of trying to adapt to their non-local neighbourhood.

2019-12-11 new paper

DC optimization for constructing discrete Sugeno integrals and learning nonadditive measures

We (Gleb Beliakov, Simon James and I) have another paper accepted for publication – this time in the Optimization journal; doi:10.1080/02331934.2019.1705300.

Abstract. Defined solely by means of order-theoretic operations meet (min) and join (max), weighted lattice polynomial functions are particularly useful for modeling data on an ordinal scale. A special case, the discrete Sugeno integral, defined with respect to a nonadditive measure (a capacity), enables accounting for the interdependencies between input variables.

However until recently the problem of identifying the fuzzy measure values with respect to various objectives and requirements has not received a great deal of attention. By expressing the learning problem as the difference of convex functions, we are able to apply DC (difference of convex) optimization methods. Here we formulate one of the global optimization steps as a local linear programming problem and investigate the improvement under different conditions.


IEEE WCCI 2020 Special Session - Aggregation Structures: New Trends and Applications

Call for contributions – IEEE World Congress on Computational Intelligence (WCCI) 2020, Glasgow, Scotland — FUZZ-IEEE-6 Special Session on Aggregation Structures: New Trends and Applications; for more details, click here.