2020-07-08 new paper

Paper on SimilaR in R Journal

SimilaR: R Code Clone and Plagiarism Detection by Maciej Bartoszuk and me has been accepted for publication in the R Journal.

Abstract. Third-party software for assuring source code quality is becoming increasingly popular. Tools that evaluate the coverage of unit tests, perform static code analysis, or inspect run-time memory use are crucial in the software development life cycle. More sophisticated methods allow for performing meta-analyses of large software repositories, e.g., to discover abstract topics they relate to or common design patterns applied by their developers. They may be useful in gaining a better understanding of the component interdependencies, avoiding cloned code as well as detecting plagiarism in programming classes.

A meaningful measure of similarity of computer programs often forms the basis of such tools. While there are a few noteworthy instruments for similarity assessment, none of them turns out particularly suitable for analysing R code chunks. Existing solutions rely on rather simple techniques and heuristics and fail to provide a user with the kind of sensitivity and specificity required for working with R scripts. In order to fill this gap, we propose a new algorithm based on a Program Dependence Graph, implemented in the SimilaR package. It can serve as a tool not only for improving R code quality but also for detecting plagiarism, even when it has been masked by applying some obfuscation techniques or imputing dead code. We demonstrate its accuracy and efficiency in a real-world case study.

2020-06-08 new paper

Paper in PNAS: Three Dimensions of Scientific Impact

In a paper recently published in the Proceedings of the National Academy of Sciences of the United States of America (PNAS) (doi:10.1073/pnas.2001064117; joint work with Grzesiek Siudem, Basia Żogała-Siudem and Ania Cena), we consider the mechanisms behind one’s research success as measured by one’s papers’ citability. By acknowledging the perceived esteem might be a consequence not only of how valuable one’s works are but also of pure luck, we arrived at a model that can accurately recreate a citation record based on just three parameters: the number of publications, the total number of citations, and the degree of randomness in the citation patterns. As a by-product, we show that a single index will never be able to embrace the complex reality of the scientific impact. However, three of them can already provide us with a reliable summary.

Abstract. The growing popularity of bibliometric indexes (whose most famous example is the h index by J. E. Hirsch [J. E. Hirsch, Proc. Natl. Acad. Sci. U.S.A. 102, 16569–16572 (2005)]) is opposed by those claiming that one's scientific impact cannot be reduced to a single number. Some even believe that our complex reality fails to submit to any quantitative description. We argue that neither of the two controversial extremes is true. By assuming that some citations are distributed according to the rich get richer rule (success breeds success, preferential attachment) while some others are assigned totally at random (all in all, a paper needs a bibliography), we have crafted a model that accurately summarizes citation records with merely three easily interpretable parameters: productivity, total impact, and how lucky an author has been so far.


Benchmark Suite for Clustering Algorithms - Version 1

Let's aggregate, polish and standardise the existing clustering benchmark suites referred to across the machine learning and data mining literature! See our new Benchmark Suite for Clustering Algorithms.
2020-02-23 book draft

Lightweight Machine Learning Classics with R

A first draft of my new textbook Lightweight Machine Learning Classics with R is now available.

About. Explore some of the most fundamental algorithms which have stood the test of time and provide the basis for innovative solutions in data-driven AI. Learn how to use the R language for implementing various stages of data processing and modelling activities. Appreciate mathematics as the universal language for formalising data-intense problems and communicating their solutions. The book is for you if you're yet to be fluent with university-level linear algebra, calculus and probability theory or you've forgotten all the maths you've ever learned, and are seeking a gentle, yet thorough, introduction to the topic.

2020-02-10 new paper

Genie+OWA: Robustifying Hierarchical Clustering with OWA-based Linkages

Check out our (by Anna Cena and me) most recent paper on the best hierarchical clustering algorithm in the world – Genie. It is going to appear in Information Sciences; doi:10.1016/j.ins.2020.02.025.

Abstract. We investigate the application of the Ordered Weighted Averaging (OWA) data fusion operator in agglomerative hierarchical clustering. The examined setting generalises the well-known single, complete and average linkage schemes. It allows to embody expert knowledge in the cluster merge process and to provide a much wider range of possible linkages. We analyse various families of weighting functions on numerous benchmark data sets in order to assess their influence on the resulting cluster structure. Moreover, we inspect the correction for the inequality of cluster size distribution -- similar to the one in the Genie algorithm. Our results demonstrate that by robustifying the procedure with the Genie correction, we can obtain a significant performance boost in terms of clustering quality. This is particularly beneficial in the case of the linkages based on the closest distances between clusters, including the single linkage and its "smoothed" counterparts. To explain this behaviour, we propose a new linkage process called three-stage OWA which yields further improvements. This way we confirm the intuition that hierarchical cluster analysis should rather take into account a few nearest neighbours of each point, instead of trying to adapt to their non-local neighbourhood.

2019-12-11 new paper

DC optimization for constructing discrete Sugeno integrals and learning nonadditive measures

We (Gleb Beliakov, Simon James and I) have another paper accepted for publication – this time in the Optimization journal; doi:10.1080/02331934.2019.1705300.

Abstract. Defined solely by means of order-theoretic operations meet (min) and join (max), weighted lattice polynomial functions are particularly useful for modeling data on an ordinal scale. A special case, the discrete Sugeno integral, defined with respect to a nonadditive measure (a capacity), enables accounting for the interdependencies between input variables.

However until recently the problem of identifying the fuzzy measure values with respect to various objectives and requirements has not received a great deal of attention. By expressing the learning problem as the difference of convex functions, we are able to apply DC (difference of convex) optimization methods. Here we formulate one of the global optimization steps as a local linear programming problem and investigate the improvement under different conditions.


IEEE WCCI 2020 Special Session - Aggregation Structures: New Trends and Applications

Call for contributions – IEEE World Congress on Computational Intelligence (WCCI) 2020, Glasgow, Scotland — FUZZ-IEEE-6 Special Session on Aggregation Structures: New Trends and Applications; for more details, click here.
2019-11-14 new paper

Robust fitting for the Sugeno integral with respect to general fuzzy measures

The editor of Information Sciences have just let us know that a paper by Gleb Beliakov, Simon James and me will be published in this outlet.

Abstract. The Sugeno integral is an expressive aggregation function with potential applications across a range of decision contexts. Its calculation requires only the lattice minimum and maximum operations, making it particularly suited to ordinal data and robust to scale transformations. However, for practical use in data analysis and prediction, we require efficient methods for learning the associated fuzzy measure. While such methods are well developed for the Choquet integral, the fitting problem is more difficult for the Sugeno integral because it is not amenable to being expressed as a linear combination of weights, and more generally due to plateaus and non-differentiability in the objective function. Previous research has hence focused on heuristic approaches or simplified fuzzy measures. Here we show that the problem of fitting the Sugeno integral to data such that the maximum absolute error is minimized can be solved using an efficient bilevel program. This method can be incorporated into algorithms that learn fuzzy measures with the aim of minimizing the median residual. This equips us with tools that make the Sugeno integral a feasible option in robust data regression and analysis. We provide experimental comparison with a genetic algorithms approach and an example in data analysis.