News

2016-10-14 invited talk

Invited Talk @ European R Users Meeting 2016

Today I gave an invited talk (Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm and its R interface) at the European R Users Meeting that is held in Poznań, Poland.
Abstract. The time needed to apply a hierarchical clustering algorithm is most often dominated by the number of computations of a pairwise dissimilarity measure. Such a constraint, for larger data sets, puts at a disadvantage the use of all the classical linkage criteria but the single linkage one. However, it is known that the single linkage clustering algorithm is very sensitive to outliers, produces highly skewed dendrograms, and therefore usually does not reflect the true underlying data structure - unless the clusters are well-separated.
To overcome its limitations, we proposed a new hierarchical clustering linkage criterion called genie. Namely, our algorithm links two clusters in such a way that a chosen economic inequity measure (e.g., the Gini or Bonferroni index) of the cluster sizes does not increase drastically above a given threshold.
Benchmarks indicate a high practical usefulness of the introduced method: it most often outperforms the Ward or average linkage in terms of the clustering quality while retaining the single linkage speed. The algorithm is easily parallelizable and thus may be run on multiple threads to speed up its execution further on. Its memory overhead is small: there is no need to precompute the complete distance matrix to perform the computations in order to obtain a desired clustering. In this talk we will discuss its reference implementation, included in the genie package for R.
2016-07-06 invited talk

Invited Plenary Talk @ ISAS 2016

Today I gave a plenary talk at the International Symposium on Aggregation and Structures – ISAS 2016, entitled Penalty-based fusion of complex data, computational aspects, and applications.
Abstract. Since the 1980s, studies of aggregation functions most often focus on the construction and formal analysis of diverse ways to summarize numerical lists with elements in some real interval. Quite recently, we also observe an increasing interest in aggregation of and aggregation on generic partially ordered sets.
However, in many practical applications, we have no natural ordering of given data items. Thus, in this talk we review various aggregation methods in spaces equipped merely with a semimetric (distance). These include the concept of such penalty minimizers as the centroid, 1-median, 1-center, medoid, and their generalizations -- all leading to idempotent fusion functions. Special emphasis is placed on procedures to summarize vectors in Rd for d ≥ 2 (e.g., rows in numeric data frames) as well as character strings (e.g., DNA sequences), but of course the list of other interesting domains could go on forever (rankings, graphs, images, time series, and so on).
We discuss some of their formal properties, exact or approximate (if the underlying optimization task is hard) algorithms to compute them and their applications in clustering and classification tasks.
2016-06-07 new paper

Hierarchical Clustering via Penalty-Based Aggregation and the Genie Approach

The following paper has been accepted for publication in Proceedings of MDAI 2016: Gagolewski M., Cena A., Bartoszuk M., Hierarchical Clustering via Penalty-Based Aggregation and the Genie Approach, Lecture Notes in Artificial Intelligence, Springer, 2016.
Abstract. The paper discusses a generalization of the nearest centroid hierarchical clustering algorithm. A first extension deals with the incorporation of generic distance-based penalty minimizers instead of the classical aggregation by means of centroids. Due to that the presented algorithm can be applied in spaces equipped with an arbitrary dissimilarity measure (images, DNA sequences, etc.). Secondly, a correction preventing the formation of clusters of too highly unbalanced sizes is applied: just like in the recently introduced Genie approach, which extends the single linkage scheme, the new method averts a chosen inequity measure (e.g., the Gini-, de Vergottini-, or Bonferroni-index) of cluster sizes from raising above a predefined threshold. Numerous benchmarks indicate that the introduction of such a correction increases the quality of the resulting clusterings.
2016-05-30 software

stringi 1.1.1 released

stringi is among the top 10 most downloaded R packages, providing various string processing facilities. A new release comes with a few bugfixes and new features.
* [BUGFIX] #214: allow a regex pattern like `.*`  to match an empty string.

* [BUGFIX] #210: `stri_replace_all_fixed(c("1", "NULL"), "NULL", NA)`
now results in `c("1", NA)`.

* [NEW FEATURE] #199: `stri_sub<-` now allows for ignoring `NA` locations
(a new `omit_na` argument added).

* [NEW FEATURE] #207: `stri_sub<-` now allows for substring insertions
(via `length=0`).

* [NEW FUNCTION] #124: `stri_subset<-` functions added.

* [NEW FEATURE] #216: `stri_detect`, `stri_subset`, `stri_subset<-` gained
a `negate` argument.

* [NEW FUNCTION] #175: `stri_join_list` concatenates all strings
in a list of character vectors. Useful with, e.g., `stri_extract_all_regex`,
`stri_extract_all_words` etc.
2016-05-09 new paper

Paper on the Genie Clustering Algorithm

The following paper has been accepted for publication in Information Sciences: Gagolewski M., Bartoszuk M., Cena A., Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, 2016. It describes the Genie algorithm available thru the genie package for R. The article has been assigned DOI of 10.1016/j.ins.2016.05.003.
Abstract. The time needed to apply a hierarchical clustering algorithm is most often dominated by the number of computations of a pairwise dissimilarity measure. Such a constraint, for larger data sets, puts at a disadvantage the use of all the classical linkage criteria but the single linkage one. However, it is known that the single linkage clustering algorithm is very sensitive to outliers, produces highly skewed dendrograms, and therefore usually does not reflect the true underlying data structure – unless the clusters are well-separated. To overcome its limitations, we propose a new hierarchical clustering linkage criterion called Genie. Namely, our algorithm links two clusters in such a way that a chosen economic inequity measure (e.g., the Gini- or Bonferroni-index) of the cluster sizes does not increase drastically above a given threshold. The presented benchmarks indicate a high practical usefulness of the introduced method: it most often outperforms the Ward or average linkage in terms of the clustering quality while retaining the single linkage speed. The Genie algorithm is easily parallelizable and thus may be run on multiple threads to speed up its execution further on. Its memory overhead is small: there is no need to precompute the complete distance matrix to perform the computations in order to obtain a desired clustering. It can be applied on arbitrary spaces equipped with a dissimilarity measure, e.g., on real vectors, DNA or protein sequences, images, rankings, informetric data, etc. A reference implementation of the algorithm has been included in the open source genie package for R.
2016-03-09 new paper

Proc. IPMU'2016: 3 Papers Accepted

Three papers which I co-author: Fitting aggregation functions to data: Part I – Linearization and regularization, Fitting aggregation functions to data: Part II – Idempotentization (co-authors: Maciej Bartoszuk, Gleb Beliakov, Simon James), and Fuzzy k-minpen clustering and k-nearest-minpen classification procedures incorporating generic distance-based penalty minimizers (co-author: Anna Cena) have been accepted for the IPMU 2016 conference.

1st paper:

Abstract. The use of supervised learning techniques for fitting weights and/or generator functions of weighted quasi-arithmetic means – a special class of idempotent and nondecreasing aggregation functions – to empirical data has already been considered in a number of papers. Nevertheless, there are still some important issues that have not been discussed in the literature yet. In the first part of this two-part contribution we deal with the concept of regularization, a quite standard technique from machine learning applied so as to increase the fit quality on test and validation data samples. Due to the constraints on the weighting vector, it turns out that quite different methods can be used in the current framework, as compared to regression models. Moreover, it is worth noting that so far fitting weighted quasi-arithmetic means to empirical data has only been performed approximately, via the so-called linearization technique. In this paper we consider exact solutions to such special optimization tasks and indicate cases where linearization leads to much worse solutions.

Keywords. Aggregation functions, weighted quasi-arithmetic means, least squares fitting, regularization, linearization

2nd paper:

Abstract. The use of supervised learning techniques for fitting weights and/or generator functions of weighted quasi-arithmetic means – a special class of idempotent and nondecreasing aggregation functions – to empirical data has already been considered in a number of papers. Nevertheless, there are still some important issues that have not been discussed in the literature yet. In the second part of this two-part contribution we deal with a quite common situation in which we have inputs coming from different sources, describing a similar phenomenon, but which have not been properly normalized. In such a case, idempotent and nondecreasing functions cannot be used to aggregate them unless proper pre-processing is performed. The proposed idempotization method, based on the notion of B-splines, allows for an automatic calibration of independent variables. The introduced technique is applied in an R source code plagiarism detection system.

Keywords. Aggregation functions, weighted quasi-arithmetic means, least squares fitting, idempotence

3rd paper:

Abstract. We discuss a generalization of the fuzzy (weighted) k-means clustering procedure and point out its relationships with data aggregation in spaces equipped with arbitrary dissimilarity measures. In the proposed setting, a data set partitioning is performed based on the notion of points' proximity to generic distance-based penalty minimizers. Moreover, a new data classification algorithm, resembling the k-nearest neighbors scheme but less computationally and memory demanding, is introduced. Rich examples in complex data domains indicate the usability of the methods and aggregation theory in general.

Keywords. fuzzy k-means algorithm, clustering, classification, fusion functions, penalty minimizers

2016-03-07 software

The genie Package for R

A New, Fast, and Outlier Resistant Hierarchical Clustering Algorithm called Genie is now available via the genie package for R (co-authors: Maciej Bartoszuk and Anna Cena). A detailed description of the algorithm will be available in a forthcoming paper of ours.
2015-12-31 new book

Data Fusion Book Now Available

My book Data Fusion: Theory, Methods, and Applications is now available (click me).
Data Fusion: Theory, Methods, and Applications - cover