2016-11-21 new book

**Programowanie w języku R (2nd Ed., revised and extended)**

The 2nd edition of my R Programming Book
is now available in Polish book stores.

2016-10-28

**Eusflat'17 Special Session:Algorithms for Data Aggregation and Fusion**

Call for contributions –
EUSFLAT 2017
*(10th Conference of the European Society for Fuzzy Logic and Technology, Warsaw, Poland)*
Special Session *Algorithms for Data Aggregation and Fusion*;
for more details,
click here.

2016-10-27 new paper

**Penalty-Based and Other Representations of Economic Inequality**

My paper with Gleb Beliakov and Simon James,
entitled *Penalty-based and other representations of economic inequality*,
has been accepted for publication in * International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems* today.

Abstract.Economic inequality measures are employed as a key component in various socio-demographic indices to capture the disparity between the wealthy and poor. Since their inception, they have also been used as a basis for modelling spread and disparity in other contexts. While recent research has identified that a number of classical inequality and welfare functions can be considered in the framework of OWA operators, here we propose a framework of penalty-based aggregation functions and their associated penalties as measures of inequality.

2016-10-14 invited talk

**Invited Talk @ European R Users Meeting 2016**

Today I gave an invited talk
(*Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm and its R interface*)
at the European R Users Meeting that is held in Poznań, Poland.

Abstract.The time needed to apply a hierarchical clustering algorithm is most often dominated by the number of computations of a pairwise dissimilarity measure. Such a constraint, for larger data sets, puts at a disadvantage the use of all the classical linkage criteria but the single linkage one. However, it is known that the single linkage clustering algorithm is very sensitive to outliers, produces highly skewed dendrograms, and therefore usually does not reflect the true underlying data structure - unless the clusters are well-separated.

To overcome its limitations, we proposed a new hierarchical clustering linkage criterion called genie. Namely, our algorithm links two clusters in such a way that a chosen economic inequity measure (e.g., the Gini or Bonferroni index) of the cluster sizes does not increase drastically above a given threshold.

Benchmarks indicate a high practical usefulness of the introduced method: it most often outperforms the Ward or average linkage in terms of the clustering quality while retaining the single linkage speed. The algorithm is easily parallelizable and thus may be run on multiple threads to speed up its execution further on. Its memory overhead is small: there is no need to precompute the complete distance matrix to perform the computations in order to obtain a desired clustering. In this talk we will discuss its reference implementation, included in the genie package for R.

2016-07-06 invited talk

**Invited Plenary Talk @ ISAS 2016**

Today I gave a plenary talk at the International Symposium on Aggregation and Structures – ISAS 2016, entitled
*Penalty-based fusion of complex data, computational aspects, and applications*.

Abstract.Since the 1980s, studies of aggregation functions most often focus on the construction and formal analysis of diverse ways to summarize numerical lists with elements in some real interval. Quite recently, we also observe an increasing interest in aggregationofand aggregationongeneric partially ordered sets.

However, in many practical applications, we have no natural ordering of given data items. Thus, in this talk we review various aggregation methods in spaces equipped merely with a semimetric (distance). These include the concept of such penalty minimizers as the centroid, 1-median, 1-center, medoid, and their generalizations -- all leading to idempotent fusion functions. Special emphasis is placed on procedures to summarize vectors in R^{d}for d ≥ 2 (e.g., rows in numeric data frames) as well as character strings (e.g., DNA sequences), but of course the list of other interesting domains could go on forever (rankings, graphs, images, time series, and so on).

We discuss some of their formal properties, exact or approximate (if the underlying optimization task is hard) algorithms to compute them and their applications in clustering and classification tasks.

2016-06-07 new paper

**Hierarchical Clustering via Penalty-Based Aggregation and the Genie Approach**

The following paper has been accepted for publication in Proceedings of MDAI 2016:
**Gagolewski M.**, Cena A., Bartoszuk M.,
Hierarchical Clustering via Penalty-Based Aggregation and the Genie Approach,
*Lecture Notes in Artificial Intelligence*, Springer, 2016.

Abstract.The paper discusses a generalization of the nearest centroid hierarchical clustering algorithm. A first extension deals with the incorporation of generic distance-based penalty minimizers instead of the classical aggregation by means of centroids. Due to that the presented algorithm can be applied in spaces equipped with an arbitrary dissimilarity measure (images, DNA sequences, etc.). Secondly, a correction preventing the formation of clusters of too highly unbalanced sizes is applied: just like in the recently introducedGenieapproach, which extends the single linkage scheme, the new method averts a chosen inequity measure (e.g., the Gini-, de Vergottini-, or Bonferroni-index) of cluster sizes from raising above a predefined threshold. Numerous benchmarks indicate that the introduction of such a correction increases the quality of the resulting clusterings.

2016-05-30 software

`stringi`

is among the top
10 most downloaded `R`

packages, providing various string
processing facilities. A new release comes with a few bugfixes
and new features.
* [BUGFIX] #214: allow a regex pattern like `.*` to match an empty string. * [BUGFIX] #210: `stri_replace_all_fixed(c("1", "NULL"), "NULL", NA)` now results in `c("1", NA)`. * [NEW FEATURE] #199: `stri_sub<-` now allows for ignoring `NA` locations (a new `omit_na` argument added). * [NEW FEATURE] #207: `stri_sub<-` now allows for substring insertions (via `length=0`). * [NEW FUNCTION] #124: `stri_subset<-` functions added. * [NEW FEATURE] #216: `stri_detect`, `stri_subset`, `stri_subset<-` gained a `negate` argument. * [NEW FUNCTION] #175: `stri_join_list` concatenates all strings in a list of character vectors. Useful with, e.g., `stri_extract_all_regex`, `stri_extract_all_words` etc.

2016-05-09 new paper

**Paper on the Genie Clustering Algorithm**

The following paper has been accepted for publication in *Information Sciences*:
**Gagolewski M.**, Bartoszuk M., Cena A.,
Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, 2016.
It describes the *Genie*
algorithm available thru the

`genie`

package for `R`

. The article has been assigned DOI of 10.1016/j.ins.2016.05.003.
Abstract.The time needed to apply a hierarchical clustering algorithm is most often dominated by the number of computations of a pairwise dissimilarity measure. Such a constraint, for larger data sets, puts at a disadvantage the use of all the classical linkage criteria but the single linkage one. However, it is known that the single linkage clustering algorithm is very sensitive to outliers, produces highly skewed dendrograms, and therefore usually does not reflect the true underlying data structure – unless the clusters are well-separated. To overcome its limitations, we propose a new hierarchical clustering linkage criterion called Genie. Namely, our algorithm links two clusters in such a way that a chosen economic inequity measure (e.g., the Gini- or Bonferroni-index) of the cluster sizes does not increase drastically above a given threshold. The presented benchmarks indicate a high practical usefulness of the introduced method: it most often outperforms the Ward or average linkage in terms of the clustering quality while retaining the single linkage speed. The Genie algorithm is easily parallelizable and thus may be run on multiple threads to speed up its execution further on. Its memory overhead is small: there is no need to precompute the complete distance matrix to perform the computations in order to obtain a desired clustering. It can be applied on arbitrary spaces equipped with a dissimilarity measure, e.g., on real vectors, DNA or protein sequences, images, rankings, informetric data, etc. A reference implementation of the algorithm has been included in the open source genie package for R.