2021-08-26 new paper

T-norms or t-conorms? How to aggregate similarity degrees for plagiarism detection

A new paper by Maciek Bartoszuk and me is to appear in Knowledge-Based Systems (doi:10.1016/j.knosys.2021.107427).

Abstract. Making correct decisions as to whether code chunks should be considered similar becomes increasingly important in software design and education and not only can improve the quality of computer programs, but also help assure the integrity of student assessments. In this paper we test numerous source code similarity detection tools on pairs of code fragments written in the data science-oriented functional programming language R. Contrary to mainstream approaches, instead of considering symmetric measures of “how much code chunks A and B are similar to each other”, we propose and study the nonsymmetric degrees of inclusion “to what extent A is a subset of B” and “to what degree B is included in A”. Overall, t-norms yield better precision (how many suspicious pairs are actually similar), t-conorms maximise recall (how many similar pairs are successfully retrieved), and custom aggregation functions fitted to training data provide a good balance between the two. Also, we find that program dependence graph-based methods tend to outperform those relying on normalised source code text, tokens, and names of functions invoked.

2021-07-29 software

stringx: Drop-in replacements for base R string functions powered by stringi

English is the native language for only 5% of the World population. Also, only 17% of us can understand this text. Moreover, the Latin alphabet is the main one for merely 36% of the total. The early computer era, now a very long time ago, was dominated by the US. Due to the proliferation of the internet, smartphones, social media, and other technologies and communication platforms, this is no longer the case. The stringx package replaces base R string functions (such as grep(), tolower(), and sprintf()) with ones that fully support the Unicode standards related to natural language processing, fixes some long-standing inconsistencies, and introduces some new, useful features. Thanks to ICU (International Components for Unicode) and stringi, they are fast, reliable, and portable across different platforms. Now available from CRAN.
2021-07-14 software

stringi 1.7.2

Another major update of stringi brings a rewritten version of stri_sprintf, support for custom rule-based transliteration, extraction of named regex capture groups, and many other enhancements.

Changes since v1.6.2:

* [BACKWARD INCOMPATIBILITY] `%s$%` and `%stri$%` now use the new `stri_sprintf`
(see below) function instead of `base::sprintf`.

* [BACKWARD INCOMPATIBILITY, NEW FEATURE] In `stri_sub<-` and `stri_sub_all<-`,
providing a negative `length` from now on does not result in the corresponding
input string being altered.

* [BACKWARD INCOMPATIBILITY, NEW FEATURE] In `stri_sub` and `stri_sub_all`,
negative `length` results in the corresponding output being `NA`
or not extracted at all, depending on the setting of the new argument

and their replacement versions, `pattern` and `value` cannot be longer
than `str` (but now they are recycled if necessary).

* [BACKWARD INCOMPATIBILITY, NEW FEATURE] `stri_sub*` now accept the
`from` argument being a matrix like `cbind(from, length=length)`.
Unnamed columns or any other names are still interpreted as `cbind(from, to)`.
Also, the new argument `use_matrix` can be used to disable
the special treatment of such matrices.

* [DOCUMENTATION] It has been clarified that the syntax of `*_charclass`
(e.g., used in `stri_trim*`) differs slightly from regex character

* [NEW FEATURE] #420: `stri_sprintf` (alias: `stri_string_format`)
is a Unicode-aware replacement for and enhancement of the base `sprintf`:
it adds a customised handling of `NA`s (on demand), computing field size
based on code point width, outputting substrings of at most given width,
variable width and precision (both at the same time), etc. Moreover,
`stri_printf` can be used to display formatted strings conveniently.

* [NEW FEATURE] #153: `stri_match_*_regex` now extract capture group names.

* [NEW FEATURE] #25: `stri_locate_*_regex` now have a new argument,
`capture_groups`, which allows for extracting positions of matches
to parenthesised subexpressions.

* [NEW FEATURE] `stri_locate_*` now have a new argument, `get_length`,
whose setting may result in generating *from-length* matrices
(instead of *from-to* ones).

* [NEW FEATURE] #438: `stri_trans_general` now supports rule-based
as well as reverse-direction transliteration.

* [NEW FEATURE] #434: `stri_datetime_format` and `stri_datetime_parse`
are now vectorised also with respect to the `format` argument.

* [NEW FEATURE] `stri_datetime_fstr` has a new argument, `ignore_special`,
which defaults to `TRUE` for backward compatibility.

* [NEW FEATURE] `stri_datetime_format`, `stri_datetime_add`, and
`stri_datetime_fields` now call `as.POSIXct` more eagerly.

* [NEW FEATURE] `stri_trim*` now have a new argument, `negate`.

* [NEW FEATURE] `stri_replace_rstr` converts `gsub`-style replacement strings
to `stri_replace`-style.

* [INTERNAL] `stri_prepare_arg*` have been refactored, buffer overruns
in the exception handling subsystem are now avoided.

* [BUGFIX] Few functions (`stri_length`, `stri_enc_toutf32`, etc.)
did not throw an exception on an invalid UTF-8
byte sequence (and merely issues a warning instead).

* [BUGFIX] `stri_datetime_fstr` did not honour `NA_character_`
and did not parse format strings such as `"%Y%m%d"` correctly.
It has now been completely rewritten (in C).

* [BUGFIX] `stri_wrap` did not recognise the width of certain Unicode sequences
2021-06-17 software

realtest 0.2.1 on CRAN

An update to realtest is now available.

Changes since v0.1.2:

* [NEW FEATURE] `sides_comparer` is now solely responsible for
defining the semantics of side effect prototypes, therefore
`P` performs only few non-invasive sanity checks of its arguments.

* [BACKWARD INCOMPATIBILITY] Example comparer `identical_or_TRUE`
is no longer available.

* [BACKWARD INCOMPATIBILITY] `maps_identical_or_TRUE` has been renamed
`sides_similar` and now allows for ignoring the side effects
indicated by the user.

* [BUGFIX] `summary.realtest_results` no longer tries to subset symbols.
2021-06-04 software

realtest: When Expectations Meet Reality: Realistic Unit Testing in R

realtest is a framework for unit testing for realistic minimalists, where we distinguish between expected, acceptable, current, fallback, ideal, or regressive behaviour. It can also be used for monitoring other software projects for changes. Now available on CRAN.
2021-05-27 new paper

Paper on the genieclust Python+R package

genieclust: Fast and robust hierarchical clustering was accepted for publication in SoftwareX (doi:10.1016/j.softx.2021.100722).

Abstract. genieclust is an open source Python and R package that implements the hierarchical clustering algorithm called Genie. This method frequently outperforms other state-of-the-art approaches in terms of clustering quality and speed, supports various distances over dense, sparse, and string data domains, and can be robustified even further with the built-in noise point detector. As domain-independent software, it can be used for solving problems arising in all data-driven research and development activities, including environmental, health, biological, physical, decision, and social sciences as well as technology and engineering. The Python version provides a scikit-learn-compliant API, whereas the R variant is compatible with the classic hclust(). Numerous tutorials, use cases, non-trivial examples, documentation, installation instructions, benchmark results and timings can be found at

2021-05-17 software

stringi 1.6.2

stringi is now shipped with ICU4C 69.1 which supports Unicode 13.0 and CLDR 39.

Changes since v1.5.3:

* [GENERAL] #401: stringi is now bundled with ICU4C 69.1 (upgraded from 61.1),
which is used on most Windows and OS X builds as well as on *nix systems
not equipped with system ICU. However, if the C++11 support is disabled,
stringi will be built against the battle-tested ICU4C 55.1.
The update to ICU brings Unicode 13.0 and CLDR 39 support.

* [BACKWARD INCOMPATIBILITY] In `stri_enc_list()`,
`simplify` now defaults to `TRUE`.

* [DOCUMENTATION] A draft version of a paper on `stringi` is now available at

* [GENERAL] stringi now requires R >= 3.1 (`CXX_STD` of `CXX11` or `CXX1X`).

* [NEW FEATURE] #408: `stri_trans_casefold()` performs case folding;
this is different from case mapping, which is locale-dependent.
Folding makes two pieces of text that differ only in case identical.
This can come in handy when comparing strings.

* [NEW FEATURE] #421: `stri_rank()` ranks strings in a character vector
(e.g., for ordering data frames with regards to multiple criteria,
the ranks can be passed to `order()`, see #219).

* [NEW FEATURE] #266: `stri_width()` now supports emojis.

* [NEW FEATURE] `%s$%` and `%stri$%` are now vectorised with respect to
both arguments.

* [NEW FEATURE] #425: The outputs of `stri_enc_list()`, `stri_locale_list()`,
`stri_timezone_list()`, and `stri_trans_list()` are now sorted.

* [NEW FEATURE] #428: In `stri_flatten`, `na_empty=NA` now omits missing values.

* [BUILD TIME] #431: Pre-4.9.0 GCC has `::max_align_t`,
but not `std::max_align_t`, added a (possible) workaround, see the INSTALL

* [BUGFIX] `stri_sort_key()` now outputs `bytes`-encoded strings.

* [BUGFIX] #415: `locale=''` was not equivalent to `locale=NULL`
in `stri_opts_collator()`.

* [BUGFIX] #354: `ALTREP` `CHARSXP`s were not copied, and thus could have been
garbage collected in the so-called meanwhile (with thanks to @jimhester).

* [INTERNAL] #414: Use `LEVELS(x)` macro instead of accessing `(x)->`
directly (@lukaszdaniel).
2021-04-22 software

genieclust 1.0.0

A maintenance release of the Python and R package genieclust for fast and robust hierarchical clustering with noise point detection is now available on PyPI and CRAN.
2021-02-27 new paper

On the aggregation of compositional data

Raul Pérez-Fernández, Bernard De Baets and I have a new paper accepted for publication in Information Fusion; abstract in the sequel.

Abstract. Compositional data naturally appear in many fields of application. For instance, in chemistry, the relative contributions of different chemical substances to a product are typically described in terms of a compositional data vector. Although the aggregation of compositional data frequently arises in practice, the functions formalizing this process do not fit the standard order-based aggregation framework. This is due to the fact that there is no intuitive order that carries the semantics of the set of compositional data vectors (referred to as the standard simplex). In this paper, we consider the more general betweenness-based aggregation framework that yields a natural definition of an aggregation function for compositional data. The weighted centroid is proved to fit within this definition and discussed to be linked to a very tangible interpretation. Other functions for the aggregation of compositional data are presented and their fit within the proposed definition is discussed.

2021-02-09 new paper

Hierarchical data fusion processes involving the Möbius representation of capacities

To appear in Fuzzy Sets and Systems — a new paper written together with Gleb Beliakov and Simon James; abstract below.

Abstract. The use of the Choquet integral in data fusion processes allows for the effective modelling of interactions and dependencies between data features or criteria. Its application requires identification of the defining capacity (also known as fuzzy measure) values. The main limiting factor is the complexity of the underlying parameter learning problem, which grows exponentially in the number of variables. However, in practice we may have expert knowledge regarding which of the subsets of criteria interact with each other, and which groups are independent. In this paper we study hierarchical aggregation processes, architecturally similar to feed-forward neural networks, but which allow for the simplification of the fitting problem both in terms of the number of variables and monotonicity constraints. We note that the Möbius representation lets us identify a number of relationships between the overall fuzzy measure and the data pipeline structure. Included in our findings are simplified fuzzy measures that generalise both k-intolerant and k-interactive capacities.

2021-01-08 software

Package genieclust 0.9.8 Released

A maintenance release of the R language version of genieclust is now available on CRAN.

Change log:

-   [R] Use `RcppMLPACK` directly; remove dependency on `emstreeR`.

-   [R] Switched to `tinytest` for unit testing.
2020-11-23 new paper

Interpretable sport team rating models based on the gradient descent algorithm

Jan Lasek and I authored a paper that will soon appear in International Journal of Forecasting, where we introduce several new (and efficient) rating models for teams (football/soccer in particular) based on the gradient descent algorithm.

Abstract. We introduce several new sport team rating models based upon the gradient descent algorithm. More precisely, the models can be formulated by maximising the likelihood of match results observed using a single step of this optimisation heuristic. The framework proposed, inspired by the prominent Elo rating system, yields an iterative version of the ordinal logistic regression as well as different variants of the Poisson regression-based models. This construction makes the update equations easy to interpret as well as adjusts ratings once new match results are observed. Thus, it naturally handles temporal changes in team strength. Moreover, a study of association football data indicates that the new models yield more accurate forecasts and are less computationally demanding than corresponding methods that jointly optimise likelihood for the whole set of matches.

2020-11-13 research grant

ARC 2021 Discovery Project

Our (Gleb Beliakov, Simon James, and yours truly) 2021 Discovery Project Beyond black-box models: Interaction in eXplainable Artificial Intelligence has been approved by the Australian Research Council.

Abstract. This project addresses a key issue in automated decision making: explaining how a decision was reached by a computer system to its users. Its aim is to progress towards a new generation of explainable decision models, which would match the performance of current black-box systems while at the same time allow for transparency and detailed interpretation of the underlying logic. This project expects to generate new knowledge in modelling interdependencies of decision criteria using recent advances in the theory of capacities. The expected outcomes are sophisticated but tractable models in which mutual dependencies of decision rules and criteria are treated explicitly and can be thoroughly evaluated.

2020-09-09 software

R Package stringi 1.5.3 Released

A new, major release of my R package stringi brings quite a few new features and bug fixes.

Change log:

* [NEW FEATURE] #400: `%s$%` and `%stri$%` are now binary operators
that call base R's `sprintf()`.

* [NEW FEATURE] #399: The `%s*%` and `%stri*%` operators can be used
in addition to `stri_dup()`, for the very same purpose.

* [NEW FEATURE] #355: `stri_opts_regex()` now accepts the `time_limit` and
`stack_limit` options so as to prevent malformed or malicious regexes
from running for too long.

* [NEW FEATURE] #345: `stri_startswith()` and `stri_endswith()` are now equipped
with the `negate` parameter.

* [NEW FEATURE] #382: Incorrect regexes are now reported to ease debugging.

* [DEPRECATION WARNING] #347: Any unknown option passed to `stri_opts_fixed()`,
`stri_opts_regex()`, `stri_opts_coll()`, and `stri_opts_brkiter()` now
generates a warning. In the future, the `...` parameter will be removed,
so that will be an error.

* [DEPRECATION WARNING] `stri_duplicated()`'s `fromLast` argument
has been renamed `from_last`. `fromLast` is now its alias scheduled
for removal in a future version of the package.

* [DEPRECATION WARNING] `stri_enc_detect2()`
is scheduled for removal in a future version of the package.
Use `stri_enc_detect()` or the more targeted `stri_enc_isutf8()`,
`stri_enc_isascii()`, etc., instead.

* [DEPRECATION WARNING] `stri_read_lines()`,  `stri_write_lines()`,
`stri_read_raw()`: use `con` argument instead of `fname` now.
The argument `fallback_encoding` is scheduled for removal and is no longer
used. `stri_read_lines()` does not support `encoding="auto"` anymore.

* [DEPRECATION WARNING] `nparagraphs` in `stri_rand_lipsum()` has been renamed

* [NEW FEATURE] #398: Alternative, British spelling of function parameters
has been introduced, e.g., `stri_opts_coll()` now supports both
`normalization` and `normalisation`.

* [NEW FEATURE] #393: `stri_read_bin()`, `stri_read_lines()`, and
`stri_write_lines()` are no longer marked as draft API.

* [NEW FEATURE] #187: `stri_read_bin()`, `stri_read_lines()`, and
`stri_write_lines()` now support connection objects as well.

* [NEW FEATURE] #386: New function `stri_sort_key()` for generating
locale-dependent sort keys which can be ordered at the byte level and
return an equivalent ordering to the original string (@DavisVaughan).

* [BUGFIX] #138: `stri_encode()` and `stri_rand_strings()`
now can generate strings of much larger lengths.

* [BUGFIX] `stri_wrap()` did not honour `indent` correctly when
`use_width` was `TRUE`.
2020-09-07 software

Tutorial on stringi

A comprehensive tutorial on the stringi package is now available.
2020-08-17 software

stringi Has a New Website

I have created a new home(page) for my stringi package, see
2020-07-31 software

Python and R package genieclust 0.9.4

A reimplementation of my robust hierarchical clustering algorithm Genie is now available on PyPI and CRAN. Now even faster and equipped with many more features, including noise point detection. See for more details, documentation, benchmarks, and tutorials.
2020-07-08 new paper

Paper on SimilaR in R Journal

SimilaR: R Code Clone and Plagiarism Detection by Maciej Bartoszuk and me has been accepted for publication in the R Journal.

Abstract. Third-party software for assuring source code quality is becoming increasingly popular. Tools that evaluate the coverage of unit tests, perform static code analysis, or inspect run-time memory use are crucial in the software development life cycle. More sophisticated methods allow for performing meta-analyses of large software repositories, e.g., to discover abstract topics they relate to or common design patterns applied by their developers. They may be useful in gaining a better understanding of the component interdependencies, avoiding cloned code as well as detecting plagiarism in programming classes.

A meaningful measure of similarity of computer programs often forms the basis of such tools. While there are a few noteworthy instruments for similarity assessment, none of them turns out particularly suitable for analysing R code chunks. Existing solutions rely on rather simple techniques and heuristics and fail to provide a user with the kind of sensitivity and specificity required for working with R scripts. In order to fill this gap, we propose a new algorithm based on a Program Dependence Graph, implemented in the SimilaR package. It can serve as a tool not only for improving R code quality but also for detecting plagiarism, even when it has been masked by applying some obfuscation techniques or imputing dead code. We demonstrate its accuracy and efficiency in a real-world case study.

2020-06-08 new paper

Paper in PNAS: Three Dimensions of Scientific Impact

In a paper recently published in the Proceedings of the National Academy of Sciences of the United States of America (PNAS) (doi:10.1073/pnas.2001064117; joint work with Grzesiek Siudem, Basia Żogała-Siudem and Ania Cena), we consider the mechanisms behind one’s research success as measured by one’s papers’ citability. By acknowledging the perceived esteem might be a consequence not only of how valuable one’s works are but also of pure luck, we arrived at a model that can accurately recreate a citation record based on just three parameters: the number of publications, the total number of citations, and the degree of randomness in the citation patterns. As a by-product, we show that a single index will never be able to embrace the complex reality of the scientific impact. However, three of them can already provide us with a reliable summary.

Abstract. The growing popularity of bibliometric indexes (whose most famous example is the h index by J. E. Hirsch [J. E. Hirsch, Proc. Natl. Acad. Sci. U.S.A. 102, 16569–16572 (2005)]) is opposed by those claiming that one's scientific impact cannot be reduced to a single number. Some even believe that our complex reality fails to submit to any quantitative description. We argue that neither of the two controversial extremes is true. By assuming that some citations are distributed according to the rich get richer rule (success breeds success, preferential attachment) while some others are assigned totally at random (all in all, a paper needs a bibliography), we have crafted a model that accurately summarizes citation records with merely three easily interpretable parameters: productivity, total impact, and how lucky an author has been so far.


Benchmark Suite for Clustering Algorithms - Version 1

Let's aggregate, polish and standardise the existing clustering benchmark suites referred to across the machine learning and data mining literature! See our new Benchmark Suite for Clustering Algorithms.
2020-02-23 book draft

Lightweight Machine Learning Classics with R

A first draft of my new textbook Lightweight Machine Learning Classics with R is now available.

About. Explore some of the most fundamental algorithms which have stood the test of time and provide the basis for innovative solutions in data-driven AI. Learn how to use the R language for implementing various stages of data processing and modelling activities. Appreciate mathematics as the universal language for formalising data-intense problems and communicating their solutions. The book is for you if you're yet to be fluent with university-level linear algebra, calculus and probability theory or you've forgotten all the maths you've ever learned, and are seeking a gentle, yet thorough, introduction to the topic.

2020-02-17 software

R Package stringi 1.4.6 Released

A new bug-fix release of stringi is now on CRAN.

Change log:

* [BACKWARD INCOMPATIBILITY] #369: `stri_c()` now returns an empty string
when input is empty and `collapse` is set.

* [BUGFIX] #370: fixed an issue in `stri_prepare_arg_POSIXct()`
reported by rchk.

* [DOCUMENTATION] #372: documented arguments not in `\usage` in
documentation object `stri_datetime_format`: `...`
2020-02-10 new paper

Genie+OWA: Robustifying Hierarchical Clustering with OWA-based Linkages

Check out our (by Anna Cena and me) most recent paper on the best hierarchical clustering algorithm in the world – Genie. It is going to appear in Information Sciences; doi:10.1016/j.ins.2020.02.025.

Abstract. We investigate the application of the Ordered Weighted Averaging (OWA) data fusion operator in agglomerative hierarchical clustering. The examined setting generalises the well-known single, complete and average linkage schemes. It allows to embody expert knowledge in the cluster merge process and to provide a much wider range of possible linkages. We analyse various families of weighting functions on numerous benchmark data sets in order to assess their influence on the resulting cluster structure. Moreover, we inspect the correction for the inequality of cluster size distribution -- similar to the one in the Genie algorithm. Our results demonstrate that by robustifying the procedure with the Genie correction, we can obtain a significant performance boost in terms of clustering quality. This is particularly beneficial in the case of the linkages based on the closest distances between clusters, including the single linkage and its "smoothed" counterparts. To explain this behaviour, we propose a new linkage process called three-stage OWA which yields further improvements. This way we confirm the intuition that hierarchical cluster analysis should rather take into account a few nearest neighbours of each point, instead of trying to adapt to their non-local neighbourhood.

2020-01-06 software

R Package stringi 1.4.4

stringi 1.4.4 is on its way to CRAN.

Change log:

* [BUGFIX] #348: Avoid copying 0 bytes to a nil-buffer in `stri_sub_all()`.

* [BUGFIX] #362: Removed `configure` variable `CXXCPP` as it is now deprecated.

* [BUGFIX] #318: PROTECTing objects from gcing as reported by `rchk`.

* [BUGFIX] #344, #364: Removed compiler warnings in icu61/common/cstring.h.

* [BUGFIX] #363: Status of `RegexMatcher` is now checked after its use.
2019-12-11 new paper

DC Optimisation for Constructing Discrete Sugeno Integrals and Learning Nonadditive Measures

We (Gleb Beliakov, Simon James and I) have another paper accepted for publication – this time in the Optimization journal; doi:10.1080/02331934.2019.1705300.

Abstract. Defined solely by means of order-theoretic operations meet (min) and join (max), weighted lattice polynomial functions are particularly useful for modeling data on an ordinal scale. A special case, the discrete Sugeno integral, defined with respect to a nonadditive measure (a capacity), enables accounting for the interdependencies between input variables.

However until recently the problem of identifying the fuzzy measure values with respect to various objectives and requirements has not received a great deal of attention. By expressing the learning problem as the difference of convex functions, we are able to apply DC (difference of convex) optimization methods. Here we formulate one of the global optimization steps as a local linear programming problem and investigate the improvement under different conditions.


IEEE WCCI 2020 Special Session - Aggregation Structures: New Trends and Applications

Call for contributions – IEEE World Congress on Computational Intelligence (WCCI) 2020, Glasgow, Scotland — FUZZ-IEEE-6 Special Session on Aggregation Structures: New Trends and Applications; for more details, click here.
2019-11-14 new paper

Robust Fitting for the Sugeno Integral with Respect to General Fuzzy Measures

The editor of Information Sciences have just let us know that a paper by Gleb Beliakov, Simon James and me will be published in this outlet.

Abstract. The Sugeno integral is an expressive aggregation function with potential applications across a range of decision contexts. Its calculation requires only the lattice minimum and maximum operations, making it particularly suited to ordinal data and robust to scale transformations. However, for practical use in data analysis and prediction, we require efficient methods for learning the associated fuzzy measure. While such methods are well developed for the Choquet integral, the fitting problem is more difficult for the Sugeno integral because it is not amenable to being expressed as a linear combination of weights, and more generally due to plateaus and non-differentiability in the objective function. Previous research has hence focused on heuristic approaches or simplified fuzzy measures. Here we show that the problem of fitting the Sugeno integral to data such that the maximum absolute error is minimized can be solved using an efficient bilevel program. This method can be incorporated into algorithms that learn fuzzy measures with the aim of minimizing the median residual. This equips us with tools that make the Sugeno integral a feasible option in robust data regression and analysis. We provide experimental comparison with a genetic algorithms approach and an example in data analysis.


Deakin University

On 23rd of September 2019 I commence as a Senior Lecturer in Applied Artificial Intelligence at Deakin University in Melbourne-Burwood, Australia (Australian senior lecturer is supposed to be equivalent to an associate professor in the US).
2019-09-10 new paper

Constrained Ordered Weighted Averaging Aggregation with Multiple Comonotone Constraints

Lucian Coroianu, Robert Fullér, Simon James, and I have a paper accepted in the Fuzzy Sets and Systems outlet. Abstract below.

Abstract. The constrained ordered weighted averaging (OWA) aggregation problem arises when we aim to maximize or minimize a convex combination of order statistics under linear inequality constraints that act on the variables with respect to their original sources. The standalone approach to optimizing the OWA under constraints is to consider all permutations of the inputs, which becomes quickly infeasible when there are more than a few variables, however in certain cases we can take advantage of the relationships amongst the constraints and the corresponding solution structures. For example, we can consider a land-use allocation satisfaction problem with an auxiliary aim of balancing land-types, whereby the response curves for each species are non-decreasing with respect to the land-types. This results in comonotone constraints, which allow us to drastically reduce the complexity of the problem.
In this paper, we show that if we have an arbitrary number of constraints that are comonotone (i.e., they share the same ordering permutation of the coefficients), then the optimal solution occurs for decreasing components of the solution. After investigating the form of the solution in some special cases and providing theoretical results that shed light on the form of the solution, we detail practical approaches to solving and give real-world examples.

2019-06-17 new PhD

Jan Lasek's PhD defence

My PhD student, Jan Lasek, has successfully defended his doctoral thesis, New data-driven rating systems for association football. :)
2019-06-08 new paper

Aggregation on Ordinal Scales with the Sugeno Integral for Biomedical Applications

Gleb Beliakov, Simon James and I have another paper accepted for publication in Information Sciences. This time we re-write a learning-to-aggregate problem based on the Sugeno integral in a difference-of-convex objective setting. The derived tool is particularly useful when working with ordinal data.

Abstract. The Sugeno integral is a function particularly suited to the aggregation of ordinal inputs. Defined with respect to a fuzzy measure, its ability to account for complementary and redundant relationships between variables brings much potential to the field of biomedicine, where it is common for measurements and patient information to be expressed qualitatively. However, practical applications require well-developed methods for identifying the Sugeno integral's parameters, and this task is not easily expressed using the standard optimisation approaches. Here we formulate the objective function as the difference of two convex functions, which enables the use of specialised numerical methods. Such techniques are compared with other global optimisation frameworks through a number of numerical experiments.

2019-05-14 new paper

New Paper on Information Fusion

A taxonomy of monotonicity properties for the aggregation of multidimensional data – joint work with Raúl Pérez-Fernández and Bernard De Baets has been accepted for publication in Information Fusion.

Abstract. The property of monotonicity, which requires a function to preserve a given order, has been considered the standard in the aggregation of real numbers for decades. In this paper, we argue that, for the case of multidimensional data, an order-based definition of monotonicity is far too restrictive. We propose several meaningful alternatives to this property not involving the preservation of a given order by returning to its early origins stemming from the field of calculus. Numerous aggregation methods for multidimensional data commonly used by practitioners are studied within our new framework.

2019-03-25 new paper

An Inherent Difficulty in the Aggregation of Multidimensional Data

Accepted for publication in IEEE Transactions on Fuzzy SystemsAn inherent difficulty in the aggregation of multidimensional data by Raúl Pérez-Fernández, Bernard De Baets, and me. Abstract below.

Abstract. In the field of information fusion, the problem of data aggregation has been formalized as an order-preserving process that builds upon the property of monotonicity. However, fields such as computational statistics, data analysis and geometry, usually emphasize the role of equivariances to various geometrical transformations in aggregation processes. Admittedly, if we consider a unidimensional data fusion task, both requirements are often compatible with each other. Nevertheless, in this paper we show that, in the multidimensional setting, the only idempotent functions that are monotone and orthogonal equivariant are the over-simplistic weighted centroids. Even more, this result still holds after replacing monotonicity and orthogonal equivariance by the weaker property of orthomonotonicity. This implies that the aforementioned approaches to the aggregation of multidimensional data are irreconcilable, and that, if a weighted centroid is to be avoided, we must choose between monotonicity and a desirable behaviour with regard to orthogonal transformations.

2019-03-13 new paper

Should We Introduce a Dislike Button for Academic Papers?

The abstract of our (by Agnieszka Geras, Grzegorz Siudem, and me) recent paper to be published in Journal of the Association for Information Science and Technology can be found below.

Abstract. On the grounds of the revealed, mutual resemblance between the behaviour of users of Stack Exchange and the dynamics of the citations accumulation process in the scientific community, we tackled an outwardly intractable problem of assessing the impact of introducing "negative" citations.
Although the most frequent reason to cite a paper is to highlight the connection between the two publications, researchers sometimes mention an earlier work to cast a negative light. While computing citation-based scores, for instance the h-index, information about the reason why a paper was mentioned is neglected. Therefore it can be questioned whether these indices describe scientific achievements accurately.
In this contribution we shed insight into the problem of "negative" citations, analysing data from Stack Exchange and, to draw more universal conclusions, we derive an approximation of citations scores. Here we show that the quantified influence of introducing negative citations is of lesser importance and that they could be used as an indicator of where attention of scientific community is allocated.

2019-03-12 software

R Package stringi 1.4.3

This month's new release of the R package stringi brings significant improvements in the way substring extraction tasks are performed.

Change-log since v1.3.1:

* [NEW FEATURE] #30: New function `stri_sub_all()` - a version of
`stri_sub()` accepting list `from`/`to`/`length` arguments for extracting
multiple substrings from each string in a character vector.

* [NEW FEATURE] #30: New function `stri_sub_all<-()` (and its `%<%`-friendly
version, `stri_sub_replace_all()`) - for replacing multiple substrings
with corresponding replacement strings.

* [NEW FEATURE] In `stri_sub_replace()`, `value` parameter
has a new alias, `replacement`.

* [NEW FEATURE] New convenience functions based on `stri_remove_empty()`:
`stri_omit_empty_na()`, `stri_remove_empty_na()`, `stri_omit_empty()`,
and also `stri_remove_na()`, `stri_omit_na()`.

* [BUGFIX] #343: `stri_trans_char()` did not yield correct results
for overlapping pattern and replacement strings.

* [WARNFIX] #205: `` is now included in the source bundle.
2019-03-08 software

R Package agop 0.2-2

A long out-standing release of the R package agop is now available on CRAN. See below for more details.


0.2-2 /2019-03-05/

* [IMPORTANT CHANGE] All functions dealing with binary relations now are
named like `rel_*`. Moreover, `de_transitive()` has been renamed

* [IMPORTANT CHANGE] The definition of `owa()`, `owmax()`, and `owmin()`
is now consistent with that of (Grabisch et al., 2009), i.e., uses
nondecreasing vectors, and not nonincreasing ones.

* [NEW FUNCTIONS] `rel_closure_reflexive()`, `rel_reduction_reflexive()`,
`rel_is_symmetric()`, `rel_closure_symmetric()`, `rel_is_irreflexive()`,
`rel_is_asymmetric()`, `rel_is_antisymmetric()`, `rel_is_cyclic()`, etc.,
modify given adjacency matrices representing binary relations over
finite sets.

* [NEW FUNCTIONS] some predefined fuzzy logic connectives have been added,
e.g. ,`tnorm_minimum()`, `tnorm_drastic()`, `tnorm_product()`,
`tnorm_lukasiewicz()`,  `tnorm_fodor()`, `tconorm_minimum()`,
`tconorm_drastic()`, `tconorm_product()`, `tconorm_lukasiewicz()`,
`tconorm_fodor()`, `fnegation_classic()`, `fnegation_minimal()`,
`fnegation_maximal()`, `fnegation_yager()`, `fimplication_minimal()`,
`fimplication_maximal()`, `fimplication_kleene()`,
`fimplication_lukasiewicz()`, `fimplication_reichenbach()`,
`fimplication_fodor()`, `fimplication_goguen()`, `fimplication_goedel()`,
`fimplication_rescher()`, `fimplication_weber()`, `fimplication_yager()`.

* [NEW FUNCTION] `check_comonotonicity()` determines if two vectors are

* [NEW FUNCTIONS] `pord_spread()`, `pord_spreadsym()`, `pord_nd()` -
example preorders on sets of vectors.

* [NEW FEATURE] `plot_producer()` gained a new argument: `a`.

* [BUGFIX] `rel_closure_transitive()` - a resulting matrix
was not necessarily transitive.

* [BUGFIX] `prepare_arg_numeric_sorted` (internal, C++) did not sort
some vectors.

* [BUGFIX] All built-in aggregation functions now throw an error on empty vectors.

* [INFO] The package no longer depends on the `Matrix` package.
The `igraph` package is only suggested.

* [INFO] Most of the functions are now implemented in C++.
2019-03-01 new paper

Penalty-based Data Aggregation in Real Normed Vector Spaces

To be published in the Proceedings of the AGOP'19 conference: joint work with Lucian Coroianu entitled Penalty-based data aggregation in real normed vector spaces. Abstract below.

Abstract. The problem of penalty-based data aggregation in generic real normed vector spaces is studied. Some existence and uniqueness results are indicated. Moreover, various properties of the aggregation functions are considered.

2019-02-14 software

R Package stringi 1.3.1

A new major release of the R package stringi (one of the most often downloaded extensions on CRAN) is available. Check out the change-log for more information.


* [BACKWARD INCOMPATIBILITY] #335: A fix to #314 (by design) prevented the use
of the system ICU if the library had been compiled with `U_CHARSET_IS_UTF8=1`.
However, this is the default setting in `libicu`>=61. From now on, in such
cases the system ICU is used more eagerly, but `stri_enc_set()` issues
a warning stating that the default (UTF-8) encoding cannot be changed.

* [NEW FEATURE] #232: All `stri_detect_*` functions now have the `max_count`
argument that allows for, e.g., stopping at first pattern occurrence.

* [NEW FEATURE] #338: `stri_sub_replace()` is now an alias for `stri_sub<-()`
which makes it much more easily pipable (@yutannihilation, @BastienFR).

* [NEW FEATURE] #334: Added missing `icudt61b.dat` to support big-endian
platforms (thanks to Dimitri John Ledkov @xnox).

* [BUGFIX] #296: Out-of-the box build used to fail on CentOS 6, upgraded
`./configure` to `--disable-cxx11` more eagerly at an early stage.

* [BUGFIX] #341: Fixed possible buffer overflows when calling `strncpy()`
from within ICU 61.

* [BUGFIX] #325: Made `./configure` more portable so that it works
under `/bin/dash` now.

* [BUGFIX] #319: Fixed overflow in `stri_rand_shuffle()`.

* [BUGFIX] #337: Empty search patters in search functions (e.g.,
`stri_split_regex()` and `stri_count_fixed()`) used to raise
too many warnings on empty search patters.
2019-02-14 new paper

Piecewise Linear Approximation of Fuzzy Numbers: Algorithms, Arithmetic Operations and Stability of Characteristics

A paper by me, Lucian Coroianu and Przemyslaw Grzegorzewski entitled Piecewise linear approximation of fuzzy numbers: algorithms, arithmetic operations and stability of characteristics, has been accepted for publication in Soft Computing.

Abstract. The problem of the piecewise linear approximation of fuzzy numbers giving outputs nearest to the inputs with respect to the Euclidean metric is discussed. The results given in Coroianu et al. (Fuzzy Sets Syst 233:26–51, 2013) for the 1-knot fuzzy numbers are generalized for arbitrary n-knot (n>=2) piecewise linear fuzzy numbers. Some results on the existence and properties of the approximation operator are proved. Then, the stability of some fuzzy number characteristics under approximation as the number of knots tends to infinity is considered. Finally, a simulation study concerning the computer implementations of arithmetic operations on fuzzy numbers is provided. Suggested concepts are illustrated by examples and algorithms ready for the practical use. This way, we throw a bridge between theory and applications as the latter ones are so desired in real-world problems.

2019-01-16 new paper

Supervised Learning to Aggregate Data with the Sugeno Integral

Supervised Learning to Aggregate Data with the Sugeno Integral, co-authored by Simon James and Gleb Beliakov, will appear in IEEE Trans. Fuzzy Systems.

Abstract. The problem of learning symmetric capacities (or fuzzy measures) from data is investigated toward applications in data analysis and prediction as well as decision making. Theoretical results regarding the solution minimizing the mean absolute error are exploited to develop an exact branch-refine-and-bound-type algorithm for fitting Sugeno integrals (weighted lattice polynomial functions, max-min operators) with respect to symmetric capacities. The proposed method turns out to be particularly suitable for acting on ordinal data. In addition to providing a model that can be used for the general data regression task, the results can be used, among others, to calibrate generalized h-indices to bibliometric data.

2018-12-11 new PhD

Anna Cena's PhD Defence

My PhD student, Anna Cena, has defended her doctoral thesis, Adaptive hierarchical clustering algorithms based on data aggregation methods. Yay!
2018-10-26 new PhD

Maciej Bartoszuk's PhD Defence

My PhD student, Maciej Bartoszuk has defended his doctoral thesis (cum laude!), A source code similarity assessment system for functional programming languages based on machine learning and data aggregation methods. Congratulations!
2018-07-02 new paper

The Efficacy of League Formats in Ranking Teams

The efficacy of league formats in ranking teams has been accepted for publication in Statistical Modelling. Joint work with Jan Lasek.

Abstract. The efficacy of different league formats in ranking teams according to their true latent strength is analysed. To this end, a new approach for estimating attacking and defensive strengths based on the Poisson regression for modelling match outcomes is proposed. Various performance metrics are estimated reflecting the agreement between latent teams' strength parameters and their final rank in the league table. The tournament designs studied here are used in the majority of European top-tier association football competitions. Based on numerical experiments, it turns out that a two-stage league format comprising of the three round-robin tournament together with an extra single round-robin is the most efficacious setting. In particular, it is the most accurate in selecting the best team as the winner of the league. Its efficacy can be enhanced by setting the number of points allocated for a win to two (instead of three that is currently in effect in association football).

2017-05-23 software

Python Package genieclust 0.1a2

An alpha release of the Python package implementing our fast and robust (Genie clustering algorithm ) is now available on PyPI. Check out the github repository for more information and tutorials.
2018-05-11 invited talk

Invited Plenary Lecture @ ISCAMI 2018

Today, at the International Student Conference on Applied Mathematics and Informatics – ISCAMI 2018 held in Malenovice, Czechia, I gave a lecture entitled Clustering on MSTs.

Abstract. Cluster analysis is one of the most commonly applied unsupervised machine learning techniques. Its aim is to automatically discover an underlying structure of a data set represented by a partition of its elements: mutually disjoint and nonempty subsets are determined in such a way that observations within each group are ``similar'' and entities in distinct clusters ``differ'' as much as possible from each other.

It turns out that two state-of-the-art clustering algorithms -- namely the Genie and HDBSCAN* methods -- can be computed based on the minimum spanning tree (MST) of the pairwise dissimilarity graph. Both of them are not only resistant to outliers and produce high-quality partitions, but also are relatively fast to compute.

The aim of this tutorial is to discuss some key issues of hierarchical clustering and explore their relations with graph and data aggregation theory.

2017-05-03 software

R Package stringi 1.2.2

A new major release of the R package stringi is out. Check out the change-log for more information.


* [GENERAL] #193: `stringi` is now bundled with ICU4C 61.1,
which is used on most Windows and OS X builds as well as on *nix systems
not equipped with ICU. However, if the C++11 support is disabled,
stringi will be built against ICU4C 55.1. The update to ICU brings
Unicode 10.0 support, including new emoji characters.

* [BUGFIX] #288: stri_match did not return the correct number of columns
when input was empty.

* [NEW FEATURE] #188: `stri_enc_detect` now returns a list of data frames.

* [NEW FEATURE] #289: `stri_flatten` gained `na_empty` `omit_empty` arguments.

* [NEW FEATURE] New functions: `stri_remove_empty`, `stri_na2empty`

* [NEW FEATURE] #285: Coercion from a non-trivial list (one that consists
of atomic vectors, each of length 1) to an atomic vector now issues a warning.

* [WARN] Removed `-Wparentheses` warnings in `icu55/common/cstring.h:38:63`
and `icu55/i18n/windtfmt.cpp` in the ICU4C 55.1 bundle.
2018-04-20 invited workshop

Text Analysis Developers' Workshop 2018 @ NYC

Greetings from the Text Analysis Developers' Workshop 2018 @ New York University! This is a follow-up of the great event held a year ago at the London School of Economics, but with a stronger out-of-R focus (Python included).

Associate Professor @ Polish Academy of Sciences

I have been promoted to associate professor of Polish Academy of Sciences (at the Systems Research Institute).

MADAM Seminar: Aggregation through the poset glass (Raúl Pérez-Fernández)

On March 28, 2018 at the MADAM (Methods for Analysis of Data: Algorithms and Modeling) seminar, Dr Raúl Pérez-Fernández (Ghent University) will give a talk on the need of Aggregation 2.0.

Abstract. The aggregation of several objects into a single one is a common study subject in mathematics. Unfortunately, whereas practitioners often need to deal with the aggregation of many different types of objects (rankings, graphs, strings, etc.), the current theory of aggregation is mostly developed for dealing with the aggregation of values in a poset. In this presentation, we will reflect on the limitations of this poset-based theory of aggregation and “jump through the poset glass”. On the other side, we will not find Wonderland, but, instead, we will find more questions than answers. Indeed, a new theory of aggregation is being born, and we will need to work together on this reboot for years to come.


MADAM Seminar: Should we introduce a ‘dislike’ button for papers? (Agnieszka Geras)

On March 21, 2018 at the MADAM (Methods for Analysis of Data: Algorithms and Modeling) seminar, Ms. Agnieszka Geras (PhD student @ FMIS WUT) will present her recent results concerning analysis and modeling of Stack Exchange sites.

Abstract. Citations scores and the h-index are basic tools used for measuring the quality of scientific work. Nonetheless, while evaluating academic achievements one rarely takes into consideration for what reason the paper was mentioned by another author - whether in order to highlight the connection between their work or to bring to the reader’s attention any mistakes or flaws. In my talk I will shed some insight into the problem of “negative” citations analyzing data from the Stack Exchange and using the proposed agent-based model. Joint work with Marek Gągolewski and Grzegorz Siudem.

2018-02-24 new paper

Least median of squares (LMS) and least trimmed squares (LTS) fitting for the weighted arithmetic mean

A paper entitled Least median of squares (LMS) and least trimmed squares (LTS) fitting for the weighted arithmetic mean (joint work with Gleb Beliakov and Simon James) has been accepted for publication in the Proceedings of the IPMU 2018 conference.
Abstract. We look at different approaches to learning the weights of the weighted arithmetic mean such that the median residual or sum of the smallest half of squared residuals is minimized. The more general problem of multivariate regression has been well studied in statistical literature however in the case of aggregation functions we have the restriction on the weights and the domain is usually restricted so that ‘outliers’ may not be arbitrarily large. A number of algorithms are compared in terms of accuracy and speed. Our results can be extended to other aggregation functions.
2018-02-02 invited talk

Invited Plenary Lecture @ FSTA 2018

Today I gave a lecture at the 14th International Conference of Fuzzy Set Theory and Applications – FSTA 2018 held in Liptovský Ján, Slovak Republic.

Abstract. Hirsch's h-index is perhaps the most popular citation-based measure of scientific excellence. Many of its natural generalizations can be expressed as simple functions of some discrete Sugeno integrals.

In this talk we shall review some less-known results concerning various stochastic properties of the discrete Sugeno integral with respect to a symmetric normalized capacity, i.e., weighted lattice polynomial functions of real-valued random variables -- both in i.i.d. (independent and identically distributed) and non-i.i.d. (with some dependence structure) cases. For instance, we will be interested in investigating their exact and asymptotic distributions. Based on these, we can, among others, show that the h-index is a consistent estimator of some natural probability distribution's location characteristic. Moreover, we can derive a statistical test to verify whether the difference between two h-indices (say, h'=7 vs. h''=10 in cases where both authors published 40 papers) is actually significant.

What is more, we shall discuss some agent-based models that describe the processes generating citation networks based on, e.g., the preferential attachment (``rich gets richer'') rule. Due to such an approach, we are able to simulate a scientist's activity and then estimate the expected values for the h-index and similar functions based on very simple sample statistics, such as the total number of citations and the total number of publications. Such results can help explain what does the h-index really measure.


MADAM Seminar: Measuring the efficacy of league formats in ranking football teams (Jan Lasek)

On January 5, 2018 at the MADAM (Methods for Analysis of Data: Algorithms and Modeling) seminar, Mr Jan Lasek ( & PhD student @ ICS PAS) will discuss various issues concerning the efficacy of league formats in ranking football (soccer) teams.

Abstract. Choosing between different tournament designs based on their accuracy in ranking teams is an important topic in football since many domestic championships underwent changes in the recent years. In particular, the transformations of Ekstraklasa -- the top-tier football competition in Poland -- is a topic receiving much attention from the organizing body of the competition, participating football clubs as well as supporters. In this presentation we will discuss the problem of measuring the accuracy of different league formats in ranking teams. We will present various models for rating teams that will be next used to simulate a number of tournaments to evaluate their efficacy, for example, by measuring the probability of the best team win. Finally, we will discuss several other aspects of league formats including the influence of the number of points allocated for a win on the final league standings.


Associate Professor @ WUT

I have been promoted to associate professor at the Faculty of Mathematics and Information Science, Warsaw University of Technology.

MADAM Seminar: How accidental scientific success is? (Grzegorz Siudem)

On November 24, 2017 at the MADAM (Methods for Analysis of Data: Algorithms and Modeling) seminar, Dr Grzegorz Siudem (Faculty of Physics, Warsaw University of Technology) will discuss a new agent-based model for citation networks.

Abstract. Since the classic work of de Sola Price the rich-gets-richer rule is well known as the most important mechanism governing the citation network dynamics. (Un-)Fortunatelly it is not sufficient to explain every aspect of bibliometric data. Using the proposed agent-based model for the bibliometric networks we will shed some light on the problem and try to answer the important question stated in the title. Joint work with A. Cena, M. Gagolewski and B. Żogała-Siudem.

2017-04-07 software

stringi 1.1.6 released

Another release of the stringi package for R is on CRAN. The package is one of the most downloaded R extensions and provides a rich set of string processing procedures.


* [WINDOWS SPECIFIC] #270: Strings marked with `latin1` encoding
are now converted internally to UTF-8 using the WINDOWS-1252 codec.
This fixes problems with - among others - displaying the Euro sign.

* [NEW FEATURE] #263: Add support for custom rule-based break iteration,
see `?stri_opts_brkiter`.

* [NEW FEATURE] #267: `omit_na=TRUE` in `stri_sub<-` now ignores missing values
in any of the arguments provided.

* [BUGFIX] fixed unPROTECTed variable names and stack imbalances
as reported by rchk
2017-10-24 software

TurtleGraphics v1.0-7

A bugfix release of the TurtleGraphics package for R is now available for download from CRAN.


Today I have been awarded a habilitation degree, thesis title: New algorithms for data aggregation and analysis: construction, properties, and applications.

Research Visit @ Deakin University

From July 17 until August 8, 2017 I will be visiting Dr Simon James, Prof. Gleb Beliakov, Dr Tim Wilkin and their colleagues at the School of Information Technology, Deakin University in Burwood, Victoria, Australia. The support by the SEBE Researcher in Residence 2017 Program from Deakin University is fully acknowledged.
2017-07-06 new paper

Measuring Traffic Congestion

Measuring traffic congestion: An approach based on learning weighted inequality, spread and aggregation indices from comparison data has been accepted for publication in Applied Soft Computing. Assigned DOI is 10.1016/j.asoc.2017.07.014. Simon James did a wonderful work leading this research project. The paper was written in collaboration with researchers from Deakin University, namely: Gleb Beliakov, Shannon Pace, Nicola Pastorello, Elodie Thilliez, and Rajesh Vasa.
Abstract. As cities increase in size, governments and councils face the problem of designing infrastructure and approaches to traffic management that alleviate congestion. The problem of objectively measuring congestion involves taking into account not only the volume of traffic moving throughout a network, but also the inequality or spread of this traffic over major and minor intersections. For modelling such data, we investigate the use of weighted congestion indices based on various aggregation and spread functions. We formulate the weight learning problem for comparison data and use real traffic data obtained from a medium-sized Australian city to evaluate their usefulness.
2017-06-21 invited talk

Invited Tutorial @ AGOP 2017

Today I gave a tutorial at the 9th International Summer School on Aggregation Operators – AGOP 2017 held at University of Skövde, Sweden.

Abstract. Aggregation theory classically deals with functions to summarize a sequence of numeric values, e.g., in the unit interval. Since the notion of componentwise monotonicity plays a key role in many situations, there is an increasingly growing interest in methods that act on diverse ordered structures.

However, as far as the definition of a mean or an averaging function is concerned, the internality (or at least idempotence) property seems to be of a relatively higher importance than the monotonicity condition. In particular, the Bajraktarević means or the mode are among some well-known non-monotone means.

The concept of a penalty-based function was first investigated by Yager in 1993. In such a framework, we are interested in minimizing the amount of "disagreement" between the inputs and the output being computed; the corresponding aggregation functions are at least idempotent and express many existing means in an intuitive and attractive way.

In this talk I focus on the notion of penalty-based aggregation of sequences of points in Rd, this time for some d≥1. I review three noteworthy subclasses of penalty functions: componentwise extensions of unidimensional ones, those constructed upon pairwise distances between observations, and those defined by measuring the so-called data depth. Then, I discuss their formal properties, which are particularly useful from the perspective of data analysis, e.g., different possible generalizations of internality or equivariances to various geometric transforms. I also point out the difficulties with extending some notions that are key in classical aggregation theory, like the monotonicity property.

2017-05-23 new paper

EUSFLAT'17: Fitting symmetric fuzzy measures for discrete Sugeno integration

A paper by Simon James and I entitled Fitting symmetric fuzzy measures for discrete Sugeno integration has been accepted for publication in the Proceedings of EUSFLAT conference.
2017-04-20 invited workshop

rOpenSci Text Workshop

This week I'm at the rOpenSci Text Workshop organized by Ken Benoit from the London School of Economics and Political Science. This workshop is designed to bring the R text package developers' community, to share experiences and knowledge, and hopefully foster cooperation.
2017-04-07 software

stringi 1.1.5 released

Another bugfix release of the stringi package for R is on its way to CRAN. The package provides powerful string processing facilities to R users and developers and is ranked as one of the most often downloaded R extensions.


* [GENERAL] `stringi` now requires ICU4C >= 52.

* [GENERAL] `stringi` now requires R >= 2.14.

* [BUGFIX] Fixed errors pointed out by `clang-UBSAN` in `stri_brkiter.h`.

* [BUILD TIME] #238, #220: Try "standard" ICU4C build flags if a call
to `pkg-config` fails.

* [BUILD TIME] #258: Use `CXX11` instead of `CXX1X` on R >= 3.4.

* [BUILD TIME, BUGFIX] #254: `dir.exists()` is R >= 3.2.
2017-03-21 software

stringi 1.1.3 released

I have submitted a new (bugfix) release of the stringi package to CRAN.


* [REMOVE DEPRECATED] `stri_install_check()` and `stri_install_icudt()`
marked as deprecated in `stringi` 0.5-5 are no longer being exported.

* [BUGFIX] #227: Incorrect behavior of `stri_sub()` and `stri_sub<-()`
if the empty string was the result.

* [BUILD TIME] #231: The `./configure` (*NIX only) script now reads the
following environment varialbes: `STRINGI_CFLAGS`, `STRINGI_CPPFLAGS`,
see `INSTALL` for more information.

* [BUILD TIME] #253: call to `R_useDynamicSymbols` added.

* [BUILD TIME] #230: icudt is now being downloaded by
`./configure` (*NIX only) *before* building.

* [BUILD TIME] #242: `_COUNT/_LIMIT` enum constants have been deprecated
as of ICU 58.2, stringi code has been upgraded accordingly.
2017-03-15 new paper

FUZZ-IEEE'17: Two Papers Accepted

Two papers I co-author have been accepted for publication in Proceedings of the FUZZ-IEEE'17 conference that will be held in Naples, Italy.
  • Bartoszuk M., Gagolewski M., Binary aggregation functions in software plagiarism detection
  • Cena A., Gagolewski M., OWA-based linkage and the Genie correction for hierarchical clustering
2016-12-12 new paper

Penalty-Based Aggregation of Multidimensional Data

My paper Penalty-Based Aggregation of Multidimensional Data has been accepted for publication in Fuzzy Sets and Systems (Special Issue on Aggregation Functions).
Abstract. Research in aggregation theory is nowadays still mostly focused on algorithms to summarize tuples consisting of observations in some real interval or of diverse general ordered structures. Of course, in practice of information processing many other data types between these two extreme cases are worth inspecting. This contribution deals with the aggregation of lists of data points in Rd for arbitrary d≥1. Even though particular functions aiming to summarize multidimensional data have been discussed by researchers in data analysis, computational statistics and geometry, there is clearly a need to provide a comprehensive and unified model in which their properties like equivariances to geometric transformations, internality, and monotonicity may be studied at an appropriate level of generality. The proposed penalty-based approach serves as a common framework for all idempotent information aggregation methods, including componentwise functions, pairwise distance minimizers, and data depth-based medians. It also allows for deriving many new practically useful tools.
2016-11-21 new book

Przetwarzanie i analiza danych w języku Python

My book on Python for Data Processing and Analysis is now available in Polish book stores.
Przetwarzanie i analiza danych w języku Python - okładka
2016-11-21 new book

Programowanie w języku R (2nd Ed., revised and extended)

The 2nd edition of my R Programming Book is now available in Polish book stores.
Programowanie w języku R - okładka

Eusflat'17 Special Session:
Algorithms for Data Aggregation and Fusion

Call for contributions – EUSFLAT 2017 (10th Conference of the European Society for Fuzzy Logic and Technology, Warsaw, Poland) Special Session Algorithms for Data Aggregation and Fusion; for more details, click here.
2016-10-27 new paper

Penalty-Based and Other Representations of Economic Inequality

My paper with Gleb Beliakov and Simon James, entitled Penalty-based and other representations of economic inequality, has been accepted for publication in International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems today.
Abstract. Economic inequality measures are employed as a key component in various socio-demographic indices to capture the disparity between the wealthy and poor. Since their inception, they have also been used as a basis for modelling spread and disparity in other contexts. While recent research has identified that a number of classical inequality and welfare functions can be considered in the framework of OWA operators, here we propose a framework of penalty-based aggregation functions and their associated penalties as measures of inequality.
2016-10-14 invited talk

Invited Talk @ European R Users Meeting 2016

Today I gave an invited talk (Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm and its R interface) at the European R Users Meeting that is held in Poznań, Poland.
Abstract. The time needed to apply a hierarchical clustering algorithm is most often dominated by the number of computations of a pairwise dissimilarity measure. Such a constraint, for larger data sets, puts at a disadvantage the use of all the classical linkage criteria but the single linkage one. However, it is known that the single linkage clustering algorithm is very sensitive to outliers, produces highly skewed dendrograms, and therefore usually does not reflect the true underlying data structure - unless the clusters are well-separated.
To overcome its limitations, we proposed a new hierarchical clustering linkage criterion called genie. Namely, our algorithm links two clusters in such a way that a chosen economic inequity measure (e.g., the Gini or Bonferroni index) of the cluster sizes does not increase drastically above a given threshold.
Benchmarks indicate a high practical usefulness of the introduced method: it most often outperforms the Ward or average linkage in terms of the clustering quality while retaining the single linkage speed. The algorithm is easily parallelizable and thus may be run on multiple threads to speed up its execution further on. Its memory overhead is small: there is no need to precompute the complete distance matrix to perform the computations in order to obtain a desired clustering. In this talk we will discuss its reference implementation, included in the genie package for R.
2016-07-06 invited talk

Invited Plenary Talk @ ISAS 2016

Today I gave a plenary talk at the International Symposium on Aggregation and Structures – ISAS 2016, entitled Penalty-based fusion of complex data, computational aspects, and applications.
Abstract. Since the 1980s, studies of aggregation functions most often focus on the construction and formal analysis of diverse ways to summarize numerical lists with elements in some real interval. Quite recently, we also observe an increasing interest in aggregation of and aggregation on generic partially ordered sets.
However, in many practical applications, we have no natural ordering of given data items. Thus, in this talk we review various aggregation methods in spaces equipped merely with a semimetric (distance). These include the concept of such penalty minimizers as the centroid, 1-median, 1-center, medoid, and their generalizations -- all leading to idempotent fusion functions. Special emphasis is placed on procedures to summarize vectors in Rd for d ≥ 2 (e.g., rows in numeric data frames) as well as character strings (e.g., DNA sequences), but of course the list of other interesting domains could go on forever (rankings, graphs, images, time series, and so on).
We discuss some of their formal properties, exact or approximate (if the underlying optimization task is hard) algorithms to compute them and their applications in clustering and classification tasks.
2016-06-07 new paper

Hierarchical Clustering via Penalty-Based Aggregation and the Genie Approach

The following paper has been accepted for publication in Proceedings of MDAI 2016: Gagolewski M., Cena A., Bartoszuk M., Hierarchical Clustering via Penalty-Based Aggregation and the Genie Approach, Lecture Notes in Artificial Intelligence, Springer, 2016.
Abstract. The paper discusses a generalization of the nearest centroid hierarchical clustering algorithm. A first extension deals with the incorporation of generic distance-based penalty minimizers instead of the classical aggregation by means of centroids. Due to that the presented algorithm can be applied in spaces equipped with an arbitrary dissimilarity measure (images, DNA sequences, etc.). Secondly, a correction preventing the formation of clusters of too highly unbalanced sizes is applied: just like in the recently introduced Genie approach, which extends the single linkage scheme, the new method averts a chosen inequity measure (e.g., the Gini-, de Vergottini-, or Bonferroni-index) of cluster sizes from raising above a predefined threshold. Numerous benchmarks indicate that the introduction of such a correction increases the quality of the resulting clusterings.
2016-05-30 software

stringi 1.1.1 released

stringi is among the top 10 most downloaded R packages, providing various string processing facilities. A new release comes with a few bugfixes and new features.
* [BUGFIX] #214: allow a regex pattern like `.*`  to match an empty string.

* [BUGFIX] #210: `stri_replace_all_fixed(c("1", "NULL"), "NULL", NA)`
now results in `c("1", NA)`.

* [NEW FEATURE] #199: `stri_sub<-` now allows for ignoring `NA` locations
(a new `omit_na` argument added).

* [NEW FEATURE] #207: `stri_sub<-` now allows for substring insertions
(via `length=0`).

* [NEW FUNCTION] #124: `stri_subset<-` functions added.

* [NEW FEATURE] #216: `stri_detect`, `stri_subset`, `stri_subset<-` gained
a `negate` argument.

* [NEW FUNCTION] #175: `stri_join_list` concatenates all strings
in a list of character vectors. Useful with, e.g., `stri_extract_all_regex`,
`stri_extract_all_words` etc.
2016-05-09 new paper

Paper on the Genie Clustering Algorithm

The following paper has been accepted for publication in Information Sciences: Gagolewski M., Bartoszuk M., Cena A., Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, 2016. It describes the Genie algorithm available thru the genie package for R. The article has been assigned DOI of 10.1016/j.ins.2016.05.003.
Abstract. The time needed to apply a hierarchical clustering algorithm is most often dominated by the number of computations of a pairwise dissimilarity measure. Such a constraint, for larger data sets, puts at a disadvantage the use of all the classical linkage criteria but the single linkage one. However, it is known that the single linkage clustering algorithm is very sensitive to outliers, produces highly skewed dendrograms, and therefore usually does not reflect the true underlying data structure – unless the clusters are well-separated. To overcome its limitations, we propose a new hierarchical clustering linkage criterion called Genie. Namely, our algorithm links two clusters in such a way that a chosen economic inequity measure (e.g., the Gini- or Bonferroni-index) of the cluster sizes does not increase drastically above a given threshold. The presented benchmarks indicate a high practical usefulness of the introduced method: it most often outperforms the Ward or average linkage in terms of the clustering quality while retaining the single linkage speed. The Genie algorithm is easily parallelizable and thus may be run on multiple threads to speed up its execution further on. Its memory overhead is small: there is no need to precompute the complete distance matrix to perform the computations in order to obtain a desired clustering. It can be applied on arbitrary spaces equipped with a dissimilarity measure, e.g., on real vectors, DNA or protein sequences, images, rankings, informetric data, etc. A reference implementation of the algorithm has been included in the open source genie package for R.
2016-03-09 new paper

Proc. IPMU'2016: 3 Papers Accepted

Three papers which I co-author: Fitting aggregation functions to data: Part I – Linearization and regularization, Fitting aggregation functions to data: Part II – Idempotentization (co-authors: Maciej Bartoszuk, Gleb Beliakov, Simon James), and Fuzzy k-minpen clustering and k-nearest-minpen classification procedures incorporating generic distance-based penalty minimizers (co-author: Anna Cena) have been accepted for the IPMU 2016 conference.

1st paper:

Abstract. The use of supervised learning techniques for fitting weights and/or generator functions of weighted quasi-arithmetic means – a special class of idempotent and nondecreasing aggregation functions – to empirical data has already been considered in a number of papers. Nevertheless, there are still some important issues that have not been discussed in the literature yet. In the first part of this two-part contribution we deal with the concept of regularization, a quite standard technique from machine learning applied so as to increase the fit quality on test and validation data samples. Due to the constraints on the weighting vector, it turns out that quite different methods can be used in the current framework, as compared to regression models. Moreover, it is worth noting that so far fitting weighted quasi-arithmetic means to empirical data has only been performed approximately, via the so-called linearization technique. In this paper we consider exact solutions to such special optimization tasks and indicate cases where linearization leads to much worse solutions.

Keywords. Aggregation functions, weighted quasi-arithmetic means, least squares fitting, regularization, linearization

2nd paper:

Abstract. The use of supervised learning techniques for fitting weights and/or generator functions of weighted quasi-arithmetic means – a special class of idempotent and nondecreasing aggregation functions – to empirical data has already been considered in a number of papers. Nevertheless, there are still some important issues that have not been discussed in the literature yet. In the second part of this two-part contribution we deal with a quite common situation in which we have inputs coming from different sources, describing a similar phenomenon, but which have not been properly normalized. In such a case, idempotent and nondecreasing functions cannot be used to aggregate them unless proper pre-processing is performed. The proposed idempotization method, based on the notion of B-splines, allows for an automatic calibration of independent variables. The introduced technique is applied in an R source code plagiarism detection system.

Keywords. Aggregation functions, weighted quasi-arithmetic means, least squares fitting, idempotence

3rd paper:

Abstract. We discuss a generalization of the fuzzy (weighted) k-means clustering procedure and point out its relationships with data aggregation in spaces equipped with arbitrary dissimilarity measures. In the proposed setting, a data set partitioning is performed based on the notion of points' proximity to generic distance-based penalty minimizers. Moreover, a new data classification algorithm, resembling the k-nearest neighbors scheme but less computationally and memory demanding, is introduced. Rich examples in complex data domains indicate the usability of the methods and aggregation theory in general.

Keywords. fuzzy k-means algorithm, clustering, classification, fusion functions, penalty minimizers

2016-03-07 software

The genie Package for R

A New, Fast, and Outlier Resistant Hierarchical Clustering Algorithm called Genie is now available via the genie package for R (co-authors: Maciej Bartoszuk and Anna Cena). A detailed description of the algorithm will be available in a forthcoming paper of ours.
2015-12-31 new book

Data Fusion Book Now Available

My book Data Fusion: Theory, Methods, and Applications is now available (click me).
Data Fusion: Theory, Methods, and Applications - cover
2015-12-13 new paper

Accepted Paper in IEEE TFS

A short paper entitled H-index and other Sugeno integrals: Some deffects and their compensation, by Radko Mesiar and Marek Gagolewski, has been accepted for publication in IEEE Transactions on Fuzzy Systems.
Abstract: The famous Hirsch index has been introduced just ca. 10 years ago. Despite that, it is already widely used in many decision making tasks, like in evaluation of individual scientists, research grant allocation, or even production planning. It is known that the h-index is related to the discrete Sugeno integral and the Ky Fan metric introduced in 1940s. The aim of this paper is to propose a few modifications of this index as well as other fuzzy integrals -- also on bounded chains -- that lead to better discrimination of some types of data that are to be aggregated. All of the suggested compensation methods try to retain the simplicity of the original measure.
2015-12-01 new paper

Accepted Paper in European Physical Journal B

Agent-based model for the h-index – Exact solution by Żogała-Siudem B., Siudem G., Cena A., and Gagolewski M. has been accepted for publication in European Physical Journal B (assigned doi:10.1140/epjb/e2015-60757-1).
Abstract: The Hirsch's h-index is perhaps the most popular citation-based measure of the scientific excellence. In 2013 G. Ionescu and B. Chopard proposed an agent-based model for this index to describe a publications and citations generation process in an abstract scientific community. With such an approach one can simulate a single scientist's activity, and by extension investigate the whole community of researchers. Even though this approach predicts quite well the h-index from bibliometric data, only a solution based on simulations was given. In this paper, we complete their results with exact, analytic formulas. What is more, due to our exact solution we are able to simplify the Ionescu-Chopard model which allows us to obtain a compact formula for h-index. Moreover, a simulation study designed to compare both, approximated and exact, solutions is included. The last part of this paper presents evaluation of the obtained results on a real-word data set.

IPMU 2016 Special Session:
Computational Aspects of Data Aggregation and Complex Data Fusion

We are happy to invite you to submit your contribution(s) to the special session entitled Computational Aspects of Data Aggregation and Complex Data Fusion within the 16th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU 2016) that will be held on June 20-24, 2016 in Eindhoven, The Netherlands.

Important dates:

  • Paper submission: January 8, 2016
  • Notification of acceptance/rejection: March 1, 2016
  • Camera-ready papers: March 31, 2016

The proceedings of IPMU 2016 will be published in Communications in Computer and Information Science (CCIS) with Springer. Papers must be prepared in the LNCS/CCIS one-column page format. The length of papers is 12 pages in this special LaTeX2e format. For the details of submission click here.

Please feel free to disseminate this information to other researchers that may potentially be interested in the session. For the details on the Session click here.
2015-10-22 software

stringi 1.0-1 Now on CRAN

Notable changes since v0.5-2:

* [GENERAL] #88: C++ API is now available for use in, e.g., Rcpp packages, see for an example.

* [BUGFIX] #183: Floating point exception raised in `stri_sub()` and
`stri_sub<-()` when `to` or `length` was a zero-length numeric vector.

* [BUGFIX] #180: `stri_c()` warned incorrectly (recycling rule) when using more
than 2 elements.
2015-09-23 new paper

Accepted Paper in Journal of Applied Statistics

"How to improve a team's position in the FIFA ranking – A simulation study" by Lasek J., Szlavik Z., Gagolewski M., and Bhulai S. has been accepted for publication in Journal of Applied Statistics.
Abstract: In this paper, we study the efficacy of the official ranking for international football teams compiled by FIFA, the body governing football competition around the globe. We present strategies for improving a team's position in the ranking. By combining several statistical techniques we derive an objective function in a decision problem of optimal scheduling of future matches. The presented results display how a team's position can be improved. Along the way, we compare the official procedure to the famous Elo rating system. Although it originates from chess, it has been successfully tailored to ranking football teams as well.
2015-09-18 award

Scholarship for Outstanding Young Scientists

I am happy to announce that I have been awarded a scholarship for outstanding young scientists from Ministry of Science and Higher Education, Republic of Poland (36 months). According to the Ministry, scholarships are awarded to scientists below the age of 35, who conduct high-quality research and have impressive scientific achievements. Here is the complete list of laureates (in Polish).
2015-06-22 software

stringi 0.5-2 Now on CRAN

A new release of the stringi package is available on CRAN. As for now, about 850 CRAN packages depend (either directly or recursively) on stringi. And quite recently, the package got listed among the top downloaded R extensions.

Notable changes since v0.4-1:

* [BACKWARD INCOMPATIBILITY] The second argument to `stri_pad_*()` has
been renamed `width`.

* [GENERAL] #69: `stringi` is now bundled with ICU4C 55.1.

* [NEW FUNCTIONS] `stri_extract_*_boundaries()` extract text between text

* [NEW FUNCTION] #46: `stri_trans_char()` is a `stringi`-flavoured
`chartr()` equivalent.

* [NEW FUNCTION] #8: `stri_width()` approximates the *width* of a string
in a more Unicodish fashion than `nchar(..., "width")`

* [NEW FEATURE] #149: `stri_pad()` and `stri_wrap()` now by default bases on
code point widths instead of the number of code points. Moreover, the default
behavior of `stri_wrap()` is now such that it does not get rid
of non-breaking, zero width, etc. spaces

* [NEW FEATURE] #133: `stri_wrap()` silently allows for `width <= 0`
(for compatibility with `strwrap()`).

* [NEW FEATURE] #139: `stri_wrap()` gained a new argument: `whitespace_only`.

* [NEW FUNCTIONS] #137: date-time formatting/parsing:
* `stri_timezone_list()` - lists all known time zone identifiers
* `stri_timezone_set()`, `stri_timezone_get()` - manage current default time zone
* `stri_timezone_info()` - basic information on a given time zone
* `stri_datetime_symbols()` - localizable date-time formatting data
* `stri_datetime_fstr()` - convert a `strptime`-like format string
to an ICU date/time format string
* `stri_datetime_format()` - convert date/time to string
* `stri_datetime_parse()` - convert string to date/time object
* `stri_datetime_create()` - construct date-time objects
from numeric representations
* `stri_datetime_now()` - return current date-time
* `stri_datetime_fields()` - get values for date-time fields
* `stri_datetime_add()` - add specific number of date-time units
to a date-time object

* [GENERAL] #144: Performance improvements in handling ASCII strings
(these affect `stri_sub()`, `stri_locate()` and other string index-based

* [GENERAL] #143: Searching for short fixed patterns (`stri_*_fixed()`) now
relies on the current `libC`'s implementation of `strchr()` and `strstr()`.
This is very fast e.g. on `glibc` utilizing the `SSE2/3/4` instruction set.

* [BUILD TIME] #141: a local copy of `icudt*.zip` may be used on package
install; see the `INSTALL` file for more information.

* [BUILD TIME] #165: the `./configure` option `--disable-icu-bundle`
forces the use of system ICU when building the package.

* [BUGFIX] locale specifiers are now normalized in a more intelligent way:
e.g. `@calendar=gregorian` expands to `DEFAULT_LOCALE@calendar=gregorian`.

* [BUGFIX] #134: `stri_extract_all_words()` did not accept `simplify=NA`.

* [BUGFIX] #132: incorrect behavior in `stri_locate_regex()` for matches
of zero lengths

* [BUGFIX] stringr/#73: `stri_wrap()` returned `CHARSXP` instead of `STRSXP`
on empty string input with `simplify=FALSE` argument.

* [BUGFIX] #164: `libicu-dev` usage used to fail on Ubuntu
(`LIBS` shall be passed after `LDFLAGS` and the list of `.o` files).

* [BUGFIX] #168: Build now fails if `icudt` is not available.

* [BUGFIX] #135: C++11 is now used by default (see the `INSTALL` file,
however) to build `stringi` from sources. This is because ICU4C uses the
`long long` type which is not part of the C++98 standard.

* [BUGFIX] #154: Dates and other objects with a custom class attribute
were not coerced to the character type correctly.

* [BUGFIX] Force ICU `u_init()` call on `stringi` dynlib load.

* [BUGFIX] #157: many overfull hboxes in the package PDF manual has been
2015-04-30 software

stringr Now Powered by stringi

I'm happy to announce that starting from the 1.0.0 release, the stringr package for R is now powered by stringi. For more details, read more here.
2015-04-17 new paper

AGOP 2015: Two Papers Accepted

Two papers which I authored have been accepted for the AGOP 2015 workshop in Katowice, Poland.
  • Cena A., Gagolewski M., Aggregation and soft clustering of informetric data
  • Gagolewski M., Some issues in aggregation of multidimensional data

Postdoctoral Research Visit @ IRAFM in Ostrava

Today I start a 2-month reasearch visit at the Institute for Research and Applications of Fuzzy Modeling, University of Ostrava, Czech Republic. My stay is supported by the European Union European Social Fund, Project UDA-POKL.04.01.01-00-051/10-00 Information technologies: Research and their interdisciplinary applications.
2015-03-25 new paper

IFSA-EUSFLAT 2015: 4 Papers Accepted

Four papers which I author or coauthor have been accepted for the IFSA-EUSFLAT 2015 conference in Gijon, Spain.

  • Cena A., Gagolewski M., A K-means-like algorithm for informetric data clustering
  • Bartoszuk M., Gagolewski M., Detecting similarity of R functions via a fusion of multiple heuristic methods
  • Gagolewski M., Lasek J., Learning experts' preferences from informetric data
  • Gagolewski M., Normalized WDpWAM and WDpOWA spread measures
2015-02-06 new paper

Accepted Paper in Journal of Informetrics

Cena A., Gagolewski M., Mesiar R., Problems and challenges of information resources producers' clustering, Journal of Informetrics, 2015, doi:10.1016/j.joi.2015.02.005; has been accepted for publication.

Abstract: Classically, unsupervised machine learning techniques are applied on data sets with fixed number of attributes (variables). However, many problems encountered in the field of informetrics face us with the need to extend these kinds of methods in a way such that they may be computed over a set of nonincreasingly ordered vectors of unequal lengths. Thus, in this paper, some new dissimilarity measures (metrics) are introduced and studied. Owing to that we may use i.a. hierarchical clustering algorithms in order to determine an input data set's partition consisting of sets of producers that are homogeneous not only with respect to the quality of information resources, but also their quantity.
2014-12-14 software

stringi_0.4-1 Released

Yet another official release of the stringi package for R is on CRAN now. This time we particularly focused on a better compatibility of stringi with stringr.

Notable changes since v0.3-1:

* [IMPORTANT CHANGE] `n_max` argument in `stri_split_*()` has been renamed `n`.

* [IMPORTANT CHANGE] `simplify=FALSE` in `stri_extract_all_*()` and
`stri_split_*()` now calls `stri_list2matrix()` with `fill=""`.
`fill=NA_character_` may be obtained by using `simplify=NA`.

* [IMPORTANT CHANGE, NEW FUNCTIONS] #120: `stri_extract_words` has been
renamed `stri_extract_all_words` and `stri_locate_boundaries` -
`stri_locate_all_boundaries` as well as `stri_locate_words` -
`stri_locate_all_words`. New functions are now available:
`stri_locate_first_boundaries`, `stri_locate_last_boundaries`,
`stri_locate_first_words`, `stri_locate_last_words`,
`stri_extract_first_words`, `stri_extract_last_words`.

* [IMPORTANT CHANGE] #111: `opts_regex`, `opts_collator`, `opts_fixed`, and
`opts_brkiter` can now be supplied individually via `...`.
In other words, you may now simply call e.g.
`stri_detect_regex(str, pattern, case_insensitive=TRUE)` instead of
`stri_detect_regex(str, pattern, opts_regex=stri_opts_regex(case_insensitive=TRUE))`.

* [NEW FEATURE] #110: Fixed pattern search engine's settings can
now be supplied via `opts_fixed` argument in `stri_*_fixed()`,
see `stri_opts_fixed()`. A simple (not suitable for natural language
processing) yet very fast `case_insensitive` pattern matching can be
performed now. `stri_extract_*_fixed` is again available.

* [NEW FEATURE] #23: `stri_extract_all_fixed`, `stri_count`, and
`stri_locate_all_fixed` may now also look for overlapping pattern
matches, see `?stri_opts_fixed`.

* [NEW FEATURE] #129: `stri_match_*_regex` gained a `cg_missing` argument.

* [NEW FEATURE] #117: `stri_extract_all_*()`, `stri_locate_all_*()`,
`stri_match_all_*()` gained a new argument: `omit_no_match`.
Setting it to `TRUE` makes these functions compatible with their
`stringr` equivalents.

* [NEW FEATURE] #118: `stri_wrap()` gained `indent`, `exdent`, `initial`,
and `prefix` arguments. Moreover Knuth's dynamic word wrapping algorithm
now assumes that the cost of printing the last line is zero, see #128.

* [NEW FEATURE] #122: `stri_subset()` gained an `omit_na` argument.

* [NEW FEATURE] `stri_list2matrix()` gained an `n_min` argument.

* [NEW FEATURE] #126: `stri_split()` now is also able to act
just like `stringr::str_split_fixed()`.

* [NEW FEATURE] #119:  `stri_split_boundaries()` now have
`n`, `tokens_only`, and `simplify` arguments. Additionally,
`stri_extract_all_words()` is now equipped with `simplify` arg.

* [NEW FEATURE] #116: `stri_paste()` gained a new argument:
`ignore_null`. Setting it to `TRUE` makes this function more compatible
with `paste()`.

* [NEW FEATURE] #114: `stri_paste()`: `ignore_null` arg has been added.

* [OTHER] #123: `useDynLib` is used to speed up symbol look-up in
the compiled dynamic library.

* [BUGFIX]  #94: Run-time errors on Solaris caused by setting
`-DU_DISABLE_RENAMING=1` -- memory allocation errors in i.a. ICU's
UnicodeString. This setting also caused some ABSan sanity check
failures within ICU code.
2014-11-22 grant

Research Project 2014/13/D/HS4/01700 (NCN)

My research project Construction and analysis of methods of information resources producers' quality management will receive funding from the National Science Centre (NCN), Poland via the SONATA founding scheme (host institution=Systems Research Institute, Polish Academy of Sciences, role=principal investigator, years=2015-2017).
2014-11-06 software

stringi_0.3-1 Released

The third official release of the stringi package for R is on CRAN now.

Notable changes since v0.2-5:

* [IMPORTANT CHANGE] #87: `%>%` overlapped with the pipe operator from
the `magrittr` package; now each operator like `%>%` has been renamed `%s>%`.

* [IMPORTANT CHANGE] #108: Now the BreakIterator (for text boundary analysis)
may be better controlled via `stri_opts_brkiter()` (see options `type`
and `locale` which aim to replace now-removed `boundary` and `locale` parameters
to `stri_locate_boundaries`, `stri_split_boundaries`, `stri_trans_totitle`,
`stri_extract_words`, `stri_locate_words`).

* [NEW FUNCTIONS] #109: `stri_count_boundaries` and `stri_count_words`
count the number of text boundaries in a string.

* [NEW FUNCTIONS] #41: `stri_startswith_*` and `stri_endswith_*`
determine whether a string starts or ends with a given pattern.

* [NEW FEATURE] #102: `stri_replace_all_*` gained a `vectorize_all` parameter,
which defaults to TRUE for backward compatibility.

* [NEW FUNCTION] #91: `stri_subset_*`, a convenient and more efficient
substitute for `str[stri_detect_*(str, ...)]`, added.

* [NEW FEATURE] #100: `stri_split_fixed`, `stri_split_charclass`,
`stri_split_regex`, `stri_split_coll` gained a `tokens_only` parameter,
which defaults to `FALSE` for backward compatibility.

* [NEW FUNCTION] #105: `stri_list2matrix` converts lists of atomic vectors
to character matrices, useful in connection with `stri_split`
and `stri_extract`.

* [NEW FEATURE] #107: `stri_split_*` now allow setting an `omit_empty=NA` argument.

* [NEW FEATURE] #106: `stri_split` and `stri_extract_all` gained a `simplify`
argument (if `TRUE`, then `stri_list2matrix(..., byrow=TRUE)`
is called on the resulting list.

* [NEW FUNCTION] #77: `stri_rand_lipsum` generates
(pseudo)random dummy *lorem ipsum* text.

* [NEW FEATURE] #98: `stri_trans_totitle` gained a `opts_brkiter`
parameter; it indicates which ICU BreakIterator should be used when
performing case mapping.

* [NEW FEATURE] `stri_wrap` gained a new parameter: `normalize`.

* [BUGFIX] #86: `stri_*_fixed`, `stri_*_coll`, and `stri_*_regex` could
give incorrect results if one of search strings were of length 0.

* [BUGFIX] #99: `stri_replace_all` did not use the `replacement` arg.

* [BUGFIX] #94: `R CMD check` should no longer fail if `icudt` download failed.

* [BUGFIX] #112: Some of the objects were not PROTECTed from
being garbage collected, which might have caused spontaneous SEGFAULTS.

* [BUGFIX] Some collator's options were not passed correctly to ICU services.

* [BUGFIX] Memory leaks causes as detected by
`valgrind --tool=memcheck --leak-check=full` have been removed.

* [DOCUMENTATION] Significant extensions/clean ups in the stringi manual.

Refer to NEWS for a complete list of changes, new features and bug fixes.


Advanced Data Analysis Software Development with R (e-learning @ ICS PAS)

My Advanced Data Analysis Software Development with R e-learning course has just started. It is run on the educational platform of the Interdisciplinary PhD studies program hosted by the Insitute of Computer Science, Polish Academy of Sciences. The project is co-financed by the Human Capital Operational Programme, European Social Found. Batch 02 of the course will start in February/March 2014.
2014-10-01 software

FuzzyNumbers_0.3-5 Now Available

A maintenance release of the FuzzyNumbers package for R is now available on CRAN. CHANGELOG:
* added proper import directives in NAMESPACE
* piecewiseLinearApproximation: method="ApproximateNearestEuclidean"
no longer accepted; use "NearestEuclidean" instead.
* package vignette now in the vignettes/ directory.
2014-08-22 new paper

Spread Measures and Their Relation to Aggregation Functions – Accepted Paper

The paper Gagolewski M., Spread measures and their relation to aggregation functions has just been accepted for publication in European Journal of Operational Research.
Abstract: The theory of aggregation most often deals with measures of central tendency. However, sometimes a very different kind of a numeric vector's synthesis into a single number is required. In this paper we introduce a class of mathematical functions which aim to measure spread or scatter of one-dimensional quantitative data. The proposed definition serves as a common, abstract framework for measures of absolute spread known from statistics, exploratory data analysis and data mining, e.g. the sample variance, standard deviation, range, interquartile range (IQR), median absolute deviation (MAD), etc. Additionally, we develop new measures of experts' opinions diversity or consensus in group decision making problems. We investigate some properties of spread measures, show how are they related to aggregation functions, and indicate their new potentially fruitful application areas.
2014-06-17 new paper

IEEE IS'14 Proceedings Paper Accepted

Accepted for publication in Proc. IEEE IS 2014: Gagolewski M., Lasek J., The use of fuzzy relations in the assessment of information resources producers' performance.
2014-05-26 new paper

SMPS'14 Proceedings Paper Accepted

Accepted for publication in Proc. SMPS 2014: Gagolewski M., Sugeno integral-based confidence intervals for the theoretical h-index.
2014-05-14 software

stringi_0.2-3 Released

The second official release of the stringi package for R is on CRAN now.

Notable changes since v0.1-25:

* [IMPORTANT CHANGE] stri_cmp* now do not allow for passing opts_collator=NA.
From now on, stri_cmp_eq, stri_cmp_neq, and the new operators
%===%, %!==%, %stri===%, and %stri!==% are locale-independent operations,
which base on code point comparisons. New functions stri_cmp_equiv
and stri_cmp_nequiv (and from now on also %==%, %!=%, %stri==%,
and %stri!=%) test for canonical equivalence.

* [IMPORTANT CHANGE] stri_*_fixed search functions now perform
a locale-independent exact (bytewise, of course after conversion to UTF-8)
pattern search. All the Collator-based, locale-dependent search routines
are now available via stri_*_coll. The reason for this is that
ICU USearch has currently very poor performance and in many search tasks
in fact it is sufficient to do exact pattern matching.

* [IMPORTANT CHANGE] stri_enc_nf* and stri_enc_isnf* function families
have been renamed to stri_trans_nf* and stri_trans_isnf*, respectively.
This is because they deal with text transforming, and not with character
encoding. Moreover, all such operation may be performed by
ICU's Transliterator (see below).

* [IMPORTANT CHANGE] stri_*_charclass search functions now
rely solely on ICU's UnicodeSet patterns. All previously accepted
charclass identifiers became invalid. However, new patterns
should now be more familiar to the users (they are regex-like).
Moreover, we observe a very nice performance gain.

* [IMPORTANT CHANGE] stri_sort now does not include NAs
in output vectors by default, for compatibility with sort().
Moreover, currently none of the input vector's attributes are preserved.

* [NEW FUNCTION] stri_trans_general, stri_trans_list gives access
to ICU's Transliterator: may be used to perform very general
text transforms.

* [NEW FUNCTION stri_split_boundaries utilizes ICU's BreakIterator
to split strings at specific text boundaries. Moreover,
stri_locate_boundaries indicates positions of these boundaries.

* [NEW FUNCTION] stri_extract_words uses ICU's BreakIterator to
extract all words from a text. Additionally, stri_locate_words
locates start and end positions of words in a text.

* [NEW FUNCTION] stri_pad, stri_pad_left, stri_pad_right, stri_pad_both
pads a string with a specific code point.

* [NEW FUNCTION] stri_wrap breaks paragraphs of text into lines.
Two algorihms (greedy and minimal-raggedness) are available.

* [NEW FUNCTION] stri_unique extracts unique elements from
a character vector.

* [NEW FUNCTIONS] stri_duplicated any stri_duplicated_any
determine duplicate elements in a character vector.

* [NEW FUNCTION] stri_replace_na replaces NAs in a character vector
with a given string, useful for emulating e.g. R's paste() behavior.

* [NEW FUNCTION] stri_rand_shuffle generates a random permutation
of code points in a string.

* [NEW FUNCTION] stri_rand_strings generates random strings.

* [NEW FUNCTIONS] New functions and binary operators for string comparison:
stri_cmp_eq, stri_cmp_neq, stri_cmp_lt, stri_cmp_le, stri_cmp_gt,
stri_cmp_ge, %==%, %!=%, %<%, %<=%, %>%, %>=%.

* [NEW FUNCTION] stri_enc_mark reads declared encodings of character strings
as seen by stringi.

* [NEW FUNCTION] stri_enc_tonative(str) is an alias to
stri_encode(str, NULL, NULL).

* [NEW FEATURE] stri_order and stri_sort now have an additional argument
`na_last` (defaults to TRUE and NA, respectively).

* [NEW FEATURE] stri_replace_all_charclass now has `merge` arg
(defaults to FALSE for backward-compatibility). It may be used
to e.g. replace sequences of white spaces with a single space.

* [NEW FEATURE] stri_enc_toutf8 now has a new `validate` arg (defaults
to FALSE for backward-compatibility). It may be used in a (rare) case
in which a user wants to fix an invalid UTF-8 byte sequence.
stri_length (among others) now detect invalid UTF-8 byte sequences.

* [NEW FEATURE] All binary operators %???% now also have aliases %stri???%.

* stri_*_fixed now use a tweaked Knuth-Morris-Pratt search algorithm,
which improves the search performance drastically.

* Significant performance improvements in stri_join, stri_flatten,
stri_cmp, stri_trans_to*, and others.

Refer to NEWS for a complete list of changes, new features and bug fixes.

2014-04-02 new paper

Paper on OM3 Operators Accepted in FSS

A paper by A. Cena and me has been accepted for publication in Fuzzy Sets and Systems (doi:10.1016/j.fss.2014.04.001). It is a significantly extended version of our AGOP'2013 contributions entitled ``OM3: Ordered maxitive, minitive, and modular aggregation operators – axiomatic and probabilistic properties in an arity-monotonic setting.''
2014-03-12 software

stringi_0.1-25 Now on CRAN

The initial release of the stringi package for R is now available on CRAN. stringi is THE R package for very fast, correct, consistent and convenient string/text processing in each locale and any native character encoding. The use of the ICU library gives R users a platform-independent set of functions known to Java, Perl, Python, PHP, and Ruby programmers. The package’s API was inspired by Hadley Wickham’s stringr package. See the on-line manual for more information.
2014-03-11 new paper

IPMU 2014: Two Papers Accepted

The following papers have been accepted for publication in Proc. IPMU 2014:
  • Bartoszuk M., Gagolewski M., A fuzzy R code similarity detection algorithm.
  • Coroianu L., Gagolewski M., Grzegorzewski P., Adabitabar Firozja M., Houlari T., Piecewise linear approximation of fuzzy numbers preserving the support and core.
The conference proceedings will be published in Springer-Verlag's Communications in Computer and Information Science series.
2014-03-04 new book

Programowanie w Języku R [Programming in R]

My R Programming Book (In Polish) is now available in bookstores. / Książka Programowanie w języku R jest dostępna w księgarniach. Polecam nie tylko do ,,poduszki''!
Programowanie w języku R - okładka
2014-01-03 software

FuzzyNumbers_0.3-3 Released

A new version of the FuzzyNumbers package for R is now available on CRAN.
** FuzzyNumbers Package CHANGELOG **


0.3-3 /2014-01-03/

* piecewiseLinearApproximation() now supports new method="SupportCorePreserving",
see  Coroianu L., Gagolewski M., Grzegorzewski P., Adabitabar Firozja M.,
Houlari T., Piecewise Linear Approximation of Fuzzy Numbers Preserving
the Support and Core, 2014 (submitted for publication).

* piecewiseLinearApproximation() now does not fail on exceptions thrown
by integrate(); fallback=Newton-Cotes formula.

* Removed `Suggests` dependency: testthat tests now available for developers
via the FuzzyNumbers github repository.

* Package manual has been corrected and extended.

* Package vignette is now only available
online at
2013-12-07 new paper

Accepted Paper on Applications of Monotone Measures and Universal Integrals

The paper Gagolewski M., Mesiar R., Monotone measures and universal integrals in a uniform framework for the scientific impact assessment problem has just been accepted for publication in Information Sciences (doi:10.1016/j.ins.2013.12.004).
Abstract: The Choquet, Sugeno, and Shilkret integrals with respect to monotone measures, as well as their generalization – the universal integral, stand for a useful tool in decision support systems. In this paper we propose a general construction method for aggregation operators that may be used in assessing output of scientists. We show that the most often currently used indices of bibliometric impact, like Hirsch's h, Woeginger's w, Egghe's g, Kosmulski's MAXPROD, and similar constructions, may be obtained by means of our framework. Moreover, the model easily leads to some new, very interesting functions.

AGOP 2015 Website Launched

8th International Summer School on Aggregation Operators - AGOP 2015 website has been launched.

SMPS 2014 Website Launched

7th International Conference "Soft Methods in Probability and Statistics" - SMPS 2014 website has launched.
2013-08-20 software

stringi_0.1-9 Now Available

stringi is THE R package for correct, fast, and simple string processing in each locale and native charset. Another alpha release (for testing purposes) can be automatically downloaded by calling in R:

source('') # Message from the future: the link is outdated

The auto-installer gives access to a Windows i386/x64 build for R 3.0 or allows building the package from sources on Linux or MacOS.

UPDATE@2013-11-13. Version 0.1-10 now available. Includes some bugfixes. Moreover, on Linux/UNIX ./configure now first tries to read build settings from pkg-config (as the usage of icu-config is deprecated).

UPDATE@2013-11-16. Version 0.1-11 now available. ICU4C is now statically linked on Windows, so there is no need to download any additional libraries – a binary version is now available for R 2.15.X and 3.0.X. Moreover, on platforms where packages are built from sources, the ./configure script now tries to find ICU4C automagically.

UPDATE@2013-11-21. Build of version 0.1-11 now available for OS X (x64) and R 3.0. Have fun.

UPDATE@2014-02-15. Version 0.1-20 (source and Win_build only) now available. Now it does not depend on any external ICU library (the library source code is included).

2013-07-05 software

stringi: THE String Processing Package for R **alpha release**

stringi is THE R package for correct, fast, and simple string processing in each locale and native charset. ICU API bindings offer a rich variety of platform-, system-locale-, and native-charset-independent functions known from other programming languages, like Java, Perl, Python and PHP.

The alpha release (for testing purposes) is available here (includes Windows i386/x64 build for R 3.0). Any comments and suggestions are welcome!

2013-07-02 new paper

"Scientific Impact Assessment Cannot Be Fair" Accepted for Publication

My paper "Scientific Impact Assessment Cannot Be Fair" has just been accepted for publication in Journal of Informetrics. Bibliography entry:

Gagolewski M., Scientific Impact Assessment Cannot be Fair, Journal of Informetrics 7(4), 2013, pp. 792-802.

Abstract: In this paper we deal with the problem of aggregating numeric sequences of arbitrary length that represent e.g. citation records of scientists. Impact functions are the aggregation operators that express as a single number not only the quality of individual publications, but also their author's productivity.
We examine some fundamental properties of these aggregation tools. It turns out that each impact function which always gives indisputable valuations must necessarily be trivial. Moreover, it is shown that for any set of citation records in which none is dominated by the other, we may construct an impact function that gives any a priori-established authors' ordering. Theoretically then, there is considerable room for manipulation in the hands of decision makers.
We also discuss the differences between the impact function-based and the multicriteria decision making-based approach to scientific quality management, and study how the introduction of new properties of impact functions affects the assessment process. We argue that simple mathematical tools like the h- or g-index (as well asother bibliometric impact indices) may not necessarily be a good choice when it comes to assess scientific achievements.
2013-06-23 software

FuzzyNumbers-0.3-1 Released

A new version of the FuzzyNumbers package for R has just been submitted to the CRAN archive.
** FuzzyNumbers Package CHANGELOG **


0.3-1 /2013-06-23/

* piecewiseLinearApproximation() - general case (any knot.n)
for method="NearestEuclidean" now available.
Thus, method="ApproximateNearestEuclidean" is now deprecated.

* New binary arithmetic operators, especially
for PiecewiseLinearFuzzyNumbers: +, -, *, /

* New method: fapply() - applies a function on a PLFN
using the extension principle

* New methods: as.character(); also used by show().
This function also allows to generate LaTeX code defining the FN
(toLaTeX arg thanks to Jan Caha).

* as.FuzzyNumber(), as.TriangularFuzzyNumber(), as.PowerFuzzyNumber(), and
as.PiecewiseLinearFuzzyNumber() are now S4 methods,
and can be called on objects of type numeric, as well as on
various FNs

* piecewiseLinearApproximation() and as.PiecewiseLinearFuzzyNumber()
argument `knot.alpha` now defaults to equally distributed knots
(via given `knot.n`). If `knot.n` is missing, then it is guessed
from `knot.alpha`.

* PiecewiseLinearFuzzyNumber() now accepts missing `a1`, `a2`, `a3`, `a4`,
and `knot.left`, `knot.right` of length `knot.n`+2. Moreover, if `knot.n`
is not given, then it is guessed from length(knot.left).
If `knot.alpha` is missing, then the knots will be equally distributed
on the interval [0,1].

* alphacut() now always returns a named two-column matrix.
evaluate() returns a named vector.

* New function: TriangularFuzzyNumber - returns a TrapezoidalFuzzyNumber.

* Function renamed: convert.side to convertSide, convert.alpha
to convertAlpha, approx.invert to approxInvert

* Added a call to setGeneric("plot", function(x, y, ...) ...
to avoid a warning on install

* The FuzzyNumbers Tutorial has been properly included
as the package's vignette

* DiscontinuousFuzzyNumber class has been marked as **EXPERIMENTAL**
in the manual

* Man pages extensively updated

* FuzzyNumbers devel repo moved to GitHub
2013-06-05 new book

Book on R in Press

My book Programowanie w języku R. Analiza danych, obliczenia, symulacje (R Programming: Data Analysis, Computing & Simulation) has been accepted for publication by Wydawnictwo Naukowe PWN. It will appear on February, 2014.
2013-04-03 award

START 2013 Scholarship

I've been granted the Foundation for Polish Science (FNP) Scholarship for “young, talented researchers” ‐ START 2013 Program.
2013-04-02 new paper

AGOP'13: Two Papers Accepted

Our two papers has just been accepted for this year's AGOP Workshop. The proceedings will appear in Springer's Advances in Intelligent and Soft Computing Series.
  • Cena A., Gągolewski M., OM3: ordered maxitive, minitive, and modular aggregation operators – Part I: Axiomatic analysis under arity-dependence, In: Bustince H. et al (Eds.), Aggregation Functions in Theory and in Practise (AISC 228), Springer-Verlag, Heidelberg, 2013, pp. 93-103.
    Abstract: Recently, a very interesting relation between symmetric minitive, maxitive, and modular aggregation operators has been shown. It turns out that the intersection between any pair of the mentioned classes is the same. This result introduces what we here propose to call the OM3 operators. In the first part of our contribution on the analysis of the OM3 operators we study some properties that may be useful when aggregating input vectors of varying lengths. In Part II we will perform a thorough simulation study of the impact of input vectors' calibration on the aggregation results.
  • Cena A., Gągolewski M., OM3: ordered maxitive, minitive, and modular aggregation operators – Part II: A simulation study, In: Bustince H. et al (Eds.), Aggregation Functions in Theory and in Practise (AISC 228), Springer-Verlag, Heidelberg, 2013, pp. 105-115.
    Abstract: This article is a second part of the contribution on the analysis of the recently-proposed class of symmetric maxitive, minitive and modular aggregation operators. Recent results (Gagolewski, Mesiar, 2012) indicated some unstable behavior of the generalized h-index, which is a particular instance of OM3, in case of input data transformation. The study was performed on a small, carefully selected real-world data set. Here we conduct some experiments to examine these phenomena more extensively.

Postdoctoral Research Visit @ Slovak University of Technology

For the next 4 months I will be working under the kind supervision of Prof. Radko Mesiar at the Department of Mathematics, Slovak University of Technology in Bratislava, Slovakia. My research visit is supported by the European Union European Social Fund, Project UDA-POKL.04.01.01-00-051/10-00 Information technologies: Research and their interdisciplinary applications.


The news section history begins.