1.

**Abstract.**
Research in aggregation theory is nowadays still mostly focused on algorithms
to summarize tuples consisting of observations in some real interval
or of diverse general ordered structures. Of course, in practice
of information processing many other data types between these
two extreme cases are worth inspecting. This contribution deals with
the aggregation of lists of data points in **R**^{d} for arbitrary d≥1.
Even though particular functions aiming to summarize multidimensional data
have been discussed by researchers in data analysis,
computational statistics and geometry, there is clearly a need to provide
a comprehensive and unified model in which their properties
like equivariances to geometric transformations, internality, and monotonicity
may be studied at an appropriate level of generality.
The proposed penalty-based approach
serves as a common framework for all idempotent information aggregation
methods, including componentwise functions,
pairwise distance minimizers, and data depth-based medians. It also
allows for deriving many new practically useful tools.

**Keywords.** multidimensional data aggregation, penalty functions, data depth, centroid, median

2.

**Abstract.**
The time needed to apply a hierarchical clustering algorithm is most often
dominated by the number of computations of a pairwise dissimilarity measure.
Such a constraint, for larger data sets, puts at a disadvantage the use of
all the classical linkage criteria but the single linkage one. However, it
is known that the single linkage clustering algorithm is very sensitive to
outliers, produces highly skewed dendrograms, and therefore usually does not
reflect the true underlying data structure – unless the clusters are well-separated.
To overcome its limitations, we propose a new hierarchical clustering linkage
criterion called Genie. Namely, our algorithm links two clusters in such a
way that a chosen economic inequity measure (e.g., the Gini- or Bonferroni-index)
of the cluster sizes does not increase drastically above a given threshold.
The presented benchmarks indicate a high practical usefulness of the introduced
method: it most often outperforms the Ward or average linkage in terms of the
clustering quality while retaining the single linkage speed. The Genie
algorithm is easily parallelizable and thus may be run on multiple threads
to speed up its execution further on. Its memory overhead is small: there
is no need to precompute the complete distance matrix to perform the
computations in order to obtain a desired clustering. It can be applied
on arbitrary spaces equipped with a dissimilarity measure, e.g., on real
vectors, DNA or protein sequences, images, rankings, informetric data,
etc. A reference implementation of the algorithm has been included
in the open source `genie`

package for `R`

.

**Keywords.** hierarchical clustering, single linkage, inequity measures, Gini-index

3.

Żogała-Siudem B., Siudem G.,
Cena A., **Gagolewski M.**,
Agent-based model for the h-index – Exact solution,
*European Physical Journal B* **89**:21, 2016.
doi:10.1140/epjb/e2015-60757-1

**Abstract.**
Hirsch’s h-index is perhaps the most popular citation-based measure of
scientific excellence. In 2013, Ionescu and Chopard proposed an agent-based
model describing a process for generating publications and citations in an
abstract scientific community [G. Ionescu, B. Chopard, Eur. Phys. J. B 86,
426 (2013)]. Within such a framework, one may simulate a scientist’s activity,
and – by extension – investigate the whole community of researchers.
Even though the Ionescu and Chopard model predicts the h-index quite well,
the authors provided a solution based solely on simulations. In this paper,
we complete their results with exact, analytic formulas. What is more, by
considering a simplified version of the Ionescu-Chopard model, we obtained
a compact, easy to compute formula for the h-index. The derived approximate
and exact solutions are investigated on a simulated and real-world data sets.

**Keywords.** Statistical and nonlinear physics, preferential attachment rule, h-index

4.

**Abstract.** The theory of aggregation most often deals with measures of central tendency.
However, sometimes a very different kind of a numeric vector's synthesis into a
single number is required. In this paper we introduce a class of mathematical functions
which aim to measure spread or scatter of one-dimensional quantitative data.
The proposed definition serves as a common, abstract framework for measures of
absolute spread known from statistics, exploratory data analysis and data mining,
e.g. the sample variance, standard deviation, range, interquartile range (IQR),
median absolute deviation (MAD), etc. Additionally, we develop new measures
of experts' opinions diversity or consensus in group decision making problems.
We investigate some properties of spread measures, show how are they related to
aggregation functions, and indicate their new potentially fruitful application areas.

**Keywords.** Group decisions and negotiations, aggregation, spread, deviation, variance

5.

**Abstract.** In this paper the relationship between symmetric minitive,
maxitive, and modular aggregation operators is considered. It is shown
that the intersection between any two of the three discussed classes
is the same. Moreover, the intersection is explicitly characterized.

It turns out that the intersection contains families of aggregation
operators such as OWMax, OWMin, and many generalizations of the
widely-known Hirsch’s h-index, often applied in scientific quality control.

**Keywords.** Aggregation operators; OWMax; OMA; OWA; Hirsch’s h-index;
Scientometrics

**Comments.** Later we proposed that the symmetric minitive,
maxitive, and modular aggregation operators may be called the OM3 agops,
see (Cena A., Gagolewski M.,
*OM3: ordered maxitive, minitive, and modular
aggregation operators – Part I: Axiomatic analysis under arity-dependence*, 2013).

59 publications in total, including:

- 19 journal papers,
- 27 papers in proceedings of international conferences,
- 1 research monograph,
- 3 textbooks, and
- 2 edited volumes.

My current h-index = 9 (Web of Science) / 8 (Scopus) / 11 (Google Scholar).

ORCID = 0000-0003-0637-6028

ResearcherID = C-3575-2012

My Erdős number = 4 (R. Mesiar - I. Assani - R.D. Mauldin - P. Erdős
*or* L. Coroianu - S.G. Gal - J. Szabados - P. Erdős).

*My publication list is also available in
BibTeX format.*

1.

1.

`Python`

`Python`

)2.

`R`

. Analiza danych, obliczenia, symulacje`R`

Programming. Data Analysis. Computing. Simulations)3.

Grzegorzewski P., **Gagolewski M.**, Bobecka-Wesołowska K.,
*Wnioskowanie statystyczne z wykorzystaniem środowiska *
*(Statistical Inference in *,
Biuro ds. Projektu „Program Rozwojowy Politechniki Warszawskiej”,
2014, 183 pp. isbn:978-83-93-72601-1

`R`

`R`

)1.

Ferraro M.B., Giordani P., Vantaggi B.,
**Gagolewski M.**, Gil M.Á., Grzegorzewski P.,
Hryniewicz O. (Eds.),
*Soft Methods for Data Science*
(*Advances in Intelligent Systems and Computing* **456**), Springer, 2017, 535 pp. doi:10.1007/978-3-319-42972-4 isbn:978-3-319-42971-7

2.

Grzegorzewski P., **Gagolewski M.**,
Hryniewicz O., Gil M.Á. (Eds.),
*Strengthening Links Between Data Analysis and Soft Computing*
(*Advances in Intelligent Systems and Computing ***315**), Springer, 2015, 294 pp. doi:10.1007/978-3-319-10765-3 isbn:978-3-319-10764-6

1.

Lasek J., **Gagolewski M.**,
The efficacy of league formats in ranking teams,
*Statistical Modelling*, 2018, in press. doi:10.1177/1471082X18798426

**Abstract.**
The efficacy of different league formats in ranking teams according to their true latent strength is analysed.
To this end, a new approach for estimating attacking and defensive strengths based on the Poisson regression for modelling match outcomes is proposed. Various performance metrics are estimated reflecting the agreement between latent teams' strength parameters and their final rank in the league table. The tournament designs studied here are used
in the majority of European top-tier association football competitions.
Based on numerical experiments, it turns out that a two-stage league format
comprising of the three round-robin tournament together with an extra single
round-robin is the most efficacious setting.
In particular, it is the most accurate in selecting the best team as the winner of the league.
Its efficacy can be enhanced by setting the number of points allocated for a win to two
(instead of three that is currently in effect in association football).

**Keywords.** association football, league formats, rankings, rating systems, simulation, tournament design

2.

Beliakov G., **Gagolewski M.**,
James S., Pace S., Pastorello N., Thilliez E., Vasa R.,
Measuring traffic congestion: An approach based on learning weighted inequality, spread and aggregation indices from comparison data,
*Applied Soft Computing* **67**, 2018, pp. 910-919. doi:10.1016/j.asoc.2017.07.014

**Abstract.**
As cities increase in size, governments and councils face the problem of
designing infrastructure and approaches to traffic management that alleviate
congestion. The problem of objectively measuring congestion involves taking
into account not only the volume of traffic moving throughout a network, but
also the inequality or spread of this traffic over major and minor intersections.
For modelling such data, we investigate the use of weighted congestion indices
based on various aggregation and spread functions. We formulate the weight
learning problem for comparison data and use real traffic data obtained from
a medium-sized Australian city to evaluate their usefulness.

**Keywords.** aggregation functions, inequality indices, spread measures,
learning weights, traffic analysis

3.

**Abstract.**
Research in aggregation theory is nowadays still mostly focused on algorithms
to summarize tuples consisting of observations in some real interval
or of diverse general ordered structures. Of course, in practice
of information processing many other data types between these
two extreme cases are worth inspecting. This contribution deals with
the aggregation of lists of data points in **R**^{d} for arbitrary d≥1.
Even though particular functions aiming to summarize multidimensional data
have been discussed by researchers in data analysis,
computational statistics and geometry, there is clearly a need to provide
a comprehensive and unified model in which their properties
like equivariances to geometric transformations, internality, and monotonicity
may be studied at an appropriate level of generality.
The proposed penalty-based approach
serves as a common framework for all idempotent information aggregation
methods, including componentwise functions,
pairwise distance minimizers, and data depth-based medians. It also
allows for deriving many new practically useful tools.

**Keywords.** multidimensional data aggregation, penalty functions, data depth, centroid, median

4.

Beliakov G., **Gagolewski M.**, James S.,
Penalty-based and other representations of economic inequality,
*International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems*
**24**(Suppl. 1), 2016, pp. 1-23. doi:10.1142/S0218488516400018

**Abstract.**
Economic inequality measures are employed as a key component in various
socio-demographic indices to capture the disparity between the wealthy and poor.
Since their inception, they have also been used as a basis for modelling
spread and disparity in other contexts. While recent research has identified
that a number of classical inequality and welfare functions can be considered
in the framework of OWA operators, here we propose a framework of
penalty-based aggregation functions and their associated penalties as
measures of inequality.

**Keywords.** penalty functions, aggregation functions, inequality indices, spread measures

5.

**Abstract.**
The time needed to apply a hierarchical clustering algorithm is most often
dominated by the number of computations of a pairwise dissimilarity measure.
Such a constraint, for larger data sets, puts at a disadvantage the use of
all the classical linkage criteria but the single linkage one. However, it
is known that the single linkage clustering algorithm is very sensitive to
outliers, produces highly skewed dendrograms, and therefore usually does not
reflect the true underlying data structure – unless the clusters are well-separated.
To overcome its limitations, we propose a new hierarchical clustering linkage
criterion called Genie. Namely, our algorithm links two clusters in such a
way that a chosen economic inequity measure (e.g., the Gini- or Bonferroni-index)
of the cluster sizes does not increase drastically above a given threshold.
The presented benchmarks indicate a high practical usefulness of the introduced
method: it most often outperforms the Ward or average linkage in terms of the
clustering quality while retaining the single linkage speed. The Genie
algorithm is easily parallelizable and thus may be run on multiple threads
to speed up its execution further on. Its memory overhead is small: there
is no need to precompute the complete distance matrix to perform the
computations in order to obtain a desired clustering. It can be applied
on arbitrary spaces equipped with a dissimilarity measure, e.g., on real
vectors, DNA or protein sequences, images, rankings, informetric data,
etc. A reference implementation of the algorithm has been included
in the open source `genie`

package for `R`

.

**Keywords.** hierarchical clustering, single linkage, inequity measures, Gini-index

6.

Mesiar R., **Gagolewski M.**,
H-index and other Sugeno integrals: Some defects and their compensation,
*IEEE Transactions on Fuzzy Systems* **24**(6), 2016, pp. 1668-1672. doi:10.1109/TFUZZ.2016.2516579

**Abstract.**
The famous Hirsch index has been introduced just ca. 10 years ago.
Despite that, it is already widely used in many decision making
tasks, like in evaluation of individual scientists, research
grant allocation, or even production planning.
It is known that the h-index is related to the discrete
Sugeno integral and the Ky Fan metric introduced in 1940s.
The aim of this paper is to propose a few modifications of this index
as well as other fuzzy integrals – also on bounded chains – that lead
to better discrimination of some types of data that are to be aggregated.
All of the suggested compensation methods try to retain the simplicity
of the original measure.

**Keywords.** h-index, Sugeno integral, Ky Fan metric, Shilkret integral, decomposition integrals

7.

Lasek J., Szlavik Z., **Gagolewski M.**, Bhulai S.,
How to improve a team's position in the FIFA ranking – A simulation study, *Journal of Applied Statistics*
**43**(7), 2016, pp. 1349-1368. doi:10.1080/02664763.2015.1100593

**Abstract.** In this paper, we study the efficacy of the official ranking for international football teams compiled by FIFA,
the body governing football competition around the globe. We present strategies for improving a team's position in the ranking.
By combining several statistical techniques we derive an objective function in a decision problem of optimal
scheduling of future matches. The presented results display how a team's position can be improved.
Along the way, we compare the official procedure to the famous Elo rating system. Although it originates
from chess, it has been successfully tailored to ranking football teams as well.

**Keywords.** association football, FIFA ranking, prediction models, Monte Carlo simulations, optimal schedule, team rankings

8.

Żogała-Siudem B., Siudem G.,
Cena A., **Gagolewski M.**,
Agent-based model for the h-index – Exact solution,
*European Physical Journal B* **89**:21, 2016.
doi:10.1140/epjb/e2015-60757-1

**Abstract.**
Hirsch’s h-index is perhaps the most popular citation-based measure of
scientific excellence. In 2013, Ionescu and Chopard proposed an agent-based
model describing a process for generating publications and citations in an
abstract scientific community [G. Ionescu, B. Chopard, Eur. Phys. J. B 86,
426 (2013)]. Within such a framework, one may simulate a scientist’s activity,
and – by extension – investigate the whole community of researchers.
Even though the Ionescu and Chopard model predicts the h-index quite well,
the authors provided a solution based solely on simulations. In this paper,
we complete their results with exact, analytic formulas. What is more, by
considering a simplified version of the Ionescu-Chopard model, we obtained
a compact, easy to compute formula for the h-index. The derived approximate
and exact solutions are investigated on a simulated and real-world data sets.

**Keywords.** Statistical and nonlinear physics, preferential attachment rule, h-index

9.

Cena A., **Gagolewski M.**, Mesiar R.,
Problems and challenges of information resources producers' clustering, *Journal of Informetrics* **9**(2),
2015, pp. 273–284. doi:10.1016/j.joi.2015.02.005

**Abstract.** Classically, unsupervised machine learning techniques are applied
on data sets with fixed number of attributes (variables).
However, many problems encountered in the field of informetrics
face us with the need to extend these kinds of methods in a way such that they may
be computed over a set of nonincreasingly ordered vectors of unequal lengths.
Thus, in this paper, some new dissimilarity measures (metrics)
are introduced and studied.
Owing to that we may use i.a. hierarchical clustering algorithms
in order to determine an input data set's partition
consisting of sets of producers that are homogeneous not only with respect to
the quality of information resources, but also their quantity.

**Keywords.** aggregation, hierarchical clustering, distance, metric

10.

**Abstract.** The theory of aggregation most often deals with measures of central tendency.
However, sometimes a very different kind of a numeric vector's synthesis into a
single number is required. In this paper we introduce a class of mathematical functions
which aim to measure spread or scatter of one-dimensional quantitative data.
The proposed definition serves as a common, abstract framework for measures of
absolute spread known from statistics, exploratory data analysis and data mining,
e.g. the sample variance, standard deviation, range, interquartile range (IQR),
median absolute deviation (MAD), etc. Additionally, we develop new measures
of experts' opinions diversity or consensus in group decision making problems.
We investigate some properties of spread measures, show how are they related to
aggregation functions, and indicate their new potentially fruitful application areas.

**Keywords.** Group decisions and negotiations, aggregation, spread, deviation, variance

11.

Cena A., **Gagolewski M.**,
OM3: Ordered maxitive, minitive, and modular aggregation operators
– axiomatic and probabilistic properties in an arity-monotonic setting, *Fuzzy Sets and Systems* **264**,
2015, pp. 138-159. doi:10.1016/j.fss.2014.04.001

**Abstract.** The recently-introduced OM3 aggregation operators fulfill three
appealing properties: they are simultaneously minitive, maxitive, and modular.
Among the instances of OM3 operators we find e.g. OWMax and OWMin operators,
the famous Hirsch's h-index and all its natural generalizations.

In this paper the basic axiomatic and probabilistic properties
of extended, i.e. in an arity-dependent setting,
OM3 aggregation operators are studied.
We illustrate the difficulties one is inevitably faced with when
trying to combine the quality and quantity of numeric items
into a single number. The discussion on such aggregation methods
is particularly important in the information resources producers assessment problem,
which aims to reduce the negative effects of information overload.
It turns out that the Hirsch-like indices of impact
do not fulfill a set of very important properties, which puts the sensibility of their
practical usage into question.
Moreover, thanks to the probabilistic analysis of the operators in an i.i.d. model,
we may better understand the relationship between the aggregated items' quality and
their producers' productivity.

**Keywords.** Aggregation; ordered modularity, maxitivity and minitivity;
arity-monotonicity; impact assessment; Hirsch's h-index; informetrics

12.

**Abstract.** The Choquet, Sugeno, and Shilkret integrals
with respect to monotone measures,
as well as their generalization
– the universal integral, stand for a useful tool in decision support systems.
In this paper we propose a general construction method for aggregation
operators that may be used in assessing output of scientists.
We show that the most often currently used indices of bibliometric impact,
like Hirsch's h, Woeginger's w, Egghe's g, Kosmulski's MAXPROD,
and similar constructions, may be obtained by means of our framework.
Moreover, the model easily leads to some new, very interesting functions.

**Keywords.** Choquet, Sugeno, Shilkret, universal integral;
monotone measures;
aggregation;
indices of scientific impact,
bibliometrics;
h-index, w-index, g-index, MAXPROD-index

13.

Coroianu L.,
**Gagolewski M.**, Grzegorzewski P.,
Nearest piecewise linear approximation of fuzzy numbers,
*Fuzzy Sets and Systems* **233**, 2013, pp. 26-51. doi:10.1016/j.fss.2013.02.005

**Abstract.** The problem of the nearest approximation of fuzzy numbers
by piecewise linear 1-knot fuzzy numbers is discussed. By using 1-knot
fuzzy numbers one may obtain approximations which are simple enough and
flexible to reconstruct the input fuzzy concepts under study. They might
be also perceived as a generalization of the trapezoidal approximations.
Moreover, these approximations possess some desirable properties.
Apart from theoretical considerations approximation algorithms
that can be applied in practice are also given.

**Keywords.** Approximation of fuzzy numbers; Fuzzy number; Piecewise linear approximation

14.

**Abstract.** In this paper we deal with the problem of
aggregating numeric sequences of arbitrary length that represent
e.g. citation records of scientists. Impact functions are the aggregation operators that express as a
single number not only the quality of individual publications, but also their author's productivity.

We examine some fundamental properties of these aggregation tools. It turns out that each impact
function which always gives indisputable valuations must necessarily be trivial.
Moreover, it is shown that for any set of citation records in which none is dominated by the other, we
may construct an impact function that gives any a prori-established authors' ordering. Theoretically
then, there is considerable room for manipulation in the hands of decision makers.

We also discuss the differences between the impact function-based and the multicriteria decision
making-based approach to scientific quality management, and study how the introduction of new
properties of impact functions affects the assessment process. We argue that simple mathematical
tools like the h- or g-index (as well asother bibliometric impact indices) may not necessarily
be a good choice when it comes to assess scientific achievements.

**Keywords.** Impact functions;
aggregation;
decision making;
reference modeling;
Hirsch's h-index;
scientometrics;
bibliometrics

15.

**Abstract.** In this paper the relationship between symmetric minitive,
maxitive, and modular aggregation operators is considered. It is shown
that the intersection between any two of the three discussed classes
is the same. Moreover, the intersection is explicitly characterized.

It turns out that the intersection contains families of aggregation
operators such as OWMax, OWMin, and many generalizations of the
widely-known Hirsch’s h-index, often applied in scientific quality control.

**Keywords.** Aggregation operators; OWMax; OMA; OWA; Hirsch’s h-index;
Scientometrics

**Comments.** Later we proposed that the symmetric minitive,
maxitive, and modular aggregation operators may be called the OM3 agops,
see (Cena A., Gagolewski M.,
*OM3: ordered maxitive, minitive, and modular
aggregation operators – Part I: Axiomatic analysis under arity-dependence*, 2013).

16.

**Abstract.** The process of assessing individual authors should
rely upon a proper aggregation of reliable and valid papers’ quality
metrics. Citations are merely one possible way to measure appreciation
of publications. In this study we propose some new, SJR- and SNIP-based
indicators, which not only take into account the broadly conceived
popularity of a paper (manifested by the number of citations),
but also other factors like its potential, or the quality of papers
that cite a given publication. We explore the relation and correlation
between different metrics and study how they affect the values of
a real-valued generalized h-index calculated for 11 prominent
scientometricians. We note that the h-index is a very unstable
impact function, highly sensitive for applying input elements’ scaling.
Our analysis is not only of theoretical significance: data scaling
is often performed to normalize citations across disciplines.
Uncontrolled application of this operation may lead to unfair and
biased (toward some groups) decisions. This puts the validity of
authors assessment and ranking using the h-index into question.
Obviously, a good impact function to be used in practice
should not be as much sensitive to changing input
data as the analyzed one.

**Keywords.** Aggregation operators; Impact functions; Hirsch's h-index;
Quality control; Scientometrics; Bibliometrics; SJR;
SNIP; Scopus; CITAN; R

**Comments.** An empirical paper. The ideas presented here
were later explored more thoroughly in (Cena A., Gagolewski M.,
*OM3: ordered maxitive, minitive, and modular
aggregation operators – Part II: A simulation study*, 2013).

17.

**Abstract.** In this paper CITAN, the CITation ANalysis package for
R statistical computing environment, is introduced. The main aim of the
software is to support bibliometricians with a tool for preprocessing
and cleaning bibliographic data retrieved from SciVerse Scopus and
for calculating the most popular indices of scientific impact.

To show the practical usability of the package, an exemplary assessment
of authors publishing in the fields of scientometrics and
webometrics is performed.

**Keywords.** Data analysis software; Quality control in science;
Citation analysis; Bibliometrics; Hirsch's h index;
Egghe's g index; SciVerse Scopus

18.

**Abstract.** A class of arity-monotonic aggregation operators,
called impact functions, is proposed. This family of operators forms
a theoretical framework for the so-called Producer Assessment Problem,
which includes the scientometric task of fair and objective assessment
of scientists using the number of citations received by their publications.

The impact function output values are analyzed under right-censored
and dynamically changing input data. The qualitative possibilistic
approach is used to describe this kind of uncertainty.
It leads to intuitive graphical interpretations and may
be easily applied for practical purposes.

The discourse is illustrated by a family of aggregation operators
generalizing the well-known Ordered Weighted Maximum (OWMax)
and the Hirsch h-index.

**Keywords.** Aggregation operators; Possibility theory; S-statistics; h-index; OWMax

**Comments.** In this paper the class of effort-dominating impact functions
has also been introduced. I have shown later (see Gagolewski M.,
*On the Relation Between Effort-Dominating and Symmetric Minitive Aggregation Operators*, 2012)
that all such aggregation operators are symmetric minitive.

19.

**Abstract.** Two broad classes of scientific impact indices
are proposed and their properties – both theoretical and practical –
are discussed. These new classes were obtained as a geometric
generalization of the well-known tools applied in scientometric,
like Hirsch’s h-index, Woeginger’s w-index and the Kosmulski’s Maxprod.
It is shown how to apply the suggested indices for estimation of
the shape of the citation function or the total number of citations
of an individual. Additionally, a new efficient and simple O(log n)
algorithm for computing the h-index is given.

**Keywords.** Hirsch's h-index, citation analysis, scientific impact indices

**Comments.** I have shown later (see Gagolewski M.,
*On the Relation Between Effort-Dominating and Symmetric Minitive Aggregation Operators*, 2012)
that the r_{p}-indices are symmetric minitive.
Moreover, we have found that there exists a O(n log n) algorithm
for determining l_{p} (see Gagolewski M., Dębski M., Nowakiewicz M.,
*Efficient Algorithm for Computing Certain Graph-Based Monotone Integrals: the l _{p}-Indices*, 2013

1.

Beliakov G., **Gagolewski M.**, James S.,
*Least median of squares (LMS) and least trimmed squares (LTS) fitting for the weighted arithmetic mean*,
In: Medina J. et al. (Eds.), *Information Processing and Management of Uncertainty in Knowledge-Based Systems. Theory and Foundations*
(*Communications in Computer and Information Science ***854**),
Springer, 2018, pp. 367-378. doi:10.1007/978-3-319-91476-3_31

**Abstract.** We look at different approaches to learning the weights of the
weighted arithmetic mean such that the median residual or sum of the
smallest half of squared residuals is minimized. The more general problem
of multivariate regression has been well studied in statistical literature
however in the case of aggregation functions we have the restriction on
the weights and the domain is usually restricted so that ‘outliers’ may
not be arbitrarily large. A number of algorithms are compared in terms
of accuracy and speed. Our results can be extended to other aggregation
functions.

**Keywords.** aggregation, LMS fitting, LTS fitting, approximation

2.

**Abstract.** The Sugeno integral has numerous successful applications,
including but not limited to the areas of decision making, preference modeling,
and bibliometrics. Despite this, the current state of the development of usable
algorithms for numerically fitting the underlying discrete fuzzy measure based
on a sample of prototypical values – even in the simplest possible case, i.e.,
assuming the symmetry of the capacity – is yet to reach a satisfactory level.
Thus, the aim of this paper is to present some results and observations
concerning this class of data approximation problems.

**Keywords.** Sugeno integral, aggregation functions, machine learning, regression, approximation

3.

Bartoszuk M., **Gagolewski M.**,
*Binary aggregation functions in software plagiarism detection*,
In: *Proc. FUZZ-IEEE'17*, IEEE, 2017, no. 8015582. doi:10.1109/FUZZ-IEEE.2017.8015582

**Abstract.** Supervised learning is of key interest in data science.
Even though there exist many approaches to solving, among others,
classification as well as ordinal and standard regression tasks,
most of them output models that do not possess useful formal properties,
like nondecreasingness in each independent variable, idempotence,
symmetry, etc. This makes them difficult to interpret and analyze.
For instance, it might be impossible to determine the importances of
individual features or to assess the effects of increasing the values
of predictors on the behavior of a chosen response variable. Such
properties are especially important in software plagiarism detection,
where we are faced with the combination of degrees to which how much
a code chunk A is similar to (or contained in) B as well as how much
B is similar to A. Therefore, in this paper we consider a new method
for fitting B-spline tensor product-based aggregation functions to
empirical data. An empirical study indicates a highly competitive
performance of the resulting models. Additionally, they possess an
intuitive interpretation which is highly desirable for end-users.

4.

Cena A., **Gagolewski M.**,
*OWA-based linkage and the Genie correction for hierarchical clustering*,
In: *Proc. FUZZ-IEEE'17*, IEEE, 2017, no. 8015652. doi:10.1109/FUZZ-IEEE.2017.8015652

**Abstract.** In this paper we thoroughly investigate various OWA-based
linkages in hierarchical clustering on numerous benchmark data sets.
The inspected setting generalizes the well-known single, complete,
and average linkage schemes, among others. The incorporation of
weights into the cluster merge procedure creates an opportunity
to make use of experts' knowledge about a particular data domain
so as to generate partitions of a given data set that better
reflect the true underlying cluster structure. Moreover, we
introduce a correction for the inequality of cluster size distribution
— similar to the one proposed in our recently introduced Genie algorithm
— which results in a significant performance boost in terms of clustering quality.

5.

**Abstract.**
The paper discusses a generalization of the nearest centroid hierarchical
clustering algorithm. A first extension deals with the incorporation
of generic distance-based penalty minimizers instead of the classical
aggregation by means of centroids. Due to that the presented algorithm can be applied
in spaces equipped with an arbitrary dissimilarity measure (images, DNA sequences, etc.).
Secondly, a correction preventing the formation
of clusters of too highly unbalanced sizes is applied: just like in the
recently introduced *Genie* approach, which extends the single linkage scheme,
the new method averts a chosen inequity measure (e.g., the Gini-, de Vergottini-,
or Bonferroni-index) of cluster sizes from raising above a predefined
threshold. Numerous benchmarks indicate that the introduction of such
a correction increases the quality of the resulting clusterings.

**Keywords.** hierarchical clustering, aggregation, centroid, Gini-index, Genie algorithm

6.

Bartoszuk M., Beliakov G., **Gagolewski M.**, James S.,
*Fitting aggregation functions to data: Part I – Linearization and regularization*,
In: Carvalho J.P. et al. (Eds.),
*Information Processing and Management of Uncertainty in Knowledge-Based Systems*,
Part II (*Communications in Computer and Information Science* **611**),
Springer, 2016, pp. 767-779. doi:10.1007/978-3-319-40581-0_62

**Abstract.**
The use of supervised learning techniques for fitting weights and/or generator
functions of weighted quasi-arithmetic means – a special class of idempotent
and nondecreasing aggregation functions – to empirical data has already been
considered in a number of papers.
Nevertheless, there are still some important issues that have not been
discussed in the literature yet. In the first part of this two-part contribution
we deal with the concept of regularization, a quite standard technique from machine learning
applied so as to increase the fit quality on test and validation data samples.
Due to the constraints on the weighting vector,
it turns out that quite different methods can be used in the current framework, as
compared to regression models.
Moreover, it is worth noting that so far fitting weighted
quasi-arithmetic means to empirical data has only been performed
approximately, via the so-called linearization technique.
In this paper we consider exact solutions to such special optimization tasks
and indicate cases where linearization leads to much worse solutions.

**Keywords.** Aggregation functions, weighted quasi-arithmetic means,
least squares fitting, regularization, linearization

7.

Bartoszuk M., Beliakov G., **Gagolewski M.**, James S.,
*Fitting aggregation functions to data: Part II – Idempotization*,
In: Carvalho J.P. et al. (Eds.),
*Information Processing and Management of Uncertainty in Knowledge-Based Systems*,
Part II (*Communications in Computer and Information Science* **611**),
Springer, 2016, pp. 780-789. doi:10.1007/978-3-319-40581-0_63

**Abstract.**
The use of supervised learning techniques for fitting weights and/or generator
functions of weighted quasi-arithmetic means – a special class of idempotent
and nondecreasing aggregation functions – to empirical data has already been
considered in a number of papers.
Nevertheless, there are still some important issues that have not been
discussed in the literature yet. In the second part of this two-part contribution
we deal with a quite common situation in which we have inputs coming from
different sources, describing a similar phenomenon, but which
have not been properly normalized. In such a case,
idempotent and nondecreasing functions cannot be used to aggregate them
unless proper pre-processing is performed.
The proposed idempotization method, based on the notion of B-splines,
allows for an automatic calibration of independent variables.
The introduced technique is applied in an R source code plagiarism
detection system.

**Keywords.** Aggregation functions, weighted quasi-arithmetic means,
least squares fitting, idempotence

8.

Cena A., **Gagolewski M.**,
*Fuzzy k-minpen clustering and k-nearest-minpen classification procedures incorporating generic distance-based penalty minimizers*,
In: Carvalho J.P. et al. (Eds.),
*Information Processing and Management of Uncertainty in Knowledge-Based Systems*,
Part II (*Communications in Computer and Information Science* **611**),
Springer, 2016, pp. 445-456. doi:10.1007/978-3-319-40581-0_36

**Abstract.**
We discuss a generalization of the fuzzy (weighted) k-means clustering procedure
and point out its relationships with data aggregation in spaces equipped with
arbitrary dissimilarity measures. In the proposed setting, a
data set partitioning is performed based on the notion of points' proximity to generic
distance-based penalty minimizers. Moreover, a new data classification algorithm,
resembling the k-nearest neighbors scheme but less computationally and memory
demanding, is introduced. Rich examples in complex data domains
indicate the usability of the methods and aggregation theory in general.

**Keywords.** fuzzy k-means algorithm, clustering, classification, fusion functions, penalty minimizers

9.

Lasek J., **Gagolewski M.**,
*The winning solution to the AAIA'15 Data Mining Competition: Tagging firefighter activities at a fire scene*,
In:
Ganzha M., Maciaszek L., Paprzycki M. (Eds.),
*Proc. FedCSIS'15*, IEEE, 2015, pp. 375-380. doi:10.15439/2015F418

**Abstract.** Multi-sensor based classification of professionals' activities
plays a key role in ensuring the success of an his/her goals. In this paper
we present the winning solution to the *AAIA'15 Tagging Firefighter
Activities at a Fire Scene* data mining competition. The approach
is based on a Random Forest classifier trained on an input data set with
almost 5000 features describing the underlying time series of sensory data.

**Keywords.** Activity tagging, movement tagging, data mining competition, Random Forest model, FFT

10.

Cena A., **Gagolewski M.**,
*A K-means-like algorithm for informetric data clustering*,
In: Alonso J.M., Bustince H., Reformat M. (Eds.),
*Proc. IFSA/EUSFLAT 2015*,
Atlantis Press, 2015, pp. 536-543. doi:10.2991/ifsa-eusflat-15.2015.77

**Abstract.** The K-means algorithm is one of the most often used clustering techniques.
However, when it comes to discovering clusters in informetric data sets
that consist of non-increasingly ordered vectors of not necessarily conforming
lengths, such a method cannot be applied directly.
Hence, in this paper, we propose a K-means-like algorithm
to determine groups of producers that are similar
not only with respect to the quality of information resources they output,
but also their quantity.

**Keywords.** k-means clustering, informetrics, aggregation, impact functions

11.

Bartoszuk M., **Gagolewski M.**,
*Detecting similarity of R functions via a fusion of multiple heuristic methods*,
In: Alonso J.M., Bustince H., Reformat M. (Eds.),
*Proc. IFSA/EUSFLAT 2015*,
Atlantis Press, 2015, pp. 419-426. doi:10.2991/ifsa-eusflat-15.2015.61

**Abstract.** In this paper we describe recent advances in our R code similarity detection algorithm.
We propose a modification of the Program Dependence Graph (PDG) procedure used in the GPLAG system
that better fits the nature of functional programming languages like R.
The major strength of our approach lies in a proper
aggregation of outputs of multiple plagiarism detection methods,
as it is well known that no single technique gives perfect results.
It turns out that the incorporation of the PDG algorithm
significantly improves the recall ratio, i.e. it is better
in indicating true positive cases of plagiarism or code
cloning patterns. The implemented system is available
as web application at http://SimilaR.Rexamine.com/.

**Keywords.** R, plagiarism and code cloning detection,
fuzzy proximity relations, aggregation,
program dependence graph, t-norms

12.

**Abstract.** In the field of informetrics, agents are often represented
by numeric sequences of non necessarily conforming lengths.
There are numerous aggregation techniques of such sequences,
e.g., the g-index, the h-index, that may be used to compare the output
of pairs of agents. In this paper we address a question whether such impact
indices may be used to model experts' preferences accurately.

**Keywords.** preference learning, fuzzy relations, informetrics, aggregation, h-index

13.

**Abstract.** Aggregation theory often deals with measures of central tendency of quantitative data.
As sometimes a different kind of information fusion is needed,
an axiomatization of spread measures was introduced recently. In this contribution
we explore the properties of WD_{p}WAM and WD_{p}OWA operators,
which are defined as weighted L_{p}-distances to weighted
arithmetic mean and OWA operators, respectively.
In particular, we give forms of vectors that maximize
such fusion functions and thus provide a way to normalize the output value
so that the vector of maximal spread always leads to a fixed outcome, e.g., 1
if all the input elements are in [0,1].
This might be desirable when constructing measures of experts' opinions consistency or diversity
in group decision making problems.

**Keywords.** data fusion, aggregation, spread, deviation, variance, OWA operators

14.

Cena A., **Gagolewski M.**,
*Aggregation and soft clustering of informetric data*,
In: Baczyński M., De Baets B., Mesiar R. (Eds.),
*Proc. 8th International Summer School on Aggregation Operators (AGOP 2015)*,
University of Silesia, 2015, pp. 79-84. isbn:978-83-8012-519-3

**Abstract.** The aim of this contribution is to inspect possible
applications of clustering techniques
computed over a set consisting of nonincreasingly ordered vectors
of possibly nonconforming lengths. Such data sets appear in the field of
informetrics, where one may need to evaluate the quality of information items,
e.g., research papers,
and their producers. In this paper we investigate the notion of cluster centers
as an aggregated representation of all vectors from a given cluster and analyze
them by means of aggregation operators.

**Keywords.** clustering, fuzzy clustering, c-means algorithm, distance, producers assessment problem

15.

**Abstract.** The aggregation theory usually takes an interest in
summarizing a predefined number of points in the real line.
In many applications, like in statistics, data analysis, and mining,
the notion of a mean – a nondecreasing, internal, and symmetric fusion function
– plays a key role. Nevertheless, when it comes to aggregating
a set of points in higher dimensional spaces, the componentwise
extension of monotonicity and internality might not be the best choice.
Instead, the invariance to certain classes of geometric transformations
seems to be crucial in such a case.

**Keywords.** aggregation, centroid, Tukey median, 1-center, 1-median, convex hull, affine invariance, orthogonalization

16.

**Abstract.** The producers assessment problem has many important practical
instances: it is an abstract model for intelligent systems evaluating
e.g. the quality of computer software repositories, web resources,
social networking services, and digital libraries. Each producer's
performance is determined according not only to the overall quality
of the items he/she outputted, but also to the number of such items
(which may be different for each agent).

Recent theoretical results indicate that the use of aggregation
operators in the process of ranking and evaluation producers
may not necessarily lead to fair and plausible outcomes. Therefore,
to overcome some weaknesses of the most often applied approach,
in this preliminary study we encourage the use of a fuzzy preference
relation-based setting and indicate why it may provide better
control over the assessment process.

**Keywords.** fuzzy relations, preference modeling, producers assessment problem, StackOverflow, bibliometrics, h-index

17.

**Abstract.** Sugeno integral-based confidence intervals for the theoretical
h-index of a fixed-length sequence of i.i.d. random variables are derived.
They are compared with other estimators of such a distribution characteristic
in a Pareto i.i.d. model. It turns out that in the first case we obtain
much wider intervals. It seems to be due to the fact that a Sugeno integral,
which may be applied on any ordinal scale, is known to ignore too
much information from cardinal-scale data being aggregated.

**Keywords.** h-index, Sugeno integral, confidence interval, Pareto distribution

18.

Bartoszuk M., **Gagolewski M.**,
*A fuzzy R code similarity detection algorithm*,
In: Laurent A. et al. (Eds.), *Information Processing and Management of Uncertainty in Knowledge-Based Systems*,
Part III (*Communications in Computer and Information Science ***444**), Springer, 2014, pp. 21-30. doi:10.1007/978-3-319-08852-5_3

**Abstract.** R is a programming language and software environment
for performing statistical computations
and applying data analysis that increasingly gains popularity
among practitioners and scientists. In this paper we present
a preliminary version of a system to detect pairs of similar R code blocks
among a given set of routines, which bases on a proper aggregation of the output of
three different [0,1]-valued (fuzzy) proximity degree estimation algorithms.
Its analysis on empirical data indicates that the system may in future be successfully applied in practice
in order e.g. to detect plagiarism among students' homework submissions or to perform an analysis
of code recycling or code cloning in R's open source packages repositories.

**Keywords.** R, plagiarism detection, code cloning, fuzzy similarity measures

19.

Coroianu L., **Gagolewski M.**,
Grzegorzewski P., Adabitabar Firozja M., Houlari T.,
*Piecewise linear approximation of fuzzy numbers preserving the support and core*,
In: Laurent A. et al. (Eds.), *Information Processing and Management of Uncertainty in Knowledge-Based Systems*,
Part II (*Communications in Computer and Information Science ***443**), Springer, 2014, pp. 244-254. doi:10.1007/978-3-319-08855-6_25

**Abstract.** A reasonable approximation of a fuzzy number should have a simple
membership function, be close to the input fuzzy number, and should
preserve some of its important characteristics. In this
paper we suggest to approximate a
fuzzy number by a piecewise linear 1-knot fuzzy number which is
the closest one to the input fuzzy number among all piecewise
linear 1-knot fuzzy numbers having the same core and the same
support as the input. We discuss the existence of the approximation
operator, show algorithms ready for the practical
use and illustrate the considered concepts by examples. It turns out that
such an approximation task may be problematic.

**Keywords.** Approximation of fuzzy numbers, core, fuzzy number,
piecewise linear approximation, support

20.

Cena A., **Gagolewski M.**,
*OM3: Ordered maxitive, minitive, and modular
aggregation operators – Part I: Axiomatic analysis under arity-dependence*,
In: Bustince H. et al. (Eds.),
*Aggregation Functions in Theory and in Practise*
(*Advances in Intelligent Systems and Computing* **228**), Springer, 2013, pp. 93-103. doi:10.1007/978-3-642-39165-1_13

**Abstract.** Recently, a very interesting relation between symmetric
minitive, maxitive, and modular aggregation operators has been shown.
It turns out that the intersection between any pair of the mentioned
classes is the same. This result introduces what we here propose
to call the OM3 operators. In the first part of our contribution
on the analysis of the OM3 operators we study some properties that
may be useful when aggregating input vectors of varying lengths.
In Part II we will perform a thorough simulation study of the
impact of input vectors’ calibration on the aggregation results.

21.

Cena A., **Gagolewski M.**,
*OM3: Ordered maxitive, minitive, and modular
aggregation operators – Part II: A simulation study*,
In: Bustince H. et al. (Eds.),
*Aggregation Functions in Theory and in Practise*
(*Advances in Intelligent Systems and Computing* **228**), Springer, 2013,
pp. 105-115. doi:10.1007/978-3-642-39165-1_14

**Abstract.** This article is a second part of the contribution
on the analysis of the recently-proposed class of symmetric maxitive,
minitive and modular aggregation operators. Recent results
(Gagolewski, Mesiar, 2012) indicated some unstable behavior
of the generalized h-index, which is a particular instance of
OM3, in case of input data transformation. The study was performed
on a small, carefully selected real-world data set.
Here we conduct some experiments to examine this phenomena more extensively.

22.

**Abstract.** In this paper we discuss the construction of a new
parametric statistical hypothesis test for the equality of
probability distributions. The test bases on the difference
between Hirsch’s h-indices of two equal-length i.i.d. random
samples. For the sake of illustration, we analyze its power
in case of Pareto-distributed input data. It turns out that
the test is very conservative and has wide acceptance regions,
which puts in question the appropriateness of the h-index usage
in scientific quality control and decision making.

23.

**Abstract.** In this paper the recently introduced class of
effort-dominating impact functions is examined. It turns out
that each effort-dominating aggregation operator not only has a
very intuitive interpretation, but also is symmetric minitive, and
therefore may be expressed as a so-called quasi-I-statistic, which
generalizes the well-know OWMin operator.

These aggregation operators may be used e.g. in the Producer Assessment
Problem whose most important instance is the scientometric/bibliometric
issue of fair scientists’ ranking by means of the number of citations
received by their papers.

24.

**Abstract.** Two classes of aggregation functions: L-statistics
and S-statistics and their generalizations called quasi-L-statistics
and quasi-S-statistics are considered. Some interesting characterizations
of these families of operators are given. The aforementioned functions
are useful for various applications. In particular, they are very helpful
for modeling the so-called Producer Assessment Problem.

25.

**Abstract.** Some statistical properties of the so-called S-statistics,
which generalize the ordered weighted maximum aggregation operators,
are considered. In particular, the asymptotic normality of S-statistics
is proved and some possible applications in estimation problems are suggested.

26.

**Abstract.** A class of extended aggregation operators, called impact
functions, is proposed and their basic properties are examined.
Some important classes of functions like generalized ordered weighted
averaging (OWA) and ordered weighted maximum (OWMax) operators
are considered. The general idea is illustrated by the Producer
Assessment Problem which includes the scientometric problem of
rating scientists basing on the number of citations received by
their publications. An interesting characterization of the well
known h-index is given.

27.

**Abstract.** The problem of measuring scientific impact is considered. A class
of so-called p-sphere (r_{p}) indices, which generalize the well-known
Hirsch index, is used to construct a possibility measure of
scientific impact. This measure might be treated as a
starting point for prediction of future index values or for dealing
with right-censored bibliometric data.

1.

Lasek J., **Gagolewski M.**,
*Estimation of tournament metrics for association football league formats*,
In: *Selected problems in information technologies (Proc. ITRIA'15 vol. 2)*,
Institute of Computer Science, Polish Academy of Sciences,
2015, pp. 67-78.

2.

Cena A., **Gagolewski M.**,
*Clustering and aggregation of informetric data sets*,
In: *Computational methods in data analysis (Proc. ITRIA'15 vol. 1)*,
Institute of Computer Science, Polish Academy of Sciences,
2015, pp. 5-26. isbn:978-83-63159-22-1

3.

**Abstract.** The Choquet, Sugeno and Shilkret integrals with respect to monotone measures
are useful tools in decision support systems.
In this paper we propose a new class of graph-based integrals
that generalize these three operations.
Then, an efficient linear-time
algorithm for computing their special case,
that is l_{p}-indices, 1≤p<∞, is presented.
The algorithm is based on R.L. Graham's routine for determining
the convex hull of a finite planar set.

**Keywords.** Monotone measures, Choquet, Sugeno and Shilkret integral,
l_{p}-index, convex hull, Graham's scan, scientific impact indices

4.

Rowiński T., **Gagolewski M.**,
*Internet a kryzys*,
In: Jankowska M., Starzomska M. (Eds.),
*Kryzys: Pułapka czy szansa?*, WN Akapit, 2011,
pp. 211-224. isbn:978-83-609-5885-8

5.

6.

7.

Rowiński T., **Gagolewski M.**,
Preferencje i postawy wobec pomocy online,
*Studia Psychologica UKSW* **7**, 2007, pp. 195-210.