1.

Siudem G., Żogała-Siudem B.,
Cena A., **Gagolewski M.**,
Three dimensions of scientific impact,
*Proceedings of the National Academy of Sciences of the
United States of America (PNAS)* **117**(25), 2020, pp. 13896-13900. doi:10.1073/pnas.2001064117

**Abstract.**
The growing popularity of bibliometric indexes
(whose most famous example is the h index by J. E. Hirsch
[J. E. Hirsch, Proc. Natl. Acad. Sci. U.S.A. 102, 16569–16572 (2005)])
is opposed by those claiming that one's scientific impact cannot be reduced
to a single number. Some even believe that our complex reality fails
to submit to any quantitative description. We argue that neither of
the two controversial extremes is true. By assuming that some citations
are distributed according to the rich get richer rule (success breeds
success, preferential attachment) while some others are assigned totally
at random (all in all, a paper needs a bibliography), we have crafted
a model that accurately summarizes citation records with merely
three easily interpretable parameters: productivity, total impact,
and how lucky an author has been so far.

**Keywords.** science of science, scientometrics, bibliometric indexes, rich get richer

2.

Cena A., **Gagolewski M.**,
Genie+OWA: Robustifying Hierarchical Clustering with OWA-based Linkages,
*Information Sciences* **520**, 2020, pp. 324-336. doi:10.1016/j.ins.2020.02.025

**Abstract.**
We investigate the application of the Ordered Weighted
Averaging (OWA) data fusion operator in agglomerative hierarchical
clustering. The examined setting generalises the well-known single,
complete and average linkage schemes. It allows to embody expert
knowledge in the cluster merge process and to provide a much wider
range of possible linkages. We analyse various families of weighting
functions on numerous benchmark data sets in order to assess their
influence on the resulting cluster structure. Moreover, we inspect
the correction for the inequality of cluster size distribution --
similar to the one in the Genie algorithm. Our results demonstrate
that by robustifying the procedure with the Genie correction,
we can obtain a significant performance boost in terms of clustering
quality. This is particularly beneficial in the case of the linkages
based on the closest distances between clusters, including the single
linkage and its "smoothed" counterparts. To explain this behaviour,
we propose a new linkage process called three-stage OWA which yields
further improvements. This way we confirm the intuition that
hierarchical cluster analysis should rather take into account
a few nearest neighbours of each point, instead of trying to adapt
to their non-local neighbourhood.

**Keywords.** hierarchical clustering, OWA, data fusion, aggregation, Genie

3.

Pérez-Fernández R., De Baets B., **Gagolewski M.**,
A taxonomy of monotonicity properties for the aggregation of multidimensional data,
*Information Fusion* **52**, 2019, pp. 322-334. doi:10.1016/j.inffus.2019.05.006

**Abstract.**
The property of monotonicity, which requires a function to preserve a given order,
has been considered the standard in the aggregation of real numbers for decades.
In this paper, we argue that, for the case of multidimensional data,
an order-based definition of monotonicity is far too restrictive.
We propose several meaningful alternatives to this property not involving
the preservation of a given order by returning to its early origins stemming
from the field of calculus. Numerous aggregation methods for multidimensional
data commonly used by practitioners are studied within our new framework.

**Keywords.** monotonicity, aggregation, multidimensional data, centroid, spatial median

4.

**Abstract.**
The problem of learning symmetric capacities (or fuzzy measures)
from data is investigated toward applications in data analysis
and prediction as well as decision making. Theoretical results
regarding the solution minimizing the mean absolute error
are exploited to develop an exact branch-refine-and-bound-type algorithm
for fitting Sugeno integrals (weighted lattice polynomial functions,
max-min operators) with respect to symmetric capacities.
The proposed method turns out to be particularly suitable for acting
on ordinal data. In addition to providing a model that can be used
for the general data regression task, the results can be used,
among others, to calibrate generalized h-indices to bibliometric data.

**Keywords.** weight learning, ordinal data fitting, fuzzy measures, Sugeno integral, lattice polynomials, h-index

5.

**Abstract.**
The time needed to apply a hierarchical clustering algorithm is most often
dominated by the number of computations of a pairwise dissimilarity measure.
Such a constraint, for larger data sets, puts at a disadvantage the use of
all the classical linkage criteria but the single linkage one. However, it
is known that the single linkage clustering algorithm is very sensitive to
outliers, produces highly skewed dendrograms, and therefore usually does not
reflect the true underlying data structure – unless the clusters are well-separated.
To overcome its limitations, we propose a new hierarchical clustering linkage
criterion called Genie. Namely, our algorithm links two clusters in such a
way that a chosen economic inequity measure (e.g., the Gini- or Bonferroni-index)
of the cluster sizes does not increase drastically above a given threshold.
The presented benchmarks indicate a high practical usefulness of the introduced
method: it most often outperforms the Ward or average linkage in terms of the
clustering quality while retaining the single linkage speed. The Genie
algorithm is easily parallelizable and thus may be run on multiple threads
to speed up its execution further on. Its memory overhead is small: there
is no need to precompute the complete distance matrix to perform the
computations in order to obtain a desired clustering. It can be applied
on arbitrary spaces equipped with a dissimilarity measure, e.g., on real
vectors, DNA or protein sequences, images, rankings, informetric data,
etc. A reference implementation of the algorithm has been included
in the open source `genie`

package for `R`

.

**Keywords.** hierarchical clustering, single linkage, inequity measures, Gini-index

74 publications in total, including:

- 32 journal papers,
- 34 papers in proceedings of international conferences,
- 5 research monographs and textbooks,
- 3 edited volumes.

My current h-index = 14 (Google Scholar) / 10 (Scopus) / 10 (Web of Science).

ORCID = 0000-0003-0637-6028

My Erdős number = 4 (R. Mesiar - I. Assani - R.D. Mauldin - P. Erdős
*or* G. Siudem - T. Prellberg - P.J. Cameron - P. Erdős
*or* L. Coroianu - S.G. Gal - J. Szabados - P. Erdős).

My complete publication list is available in BibTeX format.

*Taken into account is the year when each bibliography entry was accepted for publication.*

*The publication list is also available in
BibTeX format.*

1.

Siudem G., Żogała-Siudem B.,
Cena A., **Gagolewski M.**,
Three dimensions of scientific impact,
*Proceedings of the National Academy of Sciences of the
United States of America (PNAS)* **117**(25), 2020, pp. 13896-13900. doi:10.1073/pnas.2001064117

**Abstract.**
The growing popularity of bibliometric indexes
(whose most famous example is the h index by J. E. Hirsch
[J. E. Hirsch, Proc. Natl. Acad. Sci. U.S.A. 102, 16569–16572 (2005)])
is opposed by those claiming that one's scientific impact cannot be reduced
to a single number. Some even believe that our complex reality fails
to submit to any quantitative description. We argue that neither of
the two controversial extremes is true. By assuming that some citations
are distributed according to the rich get richer rule (success breeds
success, preferential attachment) while some others are assigned totally
at random (all in all, a paper needs a bibliography), we have crafted
a model that accurately summarizes citation records with merely
three easily interpretable parameters: productivity, total impact,
and how lucky an author has been so far.

**Keywords.** science of science, scientometrics, bibliometric indexes, rich get richer

2.

Bartoszuk M., **Gagolewski M.**,
SimilaR: R Code Clone and Plagiarism Detection,
*R Journal*, 2020, in press.

**Abstract.**
Third-party software for assuring source code quality is becoming increasingly
popular. Tools that evaluate the coverage of unit tests,
perform static code analysis, or inspect run-time memory use
are crucial in the software development life cycle.
More sophisticated methods allow for performing meta-analyses
of large software repositories, e.g., to discover abstract topics they relate to
or common design patterns applied by their developers.
They may be useful in gaining a better understanding of the component
interdependencies, avoiding cloned code as well as detecting plagiarism
in programming classes.
A meaningful measure of similarity of computer programs often forms
the basis of such tools. While there are a few noteworthy instruments
for similarity assessment, none of them turns out particularly suitable
for analysing R code chunks. Existing solutions rely on rather simple
techniques and heuristics and fail to provide a user with
the kind of sensitivity and specificity required for working with R scripts.
In order to fill this gap, we propose a new algorithm
based on a Program Dependence Graph, implemented in the SimilaR
package. It can serve as a tool not only for improving R code quality
but also for detecting plagiarism, even when it has been masked
by applying some obfuscation techniques or imputing dead code.
We demonstrate its accuracy and efficiency in a real-world case study.

**Keywords.** plagiarism detection, R, code clones

3.

Cena A., **Gagolewski M.**,
Genie+OWA: Robustifying Hierarchical Clustering with OWA-based Linkages,
*Information Sciences* **520**, 2020, pp. 324-336. doi:10.1016/j.ins.2020.02.025

**Abstract.**
We investigate the application of the Ordered Weighted
Averaging (OWA) data fusion operator in agglomerative hierarchical
clustering. The examined setting generalises the well-known single,
complete and average linkage schemes. It allows to embody expert
knowledge in the cluster merge process and to provide a much wider
range of possible linkages. We analyse various families of weighting
functions on numerous benchmark data sets in order to assess their
influence on the resulting cluster structure. Moreover, we inspect
the correction for the inequality of cluster size distribution --
similar to the one in the Genie algorithm. Our results demonstrate
that by robustifying the procedure with the Genie correction,
we can obtain a significant performance boost in terms of clustering
quality. This is particularly beneficial in the case of the linkages
based on the closest distances between clusters, including the single
linkage and its "smoothed" counterparts. To explain this behaviour,
we propose a new linkage process called three-stage OWA which yields
further improvements. This way we confirm the intuition that
hierarchical cluster analysis should rather take into account
a few nearest neighbours of each point, instead of trying to adapt
to their non-local neighbourhood.

**Keywords.** hierarchical clustering, OWA, data fusion, aggregation, Genie

4.

5.

Pérez-Fernández R., De Baets B., **Gagolewski M.**,
A taxonomy of monotonicity properties for the aggregation of multidimensional data,
*Information Fusion* **52**, 2019, pp. 322-334. doi:10.1016/j.inffus.2019.05.006

**Abstract.**
The property of monotonicity, which requires a function to preserve a given order,
has been considered the standard in the aggregation of real numbers for decades.
In this paper, we argue that, for the case of multidimensional data,
an order-based definition of monotonicity is far too restrictive.
We propose several meaningful alternatives to this property not involving
the preservation of a given order by returning to its early origins stemming
from the field of calculus. Numerous aggregation methods for multidimensional
data commonly used by practitioners are studied within our new framework.

**Keywords.** monotonicity, aggregation, multidimensional data, centroid, spatial median

6.

**Abstract.**
The problem of learning symmetric capacities (or fuzzy measures)
from data is investigated toward applications in data analysis
and prediction as well as decision making. Theoretical results
regarding the solution minimizing the mean absolute error
are exploited to develop an exact branch-refine-and-bound-type algorithm
for fitting Sugeno integrals (weighted lattice polynomial functions,
max-min operators) with respect to symmetric capacities.
The proposed method turns out to be particularly suitable for acting
on ordinal data. In addition to providing a model that can be used
for the general data regression task, the results can be used,
among others, to calibrate generalized h-indices to bibliometric data.

**Keywords.** weight learning, ordinal data fitting, fuzzy measures, Sugeno integral, lattice polynomials, h-index

7.

Beliakov G., **Gagolewski M.**, James S.,
DC optimization for constructing discrete Sugeno integrals and learning nonadditive measures,
*Optimization*, 2019, in press. doi:10.1080/02331934.2019.1705300

**Abstract.**
Defined solely by means of order-theoretic operations meet (min) and join (max), weighted lattice polynomial functions are particularly useful for modeling data on an ordinal scale. A special case, the discrete Sugeno integral, defined with respect to a nonadditive measure (a capacity), enables accounting for the interdependencies between input variables.

However until recently the problem of identifying the fuzzy measure values with respect to various objectives and requirements has not received a great deal of attention. By expressing the learning problem as the difference of convex functions, we are able to apply DC (difference of convex) optimization methods. Here we formulate one of the global optimization steps as a local linear programming problem and investigate the improvement under different conditions.

**Keywords.** Aggregation functions, nonadditive measures, Sugeno integral, capacities, DC optimization

8.

**Abstract.**
In the field of information fusion, the problem of data aggregation has been formalized as an order-preserving process that builds upon the property of monotonicity. However, fields such as computational statistics, data analysis and geometry, usually emphasize the role of equivariances to various geometrical transformations in aggregation processes. Admittedly, if we consider a unidimensional data fusion task, both requirements are often compatible with each other. Nevertheless, in this paper we show that, in the multidimensional setting, the only idempotent functions that are monotone and orthogonal equivariant are the over-simplistic weighted centroids. Even more, this result still holds after replacing monotonicity and orthogonal equivariance by the weaker property of orthomonotonicity. This implies that the aforementioned approaches to the aggregation of multidimensional data are irreconcilable, and that, if a weighted centroid is to be avoided, we must choose between monotonicity and a desirable behaviour with regard to orthogonal transformations.

**Keywords.** multidimensional data aggregation, monotonicity, orthogonal equivariance, centroid

9.

Geras A., Siudem G., **Gagolewski M.**,
Should we introduce a dislike button for academic papers?,
*Journal of the Association for Information Science and Technology* **71**(2), 2020, pp. 221-229. doi:10.1002/ASI.24231

**Abstract.**
On the grounds of the revealed, mutual resemblance between the behaviour
of users of Stack Exchange and the dynamics of the citations accumulation
process in the scientific community, we tackled an outwardly
intractable problem of assessing the impact of introducing "negative" citations.

Although the most frequent reason to cite a paper is to highlight the
connection between the two publications, researchers sometimes mention
an earlier work to cast a negative light. While computing citation-based scores,
for instance the h-index, information about the reason why a paper was mentioned
is neglected. Therefore it can be questioned whether these indices describe
scientific achievements accurately.

In this contribution we shed insight into the problem of "negative" citations,
analysing data from Stack Exchange and, to draw more universal conclusions,
we derive an approximation of citations scores. Here we show that the quantified
influence of introducing negative citations is of lesser importance and
that they could be used as an indicator of
where attention of scientific community is allocated.

**Keywords.** citation analysis, the Hirsch index, negative citations, research evaluation, science of science

10.

Beliakov G., **Gagolewski M.**, James S.,
Robust fitting for the Sugeno integral with respect to general fuzzy measures,
*Information Sciences* **514**, 2020, pp. 449-461. doi:10.1016/j.ins.2019.11.024

**Abstract.**
The Sugeno integral is an expressive aggregation function with potential applications across a range of decision contexts. Its calculation requires only the lattice minimum and maximum operations, making it particularly suited to ordinal data and robust to scale transformations. However, for practical use in data analysis and prediction, we require efficient methods for learning the associated fuzzy measure. While such methods are well developed for the Choquet integral, the fitting problem is more difficult for the Sugeno integral because it is not amenable to being expressed as a linear combination of weights, and more generally due to plateaus and non-differentiability in the objective function. Previous research has hence focused on heuristic approaches or simplified fuzzy measures. Here we show that the problem of fitting the Sugeno integral to data such that the maximum absolute error is minimized can be solved using an efficient bilevel program. This method can be incorporated into algorithms that learn fuzzy measures with the aim of minimizing the median residual. This equips us with tools that make the Sugeno integral a feasible option in robust data regression and analysis. We provide experimental comparison with a genetic algorithms approach and an example in data analysis.

**Keywords.** Sugeno integral, fuzzy measure, parameter learning, aggregation functions

11.

Beliakov G., **Gagolewski M.**, James S.,
Aggregation on ordinal scales with the Sugeno integral for biomedical applications,
*Information Sciences* **501**, 2019, pp. 377-387. doi:10.1016/j.ins.2019.06.023

**Abstract.**
The Sugeno integral is a function particularly suited to the aggregation of ordinal inputs.
Defined with respect to a fuzzy measure, its ability to account for complementary and
redundant relationships between variables brings much potential to the field of biomedicine,
where it is common for measurements and patient information to be expressed qualitatively.
However, practical applications require well-developed methods for identifying the Sugeno integral's
parameters, and this task is not easily expressed using the standard optimisation approaches.
Here we formulate the objective function as the difference of two convex functions, which enables
the use of specialised numerical methods. Such techniques are compared with other global
optimisation frameworks through a number of numerical experiments.

**Keywords.** aggregation functions, fuzzy measures, Sugeno integral, capacities

12.

Coroianu L.,
**Gagolewski M.**, Grzegorzewski P.,
Piecewise linear approximation of fuzzy numbers: algorithms, arithmetic operations and stability of characteristics,
*Soft Computing* **23**(19), 2019, pp. 9491-9505. doi:10.1007/s00500-019-03800-2

**Abstract.**
The problem of the piecewise linear approximation of fuzzy
numbers giving outputs nearest to the inputs with respect to the
Euclidean metric is discussed. The results given in Coroianu et al.
(Fuzzy Sets Syst 233:26–51, 2013) for the 1-knot fuzzy numbers
are generalized for arbitrary n-knot (n>=2) piecewise linear fuzzy numbers.
Some results on the existence and properties of the approximation operator are proved. Then, the stability of some fuzzy number characteristics under approximation as the number of knots tends to infinity is considered. Finally, a simulation study concerning the computer implementations of arithmetic operations on fuzzy numbers is provided. Suggested concepts are illustrated by examples and algorithms ready for the practical use. This way, we throw a bridge between theory and applications as the latter ones are so desired in real-world problems.

**Keywords.** Approximation of fuzzy numbers,
Calculations on fuzzy numbers,
Characteristics of fuzzy numbers,
Fuzzy number,
Piecewise linear approximation

13.

Coroianu L., Fullér R., **Gagolewski M.**, James S.,
Constrained Ordered Weighted averaging aggregation with multiple comonotone constraints,
*Fuzzy Sets and Systems* **395**, 2020, pp. 21-39. doi:10.1016/j.fss.2019.09.006

**Abstract.**
The constrained ordered weighted averaging (OWA) aggregation problem
arises when we aim to maximize or minimize a convex combination of order
statistics under linear inequality constraints that act on the variables with
respect to their original sources. The standalone approach to optimizing
the OWA under constraints is to consider all permutations of the inputs,
which becomes quickly infeasible when there are more than a few variables,
however in certain cases we can take advantage of the relationships amongst
the constraints and the corresponding solution structures. For example, we
can consider a land-use allocation satisfaction problem with an auxiliary aim
of balancing land-types, whereby the response curves for each species are
non-decreasing with respect to the land-types. This results in comonotone
constraints, which allow us to drastically reduce the complexity of the problem.

In this paper, we show that if we have an arbitrary number of constraints
that are comonotone (i.e., they share the same ordering permutation of the
coefficients), then the optimal solution occurs for decreasing components of
the solution. After investigating the form of the solution in some special cases
and providing theoretical results that shed light on the form of the solution,
we detail practical approaches to solving and give real-world examples.

**Keywords.** Multiple criteria evaluation; Ordered weighted averaging;
Constrained OWA aggregation; Ecology; Work allocation

14.

Coroianu L., **Gagolewski M.**,
*Penalty-based data aggregation in real normed vector spaces*,
In: Halaš R. et al. (Eds.), *New Trends in Aggregation Theory*
(*Advances in Intelligent Systems and Computing* **981**),
Springer, 2019, pp. 160-171. doi:10.1007/978-3-030-19494-9_15

**Abstract.** The problem of penalty-based data aggregation in generic real normed vector
spaces is studied. Some existence and uniqueness results are indicated.
Moreover, various properties of the aggregation functions are considered.

**Keywords.** penalty-based aggregation, prototype learning, means, averages, and medians, vector spaces, Fermat-Weber problem

15.

Halaš R., **Gagolewski M.**, Mesiar R. (Eds.),
*New Trends in Aggregation Theory*
(*Advances in Intelligent Systems and Computing* **981**),
Springer, 2019, 348 pp. doi:10.1007/978-3-030-19494-9 isbn:978-3-030-19493-2

16.

Lasek J., **Gagolewski M.**,
The efficacy of league formats in ranking teams,
*Statistical Modelling* **18**(5-6), 2018, pp. 411-435. doi:10.1177/1471082X18798426

**Abstract.**
The efficacy of different league formats in ranking teams according to their true latent strength is analysed.
To this end, a new approach for estimating attacking and defensive strengths based on the Poisson regression for modelling match outcomes is proposed. Various performance metrics are estimated reflecting the agreement between latent teams' strength parameters and their final rank in the league table. The tournament designs studied here are used
in the majority of European top-tier association football competitions.
Based on numerical experiments, it turns out that a two-stage league format
comprising of the three round-robin tournament together with an extra single
round-robin is the most efficacious setting.
In particular, it is the most accurate in selecting the best team as the winner of the league.
Its efficacy can be enhanced by setting the number of points allocated for a win to two
(instead of three that is currently in effect in association football).

**Keywords.** association football, league formats, rankings, rating systems, simulation, tournament design

17.

Beliakov G., **Gagolewski M.**, James S.,
*Least median of squares (LMS) and least trimmed squares (LTS) fitting for the weighted arithmetic mean*,
In: Medina J. et al. (Eds.), *Information Processing and Management of Uncertainty in Knowledge-Based Systems. Theory and Foundations*
(*Communications in Computer and Information Science* **854**),
Springer, 2018, pp. 367-378. doi:10.1007/978-3-319-91476-3_31

**Abstract.** We look at different approaches to learning the weights of the
weighted arithmetic mean such that the median residual or sum of the
smallest half of squared residuals is minimized. The more general problem
of multivariate regression has been well studied in statistical literature
however in the case of aggregation functions we have the restriction on
the weights and the domain is usually restricted so that ‘outliers’ may
not be arbitrarily large. A number of algorithms are compared in terms
of accuracy and speed. Our results can be extended to other aggregation
functions.

**Keywords.** aggregation, LMS fitting, LTS fitting, approximation

18.

Beliakov G., **Gagolewski M.**,
James S., Pace S., Pastorello N., Thilliez E., Vasa R.,
Measuring traffic congestion: An approach based on learning weighted inequality, spread and aggregation indices from comparison data,
*Applied Soft Computing* **67**, 2018, pp. 910-919. doi:10.1016/j.asoc.2017.07.014

**Abstract.**
As cities increase in size, governments and councils face the problem of
designing infrastructure and approaches to traffic management that alleviate
congestion. The problem of objectively measuring congestion involves taking
into account not only the volume of traffic moving throughout a network, but
also the inequality or spread of this traffic over major and minor intersections.
For modelling such data, we investigate the use of weighted congestion indices
based on various aggregation and spread functions. We formulate the weight
learning problem for comparison data and use real traffic data obtained from
a medium-sized Australian city to evaluate their usefulness.

**Keywords.** aggregation functions, inequality indices, spread measures,
learning weights, traffic analysis

19.

**Abstract.** The Sugeno integral has numerous successful applications,
including but not limited to the areas of decision making, preference modeling,
and bibliometrics. Despite this, the current state of the development of usable
algorithms for numerically fitting the underlying discrete fuzzy measure based
on a sample of prototypical values – even in the simplest possible case, i.e.,
assuming the symmetry of the capacity – is yet to reach a satisfactory level.
Thus, the aim of this paper is to present some results and observations
concerning this class of data approximation problems.

**Keywords.** Sugeno integral, aggregation functions, machine learning, regression, approximation

20.

Bartoszuk M., **Gagolewski M.**,
*Binary aggregation functions in software plagiarism detection*,
In: *Proc. FUZZ-IEEE'17*, IEEE, 2017, no. 8015582. doi:10.1109/FUZZ-IEEE.2017.8015582

**Abstract.** Supervised learning is of key interest in data science.
Even though there exist many approaches to solving, among others,
classification as well as ordinal and standard regression tasks,
most of them output models that do not possess useful formal properties,
like nondecreasingness in each independent variable, idempotence,
symmetry, etc. This makes them difficult to interpret and analyze.
For instance, it might be impossible to determine the importances of
individual features or to assess the effects of increasing the values
of predictors on the behavior of a chosen response variable. Such
properties are especially important in software plagiarism detection,
where we are faced with the combination of degrees to which how much
a code chunk A is similar to (or contained in) B as well as how much
B is similar to A. Therefore, in this paper we consider a new method
for fitting B-spline tensor product-based aggregation functions to
empirical data. An empirical study indicates a highly competitive
performance of the resulting models. Additionally, they possess an
intuitive interpretation which is highly desirable for end-users.

21.

Cena A., **Gagolewski M.**,
*OWA-based linkage and the Genie correction for hierarchical clustering*,
In: *Proc. FUZZ-IEEE'17*, IEEE, 2017, no. 8015652. doi:10.1109/FUZZ-IEEE.2017.8015652

**Abstract.** In this paper we thoroughly investigate various OWA-based
linkages in hierarchical clustering on numerous benchmark data sets.
The inspected setting generalizes the well-known single, complete,
and average linkage schemes, among others. The incorporation of
weights into the cluster merge procedure creates an opportunity
to make use of experts' knowledge about a particular data domain
so as to generate partitions of a given data set that better
reflect the true underlying cluster structure. Moreover, we
introduce a correction for the inequality of cluster size distribution
— similar to the one proposed in our recently introduced Genie algorithm
— which results in a significant performance boost in terms of clustering quality.

22.

**Abstract.**
Research in aggregation theory is nowadays still mostly focused on algorithms
to summarize tuples consisting of observations in some real interval
or of diverse general ordered structures. Of course, in practice
of information processing many other data types between these
two extreme cases are worth inspecting. This contribution deals with
the aggregation of lists of data points in **R**^{d} for arbitrary d≥1.
Even though particular functions aiming to summarize multidimensional data
have been discussed by researchers in data analysis,
computational statistics and geometry, there is clearly a need to provide
a comprehensive and unified model in which their properties
like equivariances to geometric transformations, internality, and monotonicity
may be studied at an appropriate level of generality.
The proposed penalty-based approach
serves as a common framework for all idempotent information aggregation
methods, including componentwise functions,
pairwise distance minimizers, and data depth-based medians. It also
allows for deriving many new practically useful tools.

**Keywords.** multidimensional data aggregation, penalty functions, data depth, centroid, median

23.

**Abstract.**
The time needed to apply a hierarchical clustering algorithm is most often
dominated by the number of computations of a pairwise dissimilarity measure.
Such a constraint, for larger data sets, puts at a disadvantage the use of
all the classical linkage criteria but the single linkage one. However, it
is known that the single linkage clustering algorithm is very sensitive to
outliers, produces highly skewed dendrograms, and therefore usually does not
reflect the true underlying data structure – unless the clusters are well-separated.
To overcome its limitations, we propose a new hierarchical clustering linkage
criterion called Genie. Namely, our algorithm links two clusters in such a
way that a chosen economic inequity measure (e.g., the Gini- or Bonferroni-index)
of the cluster sizes does not increase drastically above a given threshold.
The presented benchmarks indicate a high practical usefulness of the introduced
method: it most often outperforms the Ward or average linkage in terms of the
clustering quality while retaining the single linkage speed. The Genie
algorithm is easily parallelizable and thus may be run on multiple threads
to speed up its execution further on. Its memory overhead is small: there
is no need to precompute the complete distance matrix to perform the
computations in order to obtain a desired clustering. It can be applied
on arbitrary spaces equipped with a dissimilarity measure, e.g., on real
vectors, DNA or protein sequences, images, rankings, informetric data,
etc. A reference implementation of the algorithm has been included
in the open source `genie`

package for `R`

.

**Keywords.** hierarchical clustering, single linkage, inequity measures, Gini-index

24.

Beliakov G., **Gagolewski M.**, James S.,
Penalty-based and other representations of economic inequality,
*International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems*
**24**(Suppl. 1), 2016, pp. 1-23. doi:10.1142/S0218488516400018

**Abstract.**
Economic inequality measures are employed as a key component in various
socio-demographic indices to capture the disparity between the wealthy and poor.
Since their inception, they have also been used as a basis for modelling
spread and disparity in other contexts. While recent research has identified
that a number of classical inequality and welfare functions can be considered
in the framework of OWA operators, here we propose a framework of
penalty-based aggregation functions and their associated penalties as
measures of inequality.

**Keywords.** penalty functions, aggregation functions, inequality indices, spread measures

25.

**Abstract.**
The paper discusses a generalization of the nearest centroid hierarchical
clustering algorithm. A first extension deals with the incorporation
of generic distance-based penalty minimizers instead of the classical
aggregation by means of centroids. Due to that the presented algorithm can be applied
in spaces equipped with an arbitrary dissimilarity measure (images, DNA sequences, etc.).
Secondly, a correction preventing the formation
of clusters of too highly unbalanced sizes is applied: just like in the
recently introduced *Genie* approach, which extends the single linkage scheme,
the new method averts a chosen inequity measure (e.g., the Gini-, de Vergottini-,
or Bonferroni-index) of cluster sizes from raising above a predefined
threshold. Numerous benchmarks indicate that the introduction of such
a correction increases the quality of the resulting clusterings.

**Keywords.** hierarchical clustering, aggregation, centroid, Gini-index, Genie algorithm

26.

Bartoszuk M., Beliakov G., **Gagolewski M.**, James S.,
*Fitting aggregation functions to data: Part I – Linearization and regularization*,
In: Carvalho J.P. et al. (Eds.),
*Information Processing and Management of Uncertainty in Knowledge-Based Systems*,
Part II (*Communications in Computer and Information Science* **611**),
Springer, 2016, pp. 767-779. doi:10.1007/978-3-319-40581-0_62

**Abstract.**
The use of supervised learning techniques for fitting weights and/or generator
functions of weighted quasi-arithmetic means – a special class of idempotent
and nondecreasing aggregation functions – to empirical data has already been
considered in a number of papers.
Nevertheless, there are still some important issues that have not been
discussed in the literature yet. In the first part of this two-part contribution
we deal with the concept of regularization, a quite standard technique from machine learning
applied so as to increase the fit quality on test and validation data samples.
Due to the constraints on the weighting vector,
it turns out that quite different methods can be used in the current framework, as
compared to regression models.
Moreover, it is worth noting that so far fitting weighted
quasi-arithmetic means to empirical data has only been performed
approximately, via the so-called linearization technique.
In this paper we consider exact solutions to such special optimization tasks
and indicate cases where linearization leads to much worse solutions.

**Keywords.** Aggregation functions, weighted quasi-arithmetic means,
least squares fitting, regularization, linearization

27.

Bartoszuk M., Beliakov G., **Gagolewski M.**, James S.,
*Fitting aggregation functions to data: Part II – Idempotization*,
In: Carvalho J.P. et al. (Eds.),
*Information Processing and Management of Uncertainty in Knowledge-Based Systems*,
Part II (*Communications in Computer and Information Science* **611**),
Springer, 2016, pp. 780-789. doi:10.1007/978-3-319-40581-0_63

**Abstract.**
The use of supervised learning techniques for fitting weights and/or generator
functions of weighted quasi-arithmetic means – a special class of idempotent
and nondecreasing aggregation functions – to empirical data has already been
considered in a number of papers.
Nevertheless, there are still some important issues that have not been
discussed in the literature yet. In the second part of this two-part contribution
we deal with a quite common situation in which we have inputs coming from
different sources, describing a similar phenomenon, but which
have not been properly normalized. In such a case,
idempotent and nondecreasing functions cannot be used to aggregate them
unless proper pre-processing is performed.
The proposed idempotization method, based on the notion of B-splines,
allows for an automatic calibration of independent variables.
The introduced technique is applied in an R source code plagiarism
detection system.

**Keywords.** Aggregation functions, weighted quasi-arithmetic means,
least squares fitting, idempotence

28.

Cena A., **Gagolewski M.**,
*Fuzzy k-minpen clustering and k-nearest-minpen classification procedures incorporating generic distance-based penalty minimizers*,
In: Carvalho J.P. et al. (Eds.),
*Information Processing and Management of Uncertainty in Knowledge-Based Systems*,
Part II (*Communications in Computer and Information Science* **611**),
Springer, 2016, pp. 445-456. doi:10.1007/978-3-319-40581-0_36

**Abstract.**
We discuss a generalization of the fuzzy (weighted) k-means clustering procedure
and point out its relationships with data aggregation in spaces equipped with
arbitrary dissimilarity measures. In the proposed setting, a
data set partitioning is performed based on the notion of points' proximity to generic
distance-based penalty minimizers. Moreover, a new data classification algorithm,
resembling the k-nearest neighbors scheme but less computationally and memory
demanding, is introduced. Rich examples in complex data domains
indicate the usability of the methods and aggregation theory in general.

**Keywords.** fuzzy k-means algorithm, clustering, classification, fusion functions, penalty minimizers

29.

`Python`

`Python`

)30.

Ferraro M.B., Giordani P., Vantaggi B.,
**Gagolewski M.**, Gil M.Á., Grzegorzewski P.,
Hryniewicz O. (Eds.),
*Soft Methods for Data Science*
(*Advances in Intelligent Systems and Computing* **456**), Springer, 2017, 535 pp. doi:10.1007/978-3-319-42972-4 isbn:978-3-319-42971-7

31.

Mesiar R., **Gagolewski M.**,
H-index and other Sugeno integrals: Some defects and their compensation,
*IEEE Transactions on Fuzzy Systems* **24**(6), 2016, pp. 1668-1672. doi:10.1109/TFUZZ.2016.2516579

**Abstract.**
The famous Hirsch index has been introduced just ca. 10 years ago.
Despite that, it is already widely used in many decision making
tasks, like in evaluation of individual scientists, research
grant allocation, or even production planning.
It is known that the h-index is related to the discrete
Sugeno integral and the Ky Fan metric introduced in 1940s.
The aim of this paper is to propose a few modifications of this index
as well as other fuzzy integrals – also on bounded chains – that lead
to better discrimination of some types of data that are to be aggregated.
All of the suggested compensation methods try to retain the simplicity
of the original measure.

**Keywords.** h-index, Sugeno integral, Ky Fan metric, Shilkret integral, decomposition integrals

32.

Lasek J., Szlavik Z., **Gagolewski M.**, Bhulai S.,
How to improve a team's position in the FIFA ranking – A simulation study, *Journal of Applied Statistics*
**43**(7), 2016, pp. 1349-1368. doi:10.1080/02664763.2015.1100593

**Abstract.** In this paper, we study the efficacy of the official ranking for international football teams compiled by FIFA,
the body governing football competition around the globe. We present strategies for improving a team's position in the ranking.
By combining several statistical techniques we derive an objective function in a decision problem of optimal
scheduling of future matches. The presented results display how a team's position can be improved.
Along the way, we compare the official procedure to the famous Elo rating system. Although it originates
from chess, it has been successfully tailored to ranking football teams as well.

**Keywords.** association football, FIFA ranking, prediction models, Monte Carlo simulations, optimal schedule, team rankings

33.

Żogała-Siudem B., Siudem G.,
Cena A., **Gagolewski M.**,
Agent-based model for the h-index – Exact solution,
*European Physical Journal B* **89**:21, 2016.
doi:10.1140/epjb/e2015-60757-1

**Abstract.**
Hirsch’s h-index is perhaps the most popular citation-based measure of
scientific excellence. In 2013, Ionescu and Chopard proposed an agent-based
model describing a process for generating publications and citations in an
abstract scientific community [G. Ionescu, B. Chopard, Eur. Phys. J. B 86,
426 (2013)]. Within such a framework, one may simulate a scientist’s activity,
and – by extension – investigate the whole community of researchers.
Even though the Ionescu and Chopard model predicts the h-index quite well,
the authors provided a solution based solely on simulations. In this paper,
we complete their results with exact, analytic formulas. What is more, by
considering a simplified version of the Ionescu-Chopard model, we obtained
a compact, easy to compute formula for the h-index. The derived approximate
and exact solutions are investigated on a simulated and real-world data sets.

**Keywords.** Statistical and nonlinear physics, preferential attachment rule, h-index

34.

Cena A., **Gagolewski M.**, Mesiar R.,
Problems and challenges of information resources producers' clustering, *Journal of Informetrics* **9**(2),
2015, pp. 273–284. doi:10.1016/j.joi.2015.02.005

**Abstract.** Classically, unsupervised machine learning techniques are applied
on data sets with fixed number of attributes (variables).
However, many problems encountered in the field of informetrics
face us with the need to extend these kinds of methods in a way such that they may
be computed over a set of nonincreasingly ordered vectors of unequal lengths.
Thus, in this paper, some new dissimilarity measures (metrics)
are introduced and studied.
Owing to that we may use i.a. hierarchical clustering algorithms
in order to determine an input data set's partition
consisting of sets of producers that are homogeneous not only with respect to
the quality of information resources, but also their quantity.

**Keywords.** aggregation, hierarchical clustering, distance, metric

35.

Lasek J., **Gagolewski M.**,
*The winning solution to the AAIA'15 Data Mining Competition: Tagging firefighter activities at a fire scene*,
In:
Ganzha M., Maciaszek L., Paprzycki M. (Eds.),
*Proc. FedCSIS'15*, IEEE, 2015, pp. 375-380. doi:10.15439/2015F418

**Abstract.** Multi-sensor based classification of professionals' activities
plays a key role in ensuring the success of an his/her goals. In this paper
we present the winning solution to the *AAIA'15 Tagging Firefighter
Activities at a Fire Scene* data mining competition. The approach
is based on a Random Forest classifier trained on an input data set with
almost 5000 features describing the underlying time series of sensory data.

**Keywords.** Activity tagging, movement tagging, data mining competition, Random Forest model, FFT

36.

Cena A., **Gagolewski M.**,
*A K-means-like algorithm for informetric data clustering*,
In: Alonso J.M., Bustince H., Reformat M. (Eds.),
*Proc. IFSA/EUSFLAT 2015*,
Atlantis Press, 2015, pp. 536-543. doi:10.2991/ifsa-eusflat-15.2015.77

**Abstract.** The K-means algorithm is one of the most often used clustering techniques.
However, when it comes to discovering clusters in informetric data sets
that consist of non-increasingly ordered vectors of not necessarily conforming
lengths, such a method cannot be applied directly.
Hence, in this paper, we propose a K-means-like algorithm
to determine groups of producers that are similar
not only with respect to the quality of information resources they output,
but also their quantity.

**Keywords.** k-means clustering, informetrics, aggregation, impact functions

37.

Bartoszuk M., **Gagolewski M.**,
*Detecting similarity of R functions via a fusion of multiple heuristic methods*,
In: Alonso J.M., Bustince H., Reformat M. (Eds.),
*Proc. IFSA/EUSFLAT 2015*,
Atlantis Press, 2015, pp. 419-426. doi:10.2991/ifsa-eusflat-15.2015.61

**Abstract.** In this paper we describe recent advances in our R code similarity detection algorithm.
We propose a modification of the Program Dependence Graph (PDG) procedure used in the GPLAG system
that better fits the nature of functional programming languages like R.
The major strength of our approach lies in a proper
aggregation of outputs of multiple plagiarism detection methods,
as it is well known that no single technique gives perfect results.
It turns out that the incorporation of the PDG algorithm
significantly improves the recall ratio, i.e. it is better
in indicating true positive cases of plagiarism or code
cloning patterns. The implemented system is available
as web application at http://SimilaR.Rexamine.com/.

**Keywords.** R, plagiarism and code cloning detection,
fuzzy proximity relations, aggregation,
program dependence graph, t-norms

38.

**Abstract.** In the field of informetrics, agents are often represented
by numeric sequences of non necessarily conforming lengths.
There are numerous aggregation techniques of such sequences,
e.g., the g-index, the h-index, that may be used to compare the output
of pairs of agents. In this paper we address a question whether such impact
indices may be used to model experts' preferences accurately.

**Keywords.** preference learning, fuzzy relations, informetrics, aggregation, h-index

39.

**Abstract.** Aggregation theory often deals with measures of central tendency of quantitative data.
As sometimes a different kind of information fusion is needed,
an axiomatization of spread measures was introduced recently. In this contribution
we explore the properties of WD_{p}WAM and WD_{p}OWA operators,
which are defined as weighted L_{p}-distances to weighted
arithmetic mean and OWA operators, respectively.
In particular, we give forms of vectors that maximize
such fusion functions and thus provide a way to normalize the output value
so that the vector of maximal spread always leads to a fixed outcome, e.g., 1
if all the input elements are in [0,1].
This might be desirable when constructing measures of experts' opinions consistency or diversity
in group decision making problems.

**Keywords.** data fusion, aggregation, spread, deviation, variance, OWA operators

40.

Cena A., **Gagolewski M.**,
*Aggregation and soft clustering of informetric data*,
In: Baczyński M., De Baets B., Mesiar R. (Eds.),
*Proc. 8th International Summer School on Aggregation Operators (AGOP 2015)*,
University of Silesia, 2015, pp. 79-84. isbn:978-83-8012-519-3

**Abstract.** The aim of this contribution is to inspect possible
applications of clustering techniques
computed over a set consisting of nonincreasingly ordered vectors
of possibly nonconforming lengths. Such data sets appear in the field of
informetrics, where one may need to evaluate the quality of information items,
e.g., research papers,
and their producers. In this paper we investigate the notion of cluster centers
as an aggregated representation of all vectors from a given cluster and analyze
them by means of aggregation operators.

**Keywords.** clustering, fuzzy clustering, c-means algorithm, distance, producers assessment problem

41.

**Abstract.** The aggregation theory usually takes an interest in
summarizing a predefined number of points in the real line.
In many applications, like in statistics, data analysis, and mining,
the notion of a mean – a nondecreasing, internal, and symmetric fusion function
– plays a key role. Nevertheless, when it comes to aggregating
a set of points in higher dimensional spaces, the componentwise
extension of monotonicity and internality might not be the best choice.
Instead, the invariance to certain classes of geometric transformations
seems to be crucial in such a case.

**Keywords.** aggregation, centroid, Tukey median, 1-center, 1-median, convex hull, affine invariance, orthogonalization

42.

Lasek J., **Gagolewski M.**,
*Estimation of tournament metrics for association football league formats*,
In: *Selected problems in information technologies (Proc. ITRIA'15 vol. 2)*,
Institute of Computer Science, Polish Academy of Sciences,
2015, pp. 67-78.

43.

Cena A., **Gagolewski M.**,
*Clustering and aggregation of informetric data sets*,
In: *Computational methods in data analysis (Proc. ITRIA'15 vol. 1)*,
Institute of Computer Science, Polish Academy of Sciences,
2015, pp. 5-26. isbn:978-83-63159-22-1

44.

45.

**Abstract.** The theory of aggregation most often deals with measures of central tendency.
However, sometimes a very different kind of a numeric vector's synthesis into a
single number is required. In this paper we introduce a class of mathematical functions
which aim to measure spread or scatter of one-dimensional quantitative data.
The proposed definition serves as a common, abstract framework for measures of
absolute spread known from statistics, exploratory data analysis and data mining,
e.g. the sample variance, standard deviation, range, interquartile range (IQR),
median absolute deviation (MAD), etc. Additionally, we develop new measures
of experts' opinions diversity or consensus in group decision making problems.
We investigate some properties of spread measures, show how are they related to
aggregation functions, and indicate their new potentially fruitful application areas.

**Keywords.** Group decisions and negotiations, aggregation, spread, deviation, variance

46.

Cena A., **Gagolewski M.**,
OM3: Ordered maxitive, minitive, and modular aggregation operators
– axiomatic and probabilistic properties in an arity-monotonic setting, *Fuzzy Sets and Systems* **264**,
2015, pp. 138-159. doi:10.1016/j.fss.2014.04.001

**Abstract.** The recently-introduced OM3 aggregation operators fulfill three
appealing properties: they are simultaneously minitive, maxitive, and modular.
Among the instances of OM3 operators we find e.g. OWMax and OWMin operators,
the famous Hirsch's h-index and all its natural generalizations.

In this paper the basic axiomatic and probabilistic properties
of extended, i.e. in an arity-dependent setting,
OM3 aggregation operators are studied.
We illustrate the difficulties one is inevitably faced with when
trying to combine the quality and quantity of numeric items
into a single number. The discussion on such aggregation methods
is particularly important in the information resources producers assessment problem,
which aims to reduce the negative effects of information overload.
It turns out that the Hirsch-like indices of impact
do not fulfill a set of very important properties, which puts the sensibility of their
practical usage into question.
Moreover, thanks to the probabilistic analysis of the operators in an i.i.d. model,
we may better understand the relationship between the aggregated items' quality and
their producers' productivity.

**Keywords.** Aggregation; ordered modularity, maxitivity and minitivity;
arity-monotonicity; impact assessment; Hirsch's h-index; informetrics

47.

**Abstract.** The producers assessment problem has many important practical
instances: it is an abstract model for intelligent systems evaluating
e.g. the quality of computer software repositories, web resources,
social networking services, and digital libraries. Each producer's
performance is determined according not only to the overall quality
of the items he/she outputted, but also to the number of such items
(which may be different for each agent).

Recent theoretical results indicate that the use of aggregation
operators in the process of ranking and evaluation producers
may not necessarily lead to fair and plausible outcomes. Therefore,
to overcome some weaknesses of the most often applied approach,
in this preliminary study we encourage the use of a fuzzy preference
relation-based setting and indicate why it may provide better
control over the assessment process.

**Keywords.** fuzzy relations, preference modeling, producers assessment problem, StackOverflow, bibliometrics, h-index

48.

**Abstract.** Sugeno integral-based confidence intervals for the theoretical
h-index of a fixed-length sequence of i.i.d. random variables are derived.
They are compared with other estimators of such a distribution characteristic
in a Pareto i.i.d. model. It turns out that in the first case we obtain
much wider intervals. It seems to be due to the fact that a Sugeno integral,
which may be applied on any ordinal scale, is known to ignore too
much information from cardinal-scale data being aggregated.

**Keywords.** h-index, Sugeno integral, confidence interval, Pareto distribution

49.

Bartoszuk M., **Gagolewski M.**,
*A fuzzy R code similarity detection algorithm*,
In: Laurent A. et al. (Eds.), *Information Processing and Management of Uncertainty in Knowledge-Based Systems*,
Part III (*Communications in Computer and Information Science* **444**), Springer, 2014, pp. 21-30. doi:10.1007/978-3-319-08852-5_3

**Abstract.** R is a programming language and software environment
for performing statistical computations
and applying data analysis that increasingly gains popularity
among practitioners and scientists. In this paper we present
a preliminary version of a system to detect pairs of similar R code blocks
among a given set of routines, which bases on a proper aggregation of the output of
three different [0,1]-valued (fuzzy) proximity degree estimation algorithms.
Its analysis on empirical data indicates that the system may in future be successfully applied in practice
in order e.g. to detect plagiarism among students' homework submissions or to perform an analysis
of code recycling or code cloning in R's open source packages repositories.

**Keywords.** R, plagiarism detection, code cloning, fuzzy similarity measures

50.

Coroianu L., **Gagolewski M.**,
Grzegorzewski P., Adabitabar Firozja M., Houlari T.,
*Piecewise linear approximation of fuzzy numbers preserving the support and core*,
In: Laurent A. et al. (Eds.), *Information Processing and Management of Uncertainty in Knowledge-Based Systems*,
Part II (*Communications in Computer and Information Science* **443**), Springer, 2014, pp. 244-254. doi:10.1007/978-3-319-08855-6_25

**Abstract.** A reasonable approximation of a fuzzy number should have a simple
membership function, be close to the input fuzzy number, and should
preserve some of its important characteristics. In this
paper we suggest to approximate a
fuzzy number by a piecewise linear 1-knot fuzzy number which is
the closest one to the input fuzzy number among all piecewise
linear 1-knot fuzzy numbers having the same core and the same
support as the input. We discuss the existence of the approximation
operator, show algorithms ready for the practical
use and illustrate the considered concepts by examples. It turns out that
such an approximation task may be problematic.

**Keywords.** Approximation of fuzzy numbers, core, fuzzy number,
piecewise linear approximation, support

51.

`R`

. Analiza danych, obliczenia, symulacje`R`

Programming. Data Analysis. Computing. Simulations)52.

Grzegorzewski P., **Gagolewski M.**, Bobecka-Wesołowska K.,
*Wnioskowanie statystyczne z wykorzystaniem środowiska *
*(Statistical Inference in *,
Politechnika Warszawska,
2014, 183 pp. isbn:978-83-93-72601-1

`R`

`R`

)53.

Grzegorzewski P., **Gagolewski M.**,
Hryniewicz O., Gil M.Á. (Eds.),
*Strengthening Links Between Data Analysis and Soft Computing*
(*Advances in Intelligent Systems and Computing* **315**), Springer, 2015, 294 pp. doi:10.1007/978-3-319-10765-3 isbn:978-3-319-10764-6

54.

**Abstract.** The Choquet, Sugeno, and Shilkret integrals
with respect to monotone measures,
as well as their generalization
– the universal integral, stand for a useful tool in decision support systems.
In this paper we propose a general construction method for aggregation
operators that may be used in assessing output of scientists.
We show that the most often currently used indices of bibliometric impact,
like Hirsch's h, Woeginger's w, Egghe's g, Kosmulski's MAXPROD,
and similar constructions, may be obtained by means of our framework.
Moreover, the model easily leads to some new, very interesting functions.

**Keywords.** Choquet, Sugeno, Shilkret, universal integral;
monotone measures;
aggregation;
indices of scientific impact,
bibliometrics;
h-index, w-index, g-index, MAXPROD-index

55.

Coroianu L.,
**Gagolewski M.**, Grzegorzewski P.,
Nearest piecewise linear approximation of fuzzy numbers,
*Fuzzy Sets and Systems* **233**, 2013, pp. 26-51. doi:10.1016/j.fss.2013.02.005

**Abstract.** The problem of the nearest approximation of fuzzy numbers
by piecewise linear 1-knot fuzzy numbers is discussed. By using 1-knot
fuzzy numbers one may obtain approximations which are simple enough and
flexible to reconstruct the input fuzzy concepts under study. They might
be also perceived as a generalization of the trapezoidal approximations.
Moreover, these approximations possess some desirable properties.
Apart from theoretical considerations approximation algorithms
that can be applied in practice are also given.

**Keywords.** Approximation of fuzzy numbers; Fuzzy number; Piecewise linear approximation

56.

**Abstract.** In this paper we deal with the problem of
aggregating numeric sequences of arbitrary length that represent
e.g. citation records of scientists. Impact functions are the aggregation operators that express as a
single number not only the quality of individual publications, but also their author's productivity.

We examine some fundamental properties of these aggregation tools. It turns out that each impact
function which always gives indisputable valuations must necessarily be trivial.
Moreover, it is shown that for any set of citation records in which none is dominated by the other, we
may construct an impact function that gives any a prori-established authors' ordering. Theoretically
then, there is considerable room for manipulation in the hands of decision makers.

We also discuss the differences between the impact function-based and the multicriteria decision
making-based approach to scientific quality management, and study how the introduction of new
properties of impact functions affects the assessment process. We argue that simple mathematical
tools like the h- or g-index (as well asother bibliometric impact indices) may not necessarily
be a good choice when it comes to assess scientific achievements.

**Keywords.** Impact functions;
aggregation;
decision making;
reference modeling;
Hirsch's h-index;
scientometrics;
bibliometrics

57.

Cena A., **Gagolewski M.**,
*OM3: Ordered maxitive, minitive, and modular
aggregation operators – Part I: Axiomatic analysis under arity-dependence*,
In: Bustince H. et al. (Eds.),
*Aggregation Functions in Theory and in Practise*
(*Advances in Intelligent Systems and Computing* **228**), Springer, 2013, pp. 93-103. doi:10.1007/978-3-642-39165-1_13

**Abstract.** Recently, a very interesting relation between symmetric
minitive, maxitive, and modular aggregation operators has been shown.
It turns out that the intersection between any pair of the mentioned
classes is the same. This result introduces what we here propose
to call the OM3 operators. In the first part of our contribution
on the analysis of the OM3 operators we study some properties that
may be useful when aggregating input vectors of varying lengths.
In Part II we will perform a thorough simulation study of the
impact of input vectors’ calibration on the aggregation results.

58.

Cena A., **Gagolewski M.**,
*OM3: Ordered maxitive, minitive, and modular
aggregation operators – Part II: A simulation study*,
In: Bustince H. et al. (Eds.),
*Aggregation Functions in Theory and in Practise*
(*Advances in Intelligent Systems and Computing* **228**), Springer, 2013,
pp. 105-115. doi:10.1007/978-3-642-39165-1_14

**Abstract.** This article is a second part of the contribution
on the analysis of the recently-proposed class of symmetric maxitive,
minitive and modular aggregation operators. Recent results
(Gagolewski, Mesiar, 2012) indicated some unstable behavior
of the generalized h-index, which is a particular instance of
OM3, in case of input data transformation. The study was performed
on a small, carefully selected real-world data set.
Here we conduct some experiments to examine this phenomena more extensively.

59.

**Abstract.** The Choquet, Sugeno and Shilkret integrals with respect to monotone measures
are useful tools in decision support systems.
In this paper we propose a new class of graph-based integrals
that generalize these three operations.
Then, an efficient linear-time
algorithm for computing their special case,
that is l_{p}-indices, 1≤p<∞, is presented.
The algorithm is based on R.L. Graham's routine for determining
the convex hull of a finite planar set.

**Keywords.** Monotone measures, Choquet, Sugeno and Shilkret integral,
l_{p}-index, convex hull, Graham's scan, scientific impact indices

60.

**Abstract.** In this paper the relationship between symmetric minitive,
maxitive, and modular aggregation operators is considered. It is shown
that the intersection between any two of the three discussed classes
is the same. Moreover, the intersection is explicitly characterized.

It turns out that the intersection contains families of aggregation
operators such as OWMax, OWMin, and many generalizations of the
widely-known Hirsch’s h-index, often applied in scientific quality control.

**Keywords.** Aggregation operators; OWMax; OMA; OWA; Hirsch’s h-index;
Scientometrics

**Comments.** Later we proposed that the symmetric minitive,
maxitive, and modular aggregation operators may be called the OM3 agops,
see (Cena A., Gagolewski M.,
*OM3: ordered maxitive, minitive, and modular
aggregation operators – Part I: Axiomatic analysis under arity-dependence*, 2013).

61.

**Abstract.** The process of assessing individual authors should
rely upon a proper aggregation of reliable and valid papers’ quality
metrics. Citations are merely one possible way to measure appreciation
of publications. In this study we propose some new, SJR- and SNIP-based
indicators, which not only take into account the broadly conceived
popularity of a paper (manifested by the number of citations),
but also other factors like its potential, or the quality of papers
that cite a given publication. We explore the relation and correlation
between different metrics and study how they affect the values of
a real-valued generalized h-index calculated for 11 prominent
scientometricians. We note that the h-index is a very unstable
impact function, highly sensitive for applying input elements’ scaling.
Our analysis is not only of theoretical significance: data scaling
is often performed to normalize citations across disciplines.
Uncontrolled application of this operation may lead to unfair and
biased (toward some groups) decisions. This puts the validity of
authors assessment and ranking using the h-index into question.
Obviously, a good impact function to be used in practice
should not be as much sensitive to changing input
data as the analyzed one.

**Keywords.** Aggregation operators; Impact functions; Hirsch's h-index;
Quality control; Scientometrics; Bibliometrics; SJR;
SNIP; Scopus; CITAN; R

**Comments.** An empirical paper. The ideas presented here
were later explored more thoroughly in (Cena A., Gagolewski M.,
*OM3: ordered maxitive, minitive, and modular
aggregation operators – Part II: A simulation study*, 2013).

62.

**Abstract.** In this paper we discuss the construction of a new
parametric statistical hypothesis test for the equality of
probability distributions. The test bases on the difference
between Hirsch’s h-indices of two equal-length i.i.d. random
samples. For the sake of illustration, we analyze its power
in case of Pareto-distributed input data. It turns out that
the test is very conservative and has wide acceptance regions,
which puts in question the appropriateness of the h-index usage
in scientific quality control and decision making.

63.

**Abstract.** In this paper the recently introduced class of
effort-dominating impact functions is examined. It turns out
that each effort-dominating aggregation operator not only has a
very intuitive interpretation, but also is symmetric minitive, and
therefore may be expressed as a so-called quasi-I-statistic, which
generalizes the well-know OWMin operator.

These aggregation operators may be used e.g. in the Producer Assessment
Problem whose most important instance is the scientometric/bibliometric
issue of fair scientists’ ranking by means of the number of citations
received by their papers.

64.

**Abstract.** In this paper CITAN, the CITation ANalysis package for
R statistical computing environment, is introduced. The main aim of the
software is to support bibliometricians with a tool for preprocessing
and cleaning bibliographic data retrieved from SciVerse Scopus and
for calculating the most popular indices of scientific impact.

To show the practical usability of the package, an exemplary assessment
of authors publishing in the fields of scientometrics and
webometrics is performed.

**Keywords.** Data analysis software; Quality control in science;
Citation analysis; Bibliometrics; Hirsch's h index;
Egghe's g index; SciVerse Scopus

65.

**Abstract.** A class of arity-monotonic aggregation operators,
called impact functions, is proposed. This family of operators forms
a theoretical framework for the so-called Producer Assessment Problem,
which includes the scientometric task of fair and objective assessment
of scientists using the number of citations received by their publications.

The impact function output values are analyzed under right-censored
and dynamically changing input data. The qualitative possibilistic
approach is used to describe this kind of uncertainty.
It leads to intuitive graphical interpretations and may
be easily applied for practical purposes.

The discourse is illustrated by a family of aggregation operators
generalizing the well-known Ordered Weighted Maximum (OWMax)
and the Hirsch h-index.

**Keywords.** Aggregation operators; Possibility theory; S-statistics; h-index; OWMax

**Comments.** In this paper the class of effort-dominating impact functions
has also been introduced. I have shown later (see Gagolewski M.,
*On the Relation Between Effort-Dominating and Symmetric Minitive Aggregation Operators*, 2012)
that all such aggregation operators are symmetric minitive.

66.

**Abstract.** Two classes of aggregation functions: L-statistics
and S-statistics and their generalizations called quasi-L-statistics
and quasi-S-statistics are considered. Some interesting characterizations
of these families of operators are given. The aforementioned functions
are useful for various applications. In particular, they are very helpful
for modeling the so-called Producer Assessment Problem.

67.

Rowiński T., **Gagolewski M.**,
*Internet a kryzys*,
In: Jankowska M., Starzomska M. (Eds.),
*Kryzys: Pułapka czy szansa?*, WN Akapit, 2011,
pp. 211-224. isbn:978-83-609-5885-8

68.

69.

**Abstract.** Some statistical properties of the so-called S-statistics,
which generalize the ordered weighted maximum aggregation operators,
are considered. In particular, the asymptotic normality of S-statistics
is proved and some possible applications in estimation problems are suggested.

70.

**Abstract.** A class of extended aggregation operators, called impact
functions, is proposed and their basic properties are examined.
Some important classes of functions like generalized ordered weighted
averaging (OWA) and ordered weighted maximum (OWMax) operators
are considered. The general idea is illustrated by the Producer
Assessment Problem which includes the scientometric problem of
rating scientists basing on the number of citations received by
their publications. An interesting characterization of the well
known h-index is given.

71.

**Abstract.** Two broad classes of scientific impact indices
are proposed and their properties – both theoretical and practical –
are discussed. These new classes were obtained as a geometric
generalization of the well-known tools applied in scientometric,
like Hirsch’s h-index, Woeginger’s w-index and the Kosmulski’s Maxprod.
It is shown how to apply the suggested indices for estimation of
the shape of the citation function or the total number of citations
of an individual. Additionally, a new efficient and simple O(log n)
algorithm for computing the h-index is given.

**Keywords.** Hirsch's h-index, citation analysis, scientific impact indices

**Comments.** I have shown later (see Gagolewski M.,
*On the Relation Between Effort-Dominating and Symmetric Minitive Aggregation Operators*, 2012)
that the r_{p}-indices are symmetric minitive.
Moreover, we have found that there exists a O(n log n) algorithm
for determining l_{p} (see Gagolewski M., Dębski M., Nowakiewicz M.,
*Efficient Algorithm for Computing Certain Graph-Based Monotone Integrals: the l _{p}-Indices*, 2013

72.

73.

**Abstract.** The problem of measuring scientific impact is considered. A class
of so-called p-sphere (r_{p}) indices, which generalize the well-known
Hirsch index, is used to construct a possibility measure of
scientific impact. This measure might be treated as a
starting point for prediction of future index values or for dealing
with right-censored bibliometric data.

74.

Rowiński T., **Gagolewski M.**,
Preferencje i postawy wobec pomocy online,
*Studia Psychologica UKSW* **7**, 2007, pp. 195-210.

1.

Siudem G., Żogała-Siudem B.,
Cena A., **Gagolewski M.**,
Three dimensions of scientific impact,
*Proceedings of the National Academy of Sciences of the
United States of America (PNAS)* **117**(25), 2020, pp. 13896-13900. doi:10.1073/pnas.2001064117

**Abstract.**
The growing popularity of bibliometric indexes
(whose most famous example is the h index by J. E. Hirsch
[J. E. Hirsch, Proc. Natl. Acad. Sci. U.S.A. 102, 16569–16572 (2005)])
is opposed by those claiming that one's scientific impact cannot be reduced
to a single number. Some even believe that our complex reality fails
to submit to any quantitative description. We argue that neither of
the two controversial extremes is true. By assuming that some citations
are distributed according to the rich get richer rule (success breeds
success, preferential attachment) while some others are assigned totally
at random (all in all, a paper needs a bibliography), we have crafted
a model that accurately summarizes citation records with merely
three easily interpretable parameters: productivity, total impact,
and how lucky an author has been so far.

**Keywords.** science of science, scientometrics, bibliometric indexes, rich get richer

2.

Bartoszuk M., **Gagolewski M.**,
SimilaR: R Code Clone and Plagiarism Detection,
*R Journal*, 2020, in press.

**Abstract.**
Third-party software for assuring source code quality is becoming increasingly
popular. Tools that evaluate the coverage of unit tests,
perform static code analysis, or inspect run-time memory use
are crucial in the software development life cycle.
More sophisticated methods allow for performing meta-analyses
of large software repositories, e.g., to discover abstract topics they relate to
or common design patterns applied by their developers.
They may be useful in gaining a better understanding of the component
interdependencies, avoiding cloned code as well as detecting plagiarism
in programming classes.
A meaningful measure of similarity of computer programs often forms
the basis of such tools. While there are a few noteworthy instruments
for similarity assessment, none of them turns out particularly suitable
for analysing R code chunks. Existing solutions rely on rather simple
techniques and heuristics and fail to provide a user with
the kind of sensitivity and specificity required for working with R scripts.
In order to fill this gap, we propose a new algorithm
based on a Program Dependence Graph, implemented in the SimilaR
package. It can serve as a tool not only for improving R code quality
but also for detecting plagiarism, even when it has been masked
by applying some obfuscation techniques or imputing dead code.
We demonstrate its accuracy and efficiency in a real-world case study.

**Keywords.** plagiarism detection, R, code clones

3.

Cena A., **Gagolewski M.**,
Genie+OWA: Robustifying Hierarchical Clustering with OWA-based Linkages,
*Information Sciences* **520**, 2020, pp. 324-336. doi:10.1016/j.ins.2020.02.025

**Abstract.**
We investigate the application of the Ordered Weighted
Averaging (OWA) data fusion operator in agglomerative hierarchical
clustering. The examined setting generalises the well-known single,
complete and average linkage schemes. It allows to embody expert
knowledge in the cluster merge process and to provide a much wider
range of possible linkages. We analyse various families of weighting
functions on numerous benchmark data sets in order to assess their
influence on the resulting cluster structure. Moreover, we inspect
the correction for the inequality of cluster size distribution --
similar to the one in the Genie algorithm. Our results demonstrate
that by robustifying the procedure with the Genie correction,
we can obtain a significant performance boost in terms of clustering
quality. This is particularly beneficial in the case of the linkages
based on the closest distances between clusters, including the single
linkage and its "smoothed" counterparts. To explain this behaviour,
we propose a new linkage process called three-stage OWA which yields
further improvements. This way we confirm the intuition that
hierarchical cluster analysis should rather take into account
a few nearest neighbours of each point, instead of trying to adapt
to their non-local neighbourhood.

**Keywords.** hierarchical clustering, OWA, data fusion, aggregation, Genie

4.

Pérez-Fernández R., De Baets B., **Gagolewski M.**,
A taxonomy of monotonicity properties for the aggregation of multidimensional data,
*Information Fusion* **52**, 2019, pp. 322-334. doi:10.1016/j.inffus.2019.05.006

**Abstract.**
The property of monotonicity, which requires a function to preserve a given order,
has been considered the standard in the aggregation of real numbers for decades.
In this paper, we argue that, for the case of multidimensional data,
an order-based definition of monotonicity is far too restrictive.
We propose several meaningful alternatives to this property not involving
the preservation of a given order by returning to its early origins stemming
from the field of calculus. Numerous aggregation methods for multidimensional
data commonly used by practitioners are studied within our new framework.

**Keywords.** monotonicity, aggregation, multidimensional data, centroid, spatial median

5.

**Abstract.**
The problem of learning symmetric capacities (or fuzzy measures)
from data is investigated toward applications in data analysis
and prediction as well as decision making. Theoretical results
regarding the solution minimizing the mean absolute error
are exploited to develop an exact branch-refine-and-bound-type algorithm
for fitting Sugeno integrals (weighted lattice polynomial functions,
max-min operators) with respect to symmetric capacities.
The proposed method turns out to be particularly suitable for acting
on ordinal data. In addition to providing a model that can be used
for the general data regression task, the results can be used,
among others, to calibrate generalized h-indices to bibliometric data.

**Keywords.** weight learning, ordinal data fitting, fuzzy measures, Sugeno integral, lattice polynomials, h-index

6.

Beliakov G., **Gagolewski M.**, James S.,
DC optimization for constructing discrete Sugeno integrals and learning nonadditive measures,
*Optimization*, 2019, in press. doi:10.1080/02331934.2019.1705300

**Abstract.**
Defined solely by means of order-theoretic operations meet (min) and join (max), weighted lattice polynomial functions are particularly useful for modeling data on an ordinal scale. A special case, the discrete Sugeno integral, defined with respect to a nonadditive measure (a capacity), enables accounting for the interdependencies between input variables.

However until recently the problem of identifying the fuzzy measure values with respect to various objectives and requirements has not received a great deal of attention. By expressing the learning problem as the difference of convex functions, we are able to apply DC (difference of convex) optimization methods. Here we formulate one of the global optimization steps as a local linear programming problem and investigate the improvement under different conditions.

**Keywords.** Aggregation functions, nonadditive measures, Sugeno integral, capacities, DC optimization

7.

**Abstract.**
In the field of information fusion, the problem of data aggregation has been formalized as an order-preserving process that builds upon the property of monotonicity. However, fields such as computational statistics, data analysis and geometry, usually emphasize the role of equivariances to various geometrical transformations in aggregation processes. Admittedly, if we consider a unidimensional data fusion task, both requirements are often compatible with each other. Nevertheless, in this paper we show that, in the multidimensional setting, the only idempotent functions that are monotone and orthogonal equivariant are the over-simplistic weighted centroids. Even more, this result still holds after replacing monotonicity and orthogonal equivariance by the weaker property of orthomonotonicity. This implies that the aforementioned approaches to the aggregation of multidimensional data are irreconcilable, and that, if a weighted centroid is to be avoided, we must choose between monotonicity and a desirable behaviour with regard to orthogonal transformations.

**Keywords.** multidimensional data aggregation, monotonicity, orthogonal equivariance, centroid

8.

Geras A., Siudem G., **Gagolewski M.**,
Should we introduce a dislike button for academic papers?,
*Journal of the Association for Information Science and Technology* **71**(2), 2020, pp. 221-229. doi:10.1002/ASI.24231

**Abstract.**
On the grounds of the revealed, mutual resemblance between the behaviour
of users of Stack Exchange and the dynamics of the citations accumulation
process in the scientific community, we tackled an outwardly
intractable problem of assessing the impact of introducing "negative" citations.

Although the most frequent reason to cite a paper is to highlight the
connection between the two publications, researchers sometimes mention
an earlier work to cast a negative light. While computing citation-based scores,
for instance the h-index, information about the reason why a paper was mentioned
is neglected. Therefore it can be questioned whether these indices describe
scientific achievements accurately.

In this contribution we shed insight into the problem of "negative" citations,
analysing data from Stack Exchange and, to draw more universal conclusions,
we derive an approximation of citations scores. Here we show that the quantified
influence of introducing negative citations is of lesser importance and
that they could be used as an indicator of
where attention of scientific community is allocated.

**Keywords.** citation analysis, the Hirsch index, negative citations, research evaluation, science of science

9.

Beliakov G., **Gagolewski M.**, James S.,
Robust fitting for the Sugeno integral with respect to general fuzzy measures,
*Information Sciences* **514**, 2020, pp. 449-461. doi:10.1016/j.ins.2019.11.024

**Abstract.**
The Sugeno integral is an expressive aggregation function with potential applications across a range of decision contexts. Its calculation requires only the lattice minimum and maximum operations, making it particularly suited to ordinal data and robust to scale transformations. However, for practical use in data analysis and prediction, we require efficient methods for learning the associated fuzzy measure. While such methods are well developed for the Choquet integral, the fitting problem is more difficult for the Sugeno integral because it is not amenable to being expressed as a linear combination of weights, and more generally due to plateaus and non-differentiability in the objective function. Previous research has hence focused on heuristic approaches or simplified fuzzy measures. Here we show that the problem of fitting the Sugeno integral to data such that the maximum absolute error is minimized can be solved using an efficient bilevel program. This method can be incorporated into algorithms that learn fuzzy measures with the aim of minimizing the median residual. This equips us with tools that make the Sugeno integral a feasible option in robust data regression and analysis. We provide experimental comparison with a genetic algorithms approach and an example in data analysis.

**Keywords.** Sugeno integral, fuzzy measure, parameter learning, aggregation functions

10.

Beliakov G., **Gagolewski M.**, James S.,
Aggregation on ordinal scales with the Sugeno integral for biomedical applications,
*Information Sciences* **501**, 2019, pp. 377-387. doi:10.1016/j.ins.2019.06.023

**Abstract.**
The Sugeno integral is a function particularly suited to the aggregation of ordinal inputs.
Defined with respect to a fuzzy measure, its ability to account for complementary and
redundant relationships between variables brings much potential to the field of biomedicine,
where it is common for measurements and patient information to be expressed qualitatively.
However, practical applications require well-developed methods for identifying the Sugeno integral's
parameters, and this task is not easily expressed using the standard optimisation approaches.
Here we formulate the objective function as the difference of two convex functions, which enables
the use of specialised numerical methods. Such techniques are compared with other global
optimisation frameworks through a number of numerical experiments.

**Keywords.** aggregation functions, fuzzy measures, Sugeno integral, capacities

11.

Coroianu L.,
**Gagolewski M.**, Grzegorzewski P.,
Piecewise linear approximation of fuzzy numbers: algorithms, arithmetic operations and stability of characteristics,
*Soft Computing* **23**(19), 2019, pp. 9491-9505. doi:10.1007/s00500-019-03800-2

**Abstract.**
The problem of the piecewise linear approximation of fuzzy
numbers giving outputs nearest to the inputs with respect to the
Euclidean metric is discussed. The results given in Coroianu et al.
(Fuzzy Sets Syst 233:26–51, 2013) for the 1-knot fuzzy numbers
are generalized for arbitrary n-knot (n>=2) piecewise linear fuzzy numbers.
Some results on the existence and properties of the approximation operator are proved. Then, the stability of some fuzzy number characteristics under approximation as the number of knots tends to infinity is considered. Finally, a simulation study concerning the computer implementations of arithmetic operations on fuzzy numbers is provided. Suggested concepts are illustrated by examples and algorithms ready for the practical use. This way, we throw a bridge between theory and applications as the latter ones are so desired in real-world problems.

**Keywords.** Approximation of fuzzy numbers,
Calculations on fuzzy numbers,
Characteristics of fuzzy numbers,
Fuzzy number,
Piecewise linear approximation

12.

Coroianu L., Fullér R., **Gagolewski M.**, James S.,
Constrained Ordered Weighted averaging aggregation with multiple comonotone constraints,
*Fuzzy Sets and Systems* **395**, 2020, pp. 21-39. doi:10.1016/j.fss.2019.09.006

**Abstract.**
The constrained ordered weighted averaging (OWA) aggregation problem
arises when we aim to maximize or minimize a convex combination of order
statistics under linear inequality constraints that act on the variables with
respect to their original sources. The standalone approach to optimizing
the OWA under constraints is to consider all permutations of the inputs,
which becomes quickly infeasible when there are more than a few variables,
however in certain cases we can take advantage of the relationships amongst
the constraints and the corresponding solution structures. For example, we
can consider a land-use allocation satisfaction problem with an auxiliary aim
of balancing land-types, whereby the response curves for each species are
non-decreasing with respect to the land-types. This results in comonotone
constraints, which allow us to drastically reduce the complexity of the problem.

In this paper, we show that if we have an arbitrary number of constraints
that are comonotone (i.e., they share the same ordering permutation of the
coefficients), then the optimal solution occurs for decreasing components of
the solution. After investigating the form of the solution in some special cases
and providing theoretical results that shed light on the form of the solution,
we detail practical approaches to solving and give real-world examples.

**Keywords.** Multiple criteria evaluation; Ordered weighted averaging;
Constrained OWA aggregation; Ecology; Work allocation

13.

Lasek J., **Gagolewski M.**,
The efficacy of league formats in ranking teams,
*Statistical Modelling* **18**(5-6), 2018, pp. 411-435. doi:10.1177/1471082X18798426

**Abstract.**
The efficacy of different league formats in ranking teams according to their true latent strength is analysed.
To this end, a new approach for estimating attacking and defensive strengths based on the Poisson regression for modelling match outcomes is proposed. Various performance metrics are estimated reflecting the agreement between latent teams' strength parameters and their final rank in the league table. The tournament designs studied here are used
in the majority of European top-tier association football competitions.
Based on numerical experiments, it turns out that a two-stage league format
comprising of the three round-robin tournament together with an extra single
round-robin is the most efficacious setting.
In particular, it is the most accurate in selecting the best team as the winner of the league.
Its efficacy can be enhanced by setting the number of points allocated for a win to two
(instead of three that is currently in effect in association football).

**Keywords.** association football, league formats, rankings, rating systems, simulation, tournament design

14.

Beliakov G., **Gagolewski M.**,
James S., Pace S., Pastorello N., Thilliez E., Vasa R.,
Measuring traffic congestion: An approach based on learning weighted inequality, spread and aggregation indices from comparison data,
*Applied Soft Computing* **67**, 2018, pp. 910-919. doi:10.1016/j.asoc.2017.07.014

**Abstract.**
As cities increase in size, governments and councils face the problem of
designing infrastructure and approaches to traffic management that alleviate
congestion. The problem of objectively measuring congestion involves taking
into account not only the volume of traffic moving throughout a network, but
also the inequality or spread of this traffic over major and minor intersections.
For modelling such data, we investigate the use of weighted congestion indices
based on various aggregation and spread functions. We formulate the weight
learning problem for comparison data and use real traffic data obtained from
a medium-sized Australian city to evaluate their usefulness.

**Keywords.** aggregation functions, inequality indices, spread measures,
learning weights, traffic analysis

15.

**Abstract.**
Research in aggregation theory is nowadays still mostly focused on algorithms
to summarize tuples consisting of observations in some real interval
or of diverse general ordered structures. Of course, in practice
of information processing many other data types between these
two extreme cases are worth inspecting. This contribution deals with
the aggregation of lists of data points in **R**^{d} for arbitrary d≥1.
Even though particular functions aiming to summarize multidimensional data
have been discussed by researchers in data analysis,
computational statistics and geometry, there is clearly a need to provide
a comprehensive and unified model in which their properties
like equivariances to geometric transformations, internality, and monotonicity
may be studied at an appropriate level of generality.
The proposed penalty-based approach
serves as a common framework for all idempotent information aggregation
methods, including componentwise functions,
pairwise distance minimizers, and data depth-based medians. It also
allows for deriving many new practically useful tools.

**Keywords.** multidimensional data aggregation, penalty functions, data depth, centroid, median

16.

**Abstract.**
The time needed to apply a hierarchical clustering algorithm is most often
dominated by the number of computations of a pairwise dissimilarity measure.
Such a constraint, for larger data sets, puts at a disadvantage the use of
all the classical linkage criteria but the single linkage one. However, it
is known that the single linkage clustering algorithm is very sensitive to
outliers, produces highly skewed dendrograms, and therefore usually does not
reflect the true underlying data structure – unless the clusters are well-separated.
To overcome its limitations, we propose a new hierarchical clustering linkage
criterion called Genie. Namely, our algorithm links two clusters in such a
way that a chosen economic inequity measure (e.g., the Gini- or Bonferroni-index)
of the cluster sizes does not increase drastically above a given threshold.
The presented benchmarks indicate a high practical usefulness of the introduced
method: it most often outperforms the Ward or average linkage in terms of the
clustering quality while retaining the single linkage speed. The Genie
algorithm is easily parallelizable and thus may be run on multiple threads
to speed up its execution further on. Its memory overhead is small: there
is no need to precompute the complete distance matrix to perform the
computations in order to obtain a desired clustering. It can be applied
on arbitrary spaces equipped with a dissimilarity measure, e.g., on real
vectors, DNA or protein sequences, images, rankings, informetric data,
etc. A reference implementation of the algorithm has been included
in the open source `genie`

package for `R`

.

**Keywords.** hierarchical clustering, single linkage, inequity measures, Gini-index

17.

Beliakov G., **Gagolewski M.**, James S.,
Penalty-based and other representations of economic inequality,
*International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems*
**24**(Suppl. 1), 2016, pp. 1-23. doi:10.1142/S0218488516400018

**Abstract.**
Economic inequality measures are employed as a key component in various
socio-demographic indices to capture the disparity between the wealthy and poor.
Since their inception, they have also been used as a basis for modelling
spread and disparity in other contexts. While recent research has identified
that a number of classical inequality and welfare functions can be considered
in the framework of OWA operators, here we propose a framework of
penalty-based aggregation functions and their associated penalties as
measures of inequality.

**Keywords.** penalty functions, aggregation functions, inequality indices, spread measures

18.

Mesiar R., **Gagolewski M.**,
H-index and other Sugeno integrals: Some defects and their compensation,
*IEEE Transactions on Fuzzy Systems* **24**(6), 2016, pp. 1668-1672. doi:10.1109/TFUZZ.2016.2516579

**Abstract.**
The famous Hirsch index has been introduced just ca. 10 years ago.
Despite that, it is already widely used in many decision making
tasks, like in evaluation of individual scientists, research
grant allocation, or even production planning.
It is known that the h-index is related to the discrete
Sugeno integral and the Ky Fan metric introduced in 1940s.
The aim of this paper is to propose a few modifications of this index
as well as other fuzzy integrals – also on bounded chains – that lead
to better discrimination of some types of data that are to be aggregated.
All of the suggested compensation methods try to retain the simplicity
of the original measure.

**Keywords.** h-index, Sugeno integral, Ky Fan metric, Shilkret integral, decomposition integrals

19.

Lasek J., Szlavik Z., **Gagolewski M.**, Bhulai S.,
How to improve a team's position in the FIFA ranking – A simulation study, *Journal of Applied Statistics*
**43**(7), 2016, pp. 1349-1368. doi:10.1080/02664763.2015.1100593

**Abstract.** In this paper, we study the efficacy of the official ranking for international football teams compiled by FIFA,
the body governing football competition around the globe. We present strategies for improving a team's position in the ranking.
By combining several statistical techniques we derive an objective function in a decision problem of optimal
scheduling of future matches. The presented results display how a team's position can be improved.
Along the way, we compare the official procedure to the famous Elo rating system. Although it originates
from chess, it has been successfully tailored to ranking football teams as well.

**Keywords.** association football, FIFA ranking, prediction models, Monte Carlo simulations, optimal schedule, team rankings

20.

Żogała-Siudem B., Siudem G.,
Cena A., **Gagolewski M.**,
Agent-based model for the h-index – Exact solution,
*European Physical Journal B* **89**:21, 2016.
doi:10.1140/epjb/e2015-60757-1

**Abstract.**
Hirsch’s h-index is perhaps the most popular citation-based measure of
scientific excellence. In 2013, Ionescu and Chopard proposed an agent-based
model describing a process for generating publications and citations in an
abstract scientific community [G. Ionescu, B. Chopard, Eur. Phys. J. B 86,
426 (2013)]. Within such a framework, one may simulate a scientist’s activity,
and – by extension – investigate the whole community of researchers.
Even though the Ionescu and Chopard model predicts the h-index quite well,
the authors provided a solution based solely on simulations. In this paper,
we complete their results with exact, analytic formulas. What is more, by
considering a simplified version of the Ionescu-Chopard model, we obtained
a compact, easy to compute formula for the h-index. The derived approximate
and exact solutions are investigated on a simulated and real-world data sets.

**Keywords.** Statistical and nonlinear physics, preferential attachment rule, h-index

21.

Cena A., **Gagolewski M.**, Mesiar R.,
Problems and challenges of information resources producers' clustering, *Journal of Informetrics* **9**(2),
2015, pp. 273–284. doi:10.1016/j.joi.2015.02.005

**Abstract.** Classically, unsupervised machine learning techniques are applied
on data sets with fixed number of attributes (variables).
However, many problems encountered in the field of informetrics
face us with the need to extend these kinds of methods in a way such that they may
be computed over a set of nonincreasingly ordered vectors of unequal lengths.
Thus, in this paper, some new dissimilarity measures (metrics)
are introduced and studied.
Owing to that we may use i.a. hierarchical clustering algorithms
in order to determine an input data set's partition
consisting of sets of producers that are homogeneous not only with respect to
the quality of information resources, but also their quantity.

**Keywords.** aggregation, hierarchical clustering, distance, metric

22.

**Abstract.** The theory of aggregation most often deals with measures of central tendency.
However, sometimes a very different kind of a numeric vector's synthesis into a
single number is required. In this paper we introduce a class of mathematical functions
which aim to measure spread or scatter of one-dimensional quantitative data.
The proposed definition serves as a common, abstract framework for measures of
absolute spread known from statistics, exploratory data analysis and data mining,
e.g. the sample variance, standard deviation, range, interquartile range (IQR),
median absolute deviation (MAD), etc. Additionally, we develop new measures
of experts' opinions diversity or consensus in group decision making problems.
We investigate some properties of spread measures, show how are they related to
aggregation functions, and indicate their new potentially fruitful application areas.

**Keywords.** Group decisions and negotiations, aggregation, spread, deviation, variance

23.

Cena A., **Gagolewski M.**,
OM3: Ordered maxitive, minitive, and modular aggregation operators
– axiomatic and probabilistic properties in an arity-monotonic setting, *Fuzzy Sets and Systems* **264**,
2015, pp. 138-159. doi:10.1016/j.fss.2014.04.001

**Abstract.** The recently-introduced OM3 aggregation operators fulfill three
appealing properties: they are simultaneously minitive, maxitive, and modular.
Among the instances of OM3 operators we find e.g. OWMax and OWMin operators,
the famous Hirsch's h-index and all its natural generalizations.

In this paper the basic axiomatic and probabilistic properties
of extended, i.e. in an arity-dependent setting,
OM3 aggregation operators are studied.
We illustrate the difficulties one is inevitably faced with when
trying to combine the quality and quantity of numeric items
into a single number. The discussion on such aggregation methods
is particularly important in the information resources producers assessment problem,
which aims to reduce the negative effects of information overload.
It turns out that the Hirsch-like indices of impact
do not fulfill a set of very important properties, which puts the sensibility of their
practical usage into question.
Moreover, thanks to the probabilistic analysis of the operators in an i.i.d. model,
we may better understand the relationship between the aggregated items' quality and
their producers' productivity.

**Keywords.** Aggregation; ordered modularity, maxitivity and minitivity;
arity-monotonicity; impact assessment; Hirsch's h-index; informetrics

24.

**Abstract.** The Choquet, Sugeno, and Shilkret integrals
with respect to monotone measures,
as well as their generalization
– the universal integral, stand for a useful tool in decision support systems.
In this paper we propose a general construction method for aggregation
operators that may be used in assessing output of scientists.
We show that the most often currently used indices of bibliometric impact,
like Hirsch's h, Woeginger's w, Egghe's g, Kosmulski's MAXPROD,
and similar constructions, may be obtained by means of our framework.
Moreover, the model easily leads to some new, very interesting functions.

**Keywords.** Choquet, Sugeno, Shilkret, universal integral;
monotone measures;
aggregation;
indices of scientific impact,
bibliometrics;
h-index, w-index, g-index, MAXPROD-index

25.

Coroianu L.,
**Gagolewski M.**, Grzegorzewski P.,
Nearest piecewise linear approximation of fuzzy numbers,
*Fuzzy Sets and Systems* **233**, 2013, pp. 26-51. doi:10.1016/j.fss.2013.02.005

**Abstract.** The problem of the nearest approximation of fuzzy numbers
by piecewise linear 1-knot fuzzy numbers is discussed. By using 1-knot
fuzzy numbers one may obtain approximations which are simple enough and
flexible to reconstruct the input fuzzy concepts under study. They might
be also perceived as a generalization of the trapezoidal approximations.
Moreover, these approximations possess some desirable properties.
Apart from theoretical considerations approximation algorithms
that can be applied in practice are also given.

**Keywords.** Approximation of fuzzy numbers; Fuzzy number; Piecewise linear approximation

26.

**Abstract.** In this paper we deal with the problem of
aggregating numeric sequences of arbitrary length that represent
e.g. citation records of scientists. Impact functions are the aggregation operators that express as a
single number not only the quality of individual publications, but also their author's productivity.

We examine some fundamental properties of these aggregation tools. It turns out that each impact
function which always gives indisputable valuations must necessarily be trivial.
Moreover, it is shown that for any set of citation records in which none is dominated by the other, we
may construct an impact function that gives any a prori-established authors' ordering. Theoretically
then, there is considerable room for manipulation in the hands of decision makers.

We also discuss the differences between the impact function-based and the multicriteria decision
making-based approach to scientific quality management, and study how the introduction of new
properties of impact functions affects the assessment process. We argue that simple mathematical
tools like the h- or g-index (as well asother bibliometric impact indices) may not necessarily
be a good choice when it comes to assess scientific achievements.

**Keywords.** Impact functions;
aggregation;
decision making;
reference modeling;
Hirsch's h-index;
scientometrics;
bibliometrics

27.

**Abstract.** In this paper the relationship between symmetric minitive,
maxitive, and modular aggregation operators is considered. It is shown
that the intersection between any two of the three discussed classes
is the same. Moreover, the intersection is explicitly characterized.

It turns out that the intersection contains families of aggregation
operators such as OWMax, OWMin, and many generalizations of the
widely-known Hirsch’s h-index, often applied in scientific quality control.

**Keywords.** Aggregation operators; OWMax; OMA; OWA; Hirsch’s h-index;
Scientometrics

**Comments.** Later we proposed that the symmetric minitive,
maxitive, and modular aggregation operators may be called the OM3 agops,
see (Cena A., Gagolewski M.,
*OM3: ordered maxitive, minitive, and modular
aggregation operators – Part I: Axiomatic analysis under arity-dependence*, 2013).

28.

**Abstract.** The process of assessing individual authors should
rely upon a proper aggregation of reliable and valid papers’ quality
metrics. Citations are merely one possible way to measure appreciation
of publications. In this study we propose some new, SJR- and SNIP-based
indicators, which not only take into account the broadly conceived
popularity of a paper (manifested by the number of citations),
but also other factors like its potential, or the quality of papers
that cite a given publication. We explore the relation and correlation
between different metrics and study how they affect the values of
a real-valued generalized h-index calculated for 11 prominent
scientometricians. We note that the h-index is a very unstable
impact function, highly sensitive for applying input elements’ scaling.
Our analysis is not only of theoretical significance: data scaling
is often performed to normalize citations across disciplines.
Uncontrolled application of this operation may lead to unfair and
biased (toward some groups) decisions. This puts the validity of
authors assessment and ranking using the h-index into question.
Obviously, a good impact function to be used in practice
should not be as much sensitive to changing input
data as the analyzed one.

**Keywords.** Aggregation operators; Impact functions; Hirsch's h-index;
Quality control; Scientometrics; Bibliometrics; SJR;
SNIP; Scopus; CITAN; R

**Comments.** An empirical paper. The ideas presented here
were later explored more thoroughly in (Cena A., Gagolewski M.,
*OM3: ordered maxitive, minitive, and modular
aggregation operators – Part II: A simulation study*, 2013).

29.

**Abstract.** In this paper CITAN, the CITation ANalysis package for
R statistical computing environment, is introduced. The main aim of the
software is to support bibliometricians with a tool for preprocessing
and cleaning bibliographic data retrieved from SciVerse Scopus and
for calculating the most popular indices of scientific impact.

To show the practical usability of the package, an exemplary assessment
of authors publishing in the fields of scientometrics and
webometrics is performed.

**Keywords.** Data analysis software; Quality control in science;
Citation analysis; Bibliometrics; Hirsch's h index;
Egghe's g index; SciVerse Scopus

30.

**Abstract.** A class of arity-monotonic aggregation operators,
called impact functions, is proposed. This family of operators forms
a theoretical framework for the so-called Producer Assessment Problem,
which includes the scientometric task of fair and objective assessment
of scientists using the number of citations received by their publications.

The impact function output values are analyzed under right-censored
and dynamically changing input data. The qualitative possibilistic
approach is used to describe this kind of uncertainty.
It leads to intuitive graphical interpretations and may
be easily applied for practical purposes.

The discourse is illustrated by a family of aggregation operators
generalizing the well-known Ordered Weighted Maximum (OWMax)
and the Hirsch h-index.

**Keywords.** Aggregation operators; Possibility theory; S-statistics; h-index; OWMax

**Comments.** In this paper the class of effort-dominating impact functions
has also been introduced. I have shown later (see Gagolewski M.,
*On the Relation Between Effort-Dominating and Symmetric Minitive Aggregation Operators*, 2012)
that all such aggregation operators are symmetric minitive.

31.

**Abstract.** Two broad classes of scientific impact indices
are proposed and their properties – both theoretical and practical –
are discussed. These new classes were obtained as a geometric
generalization of the well-known tools applied in scientometric,
like Hirsch’s h-index, Woeginger’s w-index and the Kosmulski’s Maxprod.
It is shown how to apply the suggested indices for estimation of
the shape of the citation function or the total number of citations
of an individual. Additionally, a new efficient and simple O(log n)
algorithm for computing the h-index is given.

**Keywords.** Hirsch's h-index, citation analysis, scientific impact indices

**Comments.** I have shown later (see Gagolewski M.,
*On the Relation Between Effort-Dominating and Symmetric Minitive Aggregation Operators*, 2012)
that the r_{p}-indices are symmetric minitive.
Moreover, we have found that there exists a O(n log n) algorithm
for determining l_{p} (see Gagolewski M., Dębski M., Nowakiewicz M.,
*Efficient Algorithm for Computing Certain Graph-Based Monotone Integrals: the l _{p}-Indices*, 2013

32.

Rowiński T., **Gagolewski M.**,
Preferencje i postawy wobec pomocy online,
*Studia Psychologica UKSW* **7**, 2007, pp. 195-210.

1.

Coroianu L., **Gagolewski M.**,
*Penalty-based data aggregation in real normed vector spaces*,
In: Halaš R. et al. (Eds.), *New Trends in Aggregation Theory*
(*Advances in Intelligent Systems and Computing* **981**),
Springer, 2019, pp. 160-171. doi:10.1007/978-3-030-19494-9_15

**Abstract.** The problem of penalty-based data aggregation in generic real normed vector
spaces is studied. Some existence and uniqueness results are indicated.
Moreover, various properties of the aggregation functions are considered.

**Keywords.** penalty-based aggregation, prototype learning, means, averages, and medians, vector spaces, Fermat-Weber problem

2.

Beliakov G., **Gagolewski M.**, James S.,
*Least median of squares (LMS) and least trimmed squares (LTS) fitting for the weighted arithmetic mean*,
In: Medina J. et al. (Eds.), *Information Processing and Management of Uncertainty in Knowledge-Based Systems. Theory and Foundations*
(*Communications in Computer and Information Science* **854**),
Springer, 2018, pp. 367-378. doi:10.1007/978-3-319-91476-3_31

**Abstract.** We look at different approaches to learning the weights of the
weighted arithmetic mean such that the median residual or sum of the
smallest half of squared residuals is minimized. The more general problem
of multivariate regression has been well studied in statistical literature
however in the case of aggregation functions we have the restriction on
the weights and the domain is usually restricted so that ‘outliers’ may
not be arbitrarily large. A number of algorithms are compared in terms
of accuracy and speed. Our results can be extended to other aggregation
functions.

**Keywords.** aggregation, LMS fitting, LTS fitting, approximation

3.

**Abstract.** The Sugeno integral has numerous successful applications,
including but not limited to the areas of decision making, preference modeling,
and bibliometrics. Despite this, the current state of the development of usable
algorithms for numerically fitting the underlying discrete fuzzy measure based
on a sample of prototypical values – even in the simplest possible case, i.e.,
assuming the symmetry of the capacity – is yet to reach a satisfactory level.
Thus, the aim of this paper is to present some results and observations
concerning this class of data approximation problems.

**Keywords.** Sugeno integral, aggregation functions, machine learning, regression, approximation

4.

Bartoszuk M., **Gagolewski M.**,
*Binary aggregation functions in software plagiarism detection*,
In: *Proc. FUZZ-IEEE'17*, IEEE, 2017, no. 8015582. doi:10.1109/FUZZ-IEEE.2017.8015582

**Abstract.** Supervised learning is of key interest in data science.
Even though there exist many approaches to solving, among others,
classification as well as ordinal and standard regression tasks,
most of them output models that do not possess useful formal properties,
like nondecreasingness in each independent variable, idempotence,
symmetry, etc. This makes them difficult to interpret and analyze.
For instance, it might be impossible to determine the importances of
individual features or to assess the effects of increasing the values
of predictors on the behavior of a chosen response variable. Such
properties are especially important in software plagiarism detection,
where we are faced with the combination of degrees to which how much
a code chunk A is similar to (or contained in) B as well as how much
B is similar to A. Therefore, in this paper we consider a new method
for fitting B-spline tensor product-based aggregation functions to
empirical data. An empirical study indicates a highly competitive
performance of the resulting models. Additionally, they possess an
intuitive interpretation which is highly desirable for end-users.

5.

Cena A., **Gagolewski M.**,
*OWA-based linkage and the Genie correction for hierarchical clustering*,
In: *Proc. FUZZ-IEEE'17*, IEEE, 2017, no. 8015652. doi:10.1109/FUZZ-IEEE.2017.8015652

**Abstract.** In this paper we thoroughly investigate various OWA-based
linkages in hierarchical clustering on numerous benchmark data sets.
The inspected setting generalizes the well-known single, complete,
and average linkage schemes, among others. The incorporation of
weights into the cluster merge procedure creates an opportunity
to make use of experts' knowledge about a particular data domain
so as to generate partitions of a given data set that better
reflect the true underlying cluster structure. Moreover, we
introduce a correction for the inequality of cluster size distribution
— similar to the one proposed in our recently introduced Genie algorithm
— which results in a significant performance boost in terms of clustering quality.

6.

**Abstract.**
The paper discusses a generalization of the nearest centroid hierarchical
clustering algorithm. A first extension deals with the incorporation
of generic distance-based penalty minimizers instead of the classical
aggregation by means of centroids. Due to that the presented algorithm can be applied
in spaces equipped with an arbitrary dissimilarity measure (images, DNA sequences, etc.).
Secondly, a correction preventing the formation
of clusters of too highly unbalanced sizes is applied: just like in the
recently introduced *Genie* approach, which extends the single linkage scheme,
the new method averts a chosen inequity measure (e.g., the Gini-, de Vergottini-,
or Bonferroni-index) of cluster sizes from raising above a predefined
threshold. Numerous benchmarks indicate that the introduction of such
a correction increases the quality of the resulting clusterings.

**Keywords.** hierarchical clustering, aggregation, centroid, Gini-index, Genie algorithm

7.

Bartoszuk M., Beliakov G., **Gagolewski M.**, James S.,
*Fitting aggregation functions to data: Part I – Linearization and regularization*,
In: Carvalho J.P. et al. (Eds.),
*Information Processing and Management of Uncertainty in Knowledge-Based Systems*,
Part II (*Communications in Computer and Information Science* **611**),
Springer, 2016, pp. 767-779. doi:10.1007/978-3-319-40581-0_62

**Abstract.**
The use of supervised learning techniques for fitting weights and/or generator
functions of weighted quasi-arithmetic means – a special class of idempotent
and nondecreasing aggregation functions – to empirical data has already been
considered in a number of papers.
Nevertheless, there are still some important issues that have not been
discussed in the literature yet. In the first part of this two-part contribution
we deal with the concept of regularization, a quite standard technique from machine learning
applied so as to increase the fit quality on test and validation data samples.
Due to the constraints on the weighting vector,
it turns out that quite different methods can be used in the current framework, as
compared to regression models.
Moreover, it is worth noting that so far fitting weighted
quasi-arithmetic means to empirical data has only been performed
approximately, via the so-called linearization technique.
In this paper we consider exact solutions to such special optimization tasks
and indicate cases where linearization leads to much worse solutions.

**Keywords.** Aggregation functions, weighted quasi-arithmetic means,
least squares fitting, regularization, linearization

8.

Bartoszuk M., Beliakov G., **Gagolewski M.**, James S.,
*Fitting aggregation functions to data: Part II – Idempotization*,
In: Carvalho J.P. et al. (Eds.),
*Information Processing and Management of Uncertainty in Knowledge-Based Systems*,
Part II (*Communications in Computer and Information Science* **611**),
Springer, 2016, pp. 780-789. doi:10.1007/978-3-319-40581-0_63

**Abstract.**
The use of supervised learning techniques for fitting weights and/or generator
functions of weighted quasi-arithmetic means – a special class of idempotent
and nondecreasing aggregation functions – to empirical data has already been
considered in a number of papers.
Nevertheless, there are still some important issues that have not been
discussed in the literature yet. In the second part of this two-part contribution
we deal with a quite common situation in which we have inputs coming from
different sources, describing a similar phenomenon, but which
have not been properly normalized. In such a case,
idempotent and nondecreasing functions cannot be used to aggregate them
unless proper pre-processing is performed.
The proposed idempotization method, based on the notion of B-splines,
allows for an automatic calibration of independent variables.
The introduced technique is applied in an R source code plagiarism
detection system.

**Keywords.** Aggregation functions, weighted quasi-arithmetic means,
least squares fitting, idempotence

9.

Cena A., **Gagolewski M.**,
*Fuzzy k-minpen clustering and k-nearest-minpen classification procedures incorporating generic distance-based penalty minimizers*,
In: Carvalho J.P. et al. (Eds.),
*Information Processing and Management of Uncertainty in Knowledge-Based Systems*,
Part II (*Communications in Computer and Information Science* **611**),
Springer, 2016, pp. 445-456. doi:10.1007/978-3-319-40581-0_36

**Abstract.**
We discuss a generalization of the fuzzy (weighted) k-means clustering procedure
and point out its relationships with data aggregation in spaces equipped with
arbitrary dissimilarity measures. In the proposed setting, a
data set partitioning is performed based on the notion of points' proximity to generic
distance-based penalty minimizers. Moreover, a new data classification algorithm,
resembling the k-nearest neighbors scheme but less computationally and memory
demanding, is introduced. Rich examples in complex data domains
indicate the usability of the methods and aggregation theory in general.

**Keywords.** fuzzy k-means algorithm, clustering, classification, fusion functions, penalty minimizers

10.

Lasek J., **Gagolewski M.**,
*The winning solution to the AAIA'15 Data Mining Competition: Tagging firefighter activities at a fire scene*,
In:
Ganzha M., Maciaszek L., Paprzycki M. (Eds.),
*Proc. FedCSIS'15*, IEEE, 2015, pp. 375-380. doi:10.15439/2015F418

**Abstract.** Multi-sensor based classification of professionals' activities
plays a key role in ensuring the success of an his/her goals. In this paper
we present the winning solution to the *AAIA'15 Tagging Firefighter
Activities at a Fire Scene* data mining competition. The approach
is based on a Random Forest classifier trained on an input data set with
almost 5000 features describing the underlying time series of sensory data.

**Keywords.** Activity tagging, movement tagging, data mining competition, Random Forest model, FFT

11.

Cena A., **Gagolewski M.**,
*A K-means-like algorithm for informetric data clustering*,
In: Alonso J.M., Bustince H., Reformat M. (Eds.),
*Proc. IFSA/EUSFLAT 2015*,
Atlantis Press, 2015, pp. 536-543. doi:10.2991/ifsa-eusflat-15.2015.77

**Abstract.** The K-means algorithm is one of the most often used clustering techniques.
However, when it comes to discovering clusters in informetric data sets
that consist of non-increasingly ordered vectors of not necessarily conforming
lengths, such a method cannot be applied directly.
Hence, in this paper, we propose a K-means-like algorithm
to determine groups of producers that are similar
not only with respect to the quality of information resources they output,
but also their quantity.

**Keywords.** k-means clustering, informetrics, aggregation, impact functions

12.

Bartoszuk M., **Gagolewski M.**,
*Detecting similarity of R functions via a fusion of multiple heuristic methods*,
In: Alonso J.M., Bustince H., Reformat M. (Eds.),
*Proc. IFSA/EUSFLAT 2015*,
Atlantis Press, 2015, pp. 419-426. doi:10.2991/ifsa-eusflat-15.2015.61

**Abstract.** In this paper we describe recent advances in our R code similarity detection algorithm.
We propose a modification of the Program Dependence Graph (PDG) procedure used in the GPLAG system
that better fits the nature of functional programming languages like R.
The major strength of our approach lies in a proper
aggregation of outputs of multiple plagiarism detection methods,
as it is well known that no single technique gives perfect results.
It turns out that the incorporation of the PDG algorithm
significantly improves the recall ratio, i.e. it is better
in indicating true positive cases of plagiarism or code
cloning patterns. The implemented system is available
as web application at http://SimilaR.Rexamine.com/.

**Keywords.** R, plagiarism and code cloning detection,
fuzzy proximity relations, aggregation,
program dependence graph, t-norms

13.

**Abstract.** In the field of informetrics, agents are often represented
by numeric sequences of non necessarily conforming lengths.
There are numerous aggregation techniques of such sequences,
e.g., the g-index, the h-index, that may be used to compare the output
of pairs of agents. In this paper we address a question whether such impact
indices may be used to model experts' preferences accurately.

**Keywords.** preference learning, fuzzy relations, informetrics, aggregation, h-index

14.

**Abstract.** Aggregation theory often deals with measures of central tendency of quantitative data.
As sometimes a different kind of information fusion is needed,
an axiomatization of spread measures was introduced recently. In this contribution
we explore the properties of WD_{p}WAM and WD_{p}OWA operators,
which are defined as weighted L_{p}-distances to weighted
arithmetic mean and OWA operators, respectively.
In particular, we give forms of vectors that maximize
such fusion functions and thus provide a way to normalize the output value
so that the vector of maximal spread always leads to a fixed outcome, e.g., 1
if all the input elements are in [0,1].
This might be desirable when constructing measures of experts' opinions consistency or diversity
in group decision making problems.

**Keywords.** data fusion, aggregation, spread, deviation, variance, OWA operators

15.

Cena A., **Gagolewski M.**,
*Aggregation and soft clustering of informetric data*,
In: Baczyński M., De Baets B., Mesiar R. (Eds.),
*Proc. 8th International Summer School on Aggregation Operators (AGOP 2015)*,
University of Silesia, 2015, pp. 79-84. isbn:978-83-8012-519-3

**Abstract.** The aim of this contribution is to inspect possible
applications of clustering techniques
computed over a set consisting of nonincreasingly ordered vectors
of possibly nonconforming lengths. Such data sets appear in the field of
informetrics, where one may need to evaluate the quality of information items,
e.g., research papers,
and their producers. In this paper we investigate the notion of cluster centers
as an aggregated representation of all vectors from a given cluster and analyze
them by means of aggregation operators.

**Keywords.** clustering, fuzzy clustering, c-means algorithm, distance, producers assessment problem

16.

**Abstract.** The aggregation theory usually takes an interest in
summarizing a predefined number of points in the real line.
In many applications, like in statistics, data analysis, and mining,
the notion of a mean – a nondecreasing, internal, and symmetric fusion function
– plays a key role. Nevertheless, when it comes to aggregating
a set of points in higher dimensional spaces, the componentwise
extension of monotonicity and internality might not be the best choice.
Instead, the invariance to certain classes of geometric transformations
seems to be crucial in such a case.

**Keywords.** aggregation, centroid, Tukey median, 1-center, 1-median, convex hull, affine invariance, orthogonalization

17.

Lasek J., **Gagolewski M.**,
*Estimation of tournament metrics for association football league formats*,
In: *Selected problems in information technologies (Proc. ITRIA'15 vol. 2)*,
Institute of Computer Science, Polish Academy of Sciences,
2015, pp. 67-78.

18.

Cena A., **Gagolewski M.**,
*Clustering and aggregation of informetric data sets*,
In: *Computational methods in data analysis (Proc. ITRIA'15 vol. 1)*,
Institute of Computer Science, Polish Academy of Sciences,
2015, pp. 5-26. isbn:978-83-63159-22-1

19.

**Abstract.** The producers assessment problem has many important practical
instances: it is an abstract model for intelligent systems evaluating
e.g. the quality of computer software repositories, web resources,
social networking services, and digital libraries. Each producer's
performance is determined according not only to the overall quality
of the items he/she outputted, but also to the number of such items
(which may be different for each agent).

Recent theoretical results indicate that the use of aggregation
operators in the process of ranking and evaluation producers
may not necessarily lead to fair and plausible outcomes. Therefore,
to overcome some weaknesses of the most often applied approach,
in this preliminary study we encourage the use of a fuzzy preference
relation-based setting and indicate why it may provide better
control over the assessment process.

**Keywords.** fuzzy relations, preference modeling, producers assessment problem, StackOverflow, bibliometrics, h-index

20.

**Abstract.** Sugeno integral-based confidence intervals for the theoretical
h-index of a fixed-length sequence of i.i.d. random variables are derived.
They are compared with other estimators of such a distribution characteristic
in a Pareto i.i.d. model. It turns out that in the first case we obtain
much wider intervals. It seems to be due to the fact that a Sugeno integral,
which may be applied on any ordinal scale, is known to ignore too
much information from cardinal-scale data being aggregated.

**Keywords.** h-index, Sugeno integral, confidence interval, Pareto distribution

21.

Bartoszuk M., **Gagolewski M.**,
*A fuzzy R code similarity detection algorithm*,
In: Laurent A. et al. (Eds.), *Information Processing and Management of Uncertainty in Knowledge-Based Systems*,
Part III (*Communications in Computer and Information Science* **444**), Springer, 2014, pp. 21-30. doi:10.1007/978-3-319-08852-5_3

**Abstract.** R is a programming language and software environment
for performing statistical computations
and applying data analysis that increasingly gains popularity
among practitioners and scientists. In this paper we present
a preliminary version of a system to detect pairs of similar R code blocks
among a given set of routines, which bases on a proper aggregation of the output of
three different [0,1]-valued (fuzzy) proximity degree estimation algorithms.
Its analysis on empirical data indicates that the system may in future be successfully applied in practice
in order e.g. to detect plagiarism among students' homework submissions or to perform an analysis
of code recycling or code cloning in R's open source packages repositories.

**Keywords.** R, plagiarism detection, code cloning, fuzzy similarity measures

22.

Coroianu L., **Gagolewski M.**,
Grzegorzewski P., Adabitabar Firozja M., Houlari T.,
*Piecewise linear approximation of fuzzy numbers preserving the support and core*,
In: Laurent A. et al. (Eds.), *Information Processing and Management of Uncertainty in Knowledge-Based Systems*,
Part II (*Communications in Computer and Information Science* **443**), Springer, 2014, pp. 244-254. doi:10.1007/978-3-319-08855-6_25

**Abstract.** A reasonable approximation of a fuzzy number should have a simple
membership function, be close to the input fuzzy number, and should
preserve some of its important characteristics. In this
paper we suggest to approximate a
fuzzy number by a piecewise linear 1-knot fuzzy number which is
the closest one to the input fuzzy number among all piecewise
linear 1-knot fuzzy numbers having the same core and the same
support as the input. We discuss the existence of the approximation
operator, show algorithms ready for the practical
use and illustrate the considered concepts by examples. It turns out that
such an approximation task may be problematic.

**Keywords.** Approximation of fuzzy numbers, core, fuzzy number,
piecewise linear approximation, support

23.

Cena A., **Gagolewski M.**,
*OM3: Ordered maxitive, minitive, and modular
aggregation operators – Part I: Axiomatic analysis under arity-dependence*,
In: Bustince H. et al. (Eds.),
*Aggregation Functions in Theory and in Practise*
(*Advances in Intelligent Systems and Computing* **228**), Springer, 2013, pp. 93-103. doi:10.1007/978-3-642-39165-1_13

**Abstract.** Recently, a very interesting relation between symmetric
minitive, maxitive, and modular aggregation operators has been shown.
It turns out that the intersection between any pair of the mentioned
classes is the same. This result introduces what we here propose
to call the OM3 operators. In the first part of our contribution
on the analysis of the OM3 operators we study some properties that
may be useful when aggregating input vectors of varying lengths.
In Part II we will perform a thorough simulation study of the
impact of input vectors’ calibration on the aggregation results.

24.

Cena A., **Gagolewski M.**,
*OM3: Ordered maxitive, minitive, and modular
aggregation operators – Part II: A simulation study*,
In: Bustince H. et al. (Eds.),
*Aggregation Functions in Theory and in Practise*
(*Advances in Intelligent Systems and Computing* **228**), Springer, 2013,
pp. 105-115. doi:10.1007/978-3-642-39165-1_14

**Abstract.** This article is a second part of the contribution
on the analysis of the recently-proposed class of symmetric maxitive,
minitive and modular aggregation operators. Recent results
(Gagolewski, Mesiar, 2012) indicated some unstable behavior
of the generalized h-index, which is a particular instance of
OM3, in case of input data transformation. The study was performed
on a small, carefully selected real-world data set.
Here we conduct some experiments to examine this phenomena more extensively.

25.

**Abstract.** In this paper we discuss the construction of a new
parametric statistical hypothesis test for the equality of
probability distributions. The test bases on the difference
between Hirsch’s h-indices of two equal-length i.i.d. random
samples. For the sake of illustration, we analyze its power
in case of Pareto-distributed input data. It turns out that
the test is very conservative and has wide acceptance regions,
which puts in question the appropriateness of the h-index usage
in scientific quality control and decision making.

26.

**Abstract.** The Choquet, Sugeno and Shilkret integrals with respect to monotone measures
are useful tools in decision support systems.
In this paper we propose a new class of graph-based integrals
that generalize these three operations.
Then, an efficient linear-time
algorithm for computing their special case,
that is l_{p}-indices, 1≤p<∞, is presented.
The algorithm is based on R.L. Graham's routine for determining
the convex hull of a finite planar set.

**Keywords.** Monotone measures, Choquet, Sugeno and Shilkret integral,
l_{p}-index, convex hull, Graham's scan, scientific impact indices

27.

**Abstract.** In this paper the recently introduced class of
effort-dominating impact functions is examined. It turns out
that each effort-dominating aggregation operator not only has a
very intuitive interpretation, but also is symmetric minitive, and
therefore may be expressed as a so-called quasi-I-statistic, which
generalizes the well-know OWMin operator.

These aggregation operators may be used e.g. in the Producer Assessment
Problem whose most important instance is the scientometric/bibliometric
issue of fair scientists’ ranking by means of the number of citations
received by their papers.

28.

**Abstract.** Two classes of aggregation functions: L-statistics
and S-statistics and their generalizations called quasi-L-statistics
and quasi-S-statistics are considered. Some interesting characterizations
of these families of operators are given. The aforementioned functions
are useful for various applications. In particular, they are very helpful
for modeling the so-called Producer Assessment Problem.

29.

Rowiński T., **Gagolewski M.**,
*Internet a kryzys*,
In: Jankowska M., Starzomska M. (Eds.),
*Kryzys: Pułapka czy szansa?*, WN Akapit, 2011,
pp. 211-224. isbn:978-83-609-5885-8

30.

31.

**Abstract.** Some statistical properties of the so-called S-statistics,
which generalize the ordered weighted maximum aggregation operators,
are considered. In particular, the asymptotic normality of S-statistics
is proved and some possible applications in estimation problems are suggested.

32.

**Abstract.** A class of extended aggregation operators, called impact
functions, is proposed and their basic properties are examined.
Some important classes of functions like generalized ordered weighted
averaging (OWA) and ordered weighted maximum (OWMax) operators
are considered. The general idea is illustrated by the Producer
Assessment Problem which includes the scientometric problem of
rating scientists basing on the number of citations received by
their publications. An interesting characterization of the well
known h-index is given.

33.

34.

**Abstract.** The problem of measuring scientific impact is considered. A class
of so-called p-sphere (r_{p}) indices, which generalize the well-known
Hirsch index, is used to construct a possibility measure of
scientific impact. This measure might be treated as a
starting point for prediction of future index values or for dealing
with right-censored bibliometric data.

1.

2.

3.

`Python`

`Python`

)4.

`R`

. Analiza danych, obliczenia, symulacje`R`

Programming. Data Analysis. Computing. Simulations)5.

Grzegorzewski P., **Gagolewski M.**, Bobecka-Wesołowska K.,
*Wnioskowanie statystyczne z wykorzystaniem środowiska *
*(Statistical Inference in *,
Politechnika Warszawska,
2014, 183 pp. isbn:978-83-93-72601-1

`R`

`R`

)1.

Halaš R., **Gagolewski M.**, Mesiar R. (Eds.),
*New Trends in Aggregation Theory*
(*Advances in Intelligent Systems and Computing* **981**),
Springer, 2019, 348 pp. doi:10.1007/978-3-030-19494-9 isbn:978-3-030-19493-2

2.

Ferraro M.B., Giordani P., Vantaggi B.,
**Gagolewski M.**, Gil M.Á., Grzegorzewski P.,
Hryniewicz O. (Eds.),
*Soft Methods for Data Science*
(*Advances in Intelligent Systems and Computing* **456**), Springer, 2017, 535 pp. doi:10.1007/978-3-319-42972-4 isbn:978-3-319-42971-7

3.

Grzegorzewski P., **Gagolewski M.**,
Hryniewicz O., Gil M.Á. (Eds.),
*Strengthening Links Between Data Analysis and Soft Computing*
(*Advances in Intelligent Systems and Computing* **315**), Springer, 2015, 294 pp. doi:10.1007/978-3-319-10765-3 isbn:978-3-319-10764-6