Publications

Summary

53 publications in total, including:

  • 1 research monograph,
  • 3 textbooks,
  • 2 edited volumes,
  • 17 journal papers, and
  • 23 papers in proceedings of international conferences.

My current h-index = 7 (Web of Science) / 7 (Scopus) / 9 (Google Scholar) / 5 (Scopus, but without self-citations).

ORCID = 0000-0003-0637-6028

ResearcherID = C-3575-2012

ResearchGate profile

Kudos profile

My Erdős number = 4 (R. Mesiar - I. Assani - R.D. Mauldin - P. Erdős or L. Coroianu - S.G. Gal - J. Szabados - P. Erdős).

My publication list is also available in BibTeX format.

Research Monographs

1.
Gagolewski M., Data Fusion: Theory, Methods, and Applications, Institute of Computer Science, Polish Academy of Sciences, 2015, 290 pp. isbn:978-83-63159-20-7
Data Fusion - cover

Textbooks

1.
Gagolewski M., Bartoszuk M., Cena A., Przetwarzanie i analiza danych w języku Python (Data Processing and Analysis in Python), Wydawnictwo Naukowe PWN, 2016, 369 pp. isbn:978-83-01-18940-2
2.
Gagolewski M., Programowanie w języku R. Analiza danych, obliczenia, symulacje (R Programming. Data Analysis. Computing. Simulations), Wydawnictwo Naukowe PWN; 1st ed. – 2014, 509 pp.; 2nd ed. – 2016, 550 pp. isbn:978-83-01-18939-6
3.
Grzegorzewski P., Gagolewski M., Bobecka-Wesołowska K., Wnioskowanie statystyczne z wykorzystaniem środowiska R (Statistical Inference in R), Biuro ds. Projektu „Program Rozwojowy Politechniki Warszawskiej”, 2014, 183 pp. isbn:978-83-93-72601-1
Przetwarzanie i analiza danych w języku Python - cover Programowanie w języku R - cover

Edited Volumes

1.
Ferraro M.B., Giordani P., Vantaggi B., Gagolewski M., Gil M.Á., Grzegorzewski P., Hryniewicz O. (Eds.), Soft Methods for Data Science (Advances in Intelligent Systems and Computing 456), Springer, 2017, 535 pp. doi:10.1007/978-3-319-42972-4 isbn:978-3-319-42971-7
2.
Grzegorzewski P., Gagolewski M., Hryniewicz O., Gil M.Á. (Eds.), Strengthening Links Between Data Analysis and Soft Computing (Advances in Intelligent Systems and Computing 315), Springer, 2015, 294 pp. doi:10.1007/978-3-319-10765-3 isbn:978-3-319-10764-6

Articles in Journals

1.
Gagolewski M., Penalty-based aggregation of multidimensional data, Fuzzy Sets and Systems, 2016. (accepted for publication) doi:10.1016/j.fss.2016.12.009

Abstract. Research in aggregation theory is nowadays still mostly focused on algorithms to summarize tuples consisting of observations in some real interval or of diverse general ordered structures. Of course, in practice of information processing many other data types between these two extreme cases are worth inspecting. This contribution deals with the aggregation of lists of data points in Rd for arbitrary d≥1. Even though particular functions aiming to summarize multidimensional data have been discussed by researchers in data analysis, computational statistics and geometry, there is clearly a need to provide a comprehensive and unified model in which their properties like equivariances to geometric transformations, internality, and monotonicity may be studied at an appropriate level of generality. The proposed penalty-based approach serves as a common framework for all idempotent information aggregation methods, including componentwise functions, pairwise distance minimizers, and data depth-based medians. It also allows for deriving many new practically useful tools.

Keywords. multidimensional data aggregation, penalty functions, data depth, centroid, median

2.
Beliakov G., Gagolewski M., James S., Penalty-based and other representations of economic inequality, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 24(Suppl. 1), 2016, pp. 1-23. doi:10.1142/S0218488516400018

Abstract. Economic inequality measures are employed as a key component in various socio-demographic indices to capture the disparity between the wealthy and poor. Since their inception, they have also been used as a basis for modelling spread and disparity in other contexts. While recent research has identified that a number of classical inequality and welfare functions can be considered in the framework of OWA operators, here we propose a framework of penalty-based aggregation functions and their associated penalties as measures of inequality.

Keywords. penalty functions, aggregation functions, inequality indices, spread measures

3.
Gagolewski M., Bartoszuk M., Cena A., Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, Information Sciences 363, 2016, pp. 8-23. doi:10.1016/j.ins.2016.05.003

Abstract. The time needed to apply a hierarchical clustering algorithm is most often dominated by the number of computations of a pairwise dissimilarity measure. Such a constraint, for larger data sets, puts at a disadvantage the use of all the classical linkage criteria but the single linkage one. However, it is known that the single linkage clustering algorithm is very sensitive to outliers, produces highly skewed dendrograms, and therefore usually does not reflect the true underlying data structure – unless the clusters are well-separated. To overcome its limitations, we propose a new hierarchical clustering linkage criterion called Genie. Namely, our algorithm links two clusters in such a way that a chosen economic inequity measure (e.g., the Gini- or Bonferroni-index) of the cluster sizes does not increase drastically above a given threshold. The presented benchmarks indicate a high practical usefulness of the introduced method: it most often outperforms the Ward or average linkage in terms of the clustering quality while retaining the single linkage speed. The Genie algorithm is easily parallelizable and thus may be run on multiple threads to speed up its execution further on. Its memory overhead is small: there is no need to precompute the complete distance matrix to perform the computations in order to obtain a desired clustering. It can be applied on arbitrary spaces equipped with a dissimilarity measure, e.g., on real vectors, DNA or protein sequences, images, rankings, informetric data, etc. A reference implementation of the algorithm has been included in the open source genie package for R.

Keywords. hierarchical clustering, single linkage, inequity measures, Gini-index

4.
Mesiar R., Gagolewski M., H-index and other Sugeno integrals: Some defects and their compensation, IEEE Transactions on Fuzzy Systems 24(6), 2016, pp. 1668-1672. doi:10.1109/TFUZZ.2016.2516579

Abstract. The famous Hirsch index has been introduced just ca. 10 years ago. Despite that, it is already widely used in many decision making tasks, like in evaluation of individual scientists, research grant allocation, or even production planning. It is known that the h-index is related to the discrete Sugeno integral and the Ky Fan metric introduced in 1940s. The aim of this paper is to propose a few modifications of this index as well as other fuzzy integrals – also on bounded chains – that lead to better discrimination of some types of data that are to be aggregated. All of the suggested compensation methods try to retain the simplicity of the original measure.

Keywords. h-index, Sugeno integral, Ky Fan metric, Shilkret integral, decomposition integrals

5.
Lasek J., Szlavik Z., Gagolewski M., Bhulai S., How to improve a team's position in the FIFA ranking – A simulation study, Journal of Applied Statistics 43(7), 2016, pp. 1349-1368. doi:10.1080/02664763.2015.1100593

Abstract. In this paper, we study the efficacy of the official ranking for international football teams compiled by FIFA, the body governing football competition around the globe. We present strategies for improving a team's position in the ranking. By combining several statistical techniques we derive an objective function in a decision problem of optimal scheduling of future matches. The presented results display how a team's position can be improved. Along the way, we compare the official procedure to the famous Elo rating system. Although it originates from chess, it has been successfully tailored to ranking football teams as well.

Keywords. association football, FIFA ranking, prediction models, Monte Carlo simulations, optimal schedule, team rankings

6.
Żogała-Siudem B., Siudem G., Cena A., Gagolewski M., Agent-based model for the h-index – Exact solution, European Physical Journal B 89:21, 2016. doi:10.1140/epjb/e2015-60757-1

Abstract. Hirsch’s h-index is perhaps the most popular citation-based measure of scientific excellence. In 2013, Ionescu and Chopard proposed an agent-based model describing a process for generating publications and citations in an abstract scientific community [G. Ionescu, B. Chopard, Eur. Phys. J. B 86, 426 (2013)]. Within such a framework, one may simulate a scientist’s activity, and – by extension – investigate the whole community of researchers. Even though the Ionescu and Chopard model predicts the h-index quite well, the authors provided a solution based solely on simulations. In this paper, we complete their results with exact, analytic formulas. What is more, by considering a simplified version of the Ionescu-Chopard model, we obtained a compact, easy to compute formula for the h-index. The derived approximate and exact solutions are investigated on a simulated and real-world data sets.

Keywords. Statistical and nonlinear physics, preferential attachment rule, h-index

7.
Cena A., Gagolewski M., Mesiar R., Problems and challenges of information resources producers' clustering, Journal of Informetrics 9(2), 2015, pp. 273–284. doi:10.1016/j.joi.2015.02.005

Abstract. Classically, unsupervised machine learning techniques are applied on data sets with fixed number of attributes (variables). However, many problems encountered in the field of informetrics face us with the need to extend these kinds of methods in a way such that they may be computed over a set of nonincreasingly ordered vectors of unequal lengths. Thus, in this paper, some new dissimilarity measures (metrics) are introduced and studied. Owing to that we may use i.a. hierarchical clustering algorithms in order to determine an input data set's partition consisting of sets of producers that are homogeneous not only with respect to the quality of information resources, but also their quantity.

Keywords. aggregation, hierarchical clustering, distance, metric

8.
Gagolewski M., Spread measures and their relation to aggregation functions, European Journal of Operational Research 241(2), 2015, pp. 469-477. doi:10.1016/j.ejor.2014.08.034

Abstract. The theory of aggregation most often deals with measures of central tendency. However, sometimes a very different kind of a numeric vector's synthesis into a single number is required. In this paper we introduce a class of mathematical functions which aim to measure spread or scatter of one-dimensional quantitative data. The proposed definition serves as a common, abstract framework for measures of absolute spread known from statistics, exploratory data analysis and data mining, e.g. the sample variance, standard deviation, range, interquartile range (IQR), median absolute deviation (MAD), etc. Additionally, we develop new measures of experts' opinions diversity or consensus in group decision making problems. We investigate some properties of spread measures, show how are they related to aggregation functions, and indicate their new potentially fruitful application areas.

Keywords. Group decisions and negotiations, aggregation, spread, deviation, variance

9.
Cena A., Gagolewski M., OM3: Ordered maxitive, minitive, and modular aggregation operators – axiomatic and probabilistic properties in an arity-monotonic setting, Fuzzy Sets and Systems 264, 2015, pp. 138-159. doi:10.1016/j.fss.2014.04.001

Abstract. The recently-introduced OM3 aggregation operators fulfill three appealing properties: they are simultaneously minitive, maxitive, and modular. Among the instances of OM3 operators we find e.g. OWMax and OWMin operators, the famous Hirsch's h-index and all its natural generalizations.
In this paper the basic axiomatic and probabilistic properties of extended, i.e. in an arity-dependent setting, OM3 aggregation operators are studied. We illustrate the difficulties one is inevitably faced with when trying to combine the quality and quantity of numeric items into a single number. The discussion on such aggregation methods is particularly important in the information resources producers assessment problem, which aims to reduce the negative effects of information overload. It turns out that the Hirsch-like indices of impact do not fulfill a set of very important properties, which puts the sensibility of their practical usage into question. Moreover, thanks to the probabilistic analysis of the operators in an i.i.d. model, we may better understand the relationship between the aggregated items' quality and their producers' productivity.

Keywords. Aggregation; ordered modularity, maxitivity and minitivity; arity-monotonicity; impact assessment; Hirsch's h-index; informetrics

10.
Gagolewski M., Mesiar R., Monotone measures and universal integrals in a uniform framework for the scientific impact assessment problem, Information Sciences 263, 2014, pp. 166-174. doi:10.1016/j.ins.2013.12.004

Abstract. The Choquet, Sugeno, and Shilkret integrals with respect to monotone measures, as well as their generalization – the universal integral, stand for a useful tool in decision support systems. In this paper we propose a general construction method for aggregation operators that may be used in assessing output of scientists. We show that the most often currently used indices of bibliometric impact, like Hirsch's h, Woeginger's w, Egghe's g, Kosmulski's MAXPROD, and similar constructions, may be obtained by means of our framework. Moreover, the model easily leads to some new, very interesting functions.

Keywords. Choquet, Sugeno, Shilkret, universal integral; monotone measures; aggregation; indices of scientific impact, bibliometrics; h-index, w-index, g-index, MAXPROD-index

11.
Coroianu L., Gagolewski M., Grzegorzewski P., Nearest piecewise linear approximation of fuzzy numbers, Fuzzy Sets and Systems 233, 2013, pp. 26-51. doi:10.1016/j.fss.2013.02.005

Abstract. The problem of the nearest approximation of fuzzy numbers by piecewise linear 1-knot fuzzy numbers is discussed. By using 1-knot fuzzy numbers one may obtain approximations which are simple enough and flexible to reconstruct the input fuzzy concepts under study. They might be also perceived as a generalization of the trapezoidal approximations. Moreover, these approximations possess some desirable properties. Apart from theoretical considerations approximation algorithms that can be applied in practice are also given.

Keywords. Approximation of fuzzy numbers; Fuzzy number; Piecewise linear approximation

12.
Gagolewski M., Scientific impact assessment cannot be fair, Journal of Informetrics 7(4), 2013, pp. 792-802. doi:10.1016/j.joi.2013.07.001

Abstract. In this paper we deal with the problem of aggregating numeric sequences of arbitrary length that represent e.g. citation records of scientists. Impact functions are the aggregation operators that express as a single number not only the quality of individual publications, but also their author's productivity.
We examine some fundamental properties of these aggregation tools. It turns out that each impact function which always gives indisputable valuations must necessarily be trivial. Moreover, it is shown that for any set of citation records in which none is dominated by the other, we may construct an impact function that gives any a prori-established authors' ordering. Theoretically then, there is considerable room for manipulation in the hands of decision makers.
We also discuss the differences between the impact function-based and the multicriteria decision making-based approach to scientific quality management, and study how the introduction of new properties of impact functions affects the assessment process. We argue that simple mathematical tools like the h- or g-index (as well asother bibliometric impact indices) may not necessarily be a good choice when it comes to assess scientific achievements.

Keywords. Impact functions; aggregation; decision making; reference modeling; Hirsch's h-index; scientometrics; bibliometrics

13.
Gagolewski M., On the relationship between symmetric maxitive, minitive, and modular aggregation operators, Information Sciences 221, 2013, pp. 170-180. doi:10.1016/j.ins.2012.09.005

Abstract. In this paper the relationship between symmetric minitive, maxitive, and modular aggregation operators is considered. It is shown that the intersection between any two of the three discussed classes is the same. Moreover, the intersection is explicitly characterized.
It turns out that the intersection contains families of aggregation operators such as OWMax, OWMin, and many generalizations of the widely-known Hirsch’s h-index, often applied in scientific quality control.

Keywords. Aggregation operators; OWMax; OMA; OWA; Hirsch’s h-index; Scientometrics

Comments. Later we proposed that the symmetric minitive, maxitive, and modular aggregation operators may be called the OM3 agops, see (Cena A., Gagolewski M., OM3: ordered maxitive, minitive, and modular aggregation operators – Part I: Axiomatic analysis under arity-dependence, 2013).

14.
Gagolewski M., Mesiar R., Aggregating different paper quality measures with a generalized h-index, Journal of Informetrics 6(4), 2012, pp. 566-579. doi:10.1016/j.joi.2012.05.001

Abstract. The process of assessing individual authors should rely upon a proper aggregation of reliable and valid papers’ quality metrics. Citations are merely one possible way to measure appreciation of publications. In this study we propose some new, SJR- and SNIP-based indicators, which not only take into account the broadly conceived popularity of a paper (manifested by the number of citations), but also other factors like its potential, or the quality of papers that cite a given publication. We explore the relation and correlation between different metrics and study how they affect the values of a real-valued generalized h-index calculated for 11 prominent scientometricians. We note that the h-index is a very unstable impact function, highly sensitive for applying input elements’ scaling. Our analysis is not only of theoretical significance: data scaling is often performed to normalize citations across disciplines. Uncontrolled application of this operation may lead to unfair and biased (toward some groups) decisions. This puts the validity of authors assessment and ranking using the h-index into question. Obviously, a good impact function to be used in practice should not be as much sensitive to changing input data as the analyzed one.

Keywords. Aggregation operators; Impact functions; Hirsch's h-index; Quality control; Scientometrics; Bibliometrics; SJR; SNIP; Scopus; CITAN; R

Comments. An empirical paper. The ideas presented here were later explored more thoroughly in (Cena A., Gagolewski M., OM3: ordered maxitive, minitive, and modular aggregation operators – Part II: A simulation study, 2013).

15.
Gagolewski M., Bibliometric impact assessment with R and the CITAN package, Journal of Informetrics 5(4), 2011, pp. 678-692. doi:10.1016/j.joi.2011.06.006

Abstract. In this paper CITAN, the CITation ANalysis package for R statistical computing environment, is introduced. The main aim of the software is to support bibliometricians with a tool for preprocessing and cleaning bibliographic data retrieved from SciVerse Scopus and for calculating the most popular indices of scientific impact.
To show the practical usability of the package, an exemplary assessment of authors publishing in the fields of scientometrics and webometrics is performed.

Keywords. Data analysis software; Quality control in science; Citation analysis; Bibliometrics; Hirsch's h index; Egghe's g index; SciVerse Scopus

16.
Gagolewski M., Grzegorzewski P., Possibilistic analysis of arity-monotonic aggregation operators and its relation to bibliometric impact assessment of individuals, International Journal of Approximate Reasoning 52(9), 2011, pp. 1312-1324. doi:10.1016/j.ijar.2011.01.010

Abstract. A class of arity-monotonic aggregation operators, called impact functions, is proposed. This family of operators forms a theoretical framework for the so-called Producer Assessment Problem, which includes the scientometric task of fair and objective assessment of scientists using the number of citations received by their publications.
The impact function output values are analyzed under right-censored and dynamically changing input data. The qualitative possibilistic approach is used to describe this kind of uncertainty. It leads to intuitive graphical interpretations and may be easily applied for practical purposes.
The discourse is illustrated by a family of aggregation operators generalizing the well-known Ordered Weighted Maximum (OWMax) and the Hirsch h-index.

Keywords. Aggregation operators; Possibility theory; S-statistics; h-index; OWMax

Comments. In this paper the class of effort-dominating impact functions has also been introduced. I have shown later (see Gagolewski M., On the Relation Between Effort-Dominating and Symmetric Minitive Aggregation Operators, 2012) that all such aggregation operators are symmetric minitive.

17.
Gagolewski M., Grzegorzewski P., A geometric approach to the construction of scientific impact indices, Scientometrics 81(3), 2009, pp. 617-634. doi:10.1007/s11192-008-2253-y

Abstract. Two broad classes of scientific impact indices are proposed and their properties – both theoretical and practical – are discussed. These new classes were obtained as a geometric generalization of the well-known tools applied in scientometric, like Hirsch’s h-index, Woeginger’s w-index and the Kosmulski’s Maxprod. It is shown how to apply the suggested indices for estimation of the shape of the citation function or the total number of citations of an individual. Additionally, a new efficient and simple O(log n) algorithm for computing the h-index is given.

Keywords. Hirsch's h-index, citation analysis, scientific impact indices

Comments. I have shown later (see Gagolewski M., On the Relation Between Effort-Dominating and Symmetric Minitive Aggregation Operators, 2012) that the rp-indices are symmetric minitive. Moreover, we have found that there exists a O(n log n) algorithm for determining lp (see Gagolewski M., Dębski M., Nowakiewicz M., Efficient Algorithm for Computing Certain Graph-Based Monotone Integrals: the lp-Indices, 2013

Papers in Edited Volumes and Proceedings

1.
Gagolewski M., Cena A., Bartoszuk M., Hierarchical clustering via penalty-based aggregation and the Genie approach, In: Torra V. et al. (Eds.), Modeling Decisions for Artificial Intelligence (Lecture Notes in Artificial Intelligence 9880), Springer, 2016, pp. 191-202. doi:10.1007/978-3-319-45656-0_16

Abstract. The paper discusses a generalization of the nearest centroid hierarchical clustering algorithm. A first extension deals with the incorporation of generic distance-based penalty minimizers instead of the classical aggregation by means of centroids. Due to that the presented algorithm can be applied in spaces equipped with an arbitrary dissimilarity measure (images, DNA sequences, etc.). Secondly, a correction preventing the formation of clusters of too highly unbalanced sizes is applied: just like in the recently introduced Genie approach, which extends the single linkage scheme, the new method averts a chosen inequity measure (e.g., the Gini-, de Vergottini-, or Bonferroni-index) of cluster sizes from raising above a predefined threshold. Numerous benchmarks indicate that the introduction of such a correction increases the quality of the resulting clusterings.

Keywords. hierarchical clustering, aggregation, centroid, Gini-index, Genie algorithm

2.
Bartoszuk M., Beliakov G., Gagolewski M., James S., Fitting aggregation functions to data: Part I – Linearization and regularization, In: Carvalho J.P. et al. (Eds.), Information Processing and Management of Uncertainty in Knowledge-Based Systems, Part II (Communications in Computer and Information Science 611), Springer, 2016, pp. 767-779. doi:10.1007/978-3-319-40581-0_62

Abstract. The use of supervised learning techniques for fitting weights and/or generator functions of weighted quasi-arithmetic means – a special class of idempotent and nondecreasing aggregation functions – to empirical data has already been considered in a number of papers. Nevertheless, there are still some important issues that have not been discussed in the literature yet. In the first part of this two-part contribution we deal with the concept of regularization, a quite standard technique from machine learning applied so as to increase the fit quality on test and validation data samples. Due to the constraints on the weighting vector, it turns out that quite different methods can be used in the current framework, as compared to regression models. Moreover, it is worth noting that so far fitting weighted quasi-arithmetic means to empirical data has only been performed approximately, via the so-called linearization technique. In this paper we consider exact solutions to such special optimization tasks and indicate cases where linearization leads to much worse solutions.

Keywords. Aggregation functions, weighted quasi-arithmetic means, least squares fitting, regularization, linearization

3.
Bartoszuk M., Beliakov G., Gagolewski M., James S., Fitting aggregation functions to data: Part II – Idempotization, In: Carvalho J.P. et al. (Eds.), Information Processing and Management of Uncertainty in Knowledge-Based Systems, Part II (Communications in Computer and Information Science 611), Springer, 2016, pp. 780-789. doi:10.1007/978-3-319-40581-0_63

Abstract. The use of supervised learning techniques for fitting weights and/or generator functions of weighted quasi-arithmetic means – a special class of idempotent and nondecreasing aggregation functions – to empirical data has already been considered in a number of papers. Nevertheless, there are still some important issues that have not been discussed in the literature yet. In the second part of this two-part contribution we deal with a quite common situation in which we have inputs coming from different sources, describing a similar phenomenon, but which have not been properly normalized. In such a case, idempotent and nondecreasing functions cannot be used to aggregate them unless proper pre-processing is performed. The proposed idempotization method, based on the notion of B-splines, allows for an automatic calibration of independent variables. The introduced technique is applied in an R source code plagiarism detection system.

Keywords. Aggregation functions, weighted quasi-arithmetic means, least squares fitting, idempotence

4.
Cena A., Gagolewski M., Fuzzy k-minpen clustering and k-nearest-minpen classification procedures incorporating generic distance-based penalty minimizers, In: Carvalho J.P. et al. (Eds.), Information Processing and Management of Uncertainty in Knowledge-Based Systems, Part II (Communications in Computer and Information Science 611), Springer, 2016, pp. 445-456. doi:10.1007/978-3-319-40581-0_36

Abstract. We discuss a generalization of the fuzzy (weighted) k-means clustering procedure and point out its relationships with data aggregation in spaces equipped with arbitrary dissimilarity measures. In the proposed setting, a data set partitioning is performed based on the notion of points' proximity to generic distance-based penalty minimizers. Moreover, a new data classification algorithm, resembling the k-nearest neighbors scheme but less computationally and memory demanding, is introduced. Rich examples in complex data domains indicate the usability of the methods and aggregation theory in general.

Keywords. fuzzy k-means algorithm, clustering, classification, fusion functions, penalty minimizers

5.
Lasek J., Gagolewski M., The winning solution to the AAIA'15 Data Mining Competition: Tagging firefighter activities at a fire scene, In: Ganzha M., Maciaszek L., Paprzycki M. (Eds.), Proc. FedCSIS'15, IEEE, 2015, pp. 375-380. doi:10.15439/2015F418

Abstract. Multi-sensor based classification of professionals' activities plays a key role in ensuring the success of an his/her goals. In this paper we present the winning solution to the AAIA'15 Tagging Firefighter Activities at a Fire Scene data mining competition. The approach is based on a Random Forest classifier trained on an input data set with almost 5000 features describing the underlying time series of sensory data.

Keywords. Activity tagging, movement tagging, data mining competition, Random Forest model, FFT

6.
Cena A., Gagolewski M., A K-means-like algorithm for informetric data clustering, In: Alonso J.M., Bustince H., Reformat M. (Eds.), Proc. IFSA/EUSFLAT 2015, Atlantis Press, 2015, pp. 536-543. doi:10.2991/ifsa-eusflat-15.2015.77

Abstract. The K-means algorithm is one of the most often used clustering techniques. However, when it comes to discovering clusters in informetric data sets that consist of non-increasingly ordered vectors of not necessarily conforming lengths, such a method cannot be applied directly. Hence, in this paper, we propose a K-means-like algorithm to determine groups of producers that are similar not only with respect to the quality of information resources they output, but also their quantity.

Keywords. k-means clustering, informetrics, aggregation, impact functions

7.
Bartoszuk M., Gagolewski M., Detecting similarity of R functions via a fusion of multiple heuristic methods, In: Alonso J.M., Bustince H., Reformat M. (Eds.), Proc. IFSA/EUSFLAT 2015, Atlantis Press, 2015, pp. 419-426. doi:10.2991/ifsa-eusflat-15.2015.61

Abstract. In this paper we describe recent advances in our R code similarity detection algorithm. We propose a modification of the Program Dependence Graph (PDG) procedure used in the GPLAG system that better fits the nature of functional programming languages like R. The major strength of our approach lies in a proper aggregation of outputs of multiple plagiarism detection methods, as it is well known that no single technique gives perfect results. It turns out that the incorporation of the PDG algorithm significantly improves the recall ratio, i.e. it is better in indicating true positive cases of plagiarism or code cloning patterns. The implemented system is available as web application at http://SimilaR.Rexamine.com/.

Keywords. R, plagiarism and code cloning detection, fuzzy proximity relations, aggregation, program dependence graph, t-norms

8.
Gagolewski M., Lasek J., Learning experts' preferences from informetric data, In: Alonso J.M., Bustince H., Reformat M. (Eds.), Proc. IFSA/EUSFLAT 2015, Atlantis Press, 2015, pp. 484-491. doi:10.2991/ifsa-eusflat-15.2015.70

Abstract. In the field of informetrics, agents are often represented by numeric sequences of non necessarily conforming lengths. There are numerous aggregation techniques of such sequences, e.g., the g-index, the h-index, that may be used to compare the output of pairs of agents. In this paper we address a question whether such impact indices may be used to model experts' preferences accurately.

Keywords. preference learning, fuzzy relations, informetrics, aggregation, h-index

9.
Gagolewski M., Normalized WDpWAM and WDpOWA spread measures, In: Alonso J.M., Bustince H., Reformat M. (Eds.), Proc. IFSA/EUSFLAT 2015, Atlantis Press, 2015, pp. 210-216. doi:10.2991/ifsa-eusflat-15.2015.32

Abstract. Aggregation theory often deals with measures of central tendency of quantitative data. As sometimes a different kind of information fusion is needed, an axiomatization of spread measures was introduced recently. In this contribution we explore the properties of WDpWAM and WDpOWA operators, which are defined as weighted Lp-distances to weighted arithmetic mean and OWA operators, respectively. In particular, we give forms of vectors that maximize such fusion functions and thus provide a way to normalize the output value so that the vector of maximal spread always leads to a fixed outcome, e.g., 1 if all the input elements are in [0,1]. This might be desirable when constructing measures of experts' opinions consistency or diversity in group decision making problems.

Keywords. data fusion, aggregation, spread, deviation, variance, OWA operators

10.
Cena A., Gagolewski M., Aggregation and soft clustering of informetric data, In: Baczyński  M., De Baets B., Mesiar R. (Eds.), Proc. 8th International Summer School on Aggregation Operators (AGOP 2015), University of Silesia, 2015, pp. 79-84. isbn:978-83-8012-519-3

Abstract. The aim of this contribution is to inspect possible applications of clustering techniques computed over a set consisting of nonincreasingly ordered vectors of possibly nonconforming lengths. Such data sets appear in the field of informetrics, where one may need to evaluate the quality of information items, e.g., research papers, and their producers. In this paper we investigate the notion of cluster centers as an aggregated representation of all vectors from a given cluster and analyze them by means of aggregation operators.

Keywords. clustering, fuzzy clustering, c-means algorithm, distance, producers assessment problem

11.
Gagolewski M., Some issues in aggregation of multidimensional data, In: Baczyński  M., De Baets B., Mesiar R. (Eds.), Proc. 8th International Summer School on Aggregation Operators (AGOP 2015), University of Silesia, 2015, pp. 127-132. isbn:978-83-8012-519-3

Abstract. The aggregation theory usually takes an interest in summarizing a predefined number of points in the real line. In many applications, like in statistics, data analysis, and mining, the notion of a mean – a nondecreasing, internal, and symmetric fusion function – plays a key role. Nevertheless, when it comes to aggregating a set of points in higher dimensional spaces, the componentwise extension of monotonicity and internality might not be the best choice. Instead, the invariance to certain classes of geometric transformations seems to be crucial in such a case.

Keywords. aggregation, centroid, Tukey median, 1-center, 1-median, convex hull, affine invariance, orthogonalization

12.
Gagolewski M., Lasek J., The use of fuzzy relations in the assessment of information resources producers' performance, In: Filev D. et al. (Eds.), Proc. 7th IEEE International Conference Intelligent Systems IS'2014, Vol. 2: Tools, Architectures, Systems, Applications (Advances in Intelligent Systems and Computing 323), Springer, 2015, pp. 289-300. doi:10.1007/978-3-319-11310-4_25

Abstract. The producers assessment problem has many important practical instances: it is an abstract model for intelligent systems evaluating e.g. the quality of computer software repositories, web resources, social networking services, and digital libraries. Each producer's performance is determined according not only to the overall quality of the items he/she outputted, but also to the number of such items (which may be different for each agent).
Recent theoretical results indicate that the use of aggregation operators in the process of ranking and evaluation producers may not necessarily lead to fair and plausible outcomes. Therefore, to overcome some weaknesses of the most often applied approach, in this preliminary study we encourage the use of a fuzzy preference relation-based setting and indicate why it may provide better control over the assessment process.

Keywords. fuzzy relations, preference modeling, producers assessment problem, StackOverflow, bibliometrics, h-index

13.
Gagolewski M., Sugeno integral-based confidence intervals for the theoretical h-index, In: Grzegorzewski P. et al. (Eds.), Strengthening Links Between Data Analysis and Soft Computing (Advances in Intelligent Systems and Computing 315), Springer, 2015, pp. 233-240. doi:10.1007/978-3-319-10765-3_28

Abstract. Sugeno integral-based confidence intervals for the theoretical h-index of a fixed-length sequence of i.i.d. random variables are derived. They are compared with other estimators of such a distribution characteristic in a Pareto i.i.d. model. It turns out that in the first case we obtain much wider intervals. It seems to be due to the fact that a Sugeno integral, which may be applied on any ordinal scale, is known to ignore too much information from cardinal-scale data being aggregated.

Keywords. h-index, Sugeno integral, confidence interval, Pareto distribution

14.
Bartoszuk M., Gagolewski M., A fuzzy R code similarity detection algorithm, In: Laurent A. et al. (Eds.), Information Processing and Management of Uncertainty in Knowledge-Based Systems, Part III (Communications in Computer and Information Science 444), Springer, 2014, pp. 21-30. doi:10.1007/978-3-319-08852-5_3

Abstract. R is a programming language and software environment for performing statistical computations and applying data analysis that increasingly gains popularity among practitioners and scientists. In this paper we present a preliminary version of a system to detect pairs of similar R code blocks among a given set of routines, which bases on a proper aggregation of the output of three different [0,1]-valued (fuzzy) proximity degree estimation algorithms. Its analysis on empirical data indicates that the system may in future be successfully applied in practice in order e.g. to detect plagiarism among students' homework submissions or to perform an analysis of code recycling or code cloning in R's open source packages repositories.

Keywords. R, plagiarism detection, code cloning, fuzzy similarity measures

15.
Coroianu L., Gagolewski M., Grzegorzewski P., Adabitabar Firozja M., Houlari T., Piecewise linear approximation of fuzzy numbers preserving the support and core, In: Laurent A. et al. (Eds.), Information Processing and Management of Uncertainty in Knowledge-Based Systems, Part II (Communications in Computer and Information Science 443), Springer, 2014, pp. 244-254. doi:10.1007/978-3-319-08855-6_25

Abstract. A reasonable approximation of a fuzzy number should have a simple membership function, be close to the input fuzzy number, and should preserve some of its important characteristics. In this paper we suggest to approximate a fuzzy number by a piecewise linear 1-knot fuzzy number which is the closest one to the input fuzzy number among all piecewise linear 1-knot fuzzy numbers having the same core and the same support as the input. We discuss the existence of the approximation operator, show algorithms ready for the practical use and illustrate the considered concepts by examples. It turns out that such an approximation task may be problematic.

Keywords. Approximation of fuzzy numbers, core, fuzzy number, piecewise linear approximation, support

16.
Cena A., Gagolewski M., OM3: Ordered maxitive, minitive, and modular aggregation operators – Part I: Axiomatic analysis under arity-dependence, In: Bustince H. et al. (Eds.), Aggregation Functions in Theory and in Practise (Advances in Intelligent Systems and Computing 228), Springer, 2013, pp. 93-103. doi:10.1007/978-3-642-39165-1_13

Abstract. Recently, a very interesting relation between symmetric minitive, maxitive, and modular aggregation operators has been shown. It turns out that the intersection between any pair of the mentioned classes is the same. This result introduces what we here propose to call the OM3 operators. In the first part of our contribution on the analysis of the OM3 operators we study some properties that may be useful when aggregating input vectors of varying lengths. In Part II we will perform a thorough simulation study of the impact of input vectors’ calibration on the aggregation results.

17.
Cena A., Gagolewski M., OM3: Ordered maxitive, minitive, and modular aggregation operators – Part II: A simulation study, In: Bustince H. et al. (Eds.), Aggregation Functions in Theory and in Practise (Advances in Intelligent Systems and Computing 228), Springer, 2013, pp. 105-115. doi:10.1007/978-3-642-39165-1_14

Abstract. This article is a second part of the contribution on the analysis of the recently-proposed class of symmetric maxitive, minitive and modular aggregation operators. Recent results (Gagolewski, Mesiar, 2012) indicated some unstable behavior of the generalized h-index, which is a particular instance of OM3, in case of input data transformation. The study was performed on a small, carefully selected real-world data set. Here we conduct some experiments to examine this phenomena more extensively.

18.
Gagolewski M., Statistical hypothesis test for the difference between Hirsch indices of two Pareto-distributed random samples, In: Kruse R. et al. (Eds.), Synergies of Soft Computing and Statistics for Intelligent Data Analysis (Advances in Intelligent Systems and Computing 190), Springer, 2013, pp. 359-367. doi:10.1007/978-3-642-33042-1_39

Abstract. In this paper we discuss the construction of a new parametric statistical hypothesis test for the equality of probability distributions. The test bases on the difference between Hirsch’s h-indices of two equal-length i.i.d. random samples. For the sake of illustration, we analyze its power in case of Pareto-distributed input data. It turns out that the test is very conservative and has wide acceptance regions, which puts in question the appropriateness of the h-index usage in scientific quality control and decision making.

19.
Gagolewski M., On the relation between effort-dominating and symmetric minitive aggregation operators, In: Greco S. et al. (Eds.), Advances in Computational Intelligence, Part III (Communications in Computer and Information Science 299), Springer, 2012, pp. 276-285. doi:10.1007/978-3-642-31718-7_29

Abstract. In this paper the recently introduced class of effort-dominating impact functions is examined. It turns out that each effort-dominating aggregation operator not only has a very intuitive interpretation, but also is symmetric minitive, and therefore may be expressed as a so-called quasi-I-statistic, which generalizes the well-know OWMin operator.
These aggregation operators may be used e.g. in the Producer Assessment Problem whose most important instance is the scientometric/bibliometric issue of fair scientists’ ranking by means of the number of citations received by their papers.

20.
Gagolewski M., Grzegorzewski P., Axiomatic characterizations of (quasi-) L-statistics and S-statistics and the Producer Assessment Problem, In: Galichet S., Montero J., Mauris G. (Eds.), Proc. EUSFLAT/LFA 2011, Atlantis Press, 2011, pp. 53-58. doi:10.2991/eusflat.2011.112

Abstract. Two classes of aggregation functions: L-statistics and S-statistics and their generalizations called quasi-L-statistics and quasi-S-statistics are considered. Some interesting characterizations of these families of operators are given. The aforementioned functions are useful for various applications. In particular, they are very helpful for modeling the so-called Producer Assessment Problem.

21.
Gagolewski M., Grzegorzewski P., S-Statistics and their basic properties, In: Borgelt C. et al. (Eds.), Combining Soft Computing and Statistical Methods in Data Analysis (Advances in Intelligent and Soft Computing 77), Springer, 2010, pp. 281-288. doi:10.1007/978-3-642-14746-3_35

Abstract. Some statistical properties of the so-called S-statistics, which generalize the ordered weighted maximum aggregation operators, are considered. In particular, the asymptotic normality of S-statistics is proved and some possible applications in estimation problems are suggested.

22.
Gagolewski M., Grzegorzewski P., Arity-monotonic extended aggregation operators, In: Hüllermeier E., Kruse R., Hoffmann F. (Eds.), Information Processing and Management of Uncertainty in Knowledge-Based Systems (Communications in Computer and Information Science 80), Springer, 2010, pp. 693-702. doi:10.1007/978-3-642-14055-6_73

Abstract. A class of extended aggregation operators, called impact functions, is proposed and their basic properties are examined. Some important classes of functions like generalized ordered weighted averaging (OWA) and ordered weighted maximum (OWMax) operators are considered. The general idea is illustrated by the Producer Assessment Problem which includes the scientometric problem of rating scientists basing on the number of citations received by their publications. An interesting characterization of the well known h-index is given.

23.
Gagolewski M., Grzegorzewski P., Possible and necessary h-indices, In: Carvalho J.P. et al. (Eds.), Proc. IFSA/EUSFLAT 2009, 2009, pp. 1691-1695. isbn:978-989-95079-6-8

Abstract. The problem of measuring scientific impact is considered. A class of so-called p-sphere (rp) indices, which generalize the well-known Hirsch index, is used to construct a possibility measure of scientific impact. This measure might be treated as a starting point for prediction of future index values or for dealing with right-censored bibliometric data.

Other Peer-Reviewed Papers

1.
Lasek J., Gagolewski M., Estimation of tournament metrics for association football league formats, In: Selected problems in information technologies (Proc. ITRIA'15 vol. 2), Institute of Computer Science, Polish Academy of Sciences, 2015, pp. 67-78.
2.
Cena A., Gagolewski M., Clustering and aggregation of informetric data sets, In: Computational methods in data analysis (Proc. ITRIA'15 vol. 1), Institute of Computer Science, Polish Academy of Sciences, 2015, pp. 5-26. isbn:978-83-63159-22-1
3.
Gagolewski M., Dębski M., Nowakiewicz M., Efficient algorithm for computing certain graph-based monotone integrals: the lp-indices, In: Mesiar R., Bacigal T. (Eds.), Proc. Uncertainty Modelling, STU Bratislava, 2013, pp. 17-23.

Abstract. The Choquet, Sugeno and Shilkret integrals with respect to monotone measures are useful tools in decision support systems. In this paper we propose a new class of graph-based integrals that generalize these three operations. Then, an efficient linear-time algorithm for computing their special case, that is lp-indices, 1≤p<∞, is presented. The algorithm is based on R.L. Graham's routine for determining the convex hull of a finite planar set.

Keywords. Monotone measures, Choquet, Sugeno and Shilkret integral, lp-index, convex hull, Graham's scan, scientific impact indices

4.
Rowiński T., Gagolewski M., Internet a kryzys, In: Jankowska M., Starzomska M. (Eds.), Kryzys: Pułapka czy szansa?, WN Akapit, 2011, pp. 211-224. isbn:978-83-609-5885-8
5.
Gagolewski M., Grzegorzewski P., Metody i problemy naukometrii, In: Rowiński T., Tadeusiewicz R. (Eds.), Psychologia i informatyka. Synergia i kontradykcje, Wyd. UKSW, Warszawa, 2010, pp. 103-125. isbn:978-83-707-2679-9
6.
Gagolewski M., Grzegorzewski P., O pewnym uogólnieniu indeksu Hirscha, In: Kawalec P., Lipski P. (Eds.), Kadry i infrastruktura nowoczesnej nauki: teoria i praktyka, Proc. 1st Intl. Conf. Zarządzanie Nauką, 2009, Vol. 2, pp. 15-29. isbn:978-83-61671-12-1
7.
Rowiński T., Gagolewski M., Preferencje i postawy wobec pomocy online, Studia Psychologica UKSW 7, 2007, pp. 195-210.