Ordered by Type
This list is also available in BibTeX format.
Abstract. We investigate the application of the Ordered Weighted Averaging (OWA) data fusion operator in agglomerative hierarchical clustering. The examined setting generalises the well-known single, complete and average linkage schemes. It allows to embody expert knowledge in the cluster merge process and to provide a much wider range of possible linkages. We analyse various families of weighting functions on numerous benchmark data sets in order to assess their influence on the resulting cluster structure. Moreover, we inspect the correction for the inequality of cluster size distribution -- similar to the one in the Genie algorithm. Our results demonstrate that by robustifying the procedure with the Genie correction, we can obtain a significant performance boost in terms of clustering quality. This is particularly beneficial in the case of the linkages based on the closest distances between clusters, including the single linkage and its "smoothed" counterparts. To explain this behaviour, we propose a new linkage process called three-stage OWA which yields further improvements. This way we confirm the intuition that hierarchical cluster analysis should rather take into account a few nearest neighbours of each point, instead of trying to adapt to their non-local neighbourhood.
Keywords. hierarchical clustering, OWA, data fusion, aggregation, Genie
Defined solely by means of order-theoretic operations meet (min) and join (max), weighted lattice polynomial functions are particularly useful for modeling data on an ordinal scale. A special case, the discrete Sugeno integral, defined with respect to a nonadditive measure (a capacity), enables accounting for the interdependencies between input variables.
However until recently the problem of identifying the fuzzy measure values with respect to various objectives and requirements has not received a great deal of attention. By expressing the learning problem as the difference of convex functions, we are able to apply DC (difference of convex) optimization methods. Here we formulate one of the global optimization steps as a local linear programming problem and investigate the improvement under different conditions.
Keywords. Aggregation functions, nonadditive measures, Sugeno integral, capacities, DC optimization
The constrained ordered weighted averaging (OWA) aggregation problem
arises when we aim to maximize or minimize a convex combination of order
statistics under linear inequality constraints that act on the variables with
respect to their original sources. The standalone approach to optimizing
the OWA under constraints is to consider all permutations of the inputs,
which becomes quickly infeasible when there are more than a few variables,
however in certain cases we can take advantage of the relationships amongst
the constraints and the corresponding solution structures. For example, we
can consider a land-use allocation satisfaction problem with an auxiliary aim
of balancing land-types, whereby the response curves for each species are
non-decreasing with respect to the land-types. This results in comonotone
constraints, which allow us to drastically reduce the complexity of the problem.
In this paper, we show that if we have an arbitrary number of constraints that are comonotone (i.e., they share the same ordering permutation of the coefficients), then the optimal solution occurs for decreasing components of the solution. After investigating the form of the solution in some special cases and providing theoretical results that shed light on the form of the solution, we detail practical approaches to solving and give real-world examples.
Keywords. Multiple criteria evaluation; Ordered weighted averaging; Constrained OWA aggregation; Ecology; Work allocation
Abstract. In the field of information fusion, the problem of data aggregation has been formalized as an order-preserving process that builds upon the property of monotonicity. However, fields such as computational statistics, data analysis and geometry, usually emphasize the role of equivariances to various geometrical transformations in aggregation processes. Admittedly, if we consider a unidimensional data fusion task, both requirements are often compatible with each other. Nevertheless, in this paper we show that, in the multidimensional setting, the only idempotent functions that are monotone and orthogonal equivariant are the over-simplistic weighted centroids. Even more, this result still holds after replacing monotonicity and orthogonal equivariance by the weaker property of orthomonotonicity. This implies that the aforementioned approaches to the aggregation of multidimensional data are irreconcilable, and that, if a weighted centroid is to be avoided, we must choose between monotonicity and a desirable behaviour with regard to orthogonal transformations.
Keywords. multidimensional data aggregation, monotonicity, orthogonal equivariance, centroid
On the grounds of the revealed, mutual resemblance between the behaviour
of users of Stack Exchange and the dynamics of the citations accumulation
process in the scientific community, we tackled an outwardly
intractable problem of assessing the impact of introducing "negative" citations.
Although the most frequent reason to cite a paper is to highlight the connection between the two publications, researchers sometimes mention an earlier work to cast a negative light. While computing citation-based scores, for instance the h-index, information about the reason why a paper was mentioned is neglected. Therefore it can be questioned whether these indices describe scientific achievements accurately.
In this contribution we shed insight into the problem of "negative" citations, analysing data from Stack Exchange and, to draw more universal conclusions, we derive an approximation of citations scores. Here we show that the quantified influence of introducing negative citations is of lesser importance and that they could be used as an indicator of where attention of scientific community is allocated.
Keywords. citation analysis, the Hirsch index, negative citations, research evaluation, science of science
Abstract. The Sugeno integral is an expressive aggregation function with potential applications across a range of decision contexts. Its calculation requires only the lattice minimum and maximum operations, making it particularly suited to ordinal data and robust to scale transformations. However, for practical use in data analysis and prediction, we require efficient methods for learning the associated fuzzy measure. While such methods are well developed for the Choquet integral, the fitting problem is more difficult for the Sugeno integral because it is not amenable to being expressed as a linear combination of weights, and more generally due to plateaus and non-differentiability in the objective function. Previous research has hence focused on heuristic approaches or simplified fuzzy measures. Here we show that the problem of fitting the Sugeno integral to data such that the maximum absolute error is minimized can be solved using an efficient bilevel program. This method can be incorporated into algorithms that learn fuzzy measures with the aim of minimizing the median residual. This equips us with tools that make the Sugeno integral a feasible option in robust data regression and analysis. We provide experimental comparison with a genetic algorithms approach and an example in data analysis.
Keywords. Sugeno integral, fuzzy measure, parameter learning, aggregation functions
Abstract. The problem of learning symmetric capacities (or fuzzy measures) from data is investigated toward applications in data analysis and prediction as well as decision making. Theoretical results regarding the solution minimizing the mean absolute error are exploited to develop an exact branch-refine-and-bound-type algorithm for fitting Sugeno integrals (weighted lattice polynomial functions, max-min operators) with respect to symmetric capacities. The proposed method turns out to be particularly suitable for acting on ordinal data. In addition to providing a model that can be used for the general data regression task, the results can be used, among others, to calibrate generalized h-indices to bibliometric data.
Keywords. weight learning, ordinal data fitting, fuzzy measures, Sugeno integral, lattice polynomials, h-index
Abstract. The property of monotonicity, which requires a function to preserve a given order, has been considered the standard in the aggregation of real numbers for decades. In this paper, we argue that, for the case of multidimensional data, an order-based definition of monotonicity is far too restrictive. We propose several meaningful alternatives to this property not involving the preservation of a given order by returning to its early origins stemming from the field of calculus. Numerous aggregation methods for multidimensional data commonly used by practitioners are studied within our new framework.
Keywords. monotonicity, aggregation, multidimensional data, centroid, spatial median
Abstract. The Sugeno integral is a function particularly suited to the aggregation of ordinal inputs. Defined with respect to a fuzzy measure, its ability to account for complementary and redundant relationships between variables brings much potential to the field of biomedicine, where it is common for measurements and patient information to be expressed qualitatively. However, practical applications require well-developed methods for identifying the Sugeno integral's parameters, and this task is not easily expressed using the standard optimisation approaches. Here we formulate the objective function as the difference of two convex functions, which enables the use of specialised numerical methods. Such techniques are compared with other global optimisation frameworks through a number of numerical experiments.
Keywords. aggregation functions, fuzzy measures, Sugeno integral, capacities
Abstract. The problem of the piecewise linear approximation of fuzzy numbers giving outputs nearest to the inputs with respect to the Euclidean metric is discussed. The results given in Coroianu et al. (Fuzzy Sets Syst 233:26–51, 2013) for the 1-knot fuzzy numbers are generalized for arbitrary n-knot (n>=2) piecewise linear fuzzy numbers. Some results on the existence and properties of the approximation operator are proved. Then, the stability of some fuzzy number characteristics under approximation as the number of knots tends to infinity is considered. Finally, a simulation study concerning the computer implementations of arithmetic operations on fuzzy numbers is provided. Suggested concepts are illustrated by examples and algorithms ready for the practical use. This way, we throw a bridge between theory and applications as the latter ones are so desired in real-world problems.
Keywords. Approximation of fuzzy numbers, Calculations on fuzzy numbers, Characteristics of fuzzy numbers, Fuzzy number, Piecewise linear approximation
Abstract. The efficacy of different league formats in ranking teams according to their true latent strength is analysed. To this end, a new approach for estimating attacking and defensive strengths based on the Poisson regression for modelling match outcomes is proposed. Various performance metrics are estimated reflecting the agreement between latent teams' strength parameters and their final rank in the league table. The tournament designs studied here are used in the majority of European top-tier association football competitions. Based on numerical experiments, it turns out that a two-stage league format comprising of the three round-robin tournament together with an extra single round-robin is the most efficacious setting. In particular, it is the most accurate in selecting the best team as the winner of the league. Its efficacy can be enhanced by setting the number of points allocated for a win to two (instead of three that is currently in effect in association football).
Keywords. association football, league formats, rankings, rating systems, simulation, tournament design
Abstract. As cities increase in size, governments and councils face the problem of designing infrastructure and approaches to traffic management that alleviate congestion. The problem of objectively measuring congestion involves taking into account not only the volume of traffic moving throughout a network, but also the inequality or spread of this traffic over major and minor intersections. For modelling such data, we investigate the use of weighted congestion indices based on various aggregation and spread functions. We formulate the weight learning problem for comparison data and use real traffic data obtained from a medium-sized Australian city to evaluate their usefulness.
Keywords. aggregation functions, inequality indices, spread measures, learning weights, traffic analysis
Abstract. Research in aggregation theory is nowadays still mostly focused on algorithms to summarize tuples consisting of observations in some real interval or of diverse general ordered structures. Of course, in practice of information processing many other data types between these two extreme cases are worth inspecting. This contribution deals with the aggregation of lists of data points in Rd for arbitrary d≥1. Even though particular functions aiming to summarize multidimensional data have been discussed by researchers in data analysis, computational statistics and geometry, there is clearly a need to provide a comprehensive and unified model in which their properties like equivariances to geometric transformations, internality, and monotonicity may be studied at an appropriate level of generality. The proposed penalty-based approach serves as a common framework for all idempotent information aggregation methods, including componentwise functions, pairwise distance minimizers, and data depth-based medians. It also allows for deriving many new practically useful tools.
Keywords. multidimensional data aggregation, penalty functions, data depth, centroid, median
Abstract. Economic inequality measures are employed as a key component in various socio-demographic indices to capture the disparity between the wealthy and poor. Since their inception, they have also been used as a basis for modelling spread and disparity in other contexts. While recent research has identified that a number of classical inequality and welfare functions can be considered in the framework of OWA operators, here we propose a framework of penalty-based aggregation functions and their associated penalties as measures of inequality.
Keywords. penalty functions, aggregation functions, inequality indices, spread measures
The time needed to apply a hierarchical clustering algorithm is most often
dominated by the number of computations of a pairwise dissimilarity measure.
Such a constraint, for larger data sets, puts at a disadvantage the use of
all the classical linkage criteria but the single linkage one. However, it
is known that the single linkage clustering algorithm is very sensitive to
outliers, produces highly skewed dendrograms, and therefore usually does not
reflect the true underlying data structure – unless the clusters are well-separated.
To overcome its limitations, we propose a new hierarchical clustering linkage
criterion called Genie. Namely, our algorithm links two clusters in such a
way that a chosen economic inequity measure (e.g., the Gini- or Bonferroni-index)
of the cluster sizes does not increase drastically above a given threshold.
The presented benchmarks indicate a high practical usefulness of the introduced
method: it most often outperforms the Ward or average linkage in terms of the
clustering quality while retaining the single linkage speed. The Genie
algorithm is easily parallelizable and thus may be run on multiple threads
to speed up its execution further on. Its memory overhead is small: there
is no need to precompute the complete distance matrix to perform the
computations in order to obtain a desired clustering. It can be applied
on arbitrary spaces equipped with a dissimilarity measure, e.g., on real
vectors, DNA or protein sequences, images, rankings, informetric data,
etc. A reference implementation of the algorithm has been included
in the open source
genie package for
Keywords. hierarchical clustering, single linkage, inequity measures, Gini-index
Abstract. The famous Hirsch index has been introduced just ca. 10 years ago. Despite that, it is already widely used in many decision making tasks, like in evaluation of individual scientists, research grant allocation, or even production planning. It is known that the h-index is related to the discrete Sugeno integral and the Ky Fan metric introduced in 1940s. The aim of this paper is to propose a few modifications of this index as well as other fuzzy integrals – also on bounded chains – that lead to better discrimination of some types of data that are to be aggregated. All of the suggested compensation methods try to retain the simplicity of the original measure.
Keywords. h-index, Sugeno integral, Ky Fan metric, Shilkret integral, decomposition integrals
Abstract. In this paper, we study the efficacy of the official ranking for international football teams compiled by FIFA, the body governing football competition around the globe. We present strategies for improving a team's position in the ranking. By combining several statistical techniques we derive an objective function in a decision problem of optimal scheduling of future matches. The presented results display how a team's position can be improved. Along the way, we compare the official procedure to the famous Elo rating system. Although it originates from chess, it has been successfully tailored to ranking football teams as well.
Keywords. association football, FIFA ranking, prediction models, Monte Carlo simulations, optimal schedule, team rankings
Abstract. Hirsch’s h-index is perhaps the most popular citation-based measure of scientific excellence. In 2013, Ionescu and Chopard proposed an agent-based model describing a process for generating publications and citations in an abstract scientific community [G. Ionescu, B. Chopard, Eur. Phys. J. B 86, 426 (2013)]. Within such a framework, one may simulate a scientist’s activity, and – by extension – investigate the whole community of researchers. Even though the Ionescu and Chopard model predicts the h-index quite well, the authors provided a solution based solely on simulations. In this paper, we complete their results with exact, analytic formulas. What is more, by considering a simplified version of the Ionescu-Chopard model, we obtained a compact, easy to compute formula for the h-index. The derived approximate and exact solutions are investigated on a simulated and real-world data sets.
Keywords. Statistical and nonlinear physics, preferential attachment rule, h-index
Abstract. Classically, unsupervised machine learning techniques are applied on data sets with fixed number of attributes (variables). However, many problems encountered in the field of informetrics face us with the need to extend these kinds of methods in a way such that they may be computed over a set of nonincreasingly ordered vectors of unequal lengths. Thus, in this paper, some new dissimilarity measures (metrics) are introduced and studied. Owing to that we may use i.a. hierarchical clustering algorithms in order to determine an input data set's partition consisting of sets of producers that are homogeneous not only with respect to the quality of information resources, but also their quantity.
Keywords. aggregation, hierarchical clustering, distance, metric
Abstract. The theory of aggregation most often deals with measures of central tendency. However, sometimes a very different kind of a numeric vector's synthesis into a single number is required. In this paper we introduce a class of mathematical functions which aim to measure spread or scatter of one-dimensional quantitative data. The proposed definition serves as a common, abstract framework for measures of absolute spread known from statistics, exploratory data analysis and data mining, e.g. the sample variance, standard deviation, range, interquartile range (IQR), median absolute deviation (MAD), etc. Additionally, we develop new measures of experts' opinions diversity or consensus in group decision making problems. We investigate some properties of spread measures, show how are they related to aggregation functions, and indicate their new potentially fruitful application areas.
Keywords. Group decisions and negotiations, aggregation, spread, deviation, variance
Abstract. The recently-introduced OM3 aggregation operators fulfill three
appealing properties: they are simultaneously minitive, maxitive, and modular.
Among the instances of OM3 operators we find e.g. OWMax and OWMin operators,
the famous Hirsch's h-index and all its natural generalizations.
In this paper the basic axiomatic and probabilistic properties of extended, i.e. in an arity-dependent setting, OM3 aggregation operators are studied. We illustrate the difficulties one is inevitably faced with when trying to combine the quality and quantity of numeric items into a single number. The discussion on such aggregation methods is particularly important in the information resources producers assessment problem, which aims to reduce the negative effects of information overload. It turns out that the Hirsch-like indices of impact do not fulfill a set of very important properties, which puts the sensibility of their practical usage into question. Moreover, thanks to the probabilistic analysis of the operators in an i.i.d. model, we may better understand the relationship between the aggregated items' quality and their producers' productivity.
Keywords. Aggregation; ordered modularity, maxitivity and minitivity; arity-monotonicity; impact assessment; Hirsch's h-index; informetrics
Abstract. The Choquet, Sugeno, and Shilkret integrals with respect to monotone measures, as well as their generalization – the universal integral, stand for a useful tool in decision support systems. In this paper we propose a general construction method for aggregation operators that may be used in assessing output of scientists. We show that the most often currently used indices of bibliometric impact, like Hirsch's h, Woeginger's w, Egghe's g, Kosmulski's MAXPROD, and similar constructions, may be obtained by means of our framework. Moreover, the model easily leads to some new, very interesting functions.
Keywords. Choquet, Sugeno, Shilkret, universal integral; monotone measures; aggregation; indices of scientific impact, bibliometrics; h-index, w-index, g-index, MAXPROD-index
Abstract. The problem of the nearest approximation of fuzzy numbers by piecewise linear 1-knot fuzzy numbers is discussed. By using 1-knot fuzzy numbers one may obtain approximations which are simple enough and flexible to reconstruct the input fuzzy concepts under study. They might be also perceived as a generalization of the trapezoidal approximations. Moreover, these approximations possess some desirable properties. Apart from theoretical considerations approximation algorithms that can be applied in practice are also given.
Keywords. Approximation of fuzzy numbers; Fuzzy number; Piecewise linear approximation
Abstract. In this paper we deal with the problem of
aggregating numeric sequences of arbitrary length that represent
e.g. citation records of scientists. Impact functions are the aggregation operators that express as a
single number not only the quality of individual publications, but also their author's productivity.
We examine some fundamental properties of these aggregation tools. It turns out that each impact function which always gives indisputable valuations must necessarily be trivial. Moreover, it is shown that for any set of citation records in which none is dominated by the other, we may construct an impact function that gives any a prori-established authors' ordering. Theoretically then, there is considerable room for manipulation in the hands of decision makers.
We also discuss the differences between the impact function-based and the multicriteria decision making-based approach to scientific quality management, and study how the introduction of new properties of impact functions affects the assessment process. We argue that simple mathematical tools like the h- or g-index (as well asother bibliometric impact indices) may not necessarily be a good choice when it comes to assess scientific achievements.
Keywords. Impact functions; aggregation; decision making; reference modeling; Hirsch's h-index; scientometrics; bibliometrics
Abstract. In this paper the relationship between symmetric minitive,
maxitive, and modular aggregation operators is considered. It is shown
that the intersection between any two of the three discussed classes
is the same. Moreover, the intersection is explicitly characterized.
It turns out that the intersection contains families of aggregation operators such as OWMax, OWMin, and many generalizations of the widely-known Hirsch’s h-index, often applied in scientific quality control.
Keywords. Aggregation operators; OWMax; OMA; OWA; Hirsch’s h-index; Scientometrics
Comments. Later we proposed that the symmetric minitive, maxitive, and modular aggregation operators may be called the OM3 agops, see (Cena A., Gagolewski M., OM3: ordered maxitive, minitive, and modular aggregation operators – Part I: Axiomatic analysis under arity-dependence, 2013).
Abstract. The process of assessing individual authors should rely upon a proper aggregation of reliable and valid papers’ quality metrics. Citations are merely one possible way to measure appreciation of publications. In this study we propose some new, SJR- and SNIP-based indicators, which not only take into account the broadly conceived popularity of a paper (manifested by the number of citations), but also other factors like its potential, or the quality of papers that cite a given publication. We explore the relation and correlation between different metrics and study how they affect the values of a real-valued generalized h-index calculated for 11 prominent scientometricians. We note that the h-index is a very unstable impact function, highly sensitive for applying input elements’ scaling. Our analysis is not only of theoretical significance: data scaling is often performed to normalize citations across disciplines. Uncontrolled application of this operation may lead to unfair and biased (toward some groups) decisions. This puts the validity of authors assessment and ranking using the h-index into question. Obviously, a good impact function to be used in practice should not be as much sensitive to changing input data as the analyzed one.
Keywords. Aggregation operators; Impact functions; Hirsch's h-index; Quality control; Scientometrics; Bibliometrics; SJR; SNIP; Scopus; CITAN; R
Comments. An empirical paper. The ideas presented here were later explored more thoroughly in (Cena A., Gagolewski M., OM3: ordered maxitive, minitive, and modular aggregation operators – Part II: A simulation study, 2013).
Abstract. In this paper CITAN, the CITation ANalysis package for
R statistical computing environment, is introduced. The main aim of the
software is to support bibliometricians with a tool for preprocessing
and cleaning bibliographic data retrieved from SciVerse Scopus and
for calculating the most popular indices of scientific impact.
To show the practical usability of the package, an exemplary assessment of authors publishing in the fields of scientometrics and webometrics is performed.
Keywords. Data analysis software; Quality control in science; Citation analysis; Bibliometrics; Hirsch's h index; Egghe's g index; SciVerse Scopus
Abstract. A class of arity-monotonic aggregation operators,
called impact functions, is proposed. This family of operators forms
a theoretical framework for the so-called Producer Assessment Problem,
which includes the scientometric task of fair and objective assessment
of scientists using the number of citations received by their publications.
The impact function output values are analyzed under right-censored and dynamically changing input data. The qualitative possibilistic approach is used to describe this kind of uncertainty. It leads to intuitive graphical interpretations and may be easily applied for practical purposes.
The discourse is illustrated by a family of aggregation operators generalizing the well-known Ordered Weighted Maximum (OWMax) and the Hirsch h-index.
Keywords. Aggregation operators; Possibility theory; S-statistics; h-index; OWMax
Comments. In this paper the class of effort-dominating impact functions has also been introduced. I have shown later (see Gagolewski M., On the Relation Between Effort-Dominating and Symmetric Minitive Aggregation Operators, 2012) that all such aggregation operators are symmetric minitive.
Abstract. Two broad classes of scientific impact indices are proposed and their properties – both theoretical and practical – are discussed. These new classes were obtained as a geometric generalization of the well-known tools applied in scientometric, like Hirsch’s h-index, Woeginger’s w-index and the Kosmulski’s Maxprod. It is shown how to apply the suggested indices for estimation of the shape of the citation function or the total number of citations of an individual. Additionally, a new efficient and simple O(log n) algorithm for computing the h-index is given.
Keywords. Hirsch's h-index, citation analysis, scientific impact indices
Comments. I have shown later (see Gagolewski M., On the Relation Between Effort-Dominating and Symmetric Minitive Aggregation Operators, 2012) that the rp-indices are symmetric minitive. Moreover, we have found that there exists a O(n log n) algorithm for determining lp (see Gagolewski M., Dębski M., Nowakiewicz M., Efficient Algorithm for Computing Certain Graph-Based Monotone Integrals: the lp-Indices, 2013
Python(Data Processing and Analysis in
Python), Wydawnictwo Naukowe PWN, 2016, 369 pp. isbn:978-83-01-18940-2
R. Analiza danych, obliczenia, symulacje (
RProgramming. Data Analysis. Computing. Simulations), Wydawnictwo Naukowe PWN; 1st ed. – 2014, 509 pp.; 2nd ed. – 2016, 550 pp. isbn:978-83-01-18939-6
R(Statistical Inference in
R), Politechnika Warszawska, 2014, 183 pp. isbn:978-83-93-72601-1
Abstract. The problem of penalty-based data aggregation in generic real normed vector spaces is studied. Some existence and uniqueness results are indicated. Moreover, various properties of the aggregation functions are considered.
Keywords. penalty-based aggregation, prototype learning, means, averages, and medians, vector spaces, Fermat-Weber problem
Abstract. We look at different approaches to learning the weights of the weighted arithmetic mean such that the median residual or sum of the smallest half of squared residuals is minimized. The more general problem of multivariate regression has been well studied in statistical literature however in the case of aggregation functions we have the restriction on the weights and the domain is usually restricted so that ‘outliers’ may not be arbitrarily large. A number of algorithms are compared in terms of accuracy and speed. Our results can be extended to other aggregation functions.
Keywords. aggregation, LMS fitting, LTS fitting, approximation
Abstract. The Sugeno integral has numerous successful applications, including but not limited to the areas of decision making, preference modeling, and bibliometrics. Despite this, the current state of the development of usable algorithms for numerically fitting the underlying discrete fuzzy measure based on a sample of prototypical values – even in the simplest possible case, i.e., assuming the symmetry of the capacity – is yet to reach a satisfactory level. Thus, the aim of this paper is to present some results and observations concerning this class of data approximation problems.
Keywords. Sugeno integral, aggregation functions, machine learning, regression, approximation
Abstract. Supervised learning is of key interest in data science. Even though there exist many approaches to solving, among others, classification as well as ordinal and standard regression tasks, most of them output models that do not possess useful formal properties, like nondecreasingness in each independent variable, idempotence, symmetry, etc. This makes them difficult to interpret and analyze. For instance, it might be impossible to determine the importances of individual features or to assess the effects of increasing the values of predictors on the behavior of a chosen response variable. Such properties are especially important in software plagiarism detection, where we are faced with the combination of degrees to which how much a code chunk A is similar to (or contained in) B as well as how much B is similar to A. Therefore, in this paper we consider a new method for fitting B-spline tensor product-based aggregation functions to empirical data. An empirical study indicates a highly competitive performance of the resulting models. Additionally, they possess an intuitive interpretation which is highly desirable for end-users.
Abstract. In this paper we thoroughly investigate various OWA-based linkages in hierarchical clustering on numerous benchmark data sets. The inspected setting generalizes the well-known single, complete, and average linkage schemes, among others. The incorporation of weights into the cluster merge procedure creates an opportunity to make use of experts' knowledge about a particular data domain so as to generate partitions of a given data set that better reflect the true underlying cluster structure. Moreover, we introduce a correction for the inequality of cluster size distribution — similar to the one proposed in our recently introduced Genie algorithm — which results in a significant performance boost in terms of clustering quality.
Abstract. The paper discusses a generalization of the nearest centroid hierarchical clustering algorithm. A first extension deals with the incorporation of generic distance-based penalty minimizers instead of the classical aggregation by means of centroids. Due to that the presented algorithm can be applied in spaces equipped with an arbitrary dissimilarity measure (images, DNA sequences, etc.). Secondly, a correction preventing the formation of clusters of too highly unbalanced sizes is applied: just like in the recently introduced Genie approach, which extends the single linkage scheme, the new method averts a chosen inequity measure (e.g., the Gini-, de Vergottini-, or Bonferroni-index) of cluster sizes from raising above a predefined threshold. Numerous benchmarks indicate that the introduction of such a correction increases the quality of the resulting clusterings.
Keywords. hierarchical clustering, aggregation, centroid, Gini-index, Genie algorithm
Abstract. The use of supervised learning techniques for fitting weights and/or generator functions of weighted quasi-arithmetic means – a special class of idempotent and nondecreasing aggregation functions – to empirical data has already been considered in a number of papers. Nevertheless, there are still some important issues that have not been discussed in the literature yet. In the first part of this two-part contribution we deal with the concept of regularization, a quite standard technique from machine learning applied so as to increase the fit quality on test and validation data samples. Due to the constraints on the weighting vector, it turns out that quite different methods can be used in the current framework, as compared to regression models. Moreover, it is worth noting that so far fitting weighted quasi-arithmetic means to empirical data has only been performed approximately, via the so-called linearization technique. In this paper we consider exact solutions to such special optimization tasks and indicate cases where linearization leads to much worse solutions.
Keywords. Aggregation functions, weighted quasi-arithmetic means, least squares fitting, regularization, linearization
Abstract. The use of supervised learning techniques for fitting weights and/or generator functions of weighted quasi-arithmetic means – a special class of idempotent and nondecreasing aggregation functions – to empirical data has already been considered in a number of papers. Nevertheless, there are still some important issues that have not been discussed in the literature yet. In the second part of this two-part contribution we deal with a quite common situation in which we have inputs coming from different sources, describing a similar phenomenon, but which have not been properly normalized. In such a case, idempotent and nondecreasing functions cannot be used to aggregate them unless proper pre-processing is performed. The proposed idempotization method, based on the notion of B-splines, allows for an automatic calibration of independent variables. The introduced technique is applied in an R source code plagiarism detection system.
Keywords. Aggregation functions, weighted quasi-arithmetic means, least squares fitting, idempotence
Abstract. We discuss a generalization of the fuzzy (weighted) k-means clustering procedure and point out its relationships with data aggregation in spaces equipped with arbitrary dissimilarity measures. In the proposed setting, a data set partitioning is performed based on the notion of points' proximity to generic distance-based penalty minimizers. Moreover, a new data classification algorithm, resembling the k-nearest neighbors scheme but less computationally and memory demanding, is introduced. Rich examples in complex data domains indicate the usability of the methods and aggregation theory in general.
Keywords. fuzzy k-means algorithm, clustering, classification, fusion functions, penalty minimizers
Abstract. Multi-sensor based classification of professionals' activities plays a key role in ensuring the success of an his/her goals. In this paper we present the winning solution to the AAIA'15 Tagging Firefighter Activities at a Fire Scene data mining competition. The approach is based on a Random Forest classifier trained on an input data set with almost 5000 features describing the underlying time series of sensory data.
Keywords. Activity tagging, movement tagging, data mining competition, Random Forest model, FFT
Abstract. The K-means algorithm is one of the most often used clustering techniques. However, when it comes to discovering clusters in informetric data sets that consist of non-increasingly ordered vectors of not necessarily conforming lengths, such a method cannot be applied directly. Hence, in this paper, we propose a K-means-like algorithm to determine groups of producers that are similar not only with respect to the quality of information resources they output, but also their quantity.
Keywords. k-means clustering, informetrics, aggregation, impact functions
Abstract. In this paper we describe recent advances in our R code similarity detection algorithm. We propose a modification of the Program Dependence Graph (PDG) procedure used in the GPLAG system that better fits the nature of functional programming languages like R. The major strength of our approach lies in a proper aggregation of outputs of multiple plagiarism detection methods, as it is well known that no single technique gives perfect results. It turns out that the incorporation of the PDG algorithm significantly improves the recall ratio, i.e. it is better in indicating true positive cases of plagiarism or code cloning patterns. The implemented system is available as web application at http://SimilaR.Rexamine.com/.
Keywords. R, plagiarism and code cloning detection, fuzzy proximity relations, aggregation, program dependence graph, t-norms
Abstract. In the field of informetrics, agents are often represented by numeric sequences of non necessarily conforming lengths. There are numerous aggregation techniques of such sequences, e.g., the g-index, the h-index, that may be used to compare the output of pairs of agents. In this paper we address a question whether such impact indices may be used to model experts' preferences accurately.
Keywords. preference learning, fuzzy relations, informetrics, aggregation, h-index
Abstract. Aggregation theory often deals with measures of central tendency of quantitative data. As sometimes a different kind of information fusion is needed, an axiomatization of spread measures was introduced recently. In this contribution we explore the properties of WDpWAM and WDpOWA operators, which are defined as weighted Lp-distances to weighted arithmetic mean and OWA operators, respectively. In particular, we give forms of vectors that maximize such fusion functions and thus provide a way to normalize the output value so that the vector of maximal spread always leads to a fixed outcome, e.g., 1 if all the input elements are in [0,1]. This might be desirable when constructing measures of experts' opinions consistency or diversity in group decision making problems.
Keywords. data fusion, aggregation, spread, deviation, variance, OWA operators
Abstract. The aim of this contribution is to inspect possible applications of clustering techniques computed over a set consisting of nonincreasingly ordered vectors of possibly nonconforming lengths. Such data sets appear in the field of informetrics, where one may need to evaluate the quality of information items, e.g., research papers, and their producers. In this paper we investigate the notion of cluster centers as an aggregated representation of all vectors from a given cluster and analyze them by means of aggregation operators.
Keywords. clustering, fuzzy clustering, c-means algorithm, distance, producers assessment problem
Abstract. The aggregation theory usually takes an interest in summarizing a predefined number of points in the real line. In many applications, like in statistics, data analysis, and mining, the notion of a mean – a nondecreasing, internal, and symmetric fusion function – plays a key role. Nevertheless, when it comes to aggregating a set of points in higher dimensional spaces, the componentwise extension of monotonicity and internality might not be the best choice. Instead, the invariance to certain classes of geometric transformations seems to be crucial in such a case.
Keywords. aggregation, centroid, Tukey median, 1-center, 1-median, convex hull, affine invariance, orthogonalization
Abstract. The producers assessment problem has many important practical
instances: it is an abstract model for intelligent systems evaluating
e.g. the quality of computer software repositories, web resources,
social networking services, and digital libraries. Each producer's
performance is determined according not only to the overall quality
of the items he/she outputted, but also to the number of such items
(which may be different for each agent).
Recent theoretical results indicate that the use of aggregation operators in the process of ranking and evaluation producers may not necessarily lead to fair and plausible outcomes. Therefore, to overcome some weaknesses of the most often applied approach, in this preliminary study we encourage the use of a fuzzy preference relation-based setting and indicate why it may provide better control over the assessment process.
Keywords. fuzzy relations, preference modeling, producers assessment problem, StackOverflow, bibliometrics, h-index
Abstract. Sugeno integral-based confidence intervals for the theoretical h-index of a fixed-length sequence of i.i.d. random variables are derived. They are compared with other estimators of such a distribution characteristic in a Pareto i.i.d. model. It turns out that in the first case we obtain much wider intervals. It seems to be due to the fact that a Sugeno integral, which may be applied on any ordinal scale, is known to ignore too much information from cardinal-scale data being aggregated.
Keywords. h-index, Sugeno integral, confidence interval, Pareto distribution
Abstract. R is a programming language and software environment for performing statistical computations and applying data analysis that increasingly gains popularity among practitioners and scientists. In this paper we present a preliminary version of a system to detect pairs of similar R code blocks among a given set of routines, which bases on a proper aggregation of the output of three different [0,1]-valued (fuzzy) proximity degree estimation algorithms. Its analysis on empirical data indicates that the system may in future be successfully applied in practice in order e.g. to detect plagiarism among students' homework submissions or to perform an analysis of code recycling or code cloning in R's open source packages repositories.
Keywords. R, plagiarism detection, code cloning, fuzzy similarity measures
Abstract. A reasonable approximation of a fuzzy number should have a simple membership function, be close to the input fuzzy number, and should preserve some of its important characteristics. In this paper we suggest to approximate a fuzzy number by a piecewise linear 1-knot fuzzy number which is the closest one to the input fuzzy number among all piecewise linear 1-knot fuzzy numbers having the same core and the same support as the input. We discuss the existence of the approximation operator, show algorithms ready for the practical use and illustrate the considered concepts by examples. It turns out that such an approximation task may be problematic.
Keywords. Approximation of fuzzy numbers, core, fuzzy number, piecewise linear approximation, support
Abstract. Recently, a very interesting relation between symmetric minitive, maxitive, and modular aggregation operators has been shown. It turns out that the intersection between any pair of the mentioned classes is the same. This result introduces what we here propose to call the OM3 operators. In the first part of our contribution on the analysis of the OM3 operators we study some properties that may be useful when aggregating input vectors of varying lengths. In Part II we will perform a thorough simulation study of the impact of input vectors’ calibration on the aggregation results.
Abstract. This article is a second part of the contribution on the analysis of the recently-proposed class of symmetric maxitive, minitive and modular aggregation operators. Recent results (Gagolewski, Mesiar, 2012) indicated some unstable behavior of the generalized h-index, which is a particular instance of OM3, in case of input data transformation. The study was performed on a small, carefully selected real-world data set. Here we conduct some experiments to examine this phenomena more extensively.
Abstract. In this paper we discuss the construction of a new parametric statistical hypothesis test for the equality of probability distributions. The test bases on the difference between Hirsch’s h-indices of two equal-length i.i.d. random samples. For the sake of illustration, we analyze its power in case of Pareto-distributed input data. It turns out that the test is very conservative and has wide acceptance regions, which puts in question the appropriateness of the h-index usage in scientific quality control and decision making.
Abstract. The Choquet, Sugeno and Shilkret integrals with respect to monotone measures are useful tools in decision support systems. In this paper we propose a new class of graph-based integrals that generalize these three operations. Then, an efficient linear-time algorithm for computing their special case, that is lp-indices, 1≤p<∞, is presented. The algorithm is based on R.L. Graham's routine for determining the convex hull of a finite planar set.
Keywords. Monotone measures, Choquet, Sugeno and Shilkret integral, lp-index, convex hull, Graham's scan, scientific impact indices
Abstract. In this paper the recently introduced class of
effort-dominating impact functions is examined. It turns out
that each effort-dominating aggregation operator not only has a
very intuitive interpretation, but also is symmetric minitive, and
therefore may be expressed as a so-called quasi-I-statistic, which
generalizes the well-know OWMin operator.
These aggregation operators may be used e.g. in the Producer Assessment Problem whose most important instance is the scientometric/bibliometric issue of fair scientists’ ranking by means of the number of citations received by their papers.
Abstract. Two classes of aggregation functions: L-statistics and S-statistics and their generalizations called quasi-L-statistics and quasi-S-statistics are considered. Some interesting characterizations of these families of operators are given. The aforementioned functions are useful for various applications. In particular, they are very helpful for modeling the so-called Producer Assessment Problem.
Abstract. Some statistical properties of the so-called S-statistics, which generalize the ordered weighted maximum aggregation operators, are considered. In particular, the asymptotic normality of S-statistics is proved and some possible applications in estimation problems are suggested.
Abstract. A class of extended aggregation operators, called impact functions, is proposed and their basic properties are examined. Some important classes of functions like generalized ordered weighted averaging (OWA) and ordered weighted maximum (OWMax) operators are considered. The general idea is illustrated by the Producer Assessment Problem which includes the scientometric problem of rating scientists basing on the number of citations received by their publications. An interesting characterization of the well known h-index is given.
Abstract. The problem of measuring scientific impact is considered. A class of so-called p-sphere (rp) indices, which generalize the well-known Hirsch index, is used to construct a possibility measure of scientific impact. This measure might be treated as a starting point for prediction of future index values or for dealing with right-censored bibliometric data.