Data Fusion
Theory, Methods, and Applications

Research Monograph


Author:Marek Gagolewski
Publisher:Institute of Computer Science, Polish Academy of Sciences
Reviewers:Gleb Beliakov, Radko Mesiar

This publication is issued as a part of the project Information technologies: Research and their interdisciplinary applications, objective 4.1 of the Human Capital Operational Program, agreement no. UDA-POKL.04.01.01-00-051/10-00. It is co-financed by European Union from resources of European Social Fund.

Due to the above, the publication is distributed free of charge. Download a copy here.

Appropriate fusion of large, complex data sets is necessary in the information era. Having to deal with just a few records already forces the human brain to look for patterns in the data and to make its overall picture instead of conceiving a reality as a set of individual entities, which are much more difficult to process and analyze. Quite similarly, the usage of appropriate methods to reduce the information overload on a computer, may not only increase the quality of the results but also significantly decrease algorithms' run-time.

It is known that information systems relying on a single information source (e.g., measurements gathered from one sensor, opinions of just a single authoritative decision maker, outputs of one and only one machine learning algorithm, answers of an individual social survey taker) are most often neither accurate nor reliable.

The theory of aggregation is a relatively new research field, even though various particular methods for data fusion were known and used already by the ancient mathematicians. Since the 1980s, studies of aggregation functions most often focus on the construction and formal, mathematical analysis of diverse ways to summarize numerical lists with elements in some real interval [a,b]. This covers different kinds of broadly-conceived means, fuzzy logic connectives (t-norms, fuzzy implications), as well as copulas. Quite recently, we observe an increasing interest in aggregation on partially ordered sets – in particular, on ordinal (linguistic) scales.

During the 2013 AGOP – International Summer School on Aggregation Operators – conference in Pamplona, Spain, Prof. Bernard De Baets in his plenary lecture pointed out the need to convey research on the so-called Aggregation 2.0. Of course, Aggregation 2.0 does not aim to replace or in any terms depreciate the very successful and important classical aggregation field, but rather to attract the investigators' attention to new, more complex domains, most of which cannot be properly handled without using computational methods. From this perspective, data fusion tools may be embedded in larger, more complicated information processing systems and thus studied as their key components.

A proper complex data fusion has been of interest to many researchers in diverse fields, including computational statistics, computational geometry, bioinformatics, machine learning, pattern recognition, quality management, engineering, statistics, finance, economics, etc. Let us note that it plays a crucial role in:

  • a synthetic description of data processes or whole domains,
  • creation of rule bases for approximate reasoning tasks,
  • consensus reaching and selection of the optimal strategy in decision support systems,
  • missing values imputations,
  • data deduplication and consolidation,
  • record linkage across heterogeneous databases,
  • automated data segmentation algorithms' construction (compare, e.g., the k-means and hierarchical clustering algorithms).

We observe that many useful machine learning methods are based on a proper aggregation of information entities. In particular, the class of ensemble methods for classification is very successful in practice because of the assumption that no single "weak" classifier can perform as well as their whole group. Interestingly, many of the winning solutions to data mining competitions on Kaggle and similar platforms base somehow on the random forest and similar algorithms. What is more, e.g., neural networks – universal approximators – and other deep learning tools can be understood as hierarchies of individual fusion functions. Thus, they can be conceived as kinds of aggregation techniques as well. We should also mention that an appropriate data fusion is crucial to business enterprises. For numerous reasons, companies are rarely eager to sell large parts of the data sets they posses to their clients. Instead, only carefully pre-processed and aggregated data models are delivered to the customers.

This monograph integrates the spread-out results from different domains using the methodology of the well-established classical aggregation framework, introduce researchers and practitioners to Aggregation 2.0, as well as to point out the challenges and interesting directions for further research.

Table of Contents (Overview)

. Preface
. Notation convention and R basics
1. Aggregation of univariate data
2. Aggregation of multivariate data
3. Aggregation of strings
4. Aggregation of other data types
5. Numerical characteristics of objects
. Listings
. References
. Index

Table of Contents (Detailed)

. Preface
. Notation convention and R basics
1. Aggregation of univariate data
1.1. Preliminaries
1.2. Properties of fusion functions
1.2.1. Nondecreasingness and preservation of end points
1.2.2. Idempotence and internality
1.2.3. Conjunctivity and disjunctivity
1.2.4. Symmetry. Permutations of inputs
1.2.5. Continuity and convexity
1.2.6. Equivariance to translation and scaling
1.2.7. Additivity
1.2.8. Other types of monotonicity
1.3. Construction methods
1.3.1. Compositions and transforms of fusion functions
1.3.1.A. φ-isomorphisms: Quasi-arithmetic means
1.3.1.B. Weighting: Weighted quasi-arithmetic means
1.3.1.C. Symmetrization: OWA operators
1.3.1.D. Hierarchies of fusion functions
1.3.2. Monotone measures and integrals
1.3.3. Penalty-based aggregation functions
1.4. Extended aggregation functions
1.4.1. Weighting
1.4.2. Arity-dependent vs arity-free properties
1.4.3. Some arity-dependent properties
1.5. Choosing an aggregation method (I): Desired properties
1.5.1. Internal functions
1.5.2. Conjunctive and disjunctive functions
1.5.3. Mixed, non-aggregation, and other functions
1.5.4. Andness, orness, and other numerical characteristics
1.6. Choosing an aggregation method (II): Fitting to data
1.6.1. Fitting weighted arithmetic means
1.6.1.A. Least squares fitting
1.6.1.B. Least absolute deviation fitting
1.6.1.C. Least Chebyshev metric fitting
1.6.2. Preservation of output rankings
1.6.2.A. LAD fit with P being the L1 norm
1.6.2.B. LSE fit with P being the squared L2 norm
1.6.3. Regularization
1.6.4. Fitting weights of weighted quasi-arithmetic means
1.6.4.A. LSE fit of WQAMean weights
1.6.4.B. LAD fit of WQAMean weights
1.6.5. Fitting weighted power means
1.6.6. Determining generator functions of quasi-arithmetic means
1.6.7. A note on hierarchies of quasi-arithmetic means
1.7. Aggregation on bounded posets
1.7.1. Basic order theory concepts
1.7.2. Aggregation functions on bounded posets
1.7.3. Classes of fusion functions
1.7.4. Idempotent fusion functions
1.7.5. Lattice polynomial functions
1.8. Aggregation on a nominal scale
2. Aggregation of multivariate data
2.1. Aggregation of real vectors
2.2. Equivariance to geometric transforms
2.2.1. Translation and scale equivariance
2.2.2. Orthogonal equivariance
2.2.3. Equivariance to similarity transforms
2.2.4. Affine equivariance
2.3. Idempotence, internality, and weak monotonicity
2.4. Data depth, corresponding medians, and ordering of inputs
2.4.1. Tukey's halfplane location depth and median
2.4.2. Liu's simplical depth and median
2.4.3. Oja's depth and median
2.4.4. Other depth notions
2.4.5. Symmetrization of fusion functions
2.5. Penalty-based fusion functions
2.5.1. 1-median
2.5.2. Medoid
2.5.3. Centroid
2.5.4. 1-center
2.5.5. A more general framework
2.6. Aggregation on product lattices
2.6.1. Cartesian product
2.6.2. Penalty-based aggregation on product lattices
2.6.3. Conjunctive, disjunctive, and averaging functions
2.6.4. Other orders on product lattices
2.7. Aggregation of character sequences
2.7.1. Median
2.7.2. Center
3. Aggregation of strings
3.1. Orders in the space of strings
3.1.1. Lexicographic order
3.1.2. α-, β-, and informetric orderings
3.1.3. Aggregation methods
3.2. Aggregation of informetric data
3.2.1. Metrics on the space of numeric strings
3.2.2. Centroid
3.2.3. 1-Median
3.3. Aggregation of character strings
3.3.1. Dissimilarity measures of character strings
3.3.1.A. Edit-based distances
3.3.1.B. Q-gram-based distances
3.3.1.C. Other string metrics
3.3.2. Median strings and a strings' centroid
3.3.3. Closest strings
4. Aggregation of other data types
4.1. Directional data
4.2. Aggregation of real intervals
4.3. Aggregation of fuzzy numbers
4.4. Aggregation of random variables
4.5. Aggregation of graphs and relations
4.6. Aggregation in finite semimetric spaces
4.7. Aggregation of heterogeneous data
5. Numerical characteristics of objects
5.1. Characteristics of probability distributions
5.1.1. Measures of location
5.1.2. Measures of dispersion
5.1.3. Point estimation
5.2. Spread measures
5.2.1. Measures of absolute spread for unidimensional data
5.2.2. Measures of relative spread
5.2.3. Spread measures for multidimensional data
5.3. Consensus, inequality, and other measures
5.4. Impact functions for informetric data
5.4.1. Impact functions generated by universal integrals
5.4.2. Properties of impact functions
5.5. Characteristics of fusion functions
5.5.1. Orness and related measures
5.5.2. Weighting vector's entropy
5.5.3. Breakdown points and values
5.6. Characteristics of fuzzy numbers
5.7. Checksums
. Listings
. References
. Index