My current research interests include, but are not limited to:
See also: My Academic CV and my Publication List.
MADAM: Methods for Analysis of Data – Algorithms and Modelling
@ Warsaw University of Technology
Oct. 2017 | Postdoctoral degree (Dr habil.) in Computer Science (Data aggregation and analysis); Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland |
Dec. 2011 |
PhD in Computer Science (Data aggregation); Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland |
June 2008 |
MSc in Computer Science (with honours; AI & computer graphics); Faculty of Mathematics and Information Science, Warsaw University of Technology, Poland |
Sep. 2019 – |
School of Information Technology, Deakin University, Melbourne, VIC, Australia Senior Lecturer in Applied Artificial Intelligence (09.2019 - ) Deputy Course (Program) Director for BSc in Applied Artificial Intelligence (09.2019 - ) |
Oct. 2008 – Sep. 2019 |
Faculty of Mathematics and Information Science, Warsaw University of Technology, Poland Associate Professor in Data Science (01.2018 - 09.2019) Supervisor of the Data Science Course (Program) (01.2018 - 09.2019) Deputy Course (Program) Director for BSc and MSc in Data Science (10.2016 - 09.2019) Assistant Professor (04.2012 - 12.2017) Teaching and Research Assistant (09.2008 - 02.2012) |
July 2008 – Oct. 2019 |
Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Associate Professor (04.2018 - 08.2019) Assistant Professor (02.2012 - 03.2018) Research Assistant (07.2008 - 01.2012) |
July 2014 – July 2019 |
Data Science Retreat, Berlin, Germany Python, R and data science trainer and mentor (19 batches) |
July – Aug. 2017 |
School of Information Technology, Deakin University, Melbourne, VIC, Australia Supported by the SEBE Researcher in Residence Program 2017, Deakin University |
Apr. – June 2015 |
Institute for Research and Applications of Fuzzy Modeling,
University of Ostrava, Czechia Supported by the European Union European Social Fund, Project UDA-POKL.04.01.01-00-051/10-00 Information technologies: Research and their interdisciplinary applications |
Mar. – June 2013 |
Department of Mathematics, Slovak University of Technology, Bratislava, Slovakia Supported by the European Union European Social Fund, Project UDA-POKL.04.01.01-00-051/10-00 Information technologies: Research and their interdisciplinary applications |
I was the supervisor of the following PhD students:
I am currently the supervisor/scientific adviser of the following PhD students:
I was a reviewer of research project proposals for:
I served as a reviewer of PhD theses of:
I wrote 220 publication reviews, including 180 peer-reviews for the following international journals:
and 40 for international conferences (IFSA/EUSFLAT 2009, IPMU 2010, IPMU 2012, SMPS 2014, EUSFLAT 2015, IPMU 2016, ISAS 2016, SMPS 2016, EUSFLAT 2017, IFSA/SCIS 2017, EUSFLAT 2019).
Abstract. Cluster analysis is one of the most commonly applied unsupervised machine
learning techniques. Its aim is to automatically discover an underlying
structure of a data set represented by a partition of its elements:
mutually disjoint and nonempty subsets are determined in such a way
that observations within each group are ``similar'' and entities in distinct
clusters ``differ'' as much as possible from each other.
It turns out that two state-of-the-art clustering algorithms -- namely
the Genie and HDBSCAN* methods -- can be computed based on the minimum spanning
tree (MST) of the pairwise dissimilarity graph. Both of them are not only
resistant to outliers and produce high-quality partitions, but also are
relatively fast to compute.
The aim of this tutorial is to discuss some key issues of hierarchical
clustering and explore their relations with graph and data aggregation theory.
Abstract.
Hirsch's h-index is perhaps the most popular citation-based measure
of scientific excellence.
Many of its natural generalizations
can be expressed as simple functions of some discrete Sugeno integrals.
In this talk we shall review some less-known results concerning various stochastic properties
of the discrete Sugeno integral with respect to a symmetric normalized capacity,
i.e., weighted lattice polynomial functions of real-valued random variables
-- both in i.i.d. (independent and identically distributed) and non-i.i.d. (with some dependence structure)
cases. For instance, we will be interested in investigating
their exact and asymptotic distributions.
Based on these, we can, among others, show that the h-index is a consistent
estimator of some natural probability distribution's location characteristic.
Moreover, we can derive a statistical test to verify whether
the difference between two h-indices (say, h'=7 vs. h''=10 in cases where both authors
published 40 papers) is actually significant.
What is more, we shall discuss some agent-based models that describe the processes
generating citation networks based on, e.g., the preferential attachment
(``rich gets richer'') rule.
Thanks to such an approach, we are able to simulate a scientist's activity
and then estimate the expected values for the h-index and similar functions
based on very simple sample statistics, such as the total number of citations
and the total number of publications.
Such results can help explain what does the h-index really measure.
Abstract.
Aggregation theory classically deals with functions to summarize
a sequence of numeric values, e.g., in the unit interval.
Since the notion of componentwise monotonicity plays a key role in many
situations, there is an increasingly growing interest in methods
that act on diverse ordered structures.
However, as far as the definition of a mean or an averaging function
is concerned, the internality (or at least idempotence) property
seems to be of a relatively higher importance than the monotonicity condition.
In particular, the Bajraktarević means or the mode are among some
well-known non-monotone means.
The concept of a penalty-based function
was first investigated by Yager in 1993.
In such a framework, we are interested in minimizing the amount of "disagreement"
between the inputs and the output being computed;
the corresponding aggregation functions are at least idempotent
and express many existing means in an intuitive and attractive way.
In this talk I focus on the notion of penalty-based
aggregation of sequences of points in R^{d}, this time for some d≥1.
I review three noteworthy subclasses of penalty functions:
componentwise extensions of unidimensional ones, those constructed upon
pairwise distances between observations, and those defined by
measuring the so-called data depth.
Then, I discuss their formal properties, which are particularly
useful from the perspective of data analysis, e.g., different possible
generalizations of internality or equivariances to various geometric transforms.
I also point out the difficulties with extending some notions that are key
in classical aggregation theory, like the monotonicity property.
Abstract. Since the 1980s, studies of aggregation functions most often
focus on the construction and formal analysis of diverse ways
to summarize numerical lists with elements in some real interval.
Quite recently, we also observe an increasing interest in aggregation
of and aggregation on generic partially ordered sets.
However, in many practical applications, we have no natural ordering
of given data items. Thus, in this talk we review
various aggregation methods in spaces equipped merely with
a semimetric (distance).
These include the concept of such penalty minimizers as the centroid,
1-median, 1-center, medoid, and their generalizations
-- all leading to idempotent fusion functions.
Special emphasis is placed on procedures to summarize vectors
in R^{d} for d ≥ 2 (e.g., rows in numeric data frames)
as well as character strings (e.g., DNA sequences),
but of course the list of other interesting domains
could go on forever (rankings, graphs, images, time series, and so on).
We discuss some of their formal properties, exact
or approximate (if the underlying optimization task is hard)
algorithms to compute them and their applications
in clustering and classification tasks.
R
package stringi
,
Text Analysis Developers' Workshop, New York University,
New York City, NY, US, Apr. 20-21, 2018.
R
package stringi
,
Text Analysis R Developers' Workshop, London School of Economics,
London, England, Apr. 21-22, 2017.
Genie
: A new, fast, and outlier-resistant hierarchical clustering algorithm and its R
interface,
European R
Users Meeting,
Poznań, Poland, Oct. 12-14, 2016.
Abstract. The time needed to apply a hierarchical clustering
algorithm is most often dominated by the number of computations of
a pairwise dissimilarity measure. Such a constraint, for larger data sets,
puts at a disadvantage the use of all the classical linkage criteria but
the single linkage one. However, it is known that the single linkage
clustering algorithm is very sensitive to outliers, produces highly skewed
dendrograms, and therefore usually does not reflect the true underlying
data structure - unless the clusters are well-separated.
To overcome its limitations, we proposed a new hierarchical clustering
linkage criterion called *Genie* (Gagolewski, Bartoszuk, Cena, 2016).
Namely, our algorithm links two clusters in such a way that a chosen economic
inequity measure (e.g., the Gini or Bonferroni index) of the cluster sizes
does not increase drastically above a given threshold.
Benchmarks indicate a high practical usefulness of the introduced method:
it most often outperforms the Ward or average linkage in terms of the
clustering quality while retaining the single linkage speed. The algorithm
is easily parallelizable and thus may be run on multiple threads to speed
up its execution further on. Its memory overhead is small: there is no
need to precompute the complete distance matrix to perform the computations
in order to obtain a desired clustering.
In this talk we will discuss its reference implementation, included in the
*genie* package for R.
Keywords. hierarchical clustering, single linkage, inequity measures, Gini-index
Abstract. We will examine the very fundamental properties of impact functions, that is the aggregation operators which may be used in e.g. the assessment of scientists by means of citations received by their papers. It turns out that each impact function which gives noncontroversial valuations in disputable cases must necessarily be trivial. Moreover, we will show that for any set of authors with ambiguous citation records, we may construct an impact function that gives ANY desired authors' ordering. Theoretically then, there is a considerable room for manipulation.
^(R|ICU|i18n|regex)$
,
Seminarium Matematyczne Metody Informatyki,
Instytut Matematyki, Uniwersytet Śląski, Katowice, Poland, Apr. 20, 2015.