realtest
is now available.
Changes since v0.1.2:
* [NEW FEATURE] `sides_comparer` is now solely responsible for defining the semantics of side effect prototypes, therefore `P` performs only few non-invasive sanity checks of its arguments. * [BACKWARD INCOMPATIBILITY] Example comparer `identical_or_TRUE` is no longer available. * [BACKWARD INCOMPATIBILITY] `maps_identical_or_TRUE` has been renamed `sides_similar` and now allows for ignoring the side effects indicated by the user. * [BUGFIX] `summary.realtest_results` no longer tries to subset symbols.
Paper on the genieclust
Python+R package
Abstract. genieclust is an open source Python and R package that implements the hierarchical clustering algorithm called Genie. This method frequently outperforms other state-of-the-art approaches in terms of clustering quality and speed, supports various distances over dense, sparse, and string data domains, and can be robustified even further with the built-in noise point detector. As domain-independent software, it can be used for solving problems arising in all data-driven research and development activities, including environmental, health, biological, physical, decision, and social sciences as well as technology and engineering. The Python version provides a scikit-learn-compliant API, whereas the R variant is compatible with the classic hclust(). Numerous tutorials, use cases, non-trivial examples, documentation, installation instructions, benchmark results and timings can be found at https://genieclust.gagolewski.com/.
stringi
is now shipped with ICU4C 69.1
which supports Unicode 13.0 and CLDR 39.
Changes since v1.5.3:
* [GENERAL] #401: stringi is now bundled with ICU4C 69.1 (upgraded from 61.1), which is used on most Windows and OS X builds as well as on *nix systems not equipped with system ICU. However, if the C++11 support is disabled, stringi will be built against the battle-tested ICU4C 55.1. The update to ICU brings Unicode 13.0 and CLDR 39 support. * [BACKWARD INCOMPATIBILITY] In `stri_enc_list()`, `simplify` now defaults to `TRUE`. * [DOCUMENTATION] A draft version of a paper on `stringi` is now available at https://stringi.gagolewski.com/_static/vignette/stringi.pdf * [GENERAL] stringi now requires R >= 3.1 (`CXX_STD` of `CXX11` or `CXX1X`). * [NEW FEATURE] #408: `stri_trans_casefold()` performs case folding; this is different from case mapping, which is locale-dependent. Folding makes two pieces of text that differ only in case identical. This can come in handy when comparing strings. * [NEW FEATURE] #421: `stri_rank()` ranks strings in a character vector (e.g., for ordering data frames with regards to multiple criteria, the ranks can be passed to `order()`, see #219). * [NEW FEATURE] #266: `stri_width()` now supports emojis. * [NEW FEATURE] `%s$%` and `%stri$%` are now vectorised with respect to both arguments. * [NEW FEATURE] #425: The outputs of `stri_enc_list()`, `stri_locale_list()`, `stri_timezone_list()`, and `stri_trans_list()` are now sorted. * [NEW FEATURE] #428: In `stri_flatten`, `na_empty=NA` now omits missing values. * [BUILD TIME] #431: Pre-4.9.0 GCC has `::max_align_t`, but not `std::max_align_t`, added a (possible) workaround, see the INSTALL file. * [BUGFIX] `stri_sort_key()` now outputs `bytes`-encoded strings. * [BUGFIX] #415: `locale=''` was not equivalent to `locale=NULL` in `stri_opts_collator()`. * [BUGFIX] #354: `ALTREP` `CHARSXP`s were not copied, and thus could have been garbage collected in the so-called meanwhile (with thanks to @jimhester). * [INTERNAL] #414: Use `LEVELS(x)` macro instead of accessing `(x)->sxpinfo.gp` directly (@lukaszdaniel).
genieclust
for fast and robust hierarchical clustering with noise point detection
is now available on PyPI and CRAN.
On the aggregation of compositional data
Abstract. Compositional data naturally appear in many fields of application. For instance, in chemistry, the relative contributions of different chemical substances to a product are typically described in terms of a compositional data vector. Although the aggregation of compositional data frequently arises in practice, the functions formalizing this process do not fit the standard order-based aggregation framework. This is due to the fact that there is no intuitive order that carries the semantics of the set of compositional data vectors (referred to as the standard simplex). In this paper, we consider the more general betweenness-based aggregation framework that yields a natural definition of an aggregation function for compositional data. The weighted centroid is proved to fit within this definition and discussed to be linked to a very tangible interpretation. Other functions for the aggregation of compositional data are presented and their fit within the proposed definition is discussed.
Hierarchical data fusion processes involving the Möbius representation of capacities
Abstract. The use of the Choquet integral in data fusion processes allows for the effective modelling of interactions and dependencies between data features or criteria. Its application requires identification of the defining capacity (also known as fuzzy measure) values. The main limiting factor is the complexity of the underlying parameter learning problem, which grows exponentially in the number of variables. However, in practice we may have expert knowledge regarding which of the subsets of criteria interact with each other, and which groups are independent. In this paper we study hierarchical aggregation processes, architecturally similar to feed-forward neural networks, but which allow for the simplification of the fitting problem both in terms of the number of variables and monotonicity constraints. We note that the Möbius representation lets us identify a number of relationships between the overall fuzzy measure and the data pipeline structure. Included in our findings are simplified fuzzy measures that generalise both k-intolerant and k-interactive capacities.
Package genieclust
0.9.8 Released
genieclust
is now available on CRAN.
Change log:
- [R] Use `RcppMLPACK` directly; remove dependency on `emstreeR`. - [R] Switched to `tinytest` for unit testing.
Interpretable sport team rating models based on the gradient descent algorithm
Abstract. We introduce several new sport team rating models based upon the gradient descent algorithm. More precisely, the models can be formulated by maximising the likelihood of match results observed using a single step of this optimisation heuristic. The framework proposed, inspired by the prominent Elo rating system, yields an iterative version of the ordinal logistic regression as well as different variants of the Poisson regression-based models. This construction makes the update equations easy to interpret as well as adjusts ratings once new match results are observed. Thus, it naturally handles temporal changes in team strength. Moreover, a study of association football data indicates that the new models yield more accurate forecasts and are less computationally demanding than corresponding methods that jointly optimise likelihood for the whole set of matches.
Abstract. This project addresses a key issue in automated decision making: explaining how a decision was reached by a computer system to its users. Its aim is to progress towards a new generation of explainable decision models, which would match the performance of current black-box systems while at the same time allow for transparency and detailed interpretation of the underlying logic. This project expects to generate new knowledge in modelling interdependencies of decision criteria using recent advances in the theory of capacities. The expected outcomes are sophisticated but tractable models in which mutual dependencies of decision rules and criteria are treated explicitly and can be thoroughly evaluated.
R Package stringi
1.5.3 Released
stringi
brings quite a few new features and bug fixes.
Change log:
* [NEW FEATURE] #400: `%s$%` and `%stri$%` are now binary operators that call base R's `sprintf()`. * [NEW FEATURE] #399: The `%s*%` and `%stri*%` operators can be used in addition to `stri_dup()`, for the very same purpose. * [NEW FEATURE] #355: `stri_opts_regex()` now accepts the `time_limit` and `stack_limit` options so as to prevent malformed or malicious regexes from running for too long. * [NEW FEATURE] #345: `stri_startswith()` and `stri_endswith()` are now equipped with the `negate` parameter. * [NEW FEATURE] #382: Incorrect regexes are now reported to ease debugging. * [DEPRECATION WARNING] #347: Any unknown option passed to `stri_opts_fixed()`, `stri_opts_regex()`, `stri_opts_coll()`, and `stri_opts_brkiter()` now generates a warning. In the future, the `...` parameter will be removed, so that will be an error. * [DEPRECATION WARNING] `stri_duplicated()`'s `fromLast` argument has been renamed `from_last`. `fromLast` is now its alias scheduled for removal in a future version of the package. * [DEPRECATION WARNING] `stri_enc_detect2()` is scheduled for removal in a future version of the package. Use `stri_enc_detect()` or the more targeted `stri_enc_isutf8()`, `stri_enc_isascii()`, etc., instead. * [DEPRECATION WARNING] `stri_read_lines()`, `stri_write_lines()`, `stri_read_raw()`: use `con` argument instead of `fname` now. The argument `fallback_encoding` is scheduled for removal and is no longer used. `stri_read_lines()` does not support `encoding="auto"` anymore. * [DEPRECATION WARNING] `nparagraphs` in `stri_rand_lipsum()` has been renamed `n_paragraphs`. * [NEW FEATURE] #398: Alternative, British spelling of function parameters has been introduced, e.g., `stri_opts_coll()` now supports both `normalization` and `normalisation`. * [NEW FEATURE] #393: `stri_read_bin()`, `stri_read_lines()`, and `stri_write_lines()` are no longer marked as draft API. * [NEW FEATURE] #187: `stri_read_bin()`, `stri_read_lines()`, and `stri_write_lines()` now support connection objects as well. * [NEW FEATURE] #386: New function `stri_sort_key()` for generating locale-dependent sort keys which can be ordered at the byte level and return an equivalent ordering to the original string (@DavisVaughan). * [BUGFIX] #138: `stri_encode()` and `stri_rand_strings()` now can generate strings of much larger lengths. * [BUGFIX] `stri_wrap()` did not honour `indent` correctly when `use_width` was `TRUE`.
stringi
package,
see stringi.gagolewski.com/.
Python and R package genieclust
0.9.4
Abstract. Third-party software for assuring source code quality is becoming increasingly popular. Tools that evaluate the coverage of unit tests, perform static code analysis, or inspect run-time memory use are crucial in the software development life cycle. More sophisticated methods allow for performing meta-analyses of large software repositories, e.g., to discover abstract topics they relate to or common design patterns applied by their developers. They may be useful in gaining a better understanding of the component interdependencies, avoiding cloned code as well as detecting plagiarism in programming classes.
A meaningful measure of similarity of computer programs often forms the basis of such tools. While there are a few noteworthy instruments for similarity assessment, none of them turns out particularly suitable for analysing R code chunks. Existing solutions rely on rather simple techniques and heuristics and fail to provide a user with the kind of sensitivity and specificity required for working with R scripts. In order to fill this gap, we propose a new algorithm based on a Program Dependence Graph, implemented in the SimilaR package. It can serve as a tool not only for improving R code quality but also for detecting plagiarism, even when it has been masked by applying some obfuscation techniques or imputing dead code. We demonstrate its accuracy and efficiency in a real-world case study.
Paper in PNAS: Three Dimensions of Scientific Impact
Abstract. The growing popularity of bibliometric indexes (whose most famous example is the h index by J. E. Hirsch [J. E. Hirsch, Proc. Natl. Acad. Sci. U.S.A. 102, 16569–16572 (2005)]) is opposed by those claiming that one's scientific impact cannot be reduced to a single number. Some even believe that our complex reality fails to submit to any quantitative description. We argue that neither of the two controversial extremes is true. By assuming that some citations are distributed according to the rich get richer rule (success breeds success, preferential attachment) while some others are assigned totally at random (all in all, a paper needs a bibliography), we have crafted a model that accurately summarizes citation records with merely three easily interpretable parameters: productivity, total impact, and how lucky an author has been so far.
Benchmark Suite for Clustering Algorithms - Version 1
Lightweight Machine Learning Classics with R
About. Explore some of the most fundamental algorithms which have stood the test of time and provide the basis for innovative solutions in data-driven AI. Learn how to use the R language for implementing various stages of data processing and modelling activities. Appreciate mathematics as the universal language for formalising data-intense problems and communicating their solutions. The book is for you if you're yet to be fluent with university-level linear algebra, calculus and probability theory or you've forgotten all the maths you've ever learned, and are seeking a gentle, yet thorough, introduction to the topic.
R Package stringi
1.4.6 Released
stringi
is now on CRAN.
Change log:
* [BACKWARD INCOMPATIBILITY] #369: `stri_c()` now returns an empty string when input is empty and `collapse` is set. * [BUGFIX] #370: fixed an issue in `stri_prepare_arg_POSIXct()` reported by rchk. * [DOCUMENTATION] #372: documented arguments not in `\usage` in documentation object `stri_datetime_format`: `...`
Genie+OWA: Robustifying Hierarchical Clustering with OWA-based Linkages
Abstract. We investigate the application of the Ordered Weighted Averaging (OWA) data fusion operator in agglomerative hierarchical clustering. The examined setting generalises the well-known single, complete and average linkage schemes. It allows to embody expert knowledge in the cluster merge process and to provide a much wider range of possible linkages. We analyse various families of weighting functions on numerous benchmark data sets in order to assess their influence on the resulting cluster structure. Moreover, we inspect the correction for the inequality of cluster size distribution -- similar to the one in the Genie algorithm. Our results demonstrate that by robustifying the procedure with the Genie correction, we can obtain a significant performance boost in terms of clustering quality. This is particularly beneficial in the case of the linkages based on the closest distances between clusters, including the single linkage and its "smoothed" counterparts. To explain this behaviour, we propose a new linkage process called three-stage OWA which yields further improvements. This way we confirm the intuition that hierarchical cluster analysis should rather take into account a few nearest neighbours of each point, instead of trying to adapt to their non-local neighbourhood.
stringi
1.4.4 is on its way to CRAN.
Change log:
* [BUGFIX] #348: Avoid copying 0 bytes to a nil-buffer in `stri_sub_all()`. * [BUGFIX] #362: Removed `configure` variable `CXXCPP` as it is now deprecated. * [BUGFIX] #318: PROTECTing objects from gcing as reported by `rchk`. * [BUGFIX] #344, #364: Removed compiler warnings in icu61/common/cstring.h. * [BUGFIX] #363: Status of `RegexMatcher` is now checked after its use.
DC Optimisation for Constructing Discrete Sugeno Integrals and Learning Nonadditive Measures
Abstract. Defined solely by means of order-theoretic operations meet (min) and join (max), weighted lattice polynomial functions are particularly useful for modeling data on an ordinal scale. A special case, the discrete Sugeno integral, defined with respect to a nonadditive measure (a capacity), enables accounting for the interdependencies between input variables.
However until recently the problem of identifying the fuzzy measure values with respect to various objectives and requirements has not received a great deal of attention. By expressing the learning problem as the difference of convex functions, we are able to apply DC (difference of convex) optimization methods. Here we formulate one of the global optimization steps as a local linear programming problem and investigate the improvement under different conditions.
IEEE WCCI 2020 Special Session - Aggregation Structures: New Trends and Applications
Robust Fitting for the Sugeno Integral with Respect to General Fuzzy Measures
Abstract. The Sugeno integral is an expressive aggregation function with potential applications across a range of decision contexts. Its calculation requires only the lattice minimum and maximum operations, making it particularly suited to ordinal data and robust to scale transformations. However, for practical use in data analysis and prediction, we require efficient methods for learning the associated fuzzy measure. While such methods are well developed for the Choquet integral, the fitting problem is more difficult for the Sugeno integral because it is not amenable to being expressed as a linear combination of weights, and more generally due to plateaus and non-differentiability in the objective function. Previous research has hence focused on heuristic approaches or simplified fuzzy measures. Here we show that the problem of fitting the Sugeno integral to data such that the maximum absolute error is minimized can be solved using an efficient bilevel program. This method can be incorporated into algorithms that learn fuzzy measures with the aim of minimizing the median residual. This equips us with tools that make the Sugeno integral a feasible option in robust data regression and analysis. We provide experimental comparison with a genetic algorithms approach and an example in data analysis.
Constrained Ordered Weighted Averaging Aggregation with Multiple Comonotone Constraints
Abstract. The constrained ordered weighted averaging (OWA) aggregation problem arises when we aim to maximize or minimize a convex combination of order statistics under linear inequality constraints that act on the variables with respect to their original sources. The standalone approach to optimizing the OWA under constraints is to consider all permutations of the inputs, which becomes quickly infeasible when there are more than a few variables, however in certain cases we can take advantage of the relationships amongst the constraints and the corresponding solution structures. For example, we can consider a land-use allocation satisfaction problem with an auxiliary aim of balancing land-types, whereby the response curves for each species are non-decreasing with respect to the land-types. This results in comonotone constraints, which allow us to drastically reduce the complexity of the problem.
In this paper, we show that if we have an arbitrary number of constraints that are comonotone (i.e., they share the same ordering permutation of the coefficients), then the optimal solution occurs for decreasing components of the solution. After investigating the form of the solution in some special cases and providing theoretical results that shed light on the form of the solution, we detail practical approaches to solving and give real-world examples.
Aggregation on Ordinal Scales with the Sugeno Integral for Biomedical Applications
Abstract. The Sugeno integral is a function particularly suited to the aggregation of ordinal inputs. Defined with respect to a fuzzy measure, its ability to account for complementary and redundant relationships between variables brings much potential to the field of biomedicine, where it is common for measurements and patient information to be expressed qualitatively. However, practical applications require well-developed methods for identifying the Sugeno integral's parameters, and this task is not easily expressed using the standard optimisation approaches. Here we formulate the objective function as the difference of two convex functions, which enables the use of specialised numerical methods. Such techniques are compared with other global optimisation frameworks through a number of numerical experiments.
New Paper on Information Fusion
Abstract. The property of monotonicity, which requires a function to preserve a given order, has been considered the standard in the aggregation of real numbers for decades. In this paper, we argue that, for the case of multidimensional data, an order-based definition of monotonicity is far too restrictive. We propose several meaningful alternatives to this property not involving the preservation of a given order by returning to its early origins stemming from the field of calculus. Numerous aggregation methods for multidimensional data commonly used by practitioners are studied within our new framework.
An Inherent Difficulty in the Aggregation of Multidimensional Data
Abstract. In the field of information fusion, the problem of data aggregation has been formalized as an order-preserving process that builds upon the property of monotonicity. However, fields such as computational statistics, data analysis and geometry, usually emphasize the role of equivariances to various geometrical transformations in aggregation processes. Admittedly, if we consider a unidimensional data fusion task, both requirements are often compatible with each other. Nevertheless, in this paper we show that, in the multidimensional setting, the only idempotent functions that are monotone and orthogonal equivariant are the over-simplistic weighted centroids. Even more, this result still holds after replacing monotonicity and orthogonal equivariance by the weaker property of orthomonotonicity. This implies that the aforementioned approaches to the aggregation of multidimensional data are irreconcilable, and that, if a weighted centroid is to be avoided, we must choose between monotonicity and a desirable behaviour with regard to orthogonal transformations.
Should We Introduce a Dislike Button for Academic Papers?
Abstract. On the grounds of the revealed, mutual resemblance between the behaviour of users of Stack Exchange and the dynamics of the citations accumulation process in the scientific community, we tackled an outwardly intractable problem of assessing the impact of introducing "negative" citations.
Although the most frequent reason to cite a paper is to highlight the connection between the two publications, researchers sometimes mention an earlier work to cast a negative light. While computing citation-based scores, for instance the h-index, information about the reason why a paper was mentioned is neglected. Therefore it can be questioned whether these indices describe scientific achievements accurately.
In this contribution we shed insight into the problem of "negative" citations, analysing data from Stack Exchange and, to draw more universal conclusions, we derive an approximation of citations scores. Here we show that the quantified influence of introducing negative citations is of lesser importance and that they could be used as an indicator of where attention of scientific community is allocated.
stringi
brings significant improvements in the way substring extraction tasks
are performed.
Change-log since v1.3.1:
* [NEW FEATURE] #30: New function `stri_sub_all()` - a version of `stri_sub()` accepting list `from`/`to`/`length` arguments for extracting multiple substrings from each string in a character vector. * [NEW FEATURE] #30: New function `stri_sub_all<-()` (and its `%<%`-friendly version, `stri_sub_replace_all()`) - for replacing multiple substrings with corresponding replacement strings. * [NEW FEATURE] In `stri_sub_replace()`, `value` parameter has a new alias, `replacement`. * [NEW FEATURE] New convenience functions based on `stri_remove_empty()`: `stri_omit_empty_na()`, `stri_remove_empty_na()`, `stri_omit_empty()`, and also `stri_remove_na()`, `stri_omit_na()`. * [BUGFIX] #343: `stri_trans_char()` did not yield correct results for overlapping pattern and replacement strings. * [WARNFIX] #205: `configure.ac` is now included in the source bundle.
agop
is now available on CRAN. See below for more details.
Change-log:
0.2-2 /2019-03-05/ * [IMPORTANT CHANGE] All functions dealing with binary relations now are named like `rel_*`. Moreover, `de_transitive()` has been renamed `rel_reduction_hasse()`. * [IMPORTANT CHANGE] The definition of `owa()`, `owmax()`, and `owmin()` is now consistent with that of (Grabisch et al., 2009), i.e., uses nondecreasing vectors, and not nonincreasing ones. * [NEW FUNCTIONS] `rel_closure_reflexive()`, `rel_reduction_reflexive()`, `rel_is_symmetric()`, `rel_closure_symmetric()`, `rel_is_irreflexive()`, `rel_is_asymmetric()`, `rel_is_antisymmetric()`, `rel_is_cyclic()`, etc., modify given adjacency matrices representing binary relations over finite sets. * [NEW FUNCTIONS] some predefined fuzzy logic connectives have been added, e.g. ,`tnorm_minimum()`, `tnorm_drastic()`, `tnorm_product()`, `tnorm_lukasiewicz()`, `tnorm_fodor()`, `tconorm_minimum()`, `tconorm_drastic()`, `tconorm_product()`, `tconorm_lukasiewicz()`, `tconorm_fodor()`, `fnegation_classic()`, `fnegation_minimal()`, `fnegation_maximal()`, `fnegation_yager()`, `fimplication_minimal()`, `fimplication_maximal()`, `fimplication_kleene()`, `fimplication_lukasiewicz()`, `fimplication_reichenbach()`, `fimplication_fodor()`, `fimplication_goguen()`, `fimplication_goedel()`, `fimplication_rescher()`, `fimplication_weber()`, `fimplication_yager()`. * [NEW FUNCTION] `check_comonotonicity()` determines if two vectors are comonotonic. * [NEW FUNCTIONS] `pord_spread()`, `pord_spreadsym()`, `pord_nd()` - example preorders on sets of vectors. * [NEW FEATURE] `plot_producer()` gained a new argument: `a`. * [BUGFIX] `rel_closure_transitive()` - a resulting matrix was not necessarily transitive. * [BUGFIX] `prepare_arg_numeric_sorted` (internal, C++) did not sort some vectors. * [BUGFIX] All built-in aggregation functions now throw an error on empty vectors. * [INFO] The package no longer depends on the `Matrix` package. The `igraph` package is only suggested. * [INFO] Most of the functions are now implemented in C++.
Penalty-based Data Aggregation in Real Normed Vector Spaces
Abstract. The problem of penalty-based data aggregation in generic real normed vector spaces is studied. Some existence and uniqueness results are indicated. Moreover, various properties of the aggregation functions are considered.
stringi
(one of
the most often downloaded extensions on CRAN)
is available.
Check out the change-log for more information.
Change-log:
* [BACKWARD INCOMPATIBILITY] #335: A fix to #314 (by design) prevented the use of the system ICU if the library had been compiled with `U_CHARSET_IS_UTF8=1`. However, this is the default setting in `libicu`>=61. From now on, in such cases the system ICU is used more eagerly, but `stri_enc_set()` issues a warning stating that the default (UTF-8) encoding cannot be changed. * [NEW FEATURE] #232: All `stri_detect_*` functions now have the `max_count` argument that allows for, e.g., stopping at first pattern occurrence. * [NEW FEATURE] #338: `stri_sub_replace()` is now an alias for `stri_sub<-()` which makes it much more easily pipable (@yutannihilation, @BastienFR). * [NEW FEATURE] #334: Added missing `icudt61b.dat` to support big-endian platforms (thanks to Dimitri John Ledkov @xnox). * [BUGFIX] #296: Out-of-the box build used to fail on CentOS 6, upgraded `./configure` to `--disable-cxx11` more eagerly at an early stage. * [BUGFIX] #341: Fixed possible buffer overflows when calling `strncpy()` from within ICU 61. * [BUGFIX] #325: Made `./configure` more portable so that it works under `/bin/dash` now. * [BUGFIX] #319: Fixed overflow in `stri_rand_shuffle()`. * [BUGFIX] #337: Empty search patters in search functions (e.g., `stri_split_regex()` and `stri_count_fixed()`) used to raise too many warnings on empty search patters.
Piecewise Linear Approximation of Fuzzy Numbers: Algorithms, Arithmetic Operations and Stability of Characteristics
Abstract. The problem of the piecewise linear approximation of fuzzy numbers giving outputs nearest to the inputs with respect to the Euclidean metric is discussed. The results given in Coroianu et al. (Fuzzy Sets Syst 233:26–51, 2013) for the 1-knot fuzzy numbers are generalized for arbitrary n-knot (n>=2) piecewise linear fuzzy numbers. Some results on the existence and properties of the approximation operator are proved. Then, the stability of some fuzzy number characteristics under approximation as the number of knots tends to infinity is considered. Finally, a simulation study concerning the computer implementations of arithmetic operations on fuzzy numbers is provided. Suggested concepts are illustrated by examples and algorithms ready for the practical use. This way, we throw a bridge between theory and applications as the latter ones are so desired in real-world problems.
Supervised Learning to Aggregate Data with the Sugeno Integral
Abstract. The problem of learning symmetric capacities (or fuzzy measures) from data is investigated toward applications in data analysis and prediction as well as decision making. Theoretical results regarding the solution minimizing the mean absolute error are exploited to develop an exact branch-refine-and-bound-type algorithm for fitting Sugeno integrals (weighted lattice polynomial functions, max-min operators) with respect to symmetric capacities. The proposed method turns out to be particularly suitable for acting on ordinal data. In addition to providing a model that can be used for the general data regression task, the results can be used, among others, to calibrate generalized h-indices to bibliometric data.
Maciej Bartoszuk's PhD Defence
The Efficacy of League Formats in Ranking Teams
Abstract. The efficacy of different league formats in ranking teams according to their true latent strength is analysed. To this end, a new approach for estimating attacking and defensive strengths based on the Poisson regression for modelling match outcomes is proposed. Various performance metrics are estimated reflecting the agreement between latent teams' strength parameters and their final rank in the league table. The tournament designs studied here are used in the majority of European top-tier association football competitions. Based on numerical experiments, it turns out that a two-stage league format comprising of the three round-robin tournament together with an extra single round-robin is the most efficacious setting. In particular, it is the most accurate in selecting the best team as the winner of the league. Its efficacy can be enhanced by setting the number of points allocated for a win to two (instead of three that is currently in effect in association football).
Python Package genieclust
0.1a2
Invited Plenary Lecture @ ISCAMI 2018
Abstract. Cluster analysis is one of the most commonly applied unsupervised machine learning techniques. Its aim is to automatically discover an underlying structure of a data set represented by a partition of its elements: mutually disjoint and nonempty subsets are determined in such a way that observations within each group are ``similar'' and entities in distinct clusters ``differ'' as much as possible from each other.
It turns out that two state-of-the-art clustering algorithms -- namely the Genie and HDBSCAN* methods -- can be computed based on the minimum spanning tree (MST) of the pairwise dissimilarity graph. Both of them are not only resistant to outliers and produce high-quality partitions, but also are relatively fast to compute.
The aim of this tutorial is to discuss some key issues of hierarchical clustering and explore their relations with graph and data aggregation theory.
stringi
is out.
Check out the change-log for more information.
Change-log:
* [GENERAL] #193: `stringi` is now bundled with ICU4C 61.1, which is used on most Windows and OS X builds as well as on *nix systems not equipped with ICU. However, if the C++11 support is disabled, stringi will be built against ICU4C 55.1. The update to ICU brings Unicode 10.0 support, including new emoji characters. * [BUGFIX] #288: stri_match did not return the correct number of columns when input was empty. * [NEW FEATURE] #188: `stri_enc_detect` now returns a list of data frames. * [NEW FEATURE] #289: `stri_flatten` gained `na_empty` `omit_empty` arguments. * [NEW FEATURE] New functions: `stri_remove_empty`, `stri_na2empty` * [NEW FEATURE] #285: Coercion from a non-trivial list (one that consists of atomic vectors, each of length 1) to an atomic vector now issues a warning. * [WARN] Removed `-Wparentheses` warnings in `icu55/common/cstring.h:38:63` and `icu55/i18n/windtfmt.cpp` in the ICU4C 55.1 bundle.
Text Analysis Developers' Workshop 2018 @ NYC
MADAM Seminar: Aggregation through the poset glass (Raúl Pérez-Fernández)
Abstract. The aggregation of several objects into a single one is a common study subject in mathematics. Unfortunately, whereas practitioners often need to deal with the aggregation of many different types of objects (rankings, graphs, strings, etc.), the current theory of aggregation is mostly developed for dealing with the aggregation of values in a poset. In this presentation, we will reflect on the limitations of this poset-based theory of aggregation and “jump through the poset glass”. On the other side, we will not find Wonderland, but, instead, we will find more questions than answers. Indeed, a new theory of aggregation is being born, and we will need to work together on this reboot for years to come.
MADAM Seminar: Should we introduce a ‘dislike’ button for papers? (Agnieszka Geras)
Abstract. Citations scores and the h-index are basic tools used for measuring the quality of scientific work. Nonetheless, while evaluating academic achievements one rarely takes into consideration for what reason the paper was mentioned by another author - whether in order to highlight the connection between their work or to bring to the reader’s attention any mistakes or flaws. In my talk I will shed some insight into the problem of “negative” citations analyzing data from the Stack Exchange and using the proposed agent-based model. Joint work with Marek Gągolewski and Grzegorz Siudem.
Least median of squares (LMS) and least trimmed squares (LTS) fitting for the weighted arithmetic mean
Abstract. We look at different approaches to learning the weights of the weighted arithmetic mean such that the median residual or sum of the smallest half of squared residuals is minimized. The more general problem of multivariate regression has been well studied in statistical literature however in the case of aggregation functions we have the restriction on the weights and the domain is usually restricted so that ‘outliers’ may not be arbitrarily large. A number of algorithms are compared in terms of accuracy and speed. Our results can be extended to other aggregation functions.
Invited Plenary Lecture @ FSTA 2018
Abstract. Hirsch's h-index is perhaps the most popular citation-based measure of scientific excellence. Many of its natural generalizations can be expressed as simple functions of some discrete Sugeno integrals.
In this talk we shall review some less-known results concerning various stochastic properties of the discrete Sugeno integral with respect to a symmetric normalized capacity, i.e., weighted lattice polynomial functions of real-valued random variables -- both in i.i.d. (independent and identically distributed) and non-i.i.d. (with some dependence structure) cases. For instance, we will be interested in investigating their exact and asymptotic distributions. Based on these, we can, among others, show that the h-index is a consistent estimator of some natural probability distribution's location characteristic. Moreover, we can derive a statistical test to verify whether the difference between two h-indices (say, h'=7 vs. h''=10 in cases where both authors published 40 papers) is actually significant.
What is more, we shall discuss some agent-based models that describe the processes generating citation networks based on, e.g., the preferential attachment (``rich gets richer'') rule. Due to such an approach, we are able to simulate a scientist's activity and then estimate the expected values for the h-index and similar functions based on very simple sample statistics, such as the total number of citations and the total number of publications. Such results can help explain what does the h-index really measure.
MADAM Seminar: Measuring the efficacy of league formats in ranking football teams (Jan Lasek)
Abstract. Choosing between different tournament designs based on their accuracy in ranking teams is an important topic in football since many domestic championships underwent changes in the recent years. In particular, the transformations of Ekstraklasa -- the top-tier football competition in Poland -- is a topic receiving much attention from the organizing body of the competition, participating football clubs as well as supporters. In this presentation we will discuss the problem of measuring the accuracy of different league formats in ranking teams. We will present various models for rating teams that will be next used to simulate a number of tournaments to evaluate their efficacy, for example, by measuring the probability of the best team win. Finally, we will discuss several other aspects of league formats including the influence of the number of points allocated for a win on the final league standings.
MADAM Seminar: How accidental scientific success is? (Grzegorz Siudem)
Abstract. Since the classic work of de Sola Price the rich-gets-richer rule is well known as the most important mechanism governing the citation network dynamics. (Un-)Fortunatelly it is not sufficient to explain every aspect of bibliometric data. Using the proposed agent-based model for the bibliometric networks we will shed some light on the problem and try to answer the important question stated in the title. Joint work with A. Cena, M. Gagolewski and B. Żogała-Siudem.
stringi
package for R
is on CRAN. The package is one of the most
downloaded R extensions and provides a rich set of string processing
procedures.
Change-log:
* [WINDOWS SPECIFIC] #270: Strings marked with `latin1` encoding are now converted internally to UTF-8 using the WINDOWS-1252 codec. This fixes problems with - among others - displaying the Euro sign. * [NEW FEATURE] #263: Add support for custom rule-based break iteration, see `?stri_opts_brkiter`. * [NEW FEATURE] #267: `omit_na=TRUE` in `stri_sub<-` now ignores missing values in any of the arguments provided. * [BUGFIX] fixed unPROTECTed variable names and stack imbalances as reported by rchk
Research Visit @ Deakin University
Abstract. As cities increase in size, governments and councils face the problem of designing infrastructure and approaches to traffic management that alleviate congestion. The problem of objectively measuring congestion involves taking into account not only the volume of traffic moving throughout a network, but also the inequality or spread of this traffic over major and minor intersections. For modelling such data, we investigate the use of weighted congestion indices based on various aggregation and spread functions. We formulate the weight learning problem for comparison data and use real traffic data obtained from a medium-sized Australian city to evaluate their usefulness.
Abstract. Aggregation theory classically deals with functions to summarize a sequence of numeric values, e.g., in the unit interval. Since the notion of componentwise monotonicity plays a key role in many situations, there is an increasingly growing interest in methods that act on diverse ordered structures.
However, as far as the definition of a mean or an averaging function is concerned, the internality (or at least idempotence) property seems to be of a relatively higher importance than the monotonicity condition. In particular, the Bajraktarević means or the mode are among some well-known non-monotone means.
The concept of a penalty-based function was first investigated by Yager in 1993. In such a framework, we are interested in minimizing the amount of "disagreement" between the inputs and the output being computed; the corresponding aggregation functions are at least idempotent and express many existing means in an intuitive and attractive way.
In this talk I focus on the notion of penalty-based aggregation of sequences of points in R^{d}, this time for some d≥1. I review three noteworthy subclasses of penalty functions: componentwise extensions of unidimensional ones, those constructed upon pairwise distances between observations, and those defined by measuring the so-called data depth. Then, I discuss their formal properties, which are particularly useful from the perspective of data analysis, e.g., different possible generalizations of internality or equivariances to various geometric transforms. I also point out the difficulties with extending some notions that are key in classical aggregation theory, like the monotonicity property.
EUSFLAT'17: Fitting symmetric fuzzy measures for discrete Sugeno integration
stringi
package for R
is on its way to CRAN. The package provides powerful
string processing facilities to R users and developers and is ranked as
one of the most often downloaded R
extensions.
Change-log:
* [GENERAL] `stringi` now requires ICU4C >= 52. * [GENERAL] `stringi` now requires R >= 2.14. * [BUGFIX] Fixed errors pointed out by `clang-UBSAN` in `stri_brkiter.h`. * [BUILD TIME] #238, #220: Try "standard" ICU4C build flags if a call to `pkg-config` fails. * [BUILD TIME] #258: Use `CXX11` instead of `CXX1X` on R >= 3.4. * [BUILD TIME, BUGFIX] #254: `dir.exists()` is R >= 3.2.
stringi
package to CRAN.
Change-log:
* [REMOVE DEPRECATED] `stri_install_check()` and `stri_install_icudt()` marked as deprecated in `stringi` 0.5-5 are no longer being exported. * [BUGFIX] #227: Incorrect behavior of `stri_sub()` and `stri_sub<-()` if the empty string was the result. * [BUILD TIME] #231: The `./configure` (*NIX only) script now reads the following environment varialbes: `STRINGI_CFLAGS`, `STRINGI_CPPFLAGS`, `STRINGI_CXXFLAGS`, `STRINGI_LDFLAGS`, `STRINGI_LIBS`, `STRINGI_DISABLE_CXX11`, `STRINGI_DISABLE_ICU_BUNDLE`, `STRINGI_DISABLE_PKG_CONFIG`, `PKG_CONFIG`, see `INSTALL` for more information. * [BUILD TIME] #253: call to `R_useDynamicSymbols` added. * [BUILD TIME] #230: icudt is now being downloaded by `./configure` (*NIX only) *before* building. * [BUILD TIME] #242: `_COUNT/_LIMIT` enum constants have been deprecated as of ICU 58.2, stringi code has been upgraded accordingly.
FUZZ-IEEE'17: Two Papers Accepted
Penalty-Based Aggregation of Multidimensional Data
Abstract. Research in aggregation theory is nowadays still mostly focused on algorithms to summarize tuples consisting of observations in some real interval or of diverse general ordered structures. Of course, in practice of information processing many other data types between these two extreme cases are worth inspecting. This contribution deals with the aggregation of lists of data points in R^{d} for arbitrary d≥1. Even though particular functions aiming to summarize multidimensional data have been discussed by researchers in data analysis, computational statistics and geometry, there is clearly a need to provide a comprehensive and unified model in which their properties like equivariances to geometric transformations, internality, and monotonicity may be studied at an appropriate level of generality. The proposed penalty-based approach serves as a common framework for all idempotent information aggregation methods, including componentwise functions, pairwise distance minimizers, and data depth-based medians. It also allows for deriving many new practically useful tools.
Przetwarzanie i analiza danych w języku Python
Programowanie w języku R (2nd Ed., revised and extended)
Eusflat'17 Special Session:
Algorithms for Data Aggregation and Fusion
Penalty-Based and Other Representations of Economic Inequality
Abstract. Economic inequality measures are employed as a key component in various socio-demographic indices to capture the disparity between the wealthy and poor. Since their inception, they have also been used as a basis for modelling spread and disparity in other contexts. While recent research has identified that a number of classical inequality and welfare functions can be considered in the framework of OWA operators, here we propose a framework of penalty-based aggregation functions and their associated penalties as measures of inequality.
Invited Talk @ European R Users Meeting 2016
Abstract. The time needed to apply a hierarchical clustering algorithm is most often dominated by the number of computations of a pairwise dissimilarity measure. Such a constraint, for larger data sets, puts at a disadvantage the use of all the classical linkage criteria but the single linkage one. However, it is known that the single linkage clustering algorithm is very sensitive to outliers, produces highly skewed dendrograms, and therefore usually does not reflect the true underlying data structure - unless the clusters are well-separated.
To overcome its limitations, we proposed a new hierarchical clustering linkage criterion called genie. Namely, our algorithm links two clusters in such a way that a chosen economic inequity measure (e.g., the Gini or Bonferroni index) of the cluster sizes does not increase drastically above a given threshold.
Benchmarks indicate a high practical usefulness of the introduced method: it most often outperforms the Ward or average linkage in terms of the clustering quality while retaining the single linkage speed. The algorithm is easily parallelizable and thus may be run on multiple threads to speed up its execution further on. Its memory overhead is small: there is no need to precompute the complete distance matrix to perform the computations in order to obtain a desired clustering. In this talk we will discuss its reference implementation, included in the genie package for R.
Invited Plenary Talk @ ISAS 2016
Abstract. Since the 1980s, studies of aggregation functions most often focus on the construction and formal analysis of diverse ways to summarize numerical lists with elements in some real interval. Quite recently, we also observe an increasing interest in aggregation of and aggregation on generic partially ordered sets.
However, in many practical applications, we have no natural ordering of given data items. Thus, in this talk we review various aggregation methods in spaces equipped merely with a semimetric (distance). These include the concept of such penalty minimizers as the centroid, 1-median, 1-center, medoid, and their generalizations -- all leading to idempotent fusion functions. Special emphasis is placed on procedures to summarize vectors in R^{d} for d ≥ 2 (e.g., rows in numeric data frames) as well as character strings (e.g., DNA sequences), but of course the list of other interesting domains could go on forever (rankings, graphs, images, time series, and so on).
We discuss some of their formal properties, exact or approximate (if the underlying optimization task is hard) algorithms to compute them and their applications in clustering and classification tasks.
Hierarchical Clustering via Penalty-Based Aggregation and the Genie Approach
Abstract. The paper discusses a generalization of the nearest centroid hierarchical clustering algorithm. A first extension deals with the incorporation of generic distance-based penalty minimizers instead of the classical aggregation by means of centroids. Due to that the presented algorithm can be applied in spaces equipped with an arbitrary dissimilarity measure (images, DNA sequences, etc.). Secondly, a correction preventing the formation of clusters of too highly unbalanced sizes is applied: just like in the recently introduced Genie approach, which extends the single linkage scheme, the new method averts a chosen inequity measure (e.g., the Gini-, de Vergottini-, or Bonferroni-index) of cluster sizes from raising above a predefined threshold. Numerous benchmarks indicate that the introduction of such a correction increases the quality of the resulting clusterings.
stringi
is among the top
10 most downloaded R
packages, providing various string
processing facilities. A new release comes with a few bugfixes
and new features.
* [BUGFIX] #214: allow a regex pattern like `.*` to match an empty string. * [BUGFIX] #210: `stri_replace_all_fixed(c("1", "NULL"), "NULL", NA)` now results in `c("1", NA)`. * [NEW FEATURE] #199: `stri_sub<-` now allows for ignoring `NA` locations (a new `omit_na` argument added). * [NEW FEATURE] #207: `stri_sub<-` now allows for substring insertions (via `length=0`). * [NEW FUNCTION] #124: `stri_subset<-` functions added. * [NEW FEATURE] #216: `stri_detect`, `stri_subset`, `stri_subset<-` gained a `negate` argument. * [NEW FUNCTION] #175: `stri_join_list` concatenates all strings in a list of character vectors. Useful with, e.g., `stri_extract_all_regex`, `stri_extract_all_words` etc.
Paper on the Genie Clustering Algorithm
genie
package for R
. The article has been assigned DOI of 10.1016/j.ins.2016.05.003.
Abstract. The time needed to apply a hierarchical clustering algorithm is most often dominated by the number of computations of a pairwise dissimilarity measure. Such a constraint, for larger data sets, puts at a disadvantage the use of all the classical linkage criteria but the single linkage one. However, it is known that the single linkage clustering algorithm is very sensitive to outliers, produces highly skewed dendrograms, and therefore usually does not reflect the true underlying data structure – unless the clusters are well-separated. To overcome its limitations, we propose a new hierarchical clustering linkage criterion called Genie. Namely, our algorithm links two clusters in such a way that a chosen economic inequity measure (e.g., the Gini- or Bonferroni-index) of the cluster sizes does not increase drastically above a given threshold. The presented benchmarks indicate a high practical usefulness of the introduced method: it most often outperforms the Ward or average linkage in terms of the clustering quality while retaining the single linkage speed. The Genie algorithm is easily parallelizable and thus may be run on multiple threads to speed up its execution further on. Its memory overhead is small: there is no need to precompute the complete distance matrix to perform the computations in order to obtain a desired clustering. It can be applied on arbitrary spaces equipped with a dissimilarity measure, e.g., on real vectors, DNA or protein sequences, images, rankings, informetric data, etc. A reference implementation of the algorithm has been included in the open source genie package for R.
Proc. IPMU'2016: 3 Papers Accepted
1st paper:
Abstract. The use of supervised learning techniques for fitting weights and/or generator functions of weighted quasi-arithmetic means – a special class of idempotent and nondecreasing aggregation functions – to empirical data has already been considered in a number of papers. Nevertheless, there are still some important issues that have not been discussed in the literature yet. In the first part of this two-part contribution we deal with the concept of regularization, a quite standard technique from machine learning applied so as to increase the fit quality on test and validation data samples. Due to the constraints on the weighting vector, it turns out that quite different methods can be used in the current framework, as compared to regression models. Moreover, it is worth noting that so far fitting weighted quasi-arithmetic means to empirical data has only been performed approximately, via the so-called linearization technique. In this paper we consider exact solutions to such special optimization tasks and indicate cases where linearization leads to much worse solutions.
Keywords. Aggregation functions, weighted quasi-arithmetic means, least squares fitting, regularization, linearization
2nd paper:
Abstract. The use of supervised learning techniques for fitting weights and/or generator functions of weighted quasi-arithmetic means – a special class of idempotent and nondecreasing aggregation functions – to empirical data has already been considered in a number of papers. Nevertheless, there are still some important issues that have not been discussed in the literature yet. In the second part of this two-part contribution we deal with a quite common situation in which we have inputs coming from different sources, describing a similar phenomenon, but which have not been properly normalized. In such a case, idempotent and nondecreasing functions cannot be used to aggregate them unless proper pre-processing is performed. The proposed idempotization method, based on the notion of B-splines, allows for an automatic calibration of independent variables. The introduced technique is applied in an R source code plagiarism detection system.
Keywords. Aggregation functions, weighted quasi-arithmetic means, least squares fitting, idempotence
3rd paper:
Abstract. We discuss a generalization of the fuzzy (weighted) k-means clustering procedure and point out its relationships with data aggregation in spaces equipped with arbitrary dissimilarity measures. In the proposed setting, a data set partitioning is performed based on the notion of points' proximity to generic distance-based penalty minimizers. Moreover, a new data classification algorithm, resembling the k-nearest neighbors scheme but less computationally and memory demanding, is introduced. Rich examples in complex data domains indicate the usability of the methods and aggregation theory in general.
Keywords. fuzzy k-means algorithm, clustering, classification, fusion functions, penalty minimizers
genie
package for R
(co-authors: Maciej Bartoszuk and Anna Cena).
A detailed description of the algorithm will be available in a forthcoming paper of ours.
Data Fusion Book Now Available
Abstract: The famous Hirsch index has been introduced just ca. 10 years ago. Despite that, it is already widely used in many decision making tasks, like in evaluation of individual scientists, research grant allocation, or even production planning. It is known that the h-index is related to the discrete Sugeno integral and the Ky Fan metric introduced in 1940s. The aim of this paper is to propose a few modifications of this index as well as other fuzzy integrals -- also on bounded chains -- that lead to better discrimination of some types of data that are to be aggregated. All of the suggested compensation methods try to retain the simplicity of the original measure.
Accepted Paper in European Physical Journal B
Abstract: The Hirsch's h-index is perhaps the most popular citation-based measure of the scientific excellence. In 2013 G. Ionescu and B. Chopard proposed an agent-based model for this index to describe a publications and citations generation process in an abstract scientific community. With such an approach one can simulate a single scientist's activity, and by extension investigate the whole community of researchers. Even though this approach predicts quite well the h-index from bibliometric data, only a solution based on simulations was given. In this paper, we complete their results with exact, analytic formulas. What is more, due to our exact solution we are able to simplify the Ionescu-Chopard model which allows us to obtain a compact formula for h-index. Moreover, a simulation study designed to compare both, approximated and exact, solutions is included. The last part of this paper presents evaluation of the obtained results on a real-word data set.
IPMU 2016 Special Session:
Computational Aspects of Data Aggregation and Complex Data Fusion
Important dates:
The proceedings of IPMU 2016 will be published in Communications in Computer and Information Science (CCIS) with Springer. Papers must be prepared in the LNCS/CCIS one-column page format. The length of papers is 12 pages in this special LaTeX2e format. For the details of submission click here.
Please feel free to disseminate this information to other researchers that may potentially be interested in the session. For the details on the Session click here.Notable changes since v0.5-2:
* [GENERAL] #88: C++ API is now available for use in, e.g., Rcpp packages, see https://github.com/Rexamine/ExampleRcppStringi for an example. * [BUGFIX] #183: Floating point exception raised in `stri_sub()` and `stri_sub<-()` when `to` or `length` was a zero-length numeric vector. * [BUGFIX] #180: `stri_c()` warned incorrectly (recycling rule) when using more than 2 elements.
Accepted Paper in Journal of Applied Statistics
Abstract: In this paper, we study the efficacy of the official ranking for international football teams compiled by FIFA, the body governing football competition around the globe. We present strategies for improving a team's position in the ranking. By combining several statistical techniques we derive an objective function in a decision problem of optimal scheduling of future matches. The presented results display how a team's position can be improved. Along the way, we compare the official procedure to the famous Elo rating system. Although it originates from chess, it has been successfully tailored to ranking football teams as well.
Scholarship for Outstanding Young Scientists
stringi
package is available on CRAN. As for now, about 850 CRAN packages depend (either directly or recursively) on stringi
. And quite recently, the package got listed among the top downloaded R extensions.
Notable changes since v0.4-1:
* [BACKWARD INCOMPATIBILITY] The second argument to `stri_pad_*()` has been renamed `width`. * [GENERAL] #69: `stringi` is now bundled with ICU4C 55.1. * [NEW FUNCTIONS] `stri_extract_*_boundaries()` extract text between text boundaries. * [NEW FUNCTION] #46: `stri_trans_char()` is a `stringi`-flavoured `chartr()` equivalent. * [NEW FUNCTION] #8: `stri_width()` approximates the *width* of a string in a more Unicodish fashion than `nchar(..., "width")` * [NEW FEATURE] #149: `stri_pad()` and `stri_wrap()` now by default bases on code point widths instead of the number of code points. Moreover, the default behavior of `stri_wrap()` is now such that it does not get rid of non-breaking, zero width, etc. spaces * [NEW FEATURE] #133: `stri_wrap()` silently allows for `width <= 0` (for compatibility with `strwrap()`). * [NEW FEATURE] #139: `stri_wrap()` gained a new argument: `whitespace_only`. * [NEW FUNCTIONS] #137: date-time formatting/parsing: * `stri_timezone_list()` - lists all known time zone identifiers * `stri_timezone_set()`, `stri_timezone_get()` - manage current default time zone * `stri_timezone_info()` - basic information on a given time zone * `stri_datetime_symbols()` - localizable date-time formatting data * `stri_datetime_fstr()` - convert a `strptime`-like format string to an ICU date/time format string * `stri_datetime_format()` - convert date/time to string * `stri_datetime_parse()` - convert string to date/time object * `stri_datetime_create()` - construct date-time objects from numeric representations * `stri_datetime_now()` - return current date-time * `stri_datetime_fields()` - get values for date-time fields * `stri_datetime_add()` - add specific number of date-time units to a date-time object * [GENERAL] #144: Performance improvements in handling ASCII strings (these affect `stri_sub()`, `stri_locate()` and other string index-based operations) * [GENERAL] #143: Searching for short fixed patterns (`stri_*_fixed()`) now relies on the current `libC`'s implementation of `strchr()` and `strstr()`. This is very fast e.g. on `glibc` utilizing the `SSE2/3/4` instruction set. * [BUILD TIME] #141: a local copy of `icudt*.zip` may be used on package install; see the `INSTALL` file for more information. * [BUILD TIME] #165: the `./configure` option `--disable-icu-bundle` forces the use of system ICU when building the package. * [BUGFIX] locale specifiers are now normalized in a more intelligent way: e.g. `@calendar=gregorian` expands to `DEFAULT_LOCALE@calendar=gregorian`. * [BUGFIX] #134: `stri_extract_all_words()` did not accept `simplify=NA`. * [BUGFIX] #132: incorrect behavior in `stri_locate_regex()` for matches of zero lengths * [BUGFIX] stringr/#73: `stri_wrap()` returned `CHARSXP` instead of `STRSXP` on empty string input with `simplify=FALSE` argument. * [BUGFIX] #164: `libicu-dev` usage used to fail on Ubuntu (`LIBS` shall be passed after `LDFLAGS` and the list of `.o` files). * [BUGFIX] #168: Build now fails if `icudt` is not available. * [BUGFIX] #135: C++11 is now used by default (see the `INSTALL` file, however) to build `stringi` from sources. This is because ICU4C uses the `long long` type which is not part of the C++98 standard. * [BUGFIX] #154: Dates and other objects with a custom class attribute were not coerced to the character type correctly. * [BUGFIX] Force ICU `u_init()` call on `stringi` dynlib load. * [BUGFIX] #157: many overfull hboxes in the package PDF manual has been corrected.
stringr Now Powered by stringi
AGOP 2015: Two Papers Accepted
Postdoctoral Research Visit @ IRAFM in Ostrava
IFSA-EUSFLAT 2015: 4 Papers Accepted
Four papers which I author or coauthor have been accepted for the IFSA-EUSFLAT 2015 conference in Gijon, Spain.
Accepted Paper in Journal of Informetrics
Cena A., Gagolewski M., Mesiar R., Problems and challenges of information resources producers' clustering, Journal of Informetrics, 2015, doi:10.1016/j.joi.2015.02.005; has been accepted for publication.
Abstract: Classically, unsupervised machine learning techniques are applied on data sets with fixed number of attributes (variables). However, many problems encountered in the field of informetrics face us with the need to extend these kinds of methods in a way such that they may be computed over a set of nonincreasingly ordered vectors of unequal lengths. Thus, in this paper, some new dissimilarity measures (metrics) are introduced and studied. Owing to that we may use i.a. hierarchical clustering algorithms in order to determine an input data set's partition consisting of sets of producers that are homogeneous not only with respect to the quality of information resources, but also their quantity.
Notable changes since v0.3-1:
* [IMPORTANT CHANGE] `n_max` argument in `stri_split_*()` has been renamed `n`. * [IMPORTANT CHANGE] `simplify=FALSE` in `stri_extract_all_*()` and `stri_split_*()` now calls `stri_list2matrix()` with `fill=""`. `fill=NA_character_` may be obtained by using `simplify=NA`. * [IMPORTANT CHANGE, NEW FUNCTIONS] #120: `stri_extract_words` has been renamed `stri_extract_all_words` and `stri_locate_boundaries` - `stri_locate_all_boundaries` as well as `stri_locate_words` - `stri_locate_all_words`. New functions are now available: `stri_locate_first_boundaries`, `stri_locate_last_boundaries`, `stri_locate_first_words`, `stri_locate_last_words`, `stri_extract_first_words`, `stri_extract_last_words`. * [IMPORTANT CHANGE] #111: `opts_regex`, `opts_collator`, `opts_fixed`, and `opts_brkiter` can now be supplied individually via `...`. In other words, you may now simply call e.g. `stri_detect_regex(str, pattern, case_insensitive=TRUE)` instead of `stri_detect_regex(str, pattern, opts_regex=stri_opts_regex(case_insensitive=TRUE))`. * [NEW FEATURE] #110: Fixed pattern search engine's settings can now be supplied via `opts_fixed` argument in `stri_*_fixed()`, see `stri_opts_fixed()`. A simple (not suitable for natural language processing) yet very fast `case_insensitive` pattern matching can be performed now. `stri_extract_*_fixed` is again available. * [NEW FEATURE] #23: `stri_extract_all_fixed`, `stri_count`, and `stri_locate_all_fixed` may now also look for overlapping pattern matches, see `?stri_opts_fixed`. * [NEW FEATURE] #129: `stri_match_*_regex` gained a `cg_missing` argument. * [NEW FEATURE] #117: `stri_extract_all_*()`, `stri_locate_all_*()`, `stri_match_all_*()` gained a new argument: `omit_no_match`. Setting it to `TRUE` makes these functions compatible with their `stringr` equivalents. * [NEW FEATURE] #118: `stri_wrap()` gained `indent`, `exdent`, `initial`, and `prefix` arguments. Moreover Knuth's dynamic word wrapping algorithm now assumes that the cost of printing the last line is zero, see #128. * [NEW FEATURE] #122: `stri_subset()` gained an `omit_na` argument. * [NEW FEATURE] `stri_list2matrix()` gained an `n_min` argument. * [NEW FEATURE] #126: `stri_split()` now is also able to act just like `stringr::str_split_fixed()`. * [NEW FEATURE] #119: `stri_split_boundaries()` now have `n`, `tokens_only`, and `simplify` arguments. Additionally, `stri_extract_all_words()` is now equipped with `simplify` arg. * [NEW FEATURE] #116: `stri_paste()` gained a new argument: `ignore_null`. Setting it to `TRUE` makes this function more compatible with `paste()`. * [NEW FEATURE] #114: `stri_paste()`: `ignore_null` arg has been added. * [OTHER] #123: `useDynLib` is used to speed up symbol look-up in the compiled dynamic library. * [BUGFIX] #94: Run-time errors on Solaris caused by setting `-DU_DISABLE_RENAMING=1` -- memory allocation errors in i.a. ICU's UnicodeString. This setting also caused some ABSan sanity check failures within ICU code.
Research Project 2014/13/D/HS4/01700 (NCN)
Notable changes since v0.2-5:
* [IMPORTANT CHANGE] #87: `%>%` overlapped with the pipe operator from the `magrittr` package; now each operator like `%>%` has been renamed `%s>%`. * [IMPORTANT CHANGE] #108: Now the BreakIterator (for text boundary analysis) may be better controlled via `stri_opts_brkiter()` (see options `type` and `locale` which aim to replace now-removed `boundary` and `locale` parameters to `stri_locate_boundaries`, `stri_split_boundaries`, `stri_trans_totitle`, `stri_extract_words`, `stri_locate_words`). * [NEW FUNCTIONS] #109: `stri_count_boundaries` and `stri_count_words` count the number of text boundaries in a string. * [NEW FUNCTIONS] #41: `stri_startswith_*` and `stri_endswith_*` determine whether a string starts or ends with a given pattern. * [NEW FEATURE] #102: `stri_replace_all_*` gained a `vectorize_all` parameter, which defaults to TRUE for backward compatibility. * [NEW FUNCTION] #91: `stri_subset_*`, a convenient and more efficient substitute for `str[stri_detect_*(str, ...)]`, added. * [NEW FEATURE] #100: `stri_split_fixed`, `stri_split_charclass`, `stri_split_regex`, `stri_split_coll` gained a `tokens_only` parameter, which defaults to `FALSE` for backward compatibility. * [NEW FUNCTION] #105: `stri_list2matrix` converts lists of atomic vectors to character matrices, useful in connection with `stri_split` and `stri_extract`. * [NEW FEATURE] #107: `stri_split_*` now allow setting an `omit_empty=NA` argument. * [NEW FEATURE] #106: `stri_split` and `stri_extract_all` gained a `simplify` argument (if `TRUE`, then `stri_list2matrix(..., byrow=TRUE)` is called on the resulting list. * [NEW FUNCTION] #77: `stri_rand_lipsum` generates (pseudo)random dummy *lorem ipsum* text. * [NEW FEATURE] #98: `stri_trans_totitle` gained a `opts_brkiter` parameter; it indicates which ICU BreakIterator should be used when performing case mapping. * [NEW FEATURE] `stri_wrap` gained a new parameter: `normalize`. * [BUGFIX] #86: `stri_*_fixed`, `stri_*_coll`, and `stri_*_regex` could give incorrect results if one of search strings were of length 0. * [BUGFIX] #99: `stri_replace_all` did not use the `replacement` arg. * [BUGFIX] #94: `R CMD check` should no longer fail if `icudt` download failed. * [BUGFIX] #112: Some of the objects were not PROTECTed from being garbage collected, which might have caused spontaneous SEGFAULTS. * [BUGFIX] Some collator's options were not passed correctly to ICU services. * [BUGFIX] Memory leaks causes as detected by `valgrind --tool=memcheck --leak-check=full` have been removed. * [DOCUMENTATION] Significant extensions/clean ups in the stringi manual.
Refer to NEWS for a complete list of changes, new features and bug fixes.
Advanced Data Analysis Software Development with R (e-learning @ ICS PAS)
FuzzyNumbers_0.3-5 Now Available
* added proper import directives in NAMESPACE * piecewiseLinearApproximation: method="ApproximateNearestEuclidean" no longer accepted; use "NearestEuclidean" instead. * package vignette now in the vignettes/ directory.
Spread Measures and Their Relation to Aggregation Functions – Accepted Paper
Abstract: The theory of aggregation most often deals with measures of central tendency. However, sometimes a very different kind of a numeric vector's synthesis into a single number is required. In this paper we introduce a class of mathematical functions which aim to measure spread or scatter of one-dimensional quantitative data. The proposed definition serves as a common, abstract framework for measures of absolute spread known from statistics, exploratory data analysis and data mining, e.g. the sample variance, standard deviation, range, interquartile range (IQR), median absolute deviation (MAD), etc. Additionally, we develop new measures of experts' opinions diversity or consensus in group decision making problems. We investigate some properties of spread measures, show how are they related to aggregation functions, and indicate their new potentially fruitful application areas.
Notable changes since v0.1-25:
* [IMPORTANT CHANGE] stri_cmp* now do not allow for passing opts_collator=NA. From now on, stri_cmp_eq, stri_cmp_neq, and the new operators %===%, %!==%, %stri===%, and %stri!==% are locale-independent operations, which base on code point comparisons. New functions stri_cmp_equiv and stri_cmp_nequiv (and from now on also %==%, %!=%, %stri==%, and %stri!=%) test for canonical equivalence. * [IMPORTANT CHANGE] stri_*_fixed search functions now perform a locale-independent exact (bytewise, of course after conversion to UTF-8) pattern search. All the Collator-based, locale-dependent search routines are now available via stri_*_coll. The reason for this is that ICU USearch has currently very poor performance and in many search tasks in fact it is sufficient to do exact pattern matching. * [IMPORTANT CHANGE] stri_enc_nf* and stri_enc_isnf* function families have been renamed to stri_trans_nf* and stri_trans_isnf*, respectively. This is because they deal with text transforming, and not with character encoding. Moreover, all such operation may be performed by ICU's Transliterator (see below). * [IMPORTANT CHANGE] stri_*_charclass search functions now rely solely on ICU's UnicodeSet patterns. All previously accepted charclass identifiers became invalid. However, new patterns should now be more familiar to the users (they are regex-like). Moreover, we observe a very nice performance gain. * [IMPORTANT CHANGE] stri_sort now does not include NAs in output vectors by default, for compatibility with sort(). Moreover, currently none of the input vector's attributes are preserved. * [NEW FUNCTION] stri_trans_general, stri_trans_list gives access to ICU's Transliterator: may be used to perform very general text transforms. * [NEW FUNCTION stri_split_boundaries utilizes ICU's BreakIterator to split strings at specific text boundaries. Moreover, stri_locate_boundaries indicates positions of these boundaries. * [NEW FUNCTION] stri_extract_words uses ICU's BreakIterator to extract all words from a text. Additionally, stri_locate_words locates start and end positions of words in a text. * [NEW FUNCTION] stri_pad, stri_pad_left, stri_pad_right, stri_pad_both pads a string with a specific code point. * [NEW FUNCTION] stri_wrap breaks paragraphs of text into lines. Two algorihms (greedy and minimal-raggedness) are available. * [NEW FUNCTION] stri_unique extracts unique elements from a character vector. * [NEW FUNCTIONS] stri_duplicated any stri_duplicated_any determine duplicate elements in a character vector. * [NEW FUNCTION] stri_replace_na replaces NAs in a character vector with a given string, useful for emulating e.g. R's paste() behavior. * [NEW FUNCTION] stri_rand_shuffle generates a random permutation of code points in a string. * [NEW FUNCTION] stri_rand_strings generates random strings. * [NEW FUNCTIONS] New functions and binary operators for string comparison: stri_cmp_eq, stri_cmp_neq, stri_cmp_lt, stri_cmp_le, stri_cmp_gt, stri_cmp_ge, %==%, %!=%, %<%, %<=%, %>%, %>=%. * [NEW FUNCTION] stri_enc_mark reads declared encodings of character strings as seen by stringi. * [NEW FUNCTION] stri_enc_tonative(str) is an alias to stri_encode(str, NULL, NULL). * [NEW FEATURE] stri_order and stri_sort now have an additional argument `na_last` (defaults to TRUE and NA, respectively). * [NEW FEATURE] stri_replace_all_charclass now has `merge` arg (defaults to FALSE for backward-compatibility). It may be used to e.g. replace sequences of white spaces with a single space. * [NEW FEATURE] stri_enc_toutf8 now has a new `validate` arg (defaults to FALSE for backward-compatibility). It may be used in a (rare) case in which a user wants to fix an invalid UTF-8 byte sequence. stri_length (among others) now detect invalid UTF-8 byte sequences. * [NEW FEATURE] All binary operators %???% now also have aliases %stri???%. * stri_*_fixed now use a tweaked Knuth-Morris-Pratt search algorithm, which improves the search performance drastically. * Significant performance improvements in stri_join, stri_flatten, stri_cmp, stri_trans_to*, and others.
Refer to NEWS for a complete list of changes, new features and bug fixes.
Paper on OM3 Operators Accepted in FSS
IPMU 2014: Two Papers Accepted
Programowanie w Języku R [Programming in R]
** FuzzyNumbers Package CHANGELOG ** *************************************************************************** 0.3-3 /2014-01-03/ * piecewiseLinearApproximation() now supports new method="SupportCorePreserving", see Coroianu L., Gagolewski M., Grzegorzewski P., Adabitabar Firozja M., Houlari T., Piecewise Linear Approximation of Fuzzy Numbers Preserving the Support and Core, 2014 (submitted for publication). * piecewiseLinearApproximation() now does not fail on exceptions thrown by integrate(); fallback=Newton-Cotes formula. * Removed `Suggests` dependency: testthat tests now available for developers via the FuzzyNumbers github repository. * Package manual has been corrected and extended. * Package vignette is now only available online at http://FuzzyNumbers.rexamine.com.
Accepted Paper on Applications of Monotone Measures and Universal Integrals
Abstract: The Choquet, Sugeno, and Shilkret integrals with respect to monotone measures, as well as their generalization – the universal integral, stand for a useful tool in decision support systems. In this paper we propose a general construction method for aggregation operators that may be used in assessing output of scientists. We show that the most often currently used indices of bibliometric impact, like Hirsch's h, Woeginger's w, Egghe's g, Kosmulski's MAXPROD, and similar constructions, may be obtained by means of our framework. Moreover, the model easily leads to some new, very interesting functions.
stringi is THE R package for correct, fast, and simple string processing in each locale and native charset. Another alpha release (for testing purposes) can be automatically downloaded by calling in R:
source('http://stringi.rexamine.com/install.R') # Message from the future: the link is outdated
The auto-installer gives access to a Windows i386/x64 build for R 3.0 or allows building the package from sources on Linux or MacOS.
UPDATE@2013-11-13. Version 0.1-10 now available.
Includes some bugfixes. Moreover, on Linux/UNIX ./configure
now first
tries to read build settings from pkg-config
(as the usage of icu-config
is deprecated).
UPDATE@2013-11-16. Version 0.1-11 now available.
ICU4C is now statically linked on Windows, so there is no need
to download any additional libraries – a binary version is
now available for R 2.15.X and 3.0.X. Moreover, on platforms where
packages are built from sources, the ./configure
script
now tries to find ICU4C automagically.
UPDATE@2013-11-21. Build of version 0.1-11 now available for OS X (x64) and R 3.0. Have fun.
UPDATE@2014-02-15. Version 0.1-20 (source and Win_build only) now available. Now it does not depend on any external ICU library (the library source code is included).
stringi: THE String Processing Package for R **alpha release**
The alpha release (for testing purposes) is available here (includes Windows i386/x64 build for R 3.0). Any comments and suggestions are welcome!
"Scientific Impact Assessment Cannot Be Fair" Accepted for Publication
Gagolewski M., Scientific Impact Assessment Cannot be Fair, Journal of Informetrics 7(4), 2013, pp. 792-802.
Abstract: In this paper we deal with the problem of aggregating numeric sequences of arbitrary length that represent e.g. citation records of scientists. Impact functions are the aggregation operators that express as a single number not only the quality of individual publications, but also their author's productivity.
We examine some fundamental properties of these aggregation tools. It turns out that each impact function which always gives indisputable valuations must necessarily be trivial. Moreover, it is shown that for any set of citation records in which none is dominated by the other, we may construct an impact function that gives any a priori-established authors' ordering. Theoretically then, there is considerable room for manipulation in the hands of decision makers.
We also discuss the differences between the impact function-based and the multicriteria decision making-based approach to scientific quality management, and study how the introduction of new properties of impact functions affects the assessment process. We argue that simple mathematical tools like the h- or g-index (as well asother bibliometric impact indices) may not necessarily be a good choice when it comes to assess scientific achievements.
** FuzzyNumbers Package CHANGELOG ** *************************************************************************** 0.3-1 /2013-06-23/ * piecewiseLinearApproximation() - general case (any knot.n) for method="NearestEuclidean" now available. Thus, method="ApproximateNearestEuclidean" is now deprecated. * New binary arithmetic operators, especially for PiecewiseLinearFuzzyNumbers: +, -, *, / * New method: fapply() - applies a function on a PLFN using the extension principle * New methods: as.character(); also used by show(). This function also allows to generate LaTeX code defining the FN (toLaTeX arg thanks to Jan Caha). * as.FuzzyNumber(), as.TriangularFuzzyNumber(), as.PowerFuzzyNumber(), and as.PiecewiseLinearFuzzyNumber() are now S4 methods, and can be called on objects of type numeric, as well as on various FNs * piecewiseLinearApproximation() and as.PiecewiseLinearFuzzyNumber() argument `knot.alpha` now defaults to equally distributed knots (via given `knot.n`). If `knot.n` is missing, then it is guessed from `knot.alpha`. * PiecewiseLinearFuzzyNumber() now accepts missing `a1`, `a2`, `a3`, `a4`, and `knot.left`, `knot.right` of length `knot.n`+2. Moreover, if `knot.n` is not given, then it is guessed from length(knot.left). If `knot.alpha` is missing, then the knots will be equally distributed on the interval [0,1]. * alphacut() now always returns a named two-column matrix. evaluate() returns a named vector. * New function: TriangularFuzzyNumber - returns a TrapezoidalFuzzyNumber. * Function renamed: convert.side to convertSide, convert.alpha to convertAlpha, approx.invert to approxInvert * Added a call to setGeneric("plot", function(x, y, ...) ... to avoid a warning on install * The FuzzyNumbers Tutorial has been properly included as the package's vignette * DiscontinuousFuzzyNumber class has been marked as **EXPERIMENTAL** in the manual * Man pages extensively updated * FuzzyNumbers devel repo moved to GitHub
Abstract: Recently, a very interesting relation between symmetric minitive, maxitive, and modular aggregation operators has been shown. It turns out that the intersection between any pair of the mentioned classes is the same. This result introduces what we here propose to call the OM3 operators. In the first part of our contribution on the analysis of the OM3 operators we study some properties that may be useful when aggregating input vectors of varying lengths. In Part II we will perform a thorough simulation study of the impact of input vectors' calibration on the aggregation results.
Abstract: This article is a second part of the contribution on the analysis of the recently-proposed class of symmetric maxitive, minitive and modular aggregation operators. Recent results (Gagolewski, Mesiar, 2012) indicated some unstable behavior of the generalized h-index, which is a particular instance of OM3, in case of input data transformation. The study was performed on a small, carefully selected real-world data set. Here we conduct some experiments to examine these phenomena more extensively.
Postdoctoral Research Visit @ Slovak University of Technology