Abstract: The theory of aggregation most often deals with measures of central tendency. However, sometimes a very different kind of a numeric vector's synthesis into a single number is required. In this paper we introduce a class of mathematical functions which aim to measure spread or scatter of one-dimensional quantitative data. The proposed definition serves as a common, abstract framework for measures of absolute spread known from statistics, exploratory data analysis and data mining, e.g. the sample variance, standard deviation, range, interquartile range (IQR), median absolute deviation (MAD), etc. Additionally, we develop new measures of experts' opinions diversity or consensus in group decision making problems. We investigate some properties of spread measures, show how are they related to aggregation functions, and indicate their new potentially fruitful application areas.
Notable changes since v0.1-25:
* [IMPORTANT CHANGE] stri_cmp* now do not allow for passing opts_collator=NA. From now on, stri_cmp_eq, stri_cmp_neq, and the new operators %===%, %!==%, %stri===%, and %stri!==% are locale-independent operations, which base on code point comparisons. New functions stri_cmp_equiv and stri_cmp_nequiv (and from now on also %==%, %!=%, %stri==%, and %stri!=%) test for canonical equivalence. * [IMPORTANT CHANGE] stri_*_fixed search functions now perform a locale-independent exact (bytewise, of course after conversion to UTF-8) pattern search. All the Collator-based, locale-dependent search routines are now available via stri_*_coll. The reason for this is that ICU USearch has currently very poor performance and in many search tasks in fact it is sufficient to do exact pattern matching. * [IMPORTANT CHANGE] stri_enc_nf* and stri_enc_isnf* function families have been renamed to stri_trans_nf* and stri_trans_isnf*, respectively. This is because they deal with text transforming, and not with character encoding. Moreover, all such operation may be performed by ICU's Transliterator (see below). * [IMPORTANT CHANGE] stri_*_charclass search functions now rely solely on ICU's UnicodeSet patterns. All previously accepted charclass identifiers became invalid. However, new patterns should now be more familiar to the users (they are regex-like). Moreover, we observe a very nice performance gain. * [IMPORTANT CHANGE] stri_sort now does not include NAs in output vectors by default, for compatibility with sort(). Moreover, currently none of the input vector's attributes are preserved. * [NEW FUNCTION] stri_trans_general, stri_trans_list gives access to ICU's Transliterator: may be used to perform very general text transforms. * [NEW FUNCTION stri_split_boundaries utilizes ICU's BreakIterator to split strings at specific text boundaries. Moreover, stri_locate_boundaries indicates positions of these boundaries. * [NEW FUNCTION] stri_extract_words uses ICU's BreakIterator to extract all words from a text. Additionally, stri_locate_words locates start and end positions of words in a text. * [NEW FUNCTION] stri_pad, stri_pad_left, stri_pad_right, stri_pad_both pads a string with a specific code point. * [NEW FUNCTION] stri_wrap breaks paragraphs of text into lines. Two algorihms (greedy and minimal-raggedness) are available. * [NEW FUNCTION] stri_unique extracts unique elements from a character vector. * [NEW FUNCTIONS] stri_duplicated any stri_duplicated_any determine duplicate elements in a character vector. * [NEW FUNCTION] stri_replace_na replaces NAs in a character vector with a given string, useful for emulating e.g. R's paste() behavior. * [NEW FUNCTION] stri_rand_shuffle generates a random permutation of code points in a string. * [NEW FUNCTION] stri_rand_strings generates random strings. * [NEW FUNCTIONS] New functions and binary operators for string comparison: stri_cmp_eq, stri_cmp_neq, stri_cmp_lt, stri_cmp_le, stri_cmp_gt, stri_cmp_ge, %==%, %!=%, %<%, %<=%, %>%, %>=%. * [NEW FUNCTION] stri_enc_mark reads declared encodings of character strings as seen by stringi. * [NEW FUNCTION] stri_enc_tonative(str) is an alias to stri_encode(str, NULL, NULL). * [NEW FEATURE] stri_order and stri_sort now have an additional argument `na_last` (defaults to TRUE and NA, respectively). * [NEW FEATURE] stri_replace_all_charclass now has `merge` arg (defaults to FALSE for backward-compatibility). It may be used to e.g. replace sequences of white spaces with a single space. * [NEW FEATURE] stri_enc_toutf8 now has a new `validate` arg (defaults to FALSE for backward-compatibility). It may be used in a (rare) case in which a user wants to fix an invalid UTF-8 byte sequence. stri_length (among others) now detect invalid UTF-8 byte sequences. * [NEW FEATURE] All binary operators %???% now also have aliases %stri???%. * stri_*_fixed now use a tweaked Knuth-Morris-Pratt search algorithm, which improves the search performance drastically. * Significant performance improvements in stri_join, stri_flatten, stri_cmp, stri_trans_to*, and others.
Refer to NEWS for a complete list of changes, new features and bug fixes.