stringipackage is available on CRAN. As for now, about 850 CRAN packages depend (either directly or recursively) on
stringi. And quite recently, the package got listed among the top downloaded R extensions.
Notable changes since v0.4-1:
* [BACKWARD INCOMPATIBILITY] The second argument to `stri_pad_*()` has been renamed `width`. * [GENERAL] #69: `stringi` is now bundled with ICU4C 55.1. * [NEW FUNCTIONS] `stri_extract_*_boundaries()` extract text between text boundaries. * [NEW FUNCTION] #46: `stri_trans_char()` is a `stringi`-flavoured `chartr()` equivalent. * [NEW FUNCTION] #8: `stri_width()` approximates the *width* of a string in a more Unicodish fashion than `nchar(..., "width")` * [NEW FEATURE] #149: `stri_pad()` and `stri_wrap()` now by default bases on code point widths instead of the number of code points. Moreover, the default behavior of `stri_wrap()` is now such that it does not get rid of non-breaking, zero width, etc. spaces * [NEW FEATURE] #133: `stri_wrap()` silently allows for `width <= 0` (for compatibility with `strwrap()`). * [NEW FEATURE] #139: `stri_wrap()` gained a new argument: `whitespace_only`. * [NEW FUNCTIONS] #137: date-time formatting/parsing: * `stri_timezone_list()` - lists all known time zone identifiers * `stri_timezone_set()`, `stri_timezone_get()` - manage current default time zone * `stri_timezone_info()` - basic information on a given time zone * `stri_datetime_symbols()` - localizable date-time formatting data * `stri_datetime_fstr()` - convert a `strptime`-like format string to an ICU date/time format string * `stri_datetime_format()` - convert date/time to string * `stri_datetime_parse()` - convert string to date/time object * `stri_datetime_create()` - construct date-time objects from numeric representations * `stri_datetime_now()` - return current date-time * `stri_datetime_fields()` - get values for date-time fields * `stri_datetime_add()` - add specific number of date-time units to a date-time object * [GENERAL] #144: Performance improvements in handling ASCII strings (these affect `stri_sub()`, `stri_locate()` and other string index-based operations) * [GENERAL] #143: Searching for short fixed patterns (`stri_*_fixed()`) now relies on the current `libC`'s implementation of `strchr()` and `strstr()`. This is very fast e.g. on `glibc` utilizing the `SSE2/3/4` instruction set. * [BUILD TIME] #141: a local copy of `icudt*.zip` may be used on package install; see the `INSTALL` file for more information. * [BUILD TIME] #165: the `./configure` option `--disable-icu-bundle` forces the use of system ICU when building the package. * [BUGFIX] locale specifiers are now normalized in a more intelligent way: e.g. `@calendar=gregorian` expands to `DEFAULT_LOCALE@calendar=gregorian`. * [BUGFIX] #134: `stri_extract_all_words()` did not accept `simplify=NA`. * [BUGFIX] #132: incorrect behavior in `stri_locate_regex()` for matches of zero lengths * [BUGFIX] stringr/#73: `stri_wrap()` returned `CHARSXP` instead of `STRSXP` on empty string input with `simplify=FALSE` argument. * [BUGFIX] #164: `libicu-dev` usage used to fail on Ubuntu (`LIBS` shall be passed after `LDFLAGS` and the list of `.o` files). * [BUGFIX] #168: Build now fails if `icudt` is not available. * [BUGFIX] #135: C++11 is now used by default (see the `INSTALL` file, however) to build `stringi` from sources. This is because ICU4C uses the `long long` type which is not part of the C++98 standard. * [BUGFIX] #154: Dates and other objects with a custom class attribute were not coerced to the character type correctly. * [BUGFIX] Force ICU `u_init()` call on `stringi` dynlib load. * [BUGFIX] #157: many overfull hboxes in the package PDF manual has been corrected.
Four papers which I author or coauthor got accepted for the IFSA-EUSFLAT 2015 conference in Gijon, Spain.
Cena A., Gagolewski M., Mesiar R., Problems and challenges of information resources producers' clustering, Journal of Informetrics, 2015, doi:10.1016/j.joi.2015.02.005; has been accepted for publication.
Abstract: Classically, unsupervised machine learning techniques are applied on data sets with fixed number of attributes (variables). However, many problems encountered in the field of informetrics face us with the need to extend these kinds of methods in a way such that they may be computed over a set of nonincreasingly ordered vectors of unequal lengths. Thus, in this paper, some new dissimilarity measures (metrics) are introduced and studied. Owing to that we may use i.a. hierarchical clustering algorithms in order to determine an input data set's partition consisting of sets of producers that are homogeneous not only with respect to the quality of information resources, but also their quantity.
Notable changes since v0.3-1:
* [IMPORTANT CHANGE] `n_max` argument in `stri_split_*()` has been renamed `n`. * [IMPORTANT CHANGE] `simplify=FALSE` in `stri_extract_all_*()` and `stri_split_*()` now calls `stri_list2matrix()` with `fill=""`. `fill=NA_character_` may be obtained by using `simplify=NA`. * [IMPORTANT CHANGE, NEW FUNCTIONS] #120: `stri_extract_words` has been renamed `stri_extract_all_words` and `stri_locate_boundaries` - `stri_locate_all_boundaries` as well as `stri_locate_words` - `stri_locate_all_words`. New functions are now available: `stri_locate_first_boundaries`, `stri_locate_last_boundaries`, `stri_locate_first_words`, `stri_locate_last_words`, `stri_extract_first_words`, `stri_extract_last_words`. * [IMPORTANT CHANGE] #111: `opts_regex`, `opts_collator`, `opts_fixed`, and `opts_brkiter` can now be supplied individually via `...`. In other words, you may now simply call e.g. `stri_detect_regex(str, pattern, case_insensitive=TRUE)` instead of `stri_detect_regex(str, pattern, opts_regex=stri_opts_regex(case_insensitive=TRUE))`. * [NEW FEATURE] #110: Fixed pattern search engine's settings can now be supplied via `opts_fixed` argument in `stri_*_fixed()`, see `stri_opts_fixed()`. A simple (not suitable for natural language processing) yet very fast `case_insensitive` pattern matching can be performed now. `stri_extract_*_fixed` is again available. * [NEW FEATURE] #23: `stri_extract_all_fixed`, `stri_count`, and `stri_locate_all_fixed` may now also look for overlapping pattern matches, see `?stri_opts_fixed`. * [NEW FEATURE] #129: `stri_match_*_regex` gained a `cg_missing` argument. * [NEW FEATURE] #117: `stri_extract_all_*()`, `stri_locate_all_*()`, `stri_match_all_*()` gained a new argument: `omit_no_match`. Setting it to `TRUE` makes these functions compatible with their `stringr` equivalents. * [NEW FEATURE] #118: `stri_wrap()` gained `indent`, `exdent`, `initial`, and `prefix` arguments. Moreover Knuth's dynamic word wrapping algorithm now assumes that the cost of printing the last line is zero, see #128. * [NEW FEATURE] #122: `stri_subset()` gained an `omit_na` argument. * [NEW FEATURE] `stri_list2matrix()` gained an `n_min` argument. * [NEW FEATURE] #126: `stri_split()` now is also able to act just like `stringr::str_split_fixed()`. * [NEW FEATURE] #119: `stri_split_boundaries()` now have `n`, `tokens_only`, and `simplify` arguments. Additionally, `stri_extract_all_words()` is now equipped with `simplify` arg. * [NEW FEATURE] #116: `stri_paste()` gained a new argument: `ignore_null`. Setting it to `TRUE` makes this function more compatible with `paste()`. * [NEW FEATURE] #114: `stri_paste()`: `ignore_null` arg has been added. * [OTHER] #123: `useDynLib` is used to speed up symbol look-up in the compiled dynamic library. * [BUGFIX] #94: Run-time errors on Solaris caused by setting `-DU_DISABLE_RENAMING=1` -- memory allocation errors in i.a. ICU's UnicodeString. This setting also caused some ABSan sanity check failures within ICU code.