Software & Data

Free (Libre) and Open Source Software

See also: my profiles on GitHub and StackOverflow

Featured: genieclust Package for Python and R

Fast and Robust Hierarchical Clustering with Noise Point Detection
Genie finds meaningful clusters and is fast even on large data sets.
Code coverage RStudio CRAN mirror downloads

Featured: stringi Package for R

THE String Processing Package for R
stringi (pronounced “stringy”, IPA [strinɡi]) is THE R package for very fast, portable, correct, consistent, and convenient string/text processing in any locale or character encoding. It is one of the most often downloaded packages on CRAN.
RStudio CRAN mirror downloads RStudio CRAN mirror downloads

Other R Packages

stringx: Drop-in Replacements for Base String Functions Powered by stringi RStudio CRAN mirror downloads

Documentation and tutorialsCRANGitHub

realtest: When Expectations Meet Reality: Realistic Unit Testing in R RStudio CRAN mirror downloads

Documentation and tutorialsCRANGitHub

TurtleGraphics: Learn R Programming While Having a Jolly Time RStudio CRAN mirror downloads

(maintained by Barbara Żogała-Siudem)

More infoCRANGitHubTutorial [CRAN]Online manual

agop: Aggregation Operators and Preordered Sets in R RStudio CRAN mirror downloads

CRANGitHubOnline manual

genie: A New, Fast, and Outlier Resistant Hierarchical Clustering Algorithm (superseded by genieclust) RStudio CRAN mirror downloads

More infoCRANGitHubOnline manualPaper#1Paper#2

SimilaR: R R Source Code Similarity Evaluation RStudio CRAN mirror downloads

(maintained by Maciej Bartoszuk)

CRANGitHubPaper

FuzzyNumbers: Tools to Deal with Fuzzy Numbers in R RStudio CRAN mirror downloads

CRANGitHubTutorial [CRAN]

CITAN: CITation ANalysis Toolpack [deprecated] RStudio CRAN mirror downloads

CRANGitHub

Batteries of Benchmark Data

Datasets For Teaching and Training

See comment lines for a detailed description of each dataset.

How To Access

In R:

airlines <- read.csv("nycflights13_airlines.csv.gz", comment.char="#")
head(airlines)
##   carrier                     name
## 1      9E        Endeavor Air Inc.
## 2      AA   American Airlines Inc.
## 3      AS     Alaska Airlines Inc.
## 4      B6          JetBlue Airways
## 5      DL     Delta Air Lines Inc.
## 6      EV ExpressJet Airlines Inc.

In Python:

import pandas as pd
airlines = pd.read_csv("nycflights13_airlines.csv.gz",
    comment="#", compression="gzip")
airlines.head()

To print comment lines, call, e.g.:

import gzip
with gzip.open("nycflights13_airlines.csv.gz", "rt") as f:
    while True:
        x = f.readline().strip()
        if not x.startswith("#"): break
        print(x)

Own Datasets

travel.stackexchange.com Data Dumps (simplified)

Licensed under CC-by-SA 3.0; see readme.txt for more details.

nycflights13

Hadley Wickham's nycflights13-0.2.1 (licensed under CC0, gzipped) – on-time data for all flights that departed NYC (i.e., JFK, LGA, or EWR) in 2013:

All the logs are available at the webpage of the US Department of Transportation. Arunkumar Srinivasan’s github repository gives some nice R code to access the 2014 data.

babynames

Hadley Wickham's babynames-0.2.1 (licensed under CC0, gzipped) – US Baby Names 1880-2014:

fueleconomy

Hadley Wickham's fueleconomy-0.1 (licensed under CC0, gzipped) – fuel economy data from the EPA, 1985-2015:

nasaweather

Hadley Wickham's nasaweather-0.1 (licensed under GPL-3, gzipped):

R Built-ins

The following datasets are included in the datasets package for GNU R:

From Other Sources

Other — Links

Aggregates — Links

Misc

  1. My Textbook Lightweight Machine Learning Classics with R (book draft, 2020)
  2. My Textbook on R programming:
    Programowanie w języku R (2nd ed., PWN, Warszawa, 2016)
  3. My Textbook on Python for data processing and analysis:
    Przetwarzanie i analiza danych w języku Python (1st ed., PWN, Warszawa, 2016)
  4. My Textbook on Statistical inference with R:
    Wnioskowanie statystyczne z wykorzystaniem środowiska R (1st ed., PR PW, Warszawa, 2014)
  5. Google Summer of Code 2016 Mentor;
    Project: RE2 Regular Expressions in R;
    Student: Qin Wenfeng
  6. StackOverflow Academic Research Partnership Program – Supervisor of a research task related to quantitative determinants of the popularity of online content (2019)
  7. My StackOverflow profile
  8. My GitHub profile

My Skills

"If you can implement something, this means you understand it."

Nowadays I develop most of my software with:

  • Python 3 and Cython
  • C++11 and C
  • R (preferably with Rcpp)
  • TeX (duh!)

I have some past experience in: PHP, Java, Julia, Mathematica, Maxima, HTML/CSS/JavaScript, SAS, Matlab/Scilab, C#, Lisp/Scheme, x86 assembly, Fortran, Visual Basic, VBA, C64 Basic, Pascal, ObjectPascal/Delphi, VHDL, Sinclair (ZX Spectrum) Logo, etc.

Libs: ICU, theano, TensorFlow, boost, OpenMP, MPI, CGAL, Qt, Gtk, sdl, OpenGL, Rcpp, etc.

I've been programming computers since the age of 7. My first computer was the C64. Here's a trace of my 2017 visit in the Berlin Computerspiele Museum:

Playing with a C64 in the the Computerspielemuseum, Belin