Software & Data

Open Source Software

profile for gagolews at Stack Overflow, Q&A for professional and enthusiast programmers

Python Packages

genieclust: The Genie++ Hierarchical Clustering Algorithm with Noise Points Detection Build status Code coverage

Documentation & tutorialsPyPIGitHubPaper on the Genie algorithm

Featured R Packages

stringi: THE String Processing Package for R Build status RStudio CRAN mirror downloads RStudio CRAN mirror downloads

(one of the most often downloaded packages on CRAN)

Documentation & tutorialsCRANGitHub

genieclust: The Genie++ Hierarchical Clustering Algorithm with Noise Points Detection Build status RStudio CRAN mirror downloads

Documentation & tutorialsCRANGitHubPaper on the Genie algorithm

Other R Packages

TurtleGraphics: Learn R Programming While Having a Jolly Time Travis CI RStudio CRAN mirror downloads

(maintained by Barbara Żogała-Siudem)

More infoCRANGitHubTutorial [CRAN]Online manual

agop: Aggregation Operators and Preordered Sets in R Travis CI RStudio CRAN mirror downloads

CRANGitHubOnline manual

genie: A New, Fast, and Outlier Resistant Hierarchical Clustering Algorithm (superseded by genieclust) Travis CI RStudio CRAN mirror downloads

More infoCRANGitHubOnline manualPaper#1Paper#2

SimilaR: R R Source Code Similarity Evaluation Travis CI RStudio CRAN mirror downloads

(maintained by Maciej Bartoszuk)

CRANGitHub

FuzzyNumbers: Tools to Deal with Fuzzy Numbers in R Travis CI RStudio CRAN mirror downloads

CRANGitHubTutorial [CRAN]

CITAN: CITation ANalysis Toolpack [deprecated] Travis CI RStudio CRAN mirror downloads

CRANGitHub

Batteries of Benchmark Data

Datasets For Teaching and Training

See comment lines for a detailed description of each dataset.

How To Access

In R:

airlines <- read.csv("nycflights13_airlines.csv.gz", comment.char="#")
head(airlines)
##   carrier                     name
## 1      9E        Endeavor Air Inc.
## 2      AA   American Airlines Inc.
## 3      AS     Alaska Airlines Inc.
## 4      B6          JetBlue Airways
## 5      DL     Delta Air Lines Inc.
## 6      EV ExpressJet Airlines Inc.

In Python:

import pandas as pd
airlines = pd.read_csv("nycflights13_airlines.csv.gz",
    comment="#", compression="gzip")
airlines.head()

To print comment lines, call, e.g.:

import gzip
with gzip.open("nycflights13_airlines.csv.gz", "rt") as f:
    while True:
        x = f.readline().strip()
        if not x.startswith("#"): break
        print(x)

Own Datasets

travel.stackexchange.com Data Dumps (simplified)

Licensed under CC-by-SA 3.0; see readme.txt for more details.

nycflights13

Hadley Wickham's nycflights13-0.2.1 (licensed under CC0, gzipped) – on-time data for all flights that departed NYC (i.e., JFK, LGA, or EWR) in 2013:

All the logs are available at the webpage of the US Department of Transportation. Arunkumar Srinivasan’s github repository gives some nice R code to access the 2014 data.

babynames

Hadley Wickham's babynames-0.2.1 (licensed under CC0, gzipped) – US Baby Names 1880-2014:

fueleconomy

Hadley Wickham's fueleconomy-0.1 (licensed under CC0, gzipped) – fuel economy data from the EPA, 1985-2015:

nasaweather

Hadley Wickham's nasaweather-0.1 (licensed under GPL-3, gzipped):

R Built-ins

The following datasets are included in the datasets package for GNU R:

From Other Sources

Other — Links

Aggregates — Links

Misc

  1. My Textbook Lightweight Machine Learning Classics with R (book draft, 2020)
  2. My Textbook on R programming:
    Programowanie w języku R (2nd ed., PWN, Warszawa, 2016)
  3. My Textbook on Python for data processing and analysis:
    Przetwarzanie i analiza danych w języku Python (1st ed., PWN, Warszawa, 2016)
  4. My Textbook on Statistical inference with R:
    Wnioskowanie statystyczne z wykorzystaniem środowiska R (1st ed., PR PW, Warszawa, 2014)
  5. Google Summer of Code 2016 Mentor;
    Project: RE2 Regular Expressions in R;
    Student: Qin Wenfeng
  6. My StackOverflow profile

    profile for gagolews at Stack Overflow, Q&A for professional and enthusiast programmers

  7. My GitHub profile

My Skills

"If you can implement something, this means you understand it."

Nowadays I develop most of my software with:

  • Python 3 and Cython
  • C++11 and C
  • R (preferably with Rcpp)
  • TeX (duh!)

I have some past experience in: PHP, Java, Julia, Mathematica, Maxima, HTML/CSS/JavaScript, SAS, Matlab/Scilab, C#, Lisp/Scheme, x86 assembly, Fortran, Visual Basic, VBA, C64 Basic, Pascal, ObjectPascal/Delphi, VHDL, Sinclair (ZX Spectrum) Logo, etc.

Libs: ICU, theano, TensorFlow, boost, OpenMP, MPI, CGAL, Qt, Gtk, sdl, OpenGL, Rcpp, etc.

I've been programming computers since the age of 7. My first computer was the C64. Here's a trace of my 2017 visit in the Berlin Computerspiele Museum:

Playing with a C64 in the the Computerspielemuseum, Belin