Advanced Algorithms and Data Structures for Data Science¶

aka Computational Data Science


MSc studies in Data Science, Faculty of Mathematics and Information Science, Warsaw University of Technology

This course covers methods for data analysis and mining, statistics, machine learning, and artificial intelligence from an algorithmic perspective. We will introduce, analyse, and implement key algorithms and data structures used for finding similar groups of objects, building decision trees, learning parameters of artificial neural networks, processing time series, and many more. We will focus not only on implementing re-usable algorithms from scratch (acquiring programming skills needed to implement any algorithm when a ready-to-use implementation of a method is unavailable, its high-level R/Python prototype is too slow to run, or a customised modification thereof is required), but also calling methods from existing libraries (standing on the shoulders of giants, appreciating the effort of the open-source community). As a by-product, we will better understand the implementation of statistical software packages (R, Python with NumPy and Pandas, etc.).

Prerequisites:

  • Structured data processing (R and Python),

  • Data structures and algorithms (sorting and searching),

  • Numerical methods,

  • Introduction to machine learning

Schedule¶

Winter semester 2024/2025:

  • Lectures: Wednesdays, 16:15–18:00, 103 MiNI (classical, blackboard-based lectures; students are expected to participate actively in the classes, take notes, discuss/brainstorm all ideas, and overall happily relish all the food for thought mindfully seasoned and served by yours truly)

  • Pracs: Thursdays, 10:15–12:00, 219 MiNI (practical implementation, testing, and application of the algorithms presented in the lectures or described in the literature; discussion of the obtained results)

Lecture

Pracs

  1. 2024-10-02

  1. 2024-10-03

  1. 2024-10-09

  • Introduction to the C programming language (cont’d):

    • unions, enums, and preprocessor #defines

    • static arrays

    • pointers

    • dynamic memory allocation

  • References: see Week 1

  1. 2024-10-10

  1. 2024-10-16

  • static functions and variables

  • Representation of characters and strings

  • Representation of signed and unsigned integers

  1. 2024-10-17

  • Implementing basic algorithms for processing numerical arrays using the R/C API (cont’d):

    • mysum for integer and logical vectors too

    • writing unit tests (stopifnot)

    • basic benchmarking (microbenchmark)

    • mysqrt using Rf_allocVector + (UN)PROTECT

  1. 2024-10-23

  • Sorting algorithms at a glance:

    • total preorders and the corresponding equivalence relations

    • stability of algorithms

    • example applications in data science – algorithms that benefit from knowing that data are sorted: computing quantiles, ECDFs, mode(s), determining outliers, implementing unique (with elements sorted or in the order of appearance), duplicated, findInterval (binary search), match (and hence SQL-like joins, GROUP BYs – computing aggregates in groups), rank (with different tie breaking rules), finding the inverse of a permutation,

    • C’s qsort

  • References:

    • (!) T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to Algorithms, MIT Press and McGraw-Hill, 2022, Chapter 2

  • Further reading:

    • R.J. Hyndman, Y. Fan, Sample quantiles in statistical packages, American Statistician, 50(4), 1996, 361–365. DOI: 10.2307/2684934. errata

    • polars, dplyr, data.table (they have fast-track algorithms for sorted data)

  1. 2024-10-24

  • Programming Assignment 1 (10 pts max.)

    Practice by implementing R/C versions of the following R functions (and similar ones): abs, log, +, *, == (binary operators as two-argument functions myadd, mymultiply, myequal), is.na, is.unsorted, all, any, mean, var, cumsum, head, rev, rep, seq, etc.

  1. 2024-10-30

  1. 2024-10-31

  1. 2024-11-06

  • Sorting algorithms (cont’d):

    • non-comparison sorting: the case of small integers (factors/categorical data): count and radix sort

  • Beyond fixed-size arrays

    • Allocating too much: Can you afford some wastefulness? If so, just do it and don’t bother

    • Growable (dynamic) arrays. realloc. Amortised analysis

    • (*) Bonus: Are contiguous arrays really contiguous and do they have to fit in memory?: Virtual vs physical addresses, paging, swapping, memory-mapped I/O

  • References:

    • (!) T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to Algorithms, MIT Press and McGraw-Hill, 2022, Chapter 8 (Sorting in Linear Time), Section 17.4 (Dynamic Tables)

  • Further reading:

  1. 2024-11-07

  • Programming Assignment 2 (10 pts max.)

    Practice by implementing Python/C versions of the following NumPy functions (and similar ones): +, *, == (binary operators as two-argument functions myadd, mymultiply, myequal), abs, log, repeat, tile, arange, linspace, logspace, mean, std, cumsum, diff, min, argmin, allclose, unique, sort, argsort, and scipy.stats.rankdata.

  1. 2024-11-13

  • Beyond fixed-size arrays (cont’d)

    • Circular buffers

    • Recursive data types: Linked lists, rooted trees (binary search trees, decision trees, …), and other graphs

    • Arrays of arrays (ragged arrays, e.g., data frames) or arrays of linked lists (e.g., directed graphs)

  • References:

    • (!) T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to Algorithms, MIT Press and McGraw-Hill, 2022, Section 8.4 (Bucket Sort), Chapter 10 (Elementary Data Structures), Chapter 22 (Elementary Graph Algorithms)

—

  1. 2024-11-20

  • Beyond fixed-size arrays (cont’d)

    • Hash tables

    • Dense matrices (Fortran/C order)

    • Strided arrays

    • Sparse vectors

    • Sparse matrices

    • Representing data frames

  • References:

    • (!) T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to Algorithms, MIT Press and McGraw-Hill, 2022, Chapter 11 (Hash Tables),

  • (*) Further reading:

    • Python’s dict; see also this

    • PHP’s hashtables

    • R’s environments

    • D.E. Knuth, The Art of Computer Programming, Volume 3, Section 6.4 (Hashing)

  1. 2024-11-21

  • Processing matrices using the R/C API

  • Processing matrices using the NumPy/C API

  1. 2024-11-27

  • Clustering algorithms:

    • Optimising the within-cluster sum of squares is (NP) hard

    • (Steepest ascent) hill climbing

    • Random restarts

    • Tabu-like search

  1. 2024-11-28

  • Speeding up matrix multiplication:

    • Locality of reference

    • Setting compiler flags; the optimising compiler

    • R’s %*%: options("matmul")

    • Replacing the default BLAS

  1. 2024-12-04

  • Clustering algorithms (cont’d)

    • k-means

    • fuzzy (probabilistic/soft) k-means

  • Further reading:

    • J. Bezdek, R. Ehrlich, W. Full, FCM: The fuzzy c-means clustering algorithm, Computers & Geosciences 10, 1984, 191–203

  1. 2024-12-05

  1. 2024-12-11

  • Written Assignment 1 (20 pts max.)

    Material: Weeks 1–8 (inclusive)

  1. 2024-12-12

  1. 2024-12-18

  • Clustering algorithms (cont’d)

    • Disjoint sets

    • Minimum spanning trees

    • Single linkage

    • Other agglomerative algorithms: NN chains

    • The Ward linkage vs WCSS

    • Plotting dendrograms

    • Genie

  • Further reading:

    • (!) D. MĂĽllner, Modern hierarchical, agglomerative clustering algorithms, arXiV/1109.2378, 2011

    • (!) T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to Algorithms, MIT Press and McGraw-Hill, 2022, Chapter 21 (Data Structures for Disjoint Sets), Chapter 23 (Minimum Spanning Trees)

    • C.F. Olson, Parallel algorithms for hierarchical clustering, Parallel Computing 21, 1995, 1313–1325

    • M. Gagolewski, genieclust: Fast and robust hierarchical clustering, SoftwareX 15, 2021, 100722 https://genieclust.gagolewski.com/

    • M. Gagolewski, A framework for benchmarking clustering algorithms, SoftwareX 20, 2022, 101270 https://clustering-benchmarks.gagolewski.com/

    • F. Murtagh, P. Legendre, Ward’s Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward’s Criterion? Journal of Classification 31, 2014, 274–295

  1. 2024-12-19

  • Programming Assignment 3 (15 pts max.)

    Practice by implementing matrix operations in NumPy/C and R/C. Basic mastery of OpenMP will also be required.

—

  1. 2025-01-02

—

  1. 2025-01-09

  1. 2025-01-15

  • Finding nearby (similar points): k-nearest neighbour and range queries

  • Applications in classification, regression, clustering, recommender systems

  • k-d trees (multidimensional binary trees)

  • Curse of dimensionality

  • Approximate NNs

  • Further reading:

  1. 2025-01-16

  • Programming Assignment 4 (15 pts max.)

    Practice by implementing matrix operations in NumPy/C and R/C. Basic mastery of OpenMP will also be required.

    Make sure you can solve the previous programming assignment in full.

  1. 2025-01-22

  • Matrix algebra libraries: BLAS and LAPACK (see OpenBLAS)

  • General design patterns for vectorising scalar operations: map, reduce, window functions, etc.

  • SIMD, GPUs, eager evaluation vs just-in-time compilation: LLVM, Julia, Theano, TensorFlow, and stuff

  • Further reading:

  1. 2025-01-23

  1. 2025-01-29

  • Written Assignment 2 (30 pts max.)

—

General Topics Covered¶

The initial list of topics (and the order of items) was provisional. In 2024/2025, we have covered the following.

  1. Introduction to the C programming language. Calling C from R and Python.

  2. Working with arrays in C. Dynamic memory allocation. Sorting algorithms and hash tables refresher. File I/O. Processing (large) data streams. Pseudorandom number generation and Monte Carlo methods.

  3. Dense and sparse representations of matrices. Calling functions from shared libraries: GNU GSL, LAPACK, BLAS, etc. Applications of the QR and SVD decompositions in data analysis.

  4. Minimum spanning trees, disjoint-sets data structures, implementing agglomerative hierarchical clustering algorithms.

  5. Implementing linked data structures in C (linked lists, binary search trees, B-trees, etc.). Decision trees (ID3, CART, boosting, bagging, random forests).

  6. Exact and approximate near-neighbour/spatial search (kd-trees, vp-trees, LSH) with applications in k-nn classification and regression and DBSCAN and spectral clustering.

  7. One (NP-hard) objective function and many heuristic solvers: the within-cluster sum of squares and many variants of the K-means, Ward linkage, hill climbing, tabu search, evolutionary algorithms etc.

  8. Logistic regression and other feed-forward artificial neural networks: parameter learning via gradient descent and other optimisers, back-propagation, automatic differentiation (autograd).

  9. Fourier transforms of signals/time series with FFT.

Assessment Methods and Regulations¶

There are two written (Ă  20 and 30 pts) and four programming assignments (Ă  10, 10, 15, and 15 pts max, respectively).

During the written assignments, electronic devices must not be used. However, a single A4-sized sheet of hand-written notes can be brought along.

The programming assignments will be solved on the laboratory (Linux; not: own) devices. Only an explicitly specified list of online resources (e.g., official documentation of some APIs) can be accessed. In particular, the use of Gen-AI-based services (ChatGPT etc.) is strictly forbidden. The failure to comply with these rules is a breach of academic integrity (cheating) and results in a fail mark.

Class attendance is compulsory. Five pts will be deducted from the final result for each unexcused absence.

The final grade will reflect the extent to which your written and programming assessment tasks meet the specified quality criteria. The total result of ≤50 pts translates to the 2.0 grade; (50,60] – 3.0; (60,70] – 3.5; (70,80] – 4.0; (80,90] – 4,5; and >90 – 5.0.

References¶

The list is non-exhaustive. More references will be provided in the lectures (also: see above).

  1. T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to Algorithms, MIT Press and McGraw-Hill, 2022

  2. W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery, Numerical Recipes. The Art of Scientific Computing, 3rd ed., Cambridge University Press, 2007

  3. S. Marsland, Machine Learning: An Algorithmic Perspective, Chapman&Hall/CRC, 2015

  4. M. Gagolewski, Deep R Programming, 2024

  5. M. Gagolewski, Minimalist Data Wrangling with Python, 2024

  6. D. Goldberg, What every computer scientist should know about floating-point arithmetic, ACM Computing Surveys 21(1), 1991, 5–48

  7. N.J. Higham, Accuracy and Stability of Numerical Algorithms, SIAM, 2002

  8. G.H. Golub, C.F. Van Loan, Matrix Computations, The Johns Hopkins University Press, 2013

  9. NIST Digital Library of Mathematical Functions

  10. (*) D.E. Knuth, The Art of Computer Programming, Vols. 1–4B, Addison-Wesley, 2023

  11. T. Rothwell, J. Youngman, and others, The GNU C Reference Manual (skip the parts devoted to the “GNU extensions”)

  12. The GNU C Language Manual (skip the parts devoted to the “GNU extensions”)

  13. The GNU C Library

  14. J. Gustedt, Modern C, Manning, 2019

  15. R.J. Hyndman, Y. Fan, Sample quantiles in statistical packages, American Statistician, 50(4), 1996, 361–365. DOI: 10.2307/2684934.

  16. (*) B.W. Kernighan, D.M. Ritchie, The C Programming Language, Prentice Hall, 1988

  17. (*) J. Arndt, Matters Computational: Ideas, Algorithms, Source Code, Springer, 2011

  18. (*) Programming Languages – C. International Standard ISO/IEC 9899:2023, draft

Source code of:

  1. R (mirror)

  2. Python

  3. NumPy

  4. SciPy

  5. Pandas

  6. scikit-learn

  7. data.table

  8. dplyr

  9. GNU GSL

Software¶

Programming languages: C, C++ (gcc/clang), R, Python (CPython, Cython), Fortran

A GNU/Linux operating system with root access