Advanced Algorithms and Data Structures for Data ScienceÂ¶
MSc studies in Data Science, Faculty of Mathematics and Information Science, Warsaw University of Technology
This course covers methods for data analysis and mining, statistics, machine learning, and artificial intelligence from an algorithmic perspective. We will introduce, analyse, and implement key algorithms and data structures used for finding similar groups of objects, building decision trees, learning parameters of artificial neural networks, processing time series, and many more. We will focus not only on implementing reusable algorithms from scratch (acquiring programming skills needed to implement any algorithm when a readytouse implementation of a method is unavailable, its highlevel R/Python prototype is too slow to run, or a customised modification thereof is required),Â but also calling methods from existing libraries (standing on the shoulders of giants, appreciating the effort of the opensource community). As a byproduct, we will better understand the implementation of statistical software packages (R, Python with numpy and pandas,Â etc.).
Prerequisites: Structured Data Processing (R and Python), Data Structures and Algorithms (sorting and searching), Numerical Methods, Introduction to Machine Learning
ScheduleÂ¶
Winter semester 2024/2025:
Lectures: Wednesdays, 16:15â€“18:00, 103 MiNI (classical, blackboardbased lectures; students are expected to participate actively in the classes, take notes, discuss/brainstorm all ideas, and overall happily relish the food for thought carefully prepared by yours truly)
Pracs: Thursdays, 10:15â€“12:00, 219 MiNI (practical implementation, testing, and application of the algorithms presented in the lectures or described in the literature; discussion of the obtained results)
Lecture 
Pracs 














â€” 










â€” 

â€” 






â€” 
General Topics CoveredÂ¶
The list (and the order of items) is provisional.
Introduction to the C programming language. Calling C from R and Python.
Working with arrays in C. Dynamic memory allocation. Sorting algorithms and hash tables refresher. File I/O. Processing (large) data streams. Pseudorandom number generation and Monte Carlo methods
Dense and sparse representations of matrices. Calling functions from shared libraries: GNU GSL, LAPACK, BLAS, etc. Applications of the QR and SVD decompositions in data analysis.
Minimum spanning trees, disjointsets data structures, implementing agglomerative hierarchical clustering algorithms.
Implementing linked data structures in C (linked lists, binary search trees, Btrees, etc.). Decision trees (ID3, CART, boosting, bagging, random forests).
Exact and approximate nearneighbour/spatial search (kd, vptrees, LSH) with applications in knn classification and regression and DBSCAN clustering.
One (NPhard) objective function and many heuristic solvers: the withincluster sum of squares and many variants of the Kmeans, Ward linkage, hill climbing, tabu search, evolutionary algorithms etc.
Logistic regression and other feedforward artificial neural networks: parameter learning via gradient descent and other optimisers, backpropagation, automatic differentiation (autograd).
Fourier transforms of time series with FFT.
Assessment Methods and RegulationsÂ¶
There are two written (Ă 20 and 30 pts) and four programming assignments (Ă 10, 10, 15, and 15 pts max, respectively).
During the written assignments, electronic devices must not be used. However, a single A4sized sheet of handwritten notes can be brought along.
The programming assignments will be solved on the laboratory (Linux; not: own) devices. Only an explicitly specified list of online resources (e.g., official documentation of some APIs) can be accessed. In particular, the use of GenAIbased services (ChatGPT etc.) is strictly forbidden. The failure to comply with these rules is a breach of academic integrity (cheating) and results in a fail mark.
Class attendance is compulsory. Five pts will be deducted from the final result for each unexcused absence.
The final grade will reflect the extent to which your written and programming assessment tasks meet the specified quality criteria. The total result of â‰¤50 pts translates to the 2.0 grade; (50,60] â€“ 3.0; (60,70] â€“ 3.5; (70,80] â€“ 4.0; (80,90] â€“ 4,5; and >90 â€“ 5.0.
ReferencesÂ¶
The list is nonexhaustive. More references will be provided in the lectures.
T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to Algorithms, MIT Press and McGrawHill, 2022
W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery, Numerical Recipes. The Art of Scientific Computing, 3rd ed., Cambridge University Press, 2007
S. Marsland, Machine Learning: An Algorithmic Perspective, Chapman&Hall/CRC, 2015
M. Gagolewski, Deep R Programming, 2024 (Chapter 14)
D. Goldberg, What every computer scientist should know about floatingpoint arithmetic, ACM Computing Surveys 21(1), 1991, 5â€“48
N.J. Higham, Accuracy and Stability of Numerical Algorithms, SIAM, 2002
(*) D.E. Knuth, The Art of Computer Programming, Vols. 1â€“4B, AddisonWesley, 2023
T. Rothwell, J. Youngman, and others, The GNU C Reference Manual (skip the parts devoted to the â€śGNU extensionsâ€ť)
The GNU C Language Manual (skip the parts devoted to the â€śGNU extensionsâ€ť)
J. Gustedt, Modern C, Manning, 2019
(*) B.W. Kernighan, D.M. Ritchie, The C Programming Language, Prentice Hall, 1988
(*) Programming Languages â€“ C. International Standard ISO/IEC 9899:2023, draft
Source code of:
SoftwareÂ¶
Programming languages: C, C++ (gcc/clang), R, Python (CPython, Cython), Fortran
A GNU/Linux operating system with root access