Open Thesis Topics in Data Management for Data Science @ Graz University of Technology

The following list shows selected, open thesis topics that could be immediately started. However, our group is generally interested in better systems support for the entire end-to-end data science lifecycle and thus, always open for discussions on additional topics.

#1: Recycling Intermediates in Feature Selection and Hyper-parameter Optimization Workloads

Category: Bachelor/Master

Abstract: Building a robust machine learning (ML) pipeline for a given application use case, often involves experimenting with different algorithms and feature selection strategies, as well as dedicated hyper-parameter optimization. Algorithms for feature selection and hyper-parameter optimization themselves are iterative, repeatedly invoking the same ML algorithms with overlapping subsets of features, and different parameters. In this context, commonly used libraries of ML algorithms are treated in a black-box manner, which leads to redundant computation and thus, unnecessarily slows down the exploration process. Inspired by work on recycling intermediates in query processing, this topic aims to explore opportunities for recycling intermediates in linear algebra programs, specifically in feature selection and hyper-parameter optimization workloads. The key ideas are to track the lineage of intermediates (operations and non-determinism), store these intermediates in a (bounded-size) in-memory cache, and reuse intermediates for redundant operations. Major challenges include the efficient tracking of local and distributed operations, effective reuse of partial intermediates via compensation plans, and effective eviction policies. All implementation work and experiments can be done in SystemDS, but alternative system proposals are possible.


#2: An Experimental Analysis of Auto Differentiation Techniques

Category: Bachelor

Abstract: The training of deep neural networks still predominantly relies on general-purpose optimizers like mini-batch Stochastic Gradient Decent (SGD), or extensions like SGD-Nesterov, SGD-momentum, RMSprop, Adam, and Adagrad, which share the same computational characteristic. Given a batch of training examples, we make a forward pass through the network, followed by a backward pass to derive gradients for the next optimizer update. To simplify the exploration of new architectures, deep learning frameworks like TensorFlow or PyTorch all provide means of auto differentiation (AD) for automatically deriving the backward pass from the forward pass. This topic aims to implement state-of-the-art AD techniques in a common framework (preferably SystemDS, which does not yet support AD), and compare these techniques in terms of performance, memory requirements, and computational flexibility. In case this comparison reveals major shortcomings, there is the possibility of a follow-up master thesis on extending these techniques.


#3: Resource-Constrained Operator Scheduling for Model Scoring

Category: Bachelor/Master

Abstract: An often overlooked aspect of ML systems is the low-level scheduling of physical operators, which affects the memory requirements for live intermediates in complex linear algebra programs. All existing systems follow data dependencies, but order instructions either in a breadth-first or depth-first manner, sometimes combined with runtime approaches such as ready queues and adaptive recomputation. Especially in ML scoring scenarios, memory requirements can become the dominating factor due to concurrent scoring streams, or limited device memory. In contrast to traditional programming language compilers, ML systems are able infer sizes of intermediates (dimensions and sparsity), which allows for more intelligent scheduling decisions. Accordingly, this topic aims to device novel scheduling schemes for minimal memory requirements, and minimal runtime subject to a given memory budget. Potential lossless program transformations include DAG-level operator ordering, code motion across control flow, redundant computation, and update-in-place decisions. Besides classic combinatorial search and pruning techniques, we are also interested to explore deep reinforcement learning for adaptively learning the search policy.


#4: An Experimental Evaluation of Label Generation Techniques

Category: Bachelor

Abstract: Learning complex machine learning (ML) models often requires large amounts of labeled data, which can be costly to acquire in absence of application feedback loops. Recently, we see a trend toward data programming or weak supervision in general to automatically generate noisy labels for large, unlabeled datasets as a means of pretraining, for example, deep neural networks. Common techniques include traditional boosting, label generation from conflicting labeling functions, the automatic generation of such labeling functions, and domain-specific object composition (e.g., placing a known object on a random background). This topic aims to conduct a systematic comparison of these techniques on different datasets, and ML algorithms in order to gain a better understanding under which settings these techniques yield overall improvements of model accuracy.


#5: An Experimental Analysis of Sparse Tensor Formats on CPUs and GPUs

Category: Bachelor

Abstract: Sparse matrices are ubiquitous in ML workloads for natural language processing, graph analytics, recommender systems, and scientific computation, but also as a result of data preprocessing and filtering. Therefore, various ML systems provide sparse matrix representations and automatically dispatch sparse operations. In contrast, modern deep learning frameworks like TensorFlow or PyTorch still have very limited support for sparse tensors. This topic aims at a systematic evaluation of sparse tensor formats and operations in existing ML systems, and vendor-provided BLAS/DNN libraries. In detail, this entails an up-to-date survey of sparsity support in different ML systems, a benchmark for common data characteristics and operations (in the context of traditional ML and deep neural networks), as well as the implementation of selected formats and operations for both CPU and GPU devices. Relevant formats include COO, extensions of CSR/CSC for tensors, as well as the more recent CSF (compressed sparse fiber) format. Especially for GPU devices, additional extensions are likely necessary to cope with the irregularities of sparse tensors. Ultimately, the results of this thesis should guide the design and implementation of sparse tensor support and give robust decision rules for the selection of formats based on data characteristics.


#6: Efficient Matrix-Matrix Multiplication on Heterogeneous, Losslessly Compressed Matrices

Category: Bachelor/Master

Abstract: Matrix compression is a widely used technique to speed-up machine learning (ML) workloads. This includes lossy compression such as quantization (reduces value domain) and sparsification (reduces non-zero values) as commonly used for deep neural networks, as well as lossless compression via compressed linear algebra (CLA). CLA compresses matrices column-wise with lightweight database compression techniques (e.g., dictionary coding, RLE, and OLE, as well as column co-coding), and performs memory-bandwidth-bound linear algebra operations such matrix-vector multiplications directly over the compressed representation. The reduced size allows fitting larger matrices into local or distributed memory, and thus, improves performance for datasets that exceed available memory. This topic aims to extend this CLA framework by efficient matrix-matrix multiplications. So far, these operations are realized by repeated matrix-vector multiplications, which lacks cache blocking. However, column co-coding also reduces the number of required floating point operations, which makes CLA interesting for compute-intensive matrix-matrix operations. The major challenge is an efficient blocking scheme for these heterogeneously encoded, column groups, as well as sparse encoding such as RLE and OLE.


#7: LLVM-Based Code Generation Backend for Linear Algebra Programs on CPUs and GPUs

Category: Bachelor/Master

Abstract: There are many opportunities for fused operators (composed of chains of basic operators) in ML workloads. This includes avoiding unnecessary intermediates (whose allocation and write is often the bottleneck), scan sharing, and the exploitation of sparsity across operations. Accordingly, in the last years, numerous operator fusion and code generation frameworks were introduced. SystemDS inherits its Java-based code generation framework from Apache SystemML, which supports the exact, cost-based optimization of operator fusion plans for DAGs of linear algebra operations, as well as local and distributed operations over dense, sparse, and compressed data. This topic aims to extend this framework, by an additional LLVM-based native compiler backend for both CPUs and GPUs. In detail, this entails additional LLVM compiler infrastructure, code templates for LLVM, native vector libraries, as well as dedicated tuning of data transfer and individual kernels. Additional (but optional) challenges include the efficient GPU code generation for sparse and compressed matrices.


#8: Heterogeneous Quantization for Lossy Data Compression in ML Model Training

Category: Bachelor/Master


#9: Efficient, Distributed Outlier Detection for Point and Sequence Anomalies

Category: Bachelor/Master