Data Integration and Large-scale Analysis WiSe2025/26
(VL/UE, 41112 Data Integration and Large-scale Analysis)

DIA is a 6 ECTS module, applicable to the bachelor and master study courses computer science, computer engineering, information systems management, and electrical engineering, as well as the study areas data and software engineering, cognitive systems, and distributed systems and networks. This course covers major data integration architectures, key techniques for data integration and cleaning, as well as methods for large-scale, i.e., distributed, data storage and analysis.


Lectures

In detail, the course covers the following topics, which also reflects the course calendar. All slides will be made available prior to the individual lectures, which take place Thursday's 4pm-6pm (start 4.15pm) in BH-N 243 and virtually via zoom (link). Furthermore, we also offer weekly office hours, which take place Wednesday, 5pm-6pm virtually via zoom (call-in: office hour, starting Oct 23). Lecture attendance is optional and videos of the recorded zoom sessions will be made available a few days after the individual lectures.

A: Data Integration and Preparation

  • 01 Introduction and Overview [Oct 16]
  • 02 Data Warehousing, ETL, and SQL/OLAP [Oct 23]
  • 03 Replication and Message-oriented Middleware [Oct 30]
  • 04 Schema Matching and Mapping [Nov 06]
  • 05 Entity Linking and Deduplication [Nov 13]
  • 06 Data Cleaning and Data Fusion [Nov 20]
  • 07 Data Provenance and Data Catalogs [Nov 27]

B: Large-Scale Data Management and Analysis

  • 08 Cloud Computing Fundamentals [Dec 04]
  • 09 Cloud Resource Management and Scheduling [Dec 11]
  • 10 Distributed Data Storage [Dec 18]
  • 11 Distributed, Data-Parallel Computation [Jan 15]
  • 12 Distributed Stream Processing [Jan 22]
  • 13 Distributed Machine Learning Systems [Jan 29]


Project / Exercises

The lectures are accompanied by mandatory programming projects (to the extend of 3 ECTS, i.e, roughly 80-90 working hours), preferably in Apache SystemDS (an open source ML system for the end-to-end data science lifecycle), or DAPHNE (an open and extensible system infrastructure for integrated data analysis pipelines).
A list of project proposals and details on an alternative exercise (streaming full text search) are available here:

  • Apache SystemDS: Student Projects
  • Alternative Exercise: TBD (last update: TBD)


Organization

  • Lecturer: Univ.-Prof. Dr.-Ing. Matthias Boehm, DAMS
  • Teaching Assistant: Carlos E. Muniz Cuza, DAMS
  • Final written exams: Feb 05, 4pm; Feb 12, 4pm; and Mar 12, 4pm; additional oral exam slots on demand, e.g., for international students
  • Grading: 100% final exam, project as prerequisite (up to 5 extra points)