Data Integration and Large-scale Analysis WiSe2023/24
(VL/UE, 41112 Data Integration and Large-scale Analysis)

DIA is a 6 ECTS module, applicable to the bachelor and master study courses computer science, computer engineering, information systems management, and electrical engineering, as well as the study areas data and software engineering, cognitive systems, and distributed systems and networks. This course covers major data integration architectures, key techniques for data integration and cleaning, as well as methods for large-scale, i.e., distributed, data storage and analysis.


In detail, the course covers the following topics, which also reflects the course calendar. All slides will be made available prior to the individual lectures, which take place Thursday's 4pm-6pm (start 4.15pm) in H 0107 and virtually via zoom (link). Furthermore, we also offer weekly office hours, which take place Wednesday, 4.30pm-6pm in TEL 0811 and virtually via zoom (call-in: office hour, starting Nov 01). Lecture attendance is optional and videos of the recorded zoom sessions will be made available a few days after the individual lectures.

A: Data Integration and Preparation

  • 01 Introduction and Overview [Oct 19, pdf, pptx, mp4]
  • 02 Data Warehousing, ETL, and SQL/OLAP [Oct 26 (virtual only), pdf, pptx, mp4 (part1), mp4 (part2)]
  • 03 Replication and Message-oriented Middleware [Nov 02, pdf, pptx, mp4]
  • 04 Schema Matching and Mapping [Nov 09 (virtual only), pdf, pptx, mp4]
  • 05 Entity Linking and Deduplication [Nov 16, pdf, pptx, mp4]
  • 06 Data Cleaning and Data Fusion [Nov 23, pdf, pptx, mp4]
  • 07 Data Provenance and Data Catalogs [Nov 30, pdf, pptx, mp4]

B: Large-Scale Data Management and Analysis

Project / Exercises

The lectures are accompanied by mandatory programming projects (to the extend of 3 ECTS, i.e, roughly 80-90 working hours), preferably in Apache SystemDS (an open source ML system for the end-to-end data science lifecycle), or DAPHNE (an open and extensible system infrastructure for integrated data analysis pipelines).
A list of project proposals and details on alternative exercises (local and distributed entity resolution pipeline) are available here:


  • Lecturer: Univ.-Prof. Dr.-Ing. Matthias Boehm, DAMS
  • Teaching Assistant: M.Tech. Arnab Phani, DAMS
  • Final written exams (preliminary dates): Feb 08, 4pm (pdf) and Feb 15, 4pm (pdf); additional oral exam slots on demand, e.g., for international students
  • Grading: 100% final exam, project as prerequisite (up to 5 extra points)