Data Integration and Large-scale Analysis WiSe2024/25
(VL/UE, 41112 Data Integration and Large-scale Analysis)

DIA is a 6 ECTS module, applicable to the bachelor and master study courses computer science, computer engineering, information systems management, and electrical engineering, as well as the study areas data and software engineering, cognitive systems, and distributed systems and networks. This course covers major data integration architectures, key techniques for data integration and cleaning, as well as methods for large-scale, i.e., distributed, data storage and analysis.


Lectures

In detail, the course covers the following topics, which also reflects the course calendar. All slides will be made available prior to the individual lectures, which take place Thursday's 4pm-6pm (start 4.15pm) in H 0107 and virtually via zoom (link). Furthermore, we also offer weekly office hours, which take place Monday, 10am-12pm virtually via zoom (call-in: office hour, starting Oct 21). Lecture attendance is optional and videos of the recorded zoom sessions will be made available a few days after the individual lectures.

A: Data Integration and Preparation

  • 01 Introduction and Overview [Oct 17]
  • 02 Data Warehousing, ETL, and SQL/OLAP [Oct 24]
  • 03 Replication and Message-oriented Middleware [Oct 31]
  • 04 Schema Matching and Mapping [Nov 07]
  • 05 Entity Linking and Deduplication [Nov 14]
  • 06 Data Cleaning and Data Fusion [Nov 21]
  • 07 Data Provenance and Data Catalogs [Nov 28]

B: Large-Scale Data Management and Analysis

  • 08 Cloud Computing Fundamentals [Dec 05]
  • 09 Cloud Resource Management and Scheduling [Dec 12]
  • 10 Distributed Data Storage [Dec 19]
  • 11 Distributed, Data-Parallel Computation [Jan 09]
  • 12 Distributed Stream Processing [Jan 16]
  • 13 Distributed Machine Learning Systems [Jan 23]


Project / Exercises

The lectures are accompanied by mandatory programming projects (to the extend of 3 ECTS, i.e, roughly 80-90 working hours), preferably in Apache SystemDS (an open source ML system for the end-to-end data science lifecycle), or DAPHNE (an open and extensible system infrastructure for integrated data analysis pipelines).
A list of project proposals and details on an alternative exercise (TBD) are available here:


Organization

  • Lecturer: Univ.-Prof. Dr.-Ing. Matthias Boehm, DAMS
  • Teaching Assistant: M.Tech. Arnab Phani, DAMS
  • Final written exams (preliminary dates): Feb 06, 4pm, and Feb 13, 4pm; additional oral exam slots on demand, e.g., for international students
  • Grading: 100% final exam, project as prerequisite (up to 5 extra points)