Data Integration and Large-Scale Analysis WS2019/20
(VU, 706.520 Data Integration and Large-Scale Analysis)

DIA is a 5 ECTS bachelor and master course, applicable to the bachelor programs computer science or software engineering and management, as well as the computer science master catalog 'Knowledge Technologies'. This course covers major data integration architectures, key techniques for data integration and cleaning, as well as methods for large-scale, i.e., distributed, data storage and analysis.


Lectures

In detail, the course covers the following topics, which also reflects the course calendar. All slides will be made available prior to the individual lectures.

A: Data Integration and Preparation

  • 01 Introduction and Overview [Oct 04, pdf, pptx]
  • 02 Data Warehousing, ETL, and SQL/OLAP [Oct 11, pdf, pptx]
  • 03 Message-oriented Middleware, EAI, and Replication [Oct 18, pdf, pptx]
  • 04 Schema Matching and Mapping [Oct 25, pdf, pptx]
  • 05 Entity Linking and Deduplication [Nov 08, pdf, pptx]
  • 06 Data Cleaning and Data Fusion [Nov 15, pdf, pptx]
  • 07 Data Provenance and Blockchain [Nov 22, pdf, pptx]

B: Large-Scale Data Management and Analysis

  • 08 Cloud Computing Foundamentals [Dec 06, pdf, pptx]
  • 09 Cloud Resource Management and Scheduling [Dec 13, pdf, pptx]
  • 10 Distributed Data Storage [Jan 10, pdf, pptx]
  • 11 Distributed, Data-Parallel Computation [Jan 17, pdf, pptx]
  • 12 Distributed Stream Processing [Jan 24, pdf, pptx]
  • 13 Distributed Machine Learning Systems [Jan 31, pdf, pptx]


Exercises

The lectures are accompanied by mandatory exercises, leveraging open source tools and systems (at the extend of 2 ECTS, i.e, roughly 50 working hours). As an alternative to the exercises, students may choose to do a progamming project in SystemDS (an open source ML system for the end-to-end data science lifecycle) instead.


Organization

  • Lecturer: Univ.-Prof. Dr.-Ing. Matthias Boehm, ISDS
  • Teaching Assistant: M.Sc. Shafaq Siddiqi, ISDS
  • Final oral exams: Feb 03 - 05, 2020
  • Grading: 40% project, 60% final exam