Thema: Multi-Tenant Data Quality Scoring and Validation at Scale

Thema: Multi-Tenant Data Quality Scoring and Validation at Scale

Grunddaten

Titel Multi-Tenant Data Quality Scoring and Validation at Scale
Beschreibung

Goals:

  • Understand the intricacies of Baqend's data processing pipeline, particularly the tracking data collected by the Speed Kit technology and other data sources like CDN logs
  • Analyze the tracking and CDN data schemas to group attributes based on shared validation semantics, such as performance timers and aggregation dimensions
  • Identify potential anomalies within these groups to quantify the current level of data quality and enable alerting based thereon, focusing on aspects like valid data values, missing values, and specific data issues
  • Consider the challenges of multi-tenancy and the diverse nature of customer websites, ensuring that data quality is assessed effectively for each individual customer
  • Lay the groundwork for a continuous monitoring system that can assess data quality in real-time, even if the initial implementation is batch-based or focuses on a specific subset of the data
  • Description: Ensuring the quality of collected data is paramount for any organization, especially when this data plays a crucial role in monitoring operations and optimizing products. Baqend, with its Speed Kit technology, accelerates websites and relies heavily on tracking data to ensure optimal performance. However, the diverse nature of this data, combined with the challenges of multi-tenancy, makes assessing its quality a complex task.

Description:
This project aims to establish a foundational data quality scoring mechanism for Baqend. By analyzing the tracking and CDN data schemas, the goal is to group attributes that likely share validation semantics. This grouping will then serve as a basis for identifying potential anomalies that can indicate issues with data quality. Given the diverse nature of Baqend's data, which spans multiple tables with numerous attributes of varying data types, and the unique characteristics of each customer website, this task requires a nuanced approach.

While the long-term vision is to have a continuous monitoring system that can instantly flag data quality issues and ideally also pinpoint possible causes/solutions, the scope of this thesis will focus on laying the groundwork for such a system. This might involve focusing on a specific subset of the data, analyzing data from a particular timeframe, or implementing a batch-based analysis rather than a real-time one.

In the long term, we seek to provide Baqend with a robust mechanism to ensure the integrity and quality of its collected data, paving the way for more advanced, real-time monitoring systems in the future.

Resources (Mandatory)
Please check out the following resources before we meet for discussing a potential topic for your thesis:
  • CodeTalks 2023 – Data Validation at Scale: Managing Data Quality in Complex Data Pipelines: Video (coming soon)
  • VLDB 2022 – Beaconnect: Continuous Web Performance A/B-Testing at Scale: Video / Paper
  • Flink Forward 2021 – Batching was Yesterday! Real-Time Tracking For 100+ Million Visitors: Video
  • ICDE 2020 – Speed Kit: A Polyglot & GDPR-Compliant Approach For Caching Personalized Content: Video / Paper
Heimateinrichtung Department für Informatik
Art der Arbeit praktisch / anwendungsbezogen
Abschlussarbeitstyp Bachelor oder Master
Autor Prof. Dr. Wolfram Wingerath
Status verfügbar
Aufgabenstellung
Voraussetzung
depends on the topic
Erstellt 20.08.2023

Studiendaten

Abteilungen
  • Informationssysteme - Data Science
Studiengänge
  • Master of Education (Haupt- und Realschule) Informatik
  • Fach-Bachelor Wirtschaftsinformatik
  • Master of Education (Gymnasium) Informatik
  • Master Informatik
  • Master Applied Economics and Data Science
  • Fach-Bachelor Informatik
  • Master of Education (Wirtschaftspädagogik) Informatik
  • Master Wirtschaftsinformatik
Zugeordnete Veranstaltungen
Ansprechpartner