Putting Data Quality in Context

PhD candidate Daniele Foroni

22 October 2019

Location: Polo Ferrari 1 - Via Sommarive 5, Povo (TN) - Room Garda
Time: 13:00 am

PhD Candidate

Daniele Foroni

Abstract of Dissertation

Data quality is a well-known research field that aims at providing an estimation of the quality of the data itself. The research community has, for quite some time, studied the different aspects of data quality and has developed ways to find and clean dirty and incomplete data. In particular, it has so far focused on the computation of a number of data characteristics as a mean of quantifying different quality dimensions, like freshness, consistency, accuracy, number of duplicates, or completeness. However, the proposed approaches lack of an in-depth view of the quality of the data. Actually, most of the works have focused on efficient and effective ways to identify and clean the data inconsistencies, ignoring to a large extent the task that the data is to be used for, avoiding any investment on data cleaning tasks that are needed, while prioritizing data repairing on errors that are not an issue. Nevertheless, for what concerns streaming data, the concept of quality is slightly turned, since it is more focused on the data results than to the actual input data.

Hence, in the context of data quality, we focus mainly on three challenges, highlighting one aspect for each use case. First, we concentrate our attention on the TASK that the user wants to apply over the data, providing a solution to prioritize cleaning algorithms to improve the task results; second, the focus is on the USER that defines a metric to optimize for a streaming application, and we dynamically scale the resources used by the application to fit the user goal; third, the DATA is at the center and we present a solution for entity matching that focuses on the measurement of a profile of the data that is used to retrieve the similarity metric that gives better results for such data.

The first work concentrates on putting the context of the task that is applied to the data. So, we introduce F4U (that stands for FITNESS FOR USE), a framework for measuring how fit is a dataset for the intended task, which means how much a dataset leads to good results for the given task. In this system, we take a dataset and perform a systematic noise generation that creates from it a set of noisy instances. Then, we apply the user given task to the provided dataset and to these noisy instances, and later we measure the difference that the noise has implied in the results of the task, by measuring the distance of the results obtained with the noisy instances compared to those obtained with the original dataset. The distance allows the user to make some analysis on which noise is mostly affecting the results of the task, which enables a prioritization of the cleaning and repairing algorithms to apply over the original data to improve the results. Other works aims at identifying the most prominent data cleaning tools for a dataset, but our work is the first that does it by optimizing the results of the task the user has in mind.

The second work refers to data quality in a streaming context as a goal-oriented analysis for the given task. It is known that streaming data has different requirements with respect to relational data, and, in this context, data is considered of high quality if it is processed according to the user needs. Hence, we build MOIRA on top of Apache Flink, a tool that adapts the resources needed by a query, optimizing the goal metric defined by the user. The optimization enables improvements in the performance for what concerns the metric goal defined for the given query. Before a query is executed, we perform a static analysis that generates the improved query plan, which improves the performance of the goal defined by the user by a different scaling of the resources. The plan is then submitted to Flink and in the meantime a monitoring system collects information about the cluster and the running application. The system systematically creates, accordingly to these collected metrics, a new query plan and systematically checks whether the deployment of the new plan would improve the performance of the given user goal metric.

In the third work, the focus is on the data itself by proposing a solution to a well known problem, entity matching. We propose a framework that gets the insights of the data, by computing the dataset profile. This would be extremely useful to understand what kind of data the system is analyzing, in order to apply the similarity metric that better fits the data. The system has an online and an offline phase. In the offline phase, the system trains its model to find duplicates on the incoming datasets for which the matching tuples are known. Then, the system computes the profile of the dataset by measuring the accuracy of the results according to multiple similarity metrics. This knowledge would be used in the online phase, where the system divides the records in portions, minimizing the distance of the profile of each portion from the profiles already computed that we know that would lead to interesting results.

Contact: ict.school [at] unitn.it (ICT International Doctoral School)