Modul 5 von 15 · 📖 4 min Lesezeit · ⏱ 30 min gesamt

FI-DPA 05 Datenqualität messen und sichern (EN)

Inhaltsverzeichnis (6 Abschnitte)

Concepts and Background
Architecture Diagram
Practical Steps
Common Pitfalls
Further Resources
Knowledge Check

FI-DPA 05 Measuring and Ensuring Data Quality

Data quality is the foundation for reliable analyses and well-founded decisions in companies. In this module, you will learn methods for the systematic evaluation and assurance of data quality based on key criteria such as completeness, accuracy, and consistency. You will gain practical knowledge in data profiling and use the Great Expectations framework to automatically monitor and ensure data quality.

Concepts and Background

Completeness: Evaluates whether all expected data is present. Missing values can lead to incomplete analyses and distorted results.
Accuracy: Checks whether the data is based on correct and error-free values. Inaccurate data leads to false conclusions and decisions.
Consistency: Ensures that data matches across different systems or datasets. Inconsistencies can lead to duplicates and contradictory information.
Data Profiling: A systematic process for examining the characteristics of data holdings to understand structure, content, and quality.
Great Expectations: An open-source framework for creating, validating, and documenting data quality expectations that ensures continuous monitoring.

Architecture Diagram

flowchart LR
    A[Data Source] --> B[Data Profiling]
    B --> C[Great Expectations]
    C --> D[Expectation Definitions]
    C --> E[Data Validation]
    E --> F[Quality Report]
    F --> G[Automated Actions]

Practical Steps

Identify and document data sources. This forms the basis for all subsequent quality analyses.
Perform data profiling using Python libraries to determine statistical metrics, distributions, and anomalies.

import pandas as pd
df = pd.read_csv('datasource.csv')
print(df.describe())
print(df.isnull().sum())

Initialize Great Expectations and set up a data context for your project.

great_expectations init
great_expectations datasource new

Define expectations for key data, such as for completeness, data types, or value ranges.

context.add_expectation(
    expectation_suite_name="my_expectations",
    expectation_suite={
        "expectations": [
            {
                "expectation_type": "expect_column_values_to_not_be_null",
                "kwargs": {"column": "customer_id"}
            }
        ]
    }
)

Perform data validation and document the results to identify deviations from defined quality standards.

validation_result = context.validate(
    datasource_name="my_datasource",
    suite_name="my_expectations"
)

Set up automated workflows for continuous monitoring to ensure data quality in real-time.
Implement alert mechanisms for critical quality deviations to enable proactive intervention.

Common Pitfalls

Further Resources

Knowledge Check

Four questions for self-assessment. Click on each question to see the correct answer and explanation.

Which of the following data quality criteria ensures that data matches across different systems?

A) Completeness
B) Consistency
C) Accuracy
D) Validity

Correct Answer: B. Consistency ensures that data matches across different systems or datasets. Completeness refers to the presence of all expected data, accuracy to the correctness of values, and validity is a broader term for compliance with established rules.

Which tool is presented in the module as an open-source framework for creating, validating, and documenting data quality expectations?

A) Pandas
B) NumPy
C) Great Expectations
D) SQLAlchemy

Correct Answer: C. Great Expectations is the framework presented in the module for automated monitoring of data quality. Pandas and NumPy are libraries for data manipulation and numerical calculations, and SQLAlchemy is a toolkit for SQL databases.

Which method is described in the module as a systematic process for examining the characteristics of data holdings to understand structure, content, and quality?

A) Data Cleansing
B) Data Profiling
C) Data Modeling
D) Data Aggregation

Correct Answer: B. Data profiling is the systematic process for examining the characteristics of data holdings. Data cleansing refers to the removal of errors, data modeling to structure definition, and data aggregation to the summarization of data.

Which of the following Python libraries is recommended in the module for performing data profiling with statistical metrics and distributions?

A) TensorFlow
B) Matplotlib