Modul 5 von 15 · 📖 4 min Lesezeit · ⏱ 30 min gesamt
FI-DPA 05 Datenqualität messen und sichern (EN)
Inhaltsverzeichnis (6 Abschnitte)
FI-DPA 05 Measuring and Ensuring Data Quality
Data quality is the foundation for reliable analyses and well-founded decisions in companies. In this module, you will learn methods for the systematic evaluation and assurance of data quality based on key criteria such as completeness, accuracy, and consistency. You will gain practical knowledge in data profiling and use the Great Expectations framework to automatically monitor and ensure data quality.
Concepts and Background
- Completeness
- Evaluates whether all expected data is present. Missing values can lead to incomplete analyses and distorted results.
- Accuracy
- Checks whether the data is based on correct and error-free values. Inaccurate data leads to false conclusions and decisions.
- Consistency
- Ensures that data matches across different systems or datasets. Inconsistencies can lead to duplicates and contradictory information.
- Data Profiling
- A systematic process for examining the characteristics of data holdings to understand structure, content, and quality.
- Great Expectations
- An open-source framework for creating, validating, and documenting data quality expectations that ensures continuous monitoring.
Architecture Diagram
flowchart LR
A[Data Source] --> B[Data Profiling]
B --> C[Great Expectations]
C --> D[Expectation Definitions]
C --> E[Data Validation]
E --> F[Quality Report]
F --> G[Automated Actions]
Practical Steps
- Identify and document data sources. This forms the basis for all subsequent quality analyses.
- Perform data profiling using Python libraries to determine statistical metrics, distributions, and anomalies.
- Initialize Great Expectations and set up a data context for your project.
- Define expectations for key data, such as for completeness, data types, or value ranges.
- Perform data validation and document the results to identify deviations from defined quality standards.
- Set up automated workflows for continuous monitoring to ensure data quality in real-time.
- Implement alert mechanisms for critical quality deviations to enable proactive intervention.
import pandas as pd
df = pd.read_csv('datasource.csv')
print(df.describe())
print(df.isnull().sum())
great_expectations init
great_expectations datasource new
context.add_expectation(
expectation_suite_name="my_expectations",
expectation_suite={
"expectations": [
{
"expectation_type": "expect_column_values_to_not_be_null",
"kwargs": {"column": "customer_id"}
}
]
}
)
validation_result = context.validate(
datasource_name="my_datasource",
suite_name="my_expectations"
)
Common Pitfalls
Further Resources
- Great Expectations Official Documentation
- Tetrasearch Blog: Data Quality with Great Expectations
- Pandas Documentation for Data Profiling
- The Data Quality Assessment Framework
- Great Expectations Tutorials on GitHub
Knowledge Check
Four questions for self-assessment. Click on each question to see the correct answer and explanation.
Which of the following data quality criteria ensures that data matches across different systems?
- A) Completeness
- B) Consistency
- C) Accuracy
- D) Validity
Correct Answer: B. Consistency ensures that data matches across different systems or datasets. Completeness refers to the presence of all expected data, accuracy to the correctness of values, and validity is a broader term for compliance with established rules.
Which tool is presented in the module as an open-source framework for creating, validating, and documenting data quality expectations?
- A) Pandas
- B) NumPy
- C) Great Expectations
- D) SQLAlchemy
Correct Answer: C. Great Expectations is the framework presented in the module for automated monitoring of data quality. Pandas and NumPy are libraries for data manipulation and numerical calculations, and SQLAlchemy is a toolkit for SQL databases.
Which method is described in the module as a systematic process for examining the characteristics of data holdings to understand structure, content, and quality?
- A) Data Cleansing
- B) Data Profiling
- C) Data Modeling
- D) Data Aggregation
Correct Answer: B. Data profiling is the systematic process for examining the characteristics of data holdings. Data cleansing refers to the removal of errors, data modeling to structure definition, and data aggregation to the summarization of data.
Which of the following Python libraries is recommended in the module for performing data profiling with statistical metrics and distributions?
- A) TensorFlow
- B) Matplotlib