Module 4 of 15 · 📖 4 min read · ⏱ 30 min total
FI-DPA 04 ETL-ELT-Strecken (EN)
Table of contents (6 sections)
FI-DPA 04 ETL/ELT Pipelines
ETL/ELT processes form the backbone of modern data integration. In this module, you will learn the concepts for extracting, transforming, and loading data between different systems. You will understand how staging areas are used for quality assurance and why idempotency is crucial in data processing. Practical examples with Airflow, dbt, and Talend will round out your knowledge.
You will be able to design, implement, and maintain ETL/ELT pipelines. The focus is on robust, maintainable, and performant solutions for use in SME environments.
Concepts and Background
- ETL (Extract, Transform, Load)
- The classic approach where data is first extracted, then transformed in a separate database, and finally loaded into the target system. The transformation occurs before loading into the target system.
- ELT (Extract, Load, Transform)
- A modern approach where data is loaded directly into the target system (usually a data warehouse) and transformed there. This leverages the scalability of modern data warehouses and enables more flexible transformations.
- Staging Area
- A temporary storage area that serves as an intermediate station for data during the ETL process. Here, data is cleaned, validated, and prepared before being loaded into the target system.
- Idempotency
- A property of operations that can be executed multiple times without changing the result. In the ETL context, this means that executing a pipeline multiple times results in the identical outcome, even if data sources have already been processed.
- Orchestration
- The coordination of various tasks and workflows in an ETL process. Tools like Airflow enable the scheduling, execution, and monitoring of complex data pipelines.
Architecture Diagram
flowchart LR
A[Datenquellen] --> B(Extraktion)
B --> C[Staging-Area]
C --> D[Transformation]
D --> E[Zielsystem]
F[Airflow] --> B
F --> D
G[dbt] --> D
H[Talend] --> B
H --> D
Practical Steps
- Install Apache Airflow with pip in a virtual environment. This ensures a clean separation of projects.
- Configure the Airflow database with PostgreSQL for improved performance and scalability.
- Create a new project in dbt and configure the connection parameters to your staging database.
- Define models in dbt for your transformations using SQL files in the models/ folder.
- Create a DAG in Airflow that calls dbt after the data has been extracted.
- Implement a job component in Talend for data cleaning that removes duplicate entries based on a unique ID.
- Configure retry policies in Airflow for critical tasks to handle transient errors.
- Implement tests in dbt to ensure data quality, e.g., uniqueness of IDs.
- Use the sensor operator in Airflow to wait for new files to appear in a directory.
- Configure a connection to your database in Talend with connection pooling for improved performance.
python -m venv airflow-env
source airflow-env/bin/activate
pip install apache-airflow
export AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:password@localhost/airflow
airflow db init
dbt init my_project
cd my_project
vim profiles.yml
mkdir -p models/staging
cat > models/staging/customers.sql <<EOF
select id, name, email from raw.customers
where email is not null
EOF</code>
from airflow.operators.bash import BashOperator
from airflow import DAG
from datetime import datetime
with DAG('etl_pipeline', start_date=datetime(2023,1,1)) as dag:
run_dbt = BashOperator(
task_id='run_dbt',
bash_command='dbt run'
)
tRowGenerator --component tUniqueRow --component tFilterRow --component tLogRow
run_dbt = BashOperator(
task_id='run_dbt',
bash_command='dbt run',
retries=3,
retry_delay=timedelta(minutes=5)
)
cat > models/staging/schema.yml <<EOF
version: 2
models:
- name: customers
columns:
- name: id
tests:
- unique
EOF</code>
from airflow.sensors.filesystem import FileSensor
wait_for_file = FileSensor(
task_id='wait_for_file',
filepath='/data/incoming/*.csv',
poke_interval=60,
timeout=3600
)
tDBConnection --component tDBOutput --component tDBCommit
Common Pitfalls
Further Resources
Knowledge Check
Four questions for self-assessment. Click on each question to see the correct answer and explanation.
What is the main difference between ETL and ELT?
- A) ETL processes data in the cloud, ELT locally
- B) In ETL, transformation occurs before loading, in ELT after loading
- C) ETL always uses staging areas, ELT does not
- D) ELT is only suitable for Big Data environments
Correct Answer: B. The essential difference lies in the timing of the transformation: ETL transforms data before loading into the target system, while ELT loads data first and then transforms it in the target system.