Skip to content

Module 4 of 15 · 📖 4 min read · ⏱ 30 min total

FI-DPA 04 ETL-ELT-Strecken (EN)

Table of contents (6 sections)
  1. Concepts and Background
  2. Architecture Diagram
  3. Practical Steps
  4. Common Pitfalls
  5. Further Resources
  6. Knowledge Check

FI-DPA 04 ETL/ELT Pipelines

ETL/ELT processes form the backbone of modern data integration. In this module, you will learn the concepts for extracting, transforming, and loading data between different systems. You will understand how staging areas are used for quality assurance and why idempotency is crucial in data processing. Practical examples with Airflow, dbt, and Talend will round out your knowledge.

You will be able to design, implement, and maintain ETL/ELT pipelines. The focus is on robust, maintainable, and performant solutions for use in SME environments.

Concepts and Background

ETL (Extract, Transform, Load)
The classic approach where data is first extracted, then transformed in a separate database, and finally loaded into the target system. The transformation occurs before loading into the target system.
ELT (Extract, Load, Transform)
A modern approach where data is loaded directly into the target system (usually a data warehouse) and transformed there. This leverages the scalability of modern data warehouses and enables more flexible transformations.
Staging Area
A temporary storage area that serves as an intermediate station for data during the ETL process. Here, data is cleaned, validated, and prepared before being loaded into the target system.
Idempotency
A property of operations that can be executed multiple times without changing the result. In the ETL context, this means that executing a pipeline multiple times results in the identical outcome, even if data sources have already been processed.
Orchestration
The coordination of various tasks and workflows in an ETL process. Tools like Airflow enable the scheduling, execution, and monitoring of complex data pipelines.

Architecture Diagram

flowchart LR
    A[Datenquellen] --> B(Extraktion)
    B --> C[Staging-Area]
    C --> D[Transformation]
    D --> E[Zielsystem]
    F[Airflow] --> B
    F --> D
    G[dbt] --> D
    H[Talend] --> B
    H --> D

Practical Steps

  1. Install Apache Airflow with pip in a virtual environment. This ensures a clean separation of projects.
  2. python -m venv airflow-env
    source airflow-env/bin/activate
    pip install apache-airflow
  3. Configure the Airflow database with PostgreSQL for improved performance and scalability.
  4. export AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:password@localhost/airflow
    airflow db init
  5. Create a new project in dbt and configure the connection parameters to your staging database.
  6. dbt init my_project
    cd my_project
    vim profiles.yml
  7. Define models in dbt for your transformations using SQL files in the models/ folder.
  8. mkdir -p models/staging
    cat > models/staging/customers.sql <<EOF
    select id, name, email from raw.customers
    where email is not null
    EOF</code>
  9. Create a DAG in Airflow that calls dbt after the data has been extracted.
  10. from airflow.operators.bash import BashOperator
    from airflow import DAG
    from datetime import datetime
    
    with DAG('etl_pipeline', start_date=datetime(2023,1,1)) as dag:
        run_dbt = BashOperator(
            task_id='run_dbt',
            bash_command='dbt run'
        )
  11. Implement a job component in Talend for data cleaning that removes duplicate entries based on a unique ID.
  12. tRowGenerator --component tUniqueRow --component tFilterRow --component tLogRow
  13. Configure retry policies in Airflow for critical tasks to handle transient errors.
  14. run_dbt = BashOperator(
        task_id='run_dbt',
        bash_command='dbt run',
        retries=3,
        retry_delay=timedelta(minutes=5)
    )
  15. Implement tests in dbt to ensure data quality, e.g., uniqueness of IDs.
  16. cat > models/staging/schema.yml <<EOF
    version: 2
    
    models:
      - name: customers
        columns:
          - name: id
            tests:
              - unique
    EOF</code>
  17. Use the sensor operator in Airflow to wait for new files to appear in a directory.
  18. from airflow.sensors.filesystem import FileSensor
    
    wait_for_file = FileSensor(
        task_id='wait_for_file',
        filepath='/data/incoming/*.csv',
        poke_interval=60,
        timeout=3600
    )
  19. Configure a connection to your database in Talend with connection pooling for improved performance.
  20. tDBConnection --component tDBOutput --component tDBCommit

Common Pitfalls

Further Resources

Knowledge Check

Four questions for self-assessment. Click on each question to see the correct answer and explanation.

What is the main difference between ETL and ELT?
  • A) ETL processes data in the cloud, ELT locally
  • B) In ETL, transformation occurs before loading, in ELT after loading
  • C) ETL always uses staging areas, ELT does not
  • D) ELT is only suitable for Big Data environments

Correct Answer: B. The essential difference lies in the timing of the transformation: ETL transforms data before loading into the target system, while ELT loads data first and then transforms it in the target system.