Module 14 of 15 · 📖 5 min read · ⏱ 30 min total
FI-DPA 14 ML-Pipeline — Daten, Training, Evaluation (EN)
Table of contents (6 sections)
FI-DPA 14 ML-Pipeline — Data, Training, Evaluation
In this module, you will learn the fundamental concepts and practical steps for creating a machine learning pipeline. You will understand how data is prepared, split into training, validation, and test sets, and how models are trained and evaluated. The focus is on the practical use of scikit-learn and the key metrics for evaluating models.
You will acquire the knowledge to avoid common pitfalls in ML projects and learn methods such as Cross-Validation and Hyperparameter-Tuning to systematically improve your model performance.
Concepts and Background
- Feature Engineering
- The process of transforming and selecting raw features to improve the performance of ML models. This includes normalization, encoding categorical variables, creating new features, and selecting relevant features.
- Train/Validation/Test-Split
- The division of the dataset into three separate sets: for training the model, for optimizing hyperparameters, and for the final, unbiased evaluation of model performance.
- Cross-Validation
- A robust method for model evaluation where the dataset is split multiple times into different training and test sets to minimize dependence on a specific split.
- Hyperparameter-Tuning
- The process of systematically searching for optimal settings for the hyperparameters of an ML model, which are not directly learned from the data.
- Evaluationsmetriken
- Quantitative measures for evaluating the performance of ML models, such as Accuracy, F1-Score (harmonic mean of Precision and Recall), ROC-AUC (Area Under the ROC Curve), and RMSE (Root Mean Square Error).
Architecture Diagram
flowchart LR
A[Rohdaten] --> B[Feature Engineering]
B --> C[Train/Validation/Test-Split]
C --> D[Modelltraining]
D --> E[Hyperparameter-Tuning]
E --> F[Modellvalidierung]
F --> G[Modellbewertung]
G --> H[Modelldeployment]
Practical Steps
- Load and explore data with pandas to gain initial insights into data structure and quality.
import pandas as pd data = pd.read_csv('daten.csv') print(data.head()) print(data.info()) - Perform Feature Engineering: handle missing values, encode categorical variables, and scale features.
from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer # Numerische Features skalieren scaler = StandardScaler() data[['feature1', 'feature2']] = scaler.fit_transform(data[['feature1', 'feature2']]) # Kategorische Features kodieren encoder = OneHotEncoder() encoded_features = encoder.fit_transform(data[['kategorie']]) - Split data into training, validation, and test sets to avoid overfitting.
from sklearn.model_selection import train_test_split X = data.drop('zielvariable', axis=1) y = data['zielvariable'] X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42) X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42) - Train an ML model (e.g., Random Forest) with the training data.
from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(X_train, y_train) - Perform Hyperparameter-Tuning with GridSearchCV and Cross-Validation.
from sklearn.model_selection import GridSearchCV param_grid = { 'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20, 30] } grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy') grid_search.fit(X_train, y_train) - Evaluate the best model on the validation data and calculate metrics.
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score y_pred = grid_search.predict(X_val) print(f"Accuracy: {accuracy_score(y_val, y_pred)}") print(f"F1-Score: {f1_score(y_val, y_pred)}") print(f"ROC-AUC: {roc_auc_score(y_val, y_pred)}") - Evaluate the final model on the unused test data to obtain an unbiased performance assessment.
final_predictions = grid_search.predict(X_test) print(f"Final Accuracy: {accuracy_score(y_test, final_predictions)}")
Häufige Fallstricke
Weiterführende Ressourcen
- Official scikit-learn User Guide
- Practical Examples and Tutorials from scikit-learn
- Hyperparameter Optimization with scikit-learn
- Feature Engineering Course on Kaggle
- Model Evaluation in scikit-learn
Knowledge Check
Four questions for self-assessment. Click on each question to see the correct answer and explanation.
What is the main purpose of Feature Engineering in an ML pipeline?
- A) Reducing data size for faster processing
- B) Transforming and selecting features to improve model performance
- C) Fully automating the data process
- D) Eliminating all categorical variables from the dataset
Correct Answer: B. Feature Engineering aims to improve ML model performance through targeted transformation and selection of features, while the other options only represent partial aspects or misinterpretations of this process.
Why is a dataset split into training, validation, and test sets?
- A) To increase the amount of training data available
- B) To ensure the model can be evaluated on unseen data
- C) To reduce computational requirements
- D) To allow for parallel processing of data
Correct Answer: B. The split ensures that the model's performance can be evaluated on data it has never seen during training, providing an unbiased assessment of how well it will perform in real-world scenarios.
What is the primary benefit of using Cross-Validation?
- A) It reduces the time needed for model training
- B) It provides a more robust estimate of model performance
- C) It eliminates the need for a test set
- D) It automatically selects the best model
Correct Answer: B. Cross-Validation provides a more robust estimate of model performance by averaging results across multiple different splits of the data, reducing the impact of how the data is divided.
Which metric is most appropriate for evaluating a model on imbalanced data?
- A) Accuracy
- B) F1-Score
- C) Number of features
- D) Training time
Correct Answer: B. For imbalanced data, the F1-Score is more appropriate than Accuracy because it considers both Precision and Recall, providing a better measure of the model's performance on the minority class.