Modul 12 von 13 · 📖 4 min Lesezeit · ⏱ 30 min gesamt
FI-DPA 13 Maschinelles Lernen — Grundlagen und Algorithmen (EN)
Inhaltsverzeichnis (5 Abschnitte)
FI-DPA 13 Machine Learning — Fundamentals and Algorithms
Module 13 covers the fundamental concepts of machine learning, including the distinction between supervised and unsupervised learning. You will learn the most important algorithms for regression, classification, and clustering, as well as the concepts of bias-variance dilemma and principal component analysis (PCA).
The practical application of these concepts is demonstrated using typical algorithms such as Decision Tree, Random Forest, k-NN, k-Means, and PCA. Upon completion of this module, you will be able to evaluate and apply appropriate ML methods for given problem statements.
Concepts and Background
- Supervised Learning
- Supervised learning uses labeled training data where each input is provided with the correct output. The goal is to learn a function that can correctly predict new, unseen data. Examples include classification and regression.
- Unsupervised Learning
- Unsupervised learning works with unlabeled data and independently seeks hidden patterns or structures in the data. Typical applications are clustering and dimensionality reduction.
- Regression
- Regression is a form of supervised learning where the goal is to predict a continuous value. Examples include predicting prices or temperatures.
- Classification
- Classification is also a form of supervised learning where data is divided into predefined categories. Examples include spam email detection or disease diagnosis.
- Clustering
- Clustering is a method of unsupervised learning where similar data points are grouped together into clusters (groups). The goal is to discover the data structure.
Practical Steps
- Prepare data: Load your dataset into a suitable format (e.g., CSV) and prepare it by handling missing values and encoding categorical variables. Proper data preprocessing is crucial for model quality.
- Split data into training and test sets: Use the train_test_split function from scikit-learn to divide your data into training and test datasets. This enables an objective evaluation of the model.
- Select and initialize model: Choose an appropriate algorithm for your problem (e.g., RandomForestClassifier for classification) and initialize the model with suitable parameters. The choice of the right algorithm depends heavily on the nature of your data and the problem.
- Train model: Fit the model to your training data by calling the fit method. During this process, the model learns the underlying patterns in the data.
- Evaluate model: Use metrics such as accuracy, precision, or F1-score to evaluate the model's performance on the test set. This provides insight into the model's generalization ability.
- Optimize model: Use techniques like GridSearchCV to optimize the model's hyperparameters. Careful hyperparameter optimization can significantly improve model performance.
Common Pitfalls
Further Resources
- Scikit-learn User Guide - Official Documentation
- Machine Learning by Andrew Ng (Coursera)
- Python Machine Learning Book - 3rd Edition
- TensorFlow Tutorials - Deep Learning with TensorFlow
Knowledge Check
Four questions for self-assessment. Click on each question to see the correct answer and explanation.
What is the main difference between supervised and unsupervised learning?
- A) Supervised learning always uses neural networks, unsupervised learning does not
- B) Supervised learning requires labeled data, unsupervised learning works with unlabeled data
- C) Supervised learning is always more accurate than unsupervised learning
- D) Supervised learning can only work with numerical data, unsupervised learning can also work with categorical data
Correct Answer: B. The key difference lies in the use of labeled data in supervised learning, while unsupervised learning works without predefined labels. Option A is incorrect as both learning forms include various algorithms. Option C is not generally valid as accuracy depends on the problem statement. Option D is incorrect as both learning forms can work with different data types.
Which category of machine learning does predicting house prices based on features like size, location, and year of construction belong to?
- A) Classification
- B) Clustering
- C) Regression
- D) Principal Component Analysis
Correct Answer: C. Regression is the prediction of continuous values like prices. Classification would be incorrect as it categorizes data. Clustering is unsupervised learning and PCA is used for dimensionality reduction, not prediction.
What problem arises when a machine learning model is too closely fitted to the training data?
- A) Underfitting
- B) Overfitting
- C) The Bias-Variance Dilemma
- D) The Problem of High Dimensionality
Correct Answer: B.<st