Module
01: Introduction to Machine Learning
1.1
Introduction to Machine Learning
Machine Learning (ML) is the field of study where computers
learn from data to make predictions or decisions without being explicitly
programmed. It powers applications like recommendation systems, fraud
detection, and predictive maintenance.
The concept of ML is depicted with an example of predicting
the price of a car. The ML model learns from data, represented as some features
such as year, mileage, among others, and the target variable, in this case, the
car's price, by extracting patterns from the data.
Then, the model is given new data (without the target) about
cars and predicts their price (target).
In summary, ML is a process of extracting patterns from
data, which is of two types:
- features (information about the object) and
- target (property to predict for unseen objects).
Therefore, new feature values are presented to the model, and it makes predictions from the learned patterns.
1.2 ML vs
Rule-Based Systems
- Rule-Based Systems: Operate using predefined rules crafted
by humans, often leading to rigid and non-scalable solutions. The traditional
Rule-Based systems are based on a set of characteristics (keywords, email
length, etc.) that identify an email as spam or not. As spam emails keep
changing over time the system needs to be upgraded making the process un-tractable
due to the complexity of code maintenance as the system grows.
- Machine Learning: Learns patterns from data, allowing
models to generalize better, adapt to new inputs, and improve over time.
ML can be used to solve this problem with the following
steps:
1. Get data
Emails from the user's spam folder and inbox give examples of spam and non-spam.
2. Define and calculate features
Rules/characteristics from rule-based systems can be used as a starting point to define features for the ML model. The value of the target variable for each email can be defined based on where the email was obtained from (spam folder or inbox).
Each email can be encoded (converted) to the values of its features and target.
3. Train and use the model
A machine learning algorithm can then be applied to the encoded emails to build a model that can predict whether a new email is spam or not spam. The predictions are probabilities, and to make a decision it is necessary to define a threshold to classify emails as spam or not spam.
1.3 Supervised Machine
Learning
Supervised learning is a subset of ML where the model
learns from labeled data.
In Supervised Machine Learning (SML) there are always
labels associated with certain features.
The model is trained, and then it can make predictions on new features. In this way, the model is taught by certain features and targets.
Feature matrix (X): made of observations or objects
(rows) and features (columns).
Target variable (y): a vector with the target information
we want to predict. For each row of X there's a value in y.
The model can be represented as a function g that takes the X matrix as a parameter and tries to predict values as close as possible to y targets. The obtention of the g function is what it is called training.
Types of SML problems
- Regression:
Predicting continuous values (e.g., house prices, car's price).
- Classification: Predicting categorical labels (e.g.,
spam vs. non-spam emails).
-
Binary: there are two categories.
-
Multiclass problems: there are more than two categories.
- Ranking: the output is the top scores associated with corresponding items. It is applied in recommender systems.
In summary, SML is about teaching the model by showing different examples, and the goal is to come up with a function that takes the feature matrix as a parameter and makes predictions as close as possible to the y targets.
1.4 CRISP-DM
(Cross-Industry Standard Process for Data Mining)
CRISP-DM is a methodology used in data science and ML
projects, involving six phases:
1. Business
understanding
An important question is do we
need ML for the project. The goal of the project has to be measurable.
2. Data
understanding
Analyse available data
sources, and decide if more data is required.
3. Data
preparation
Clean data, remove noise
applying pipelines, and convert the data to a tabular format, so we can put it
into ML.
4. Modelling
Train the different models and
choose the best one. Considering the results of this step, it is proper to
decide if it is required to add new features or fix data issues.
5. Evaluation
Measure how well the model is
performing and if it solves the business problem.
6. Deployment
Roll out to production to all the users. The evaluation and deployment often happen together - online evaluation.
1.5 The Modelling Step
(Model Selection Process)
Key steps in the modelling process include:
Which model to choose?
-
Logistic regression
-
Decision tree
-
Neural Network
- Or many others
The validation dataset is not used in training. There are feature matrices and y vectors for both training and validation datasets. The model is fitted with training data, and it is used to predict the y values of the validation feature matrix. Then, the predicted y values (probabilities) are compared with the actual y values.
Multiple comparisons problem (MCP): just by chance one model can be lucky and obtain good predictions because all of them are probabilistic.
The test set can help to avoid the MCP. Obtaining the best model is done with the training and validation datasets, while the test dataset is used for assuring that the proposed best model is the best.
1. Split datasets in training, validation, and test. E.g.
60%, 20% and 20% respectively
2. Train the models
3. Evaluate the models
4. Select the best model
5. Apply the best model to the test dataset
6. Compare the performance metrics of validation and test
NB: Note that it is possible to reuse the validation data. After selecting the best model (step 4), the validation and training datasets can be combined to form a single training dataset for the chosen model before testing it on the test set.
1.6 Setting up the
Environment
We need:
- Python 3.11
- NumPy, Pandas and Scikit-Learn (latest available
versions)
- Matplotlib and Seaborn
- Jupyter notebooks (for interactive coding and
documentation)
Create environment for course
- Install Anaconda
- Create ml-zoomcamp environment
conda create -n ml-zoomcamp
python=3.11
- Activate anaconda
conda
activate ml-zoomcamp
- Installing libraries
conda install numpy pandas scikit-learn seaborn jupyter
1.7 Introduction to
NumPy
NumPy is the foundational package for numerical computing in Python, providing support for large, multi-dimensional arrays and matrices, along with a wide range of mathematical operations.
1.8 Linear Algebra
Refresher
Linear algebra is essential in ML for operations involving vectors and matrices. Key concepts include:
Linear Algebra Refresher
Vector
operations
Multiplication
Vector-vector
multiplication
Matrix-vector
multiplication
Matrix-matrix
multiplication
Identity
matrix
Inverse
1.9 Introduction to
Pandas
Pandas is a powerful Python library for data manipulation
and analysis.
It introduces two main data structures:
Series:
One-dimensional arrays.
DataFrames:
Two-dimensional, tabular data (like Excel spreadsheets).
Pandas simplifies data cleaning, exploration, and manipulation tasks.
1.10 Summary
This module introduced the basics of ML, the differences
between rule-based systems and ML, supervised learning, CRISP-DM, and essential
tools like NumPy and Pandas.
A foundational understanding of these topics is crucial for moving forward in ML.