This note is my beginner-friendly master note for machine learning. I am writing it to connect the big picture of ML with the actual project notes I already have in my brain system.

My goal is not just to memorize model names. I want to understand what machine learning is trying to do, how the major model families differ, what the end-to-end workflow looks like, what the math intuition is, how model validation works, where ethics enters the process, and how all of this connects back to my projects in credit risk, churn, attrition, household classification, and NLP.

The Note at a Glance

This note gives me one connected path through machine learning:

Business problem
→ define target and success metric
→ inspect data and its quality
→ engineer useful features
→ choose a model family
→ split train / validation / test data
→ train the model
→ tune hyperparameters
→ validate properly
→ interpret the results
→ monitor and improve the model after deployment

This note is important because machine learning is not one algorithm. It is a full decision-making workflow.

At the highest level:

supervised learning learns from labeled examples
unsupervised learning finds structure without labels
reinforcement learning learns through reward and penalty
deep learning uses layered neural networks to learn complex patterns

The rest of the note helps me understand when each category makes sense and how the core models inside them work.

Why This Note Matters for My Other Notes

This note sits above several of my project notes and helps me organize them conceptually:

in 1_lending_club_credit_risk_masterclass, I use supervised classification for PD, and later extend the framework into LGD, EAD, expected loss, and monitoring
in 5_bank_churn_neural_networks_masterclass, I move from classical tabular modeling into neural networks, threshold tuning, and recall-focused classification
in 6_employee_retention_tree_models_masterclass, I use decision trees and random forests, which are core ensemble ideas explained in this note
in 7_socio_economic_household_classification_masterclass, I use XGBoost, which is one of the most important boosting models in modern tabular ML
in 8_twitter_sentiment_nlp_masterclass, I work with text features and classification pipelines, which still follow the same train / validate / evaluate logic
in 2_regression_analysis_masterclass, I build the statistical foundation for linear and logistic regression, which are still essential benchmark models in ML

So this note is like a bridge:

statistics foundation
→ machine learning workflow
→ project-specific application

Part 1 — What Machine Learning Actually Is

1.1 The core idea

Machine learning is a way of teaching computer systems to detect patterns from data and use those patterns to make predictions, classifications, recommendations, or groupings.

Traditional programming often looks like this:

Rules + Input Data → Output

Machine learning flips the logic:

Historical Data + Known Outcomes → Learned Pattern
Learned Pattern + New Data → Predicted Output

So instead of explicitly writing every rule by hand, I give the algorithm examples and let it learn the structure.

1.2 What machine learning is not

Machine learning is not magic.

It is still built on:

data quality
assumptions
mathematical optimization
evaluation choices
human judgment
business constraints

If the data is biased, the model can learn bias.
If the labels are noisy, the model can learn noise.
If I choose the wrong metric, I can optimize the wrong business outcome.

That is why ML is as much about process discipline as it is about algorithms.

1.3 The main learning categories

Supervised learning

In supervised learning, I have:

input features X
known target labels y

The algorithm learns the relationship between them.

Examples:

default vs non-default
churn vs non-churn
fraud vs non-fraud
sentiment positive vs negative
revenue prediction

Supervised learning is divided into two broad tasks:

Regression

Used when the outcome is a continuous number.

Examples:

house price
sales amount
recovery rate
expected wait time

Classification

Used when the outcome is a category or class.

Examples:

yes / no
churn / no churn
spam / not spam
urban / rural

Unsupervised learning

In unsupervised learning, I do not have target labels.

The algorithm tries to discover hidden structure in the data.

Examples:

customer segmentation
grouping similar borrowers
anomaly detection
topic grouping

The model is not told the “correct answer.” It is only told to find structure.

Reinforcement learning

In reinforcement learning, an agent learns by interacting with an environment and receiving:

rewards for good actions
penalties for bad actions

This is common in:

robotics
game playing
dynamic decision systems
sequential control problems

This is not the core style of ML used in my current notes, but I should know what it is.

Deep learning

Deep learning is a subfield of machine learning that uses multi-layer neural networks.

It is especially strong in:

image tasks
speech tasks
text tasks
very large and complex nonlinear pattern recognition

In my own notes, the clearest related project is 5_bank_churn_neural_networks_masterclass, where I use a neural network classifier on structured tabular data.

Part 2 — Types of Data and Why They Matter

A model choice begins with understanding the outcome and predictors.

2.1 Continuous data

Continuous variables can take infinitely many values within a range.

Examples:

salary
height
revenue
loss amount

These often call for regression models.

2.2 Categorical or discrete data

Categorical variables represent groups or labels.

Examples:

default / non-default
churn / stay
department name
sentiment class

These often call for classification models or categorical analysis.

2.3 Why this matters

The type of target drives the model family.

Continuous outcome → regression
Categorical outcome → classification
No labels → clustering / unsupervised learning

That is one of the first things I should determine in any ML problem.

Part 3 — The Machine Learning Workflow

A strong ML process is more important than memorizing algorithms.

I can think of it as a practical modeling pipeline.

3.1 Plan

This means defining:

the business problem
the target variable
the success metric
the modeling constraints
the deployment context

Examples:

In churn, maybe recall matters because missing an at-risk customer is expensive.
In credit risk, missing a bad borrower can be more dangerous than rejecting a good one.
In HR attrition, interpretability may matter because managers need to understand the drivers.

So before building the model, I must know what “good performance” actually means.

3.2 Analyze

This includes:

exploratory data analysis
missing values
outliers
class imbalance
feature relationships
feature quality

This is where I decide whether my raw data is even ready for modeling.

3.3 Construct

This includes:

train / validation / test splits
preprocessing pipeline
feature engineering
model selection
hyperparameter tuning

This is where the model is actually built.

3.4 Execute

This includes:

evaluation
interpretation
communication
deployment thinking
monitoring

This is where the model becomes useful to the business.

Part 4 — Feature Engineering: How Data Becomes Model Input

A huge part of ML performance comes from feature engineering.

4.1 Feature selection

Feature selection means choosing which variables should enter the model.

I may remove:

ID columns
names
leakage variables
columns with almost no information
columns that would not exist at prediction time

Example:

CustomerId usually should not drive a churn model
Surname is usually not a stable predictive feature

4.2 Feature transformation

This means changing variables into a more usable form.

Examples:

standardization
min-max scaling
one-hot encoding
log transformation
binning

Some algorithms care a lot about scaling. Others do not.

For example:

Naive Bayes and distance-based methods can benefit from scaled inputs
tree-based methods usually do not need scaling

4.3 Feature extraction

This means creating new variables from existing variables.

Examples:

tenure / age
debt / income
monthly_hours > threshold
interaction terms

This can make the model see structure that was hidden in the raw columns.

4.4 Class imbalance

In many real-world classification problems, one class is much rarer than the other.

Examples:

fraud is rare
default is less common than non-default
churn may be a minority class

If I do nothing, the model may learn to predict the majority class almost all the time.

Common fixes

Downsampling

Reduce the majority class size.

Upsampling

Increase the minority class size by duplication or synthetic generation.

Class weights

Tell the algorithm to penalize mistakes on the minority class more heavily.

Threshold tuning

Keep the model the same, but change the decision threshold.

This matters because sometimes the real issue is not the model family but the classification threshold.

Part 5 — Recommendation Systems: A Useful Side Branch

Even though recommendation systems are not the center of my current project notes, I should know the two broad families.

5.1 Content-based filtering

This recommends items based on item attributes.

Example:

if I like a movie with a certain genre, director, and pace, the system recommends similar movies with similar content attributes

The model focuses on the item’s own features.

5.2 Collaborative filtering

This recommends items based on user behaviour patterns.

Example:

users who liked what I liked also liked these other items

The model focuses on interaction patterns across many users.

Part 6 — Ethics, Bias, and Explainability

This part is extremely important.

6.1 Garbage in, garbage out

If I train on biased, incomplete, or low-quality data, the model can learn those flaws.

Examples:

historical lending data may reflect past human bias
hiring data may reflect biased selection processes
churn labels may be noisy or incomplete

So a model is not automatically objective just because it is mathematical.

6.2 Class imbalance and unfair outcomes

If one class is rare, the model may ignore it.

If a subgroup is underrepresented, the model may perform worse on that subgroup.

So I should not evaluate only overall accuracy. I should also ask:

who is being missed?
which subgroup is getting more false positives?
which subgroup is getting more false negatives?

6.3 Explainability

Some models are easy to explain.
Some models are much harder to explain.

More interpretable

linear regression
logistic regression
small decision trees

Less interpretable

large random forests
gradient boosting
XGBoost
deep neural networks

This matters because high-stakes domains like finance, healthcare, and regulation often need strong interpretability and documentation.

That is one reason my 1_lending_club_credit_risk_masterclass note uses logistic regression so centrally.

Part 7 — Naive Bayes

Naive Bayes is one of the simplest probabilistic classifiers.

7.1 The core idea

It is based on Bayes’ theorem.

At a high level, Bayes’ theorem updates my belief about a class after seeing evidence.

The model asks:

Given these feature values, how likely is this observation to belong to class A versus class B?

7.2 Why it is called “naive”

It assumes that the predictors are conditionally independent given the class.

That is often unrealistic in the real world.

For example:

salary and credit score may be related
balance and utilization may be related
age and tenure may be related

But Naive Bayes can still perform surprisingly well even when the assumption is not perfectly true.

7.3 Gaussian Naive Bayes

Gaussian Naive Bayes is used when I assume the features follow roughly normal distributions within each class.

The model estimates class-conditional distributions and then applies Bayes’ theorem.

7.4 When it is useful

Naive Bayes is often useful when:

I want a fast baseline model
the dataset is not huge
the features are reasonably well-behaved
I want a quick benchmark for classification

It is especially well known in text classification.

7.5 Why scaling can matter here

When features exist on very different scales, some algorithms become unstable or hard to compare across dimensions.

Using a scaler like Min-Max scaling can help normalize the numeric range.

Example

from sklearn.preprocessing import MinMaxScaler
from sklearn.naive_bayes import GaussianNB
 
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
 
model = GaussianNB()
model.fit(X_train_scaled, y_train)

This does not mean scaling is always mandatory for every Naive Bayes setup, but it is often a practical preprocessing step in educational workflows.

Part 8 — Unsupervised Learning with K-Means Clustering

K-Means is one of the most common clustering algorithms.

8.1 The core goal

K-Means tries to divide the data into K clusters such that points inside a cluster are similar to one another and different from points in other clusters.

It is an unlabeled learning method.

That means I do not have a target column telling me the correct cluster.

8.2 The algorithm intuition

The algorithm works like this:

choose K starting centroids
assign each point to its nearest centroid
recompute each centroid as the mean of the points in its cluster
repeat until the centroids stop moving much

That final state is called convergence.

8.3 Why scaling matters a lot here

K-Means is distance-based.

So if one variable is on a huge scale and another is on a tiny scale, the bigger one can dominate the cluster structure.

That is why scaling is usually very important for K-Means.

8.4 How to choose K

This is one of the most common beginner questions.

Inertia

Inertia measures how tightly packed the points are inside their assigned clusters.

Lower inertia is better, but it almost always decreases as I add more clusters.

So inertia alone cannot choose the best K.

Elbow method

I plot inertia for many values of K.

I then look for the “elbow,” where adding another cluster stops giving a major improvement.

That point is often a reasonable compromise between simplicity and fit.

Silhouette score

The silhouette score measures both:

how close each point is to its own cluster
how far it is from other clusters

It ranges roughly from -1 to 1.

closer to 1 is better
near 0 means overlap
below 0 suggests poor assignments

8.5 Small code skeleton

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
 
kmeans = KMeans(n_clusters=4, random_state=42)
clusters = kmeans.fit_predict(X_scaled)
 
score = silhouette_score(X_scaled, clusters)
print(score)

Part 9 — Decision Trees

Decision trees are one of the most intuitive ML models.

9.1 Core idea

A decision tree splits the data step by step using features that best separate the outcome.

It behaves like a flowchart:

Is credit score < threshold?
→ yes / no
Is income < threshold?
→ yes / no
...
→ final prediction

9.2 Why they are attractive

Decision trees are attractive because they:

are easy to visualize
handle nonlinear relationships
can capture interactions automatically
need less preprocessing than many other models
are often robust to outliers

This is why they were so natural in my 6_employee_retention_tree_models_masterclass note.

9.3 Why they can fail

A single decision tree can overfit very easily.

That means it can memorize noise in the training data and perform badly on new data.

That is why tree-based systems often need hyperparameter control.

9.4 Important hyperparameters

Examples:

max_depth
min_samples_split
min_samples_leaf
max_features

These determine how flexible or constrained the tree becomes.

A shallower tree is simpler and more stable.
A very deep tree can become too specific to the training data.

Part 10 — Hyperparameter Tuning and Grid Search

Hyperparameters are settings chosen before training.

They are not learned directly from the data in the same way model coefficients are.

10.1 Why tuning matters

A model may be good in principle but weak in performance because its hyperparameters are poor.

For example:

a tree may be too deep
a random forest may use too few trees
an XGBoost model may learn too aggressively

10.2 Grid search

Grid search means I define a set of hyperparameter values to try, and the computer tests all combinations.

Example idea:

param_grid = {
    'max_depth': [3, 5, 7],
    'min_samples_leaf': [1, 2, 5]
}

The model trains and validates every combination, and then I choose the best one according to a chosen metric.

10.3 Why cross-validation is usually paired with grid search

If I tune on a single split, I may overreact to randomness in that split.

Cross-validation gives a more stable performance estimate.

That is why GridSearchCV is so common.

10.4 Code skeleton

from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
 
model = DecisionTreeClassifier(random_state=42)
 
param_grid = {
    'max_depth': [3, 5, 7],
    'min_samples_leaf': [1, 2, 5]
}
 
grid = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    cv=5,
    scoring='f1'
)
 
grid.fit(X_train, y_train)
print(grid.best_params_)

Part 11 — Ensemble Learning

Ensemble learning means combining multiple models to create a stronger final predictor.

The intuition is simple:

a group of imperfect models can often outperform a single model if their errors are not identical.

This is sometimes called the wisdom of the crowd idea.

There are two major ensemble styles I should know here:

bagging
boosting

11.1 Bagging and Random Forest

The core idea

Bagging means building many models on different bootstrap samples and then aggregating their predictions.

A bootstrap sample is created by sampling rows with replacement from the original training set.

So each tree sees a slightly different version of the data.

Random Forest

Random Forest is the most famous bagging model for trees.

It adds two powerful randomness ideas:

each tree gets a bootstrap sample of rows
each split only considers a random subset of features

This reduces correlation between trees and lowers variance.

Why Random Forest works well

Random forests are powerful because they:

reduce overfitting relative to a single tree
handle nonlinear patterns
capture interactions
usually need less feature engineering than linear models

This connects directly to 6_employee_retention_tree_models_masterclass.

11.2 Boosting

Boosting builds models sequentially, not in parallel.

Each new model tries to learn from the mistakes of the earlier models.

This often creates very strong predictive systems.

AdaBoost

AdaBoost increases the weight of observations that were misclassified earlier.

So later learners pay more attention to hard cases.

The model keeps adjusting itself toward what earlier trees got wrong.

Gradient Boosting

Gradient boosting is a more general and powerful idea.

Instead of simply reweighting observations, it fits new trees to the residual error left by earlier models.

So each new learner is trying to reduce the remaining mistakes.

XGBoost

XGBoost is a highly optimized implementation of gradient boosting.

It is extremely popular for tabular data because it often delivers strong predictive performance.

This connects directly to 7_socio_economic_household_classification_masterclass and also to the credit-risk extension thinking in 1_lending_club_credit_risk_masterclass.

The tradeoff

Boosting is often more accurate than simpler models, but it is also:

harder to interpret
more sensitive to tuning
more computationally expensive

So stronger raw performance does not automatically mean it is always the best business choice.

Part 12 — Cross-Validation

Cross-validation is one of the most important validation ideas in ML.

12.1 Why a single train-test split is not always enough

Suppose I split the data once and get a performance score.

That score may depend partly on luck:

maybe the test split was unusually easy
maybe the validation split was unusually hard

Cross-validation reduces my dependence on one lucky or unlucky split.

12.2 K-fold cross-validation

In K-fold CV:

I divide the training data into K folds
I train on K-1 folds
I validate on the remaining fold
I repeat until every fold has served as validation once
I average the results

This gives a more stable estimate of generalization performance.

12.3 Why it matters in interviews

If I say “my model got 96% accuracy,” a strong interviewer may ask:

Was that from one split, or did you cross-validate it?

That question is really about robustness.

Part 13 — Model Saving and Pickling

Sometimes a model takes a long time to train.

That makes it inefficient to retrain from scratch every time I reopen the notebook.

13.1 What pickling does

Pickling saves a trained Python object into a binary file.

That object can later be loaded back into memory.

So instead of retraining the model, I can reuse the saved version.

13.2 Small code skeleton

import pickle
 
with open('model.pkl', 'wb') as f:
    pickle.dump(model, f)
 
with open('model.pkl', 'rb') as f:
    loaded_model = pickle.load(f)

13.3 Important caution

Pickling is convenient, but I must also keep:

the preprocessing pipeline
the training environment assumptions
the feature order
the versioning discipline

Saving only the model without the transformations can break deployment.

This connects strongly to the monitoring and pipeline-consistency lessons in 1_lending_club_credit_risk_masterclass.

Part 14 — Feature Importance

Even if a model is complex, I often still want to know which variables matter most.

14.1 Why feature importance helps

Feature importance can help me:

understand the drivers of predictions
explain the model to stakeholders
spot suspicious signals
reduce noisy variables

14.2 Tree-based importance

Tree-based models often produce feature importance scores based on how much each feature helps reduce impurity or improve splits.

Example:

income
tenure
credit score
monthly hours
satisfaction level

These can be ranked by importance.

14.3 Why I should be careful

Feature importance is useful, but it is not the same as causation.

A variable can be important for prediction without being the true real-world cause of the outcome.

That warning is important across all my notes.

Part 15 — Model Evaluation for Classification

Because many of my projects are classification problems, this section matters a lot.

15.1 Confusion matrix

A confusion matrix helps me see:

true positives
true negatives
false positives
false negatives

This gives a richer view than just accuracy.

15.2 Precision

Precision asks:

Out of everything I predicted as positive, how many were actually positive?

This matters when false positives are costly.

15.3 Recall

Recall asks:

Out of all the actual positives, how many did I successfully capture?

This matters when false negatives are costly.

15.4 Accuracy

Accuracy asks:

Out of all predictions, how many were correct?

This can be misleading under class imbalance.

15.5 F1 score

The F1 score is the harmonic mean of precision and recall.

It is useful when I want a balance between the two.

15.6 ROC and AUC

The ROC curve tracks:

true positive rate
false positive rate

across many thresholds.

The AUC summarizes that ranking performance into a single number.

These connect strongly to both 2_regression_analysis_masterclass and 1_lending_club_credit_risk_masterclass.

Part 16 — Connecting the Algorithms to My Projects

This part helps me see machine learning not as isolated theory, but as the common language underneath my notes.

16.1 Lending Club credit risk

In 1_lending_club_credit_risk_masterclass:

supervised learning is used for default prediction
logistic regression provides an interpretable scorecard
validation and monitoring show the full lifecycle
later extensions bring in LGD, EAD, and expected loss

This is the most complete lifecycle note in my system.

16.2 Bank churn

In 5_bank_churn_neural_networks_masterclass:

supervised classification is used to predict churn
threshold tuning matters because recall is important
the neural network is a more flexible model family than basic linear methods

16.3 Employee retention

In 6_employee_retention_tree_models_masterclass:

decision trees and random forests become the main learning tools
feature importance becomes very practical
hyperparameter control matters for tree stability

16.4 Household classification

In 7_socio_economic_household_classification_masterclass:

large tabular data is modeled with XGBoost
boosting becomes the strongest practical ensemble idea
careful feature handling and evaluation matter a lot

16.5 Twitter sentiment NLP

In 8_twitter_sentiment_nlp_masterclass:

text becomes numeric features through vectorization
classification logic still remains the same underneath
evaluation metrics still depend on the confusion matrix logic

Part 17 — The Math Intuition I Should Know for Interviews

I do not need to derive everything from scratch, but I should understand the broad mathematical ideas.

17.1 Objective functions

Every ML model is trying to optimize something.

Examples:

linear regression minimizes squared error
logistic regression maximizes likelihood
K-Means minimizes within-cluster distances
decision trees minimize impurity measures like Gini impurity or entropy
boosting iteratively reduces residual error

That means models differ not only in form, but also in what loss function they are trying to optimize.

17.2 Bias-variance tradeoff

A model can fail in two main ways:

High bias

The model is too simple.
It misses real structure.
This is underfitting.

High variance

The model is too flexible.
It memorizes noise.
This is overfitting.

Many ML techniques are really attempts to balance bias and variance.

Examples:

pruning trees
using random forests
regularization
cross-validation
hyperparameter tuning

17.3 Why ensembles often work

Bagging reduces variance.
Boosting reduces systematic error step by step.

That is why ensemble methods are so dominant in practice.

Part 18 — Python Toolkit I Should Know

The practical ML ecosystem usually includes:

pandas for data manipulation
numpy for numerical operations
matplotlib and seaborn for visualization
scikit-learn for modeling pipelines, preprocessing, CV, metrics, trees, clustering, and Naive Bayes
xgboost for gradient boosting on tabular data
pickle for model saving

This toolkit appears in different combinations across my project notes.

Part 19 — What I Should Say in Interviews

If I need to explain machine learning simply, I can say:

Machine learning is the process of using data to learn patterns that generalize to new observations. The real work is not just choosing a model. It is defining the target, engineering stable features, validating properly, controlling overfitting, choosing metrics that match the business problem, and making sure the final model is interpretable and robust enough for the use case.

If I need to explain supervised vs unsupervised learning:

Supervised learning uses labeled outcomes to learn prediction or classification. Unsupervised learning has no labels and instead discovers hidden structure, such as clusters or anomalies.

If I need to explain why tree ensembles are powerful:

A single tree is easy to understand but unstable. Ensemble methods improve that by combining many trees. Random forests reduce variance through bagging and random feature selection, while boosting learns sequentially from earlier errors and often achieves stronger predictive performance.

If I need to explain why metrics matter:

Accuracy alone can be misleading, especially under class imbalance. I choose metrics like precision, recall, F1, ROC-AUC, or KS based on the type of business error that matters most.

Part 20 — Final Summary I Want to Retain

What machine learning really is

Machine learning is a structured way of learning patterns from data and using those patterns on new observations.

The big categories

supervised learning
unsupervised learning
reinforcement learning
deep learning

The practical workflow

Plan
→ Analyze
→ Construct
→ Execute

The important model families in this note

Naive Bayes
K-Means
decision trees
random forests
AdaBoost
gradient boosting
XGBoost

The practical discipline I should remember

define the right target
split data properly
engineer features carefully
choose metrics that match the business problem
tune hyperparameters responsibly
validate with cross-validation when possible
do not confuse prediction with causation
do not confuse importance with causation
do not ignore ethics, bias, and explainability

The most important mental model

A machine learning project is not:

pick algorithm → get score

It is:

understand the problem
→ understand the data
→ build a defensible pipeline
→ validate honestly
→ communicate clearly
→ improve responsibly

That is the machine learning mindset I want to carry into all my notes and interviews.

Machine Learning Masterclass — From First Principles to Trees, Ensembles, Clustering, and Model Validation

The Note at a Glance

Why This Note Matters for My Other Notes

Part 1 — What Machine Learning Actually Is

1.1 The core idea

1.2 What machine learning is not

1.3 The main learning categories

Supervised learning

Regression

Classification

Unsupervised learning

Reinforcement learning

Deep learning

Part 2 — Types of Data and Why They Matter

2.1 Continuous data

2.2 Categorical or discrete data

2.3 Why this matters

Part 3 — The Machine Learning Workflow

3.1 Plan

3.2 Analyze

3.3 Construct

3.4 Execute

Part 4 — Feature Engineering: How Data Becomes Model Input

4.1 Feature selection

4.2 Feature transformation

4.3 Feature extraction

4.4 Class imbalance

Common fixes

Downsampling

Upsampling

Class weights

Threshold tuning

Part 5 — Recommendation Systems: A Useful Side Branch

5.1 Content-based filtering

5.2 Collaborative filtering

Part 6 — Ethics, Bias, and Explainability

6.1 Garbage in, garbage out

6.2 Class imbalance and unfair outcomes

6.3 Explainability

More interpretable

Less interpretable

Part 7 — Naive Bayes

7.1 The core idea

7.2 Why it is called “naive”

7.3 Gaussian Naive Bayes

7.4 When it is useful

7.5 Why scaling can matter here

Example

Part 8 — Unsupervised Learning with K-Means Clustering

8.1 The core goal

8.2 The algorithm intuition

8.3 Why scaling matters a lot here

8.4 How to choose K

Inertia

Elbow method

Silhouette score

8.5 Small code skeleton

Part 9 — Decision Trees

9.1 Core idea

9.2 Why they are attractive

9.3 Why they can fail

9.4 Important hyperparameters

Part 10 — Hyperparameter Tuning and Grid Search

10.1 Why tuning matters

10.2 Grid search

10.3 Why cross-validation is usually paired with grid search

10.4 Code skeleton

Part 11 — Ensemble Learning

11.1 Bagging and Random Forest

The core idea

Random Forest

Why Random Forest works well

11.2 Boosting

AdaBoost

Gradient Boosting

XGBoost

The tradeoff

Part 12 — Cross-Validation

12.1 Why a single train-test split is not always enough

12.2 K-fold cross-validation