Quant OS | Quantitative Finance Knowledge Graph

This note is my standalone master reference for end-to-end quantitative modeling work.
I want one document that helps me think clearly from the moment a business problem starts all the way to model selection, validation, calibration, monitoring, documentation, and governance.

This note is written to be both a training note for a complete beginner and a direct reference note for answering practical questions.

The Big Picture

A modeling problem usually moves through this flow:

Business problem
→ portfolio and target definition
→ data extraction and data quality checks
→ cleaning and preprocessing
→ EDA and feature engineering
→ baseline model
→ candidate models
→ validation and comparison
→ calibration and threshold setting
→ champion vs challenger selection
→ documentation and governance approval
→ deployment
→ monitoring, drift checks, retraining or recalibration

The key idea is that modeling is never just “fit an algorithm.”
A good model is:

solving the right business problem
trained on the right data
statistically sensible
well validated
explainable enough for the use case
documented properly
monitored after deployment

1) Main Portfolio Types in a Bank

At a high level, banks usually work with a few major portfolio families.
The exact naming differs by bank, but the logic is similar.

Retail portfolios

These are high-volume, smaller-ticket exposures.
Examples:

credit cards
personal loans
auto loans
mortgages / home loans
consumer durable loans

Typical data shape:

one row per customer, account, or loan
large number of records
standardized fields
bureau score, income, utilization, delinquencies, balance, tenure, repayment status

Common modeling use cases:

application scorecards
behavioral scorecards
collections models
churn / attrition models
cross-sell / propensity models

SME / small business portfolios

These sit between retail and corporate.
Data is usually semi-structured.

Typical fields:

business vintage
turnover / revenue
bank statement features
bureau and tradeline data
GST / tax variables
collateral details
repayment patterns

Common use cases:

underwriting models
early warning systems
fraud / AML alerting

Corporate / commercial portfolios

These are lower-volume but larger-ticket exposures.
Data is more relationship-driven and less standardized.

Typical fields:

audited financial statements
leverage ratios
liquidity ratios
interest coverage
sector
collateral
covenants
facility type
rating history

Common use cases:

internal rating models
default models
LGD / collateral haircut models
stress testing
limit setting

Treasury / market / trading-related portfolios

These are not classic loan books.
Data often looks like positions, exposures, counterparties, sensitivities, and time series.

Typical use cases:

VaR and stress models
counterparty credit risk
exposure simulation
liquidity analytics

AML / financial crime and sanctions monitoring

This is not always called a “portfolio” in the same sense, but operationally it behaves like one.

Typical data shape:

transaction-level records
customer KYC profile
geography
counterparty details
alert history
case outcomes
SAR / STR indicators

Common use cases:

transaction monitoring
anomaly detection
alert triage
sanctions screening

2) How the Data Usually Looks

The same business problem can appear at different data grains.
That is one of the first things I must identify.

Customer-level data

One row per customer.
Used when the prediction is about the customer.

Examples:

will this customer churn?
will this customer default in the next 12 months?

Common columns:

age
income
geography
tenure
relationship length
bureau variables
balances aggregated across products

Account-level or loan-level data

One row per facility or loan.
Used when the prediction is tied to a specific exposure.

Examples:

probability this loan defaults
loss given default on this facility

Common columns:

sanctioned amount
outstanding balance
interest rate
term
collateral type
DPD bucket
repayment history

Transaction-level data

One row per transaction.
Used in fraud, AML, customer behavior, and payment analytics.

Examples:

suspicious transaction monitoring
merchant fraud detection
spend categorization

Common columns:

transaction amount
time stamp
channel
merchant category
counterparty
location
payment type

Time-series or panel data

Same customer / account observed over time.
This matters because observations are not independent anymore.

Examples:

monthly delinquency behavior
macroeconomic series
daily trading P&L
repeated measurements on the same customer

Unstructured or semi-structured data

Examples:

comments
call notes
emails
document OCR outputs
PDF statements
sanctions names

This often requires NLP, parsing, fuzzy matching, or embedding-based pipelines.

3) End-to-End Workflow of a Typical Model

A model is built to answer a business question, not to “use ML.”

Step 1: Define the business problem

I must answer these first:

What exactly am I predicting?
At what level: customer, account, transaction, or portfolio?
What is the time horizon?
What decision will use this model?
What type of error is more costly?

Examples:

predict default within 12 months
detect suspicious transactions in near real time
estimate expected loss on a portfolio
predict customer attrition in the next quarter

Step 2: Define the target clearly

A weak target creates a weak model.

Questions I must settle:

What does positive class mean?
How is it coded: 1 or 0?
What is the observation window?
What is the performance window?
Is the target observed cleanly or only through a proxy?

Example:

good_bad = 1 for good, 0 for bad
or default_flag = 1 for default, 0 for no default

Step 3: Data sourcing and extraction

Typical sources:

core banking systems
bureau data
CRM
accounting systems
collections systems
payment systems
external files
manual data sources

At this stage I must care about:

data lineage
extraction logic
joins
duplicates from joins
missing keys
stale snapshots
time alignment

Step 4: Data quality checks

Before modeling, I need basic trust in the data.

Typical checks:

record counts
duplicate IDs
null profile
impossible values
inconsistent dates
future leakage
target leakage
category spelling inconsistencies

Step 5: Cleaning and preprocessing

This includes:

missing value handling
duplicate handling
outlier handling
type conversion
date parsing
category standardization
scaling or normalization if needed
encoding categorical variables

Step 6: EDA

EDA is not decoration. It tells me:

distributions
skewness
outliers
missingness pattern
class imbalance
relationships between predictors and target
correlation structure
target rate by segments

Step 7: Feature engineering

Examples:

debt-to-income ratio
utilization ratio
vintage or tenure buckets
delinquency trends
rolling averages
lag features
WoE bins for scorecards
interaction terms
text sentiment or topic features

Step 8: Train / validation / test split

Typical split logic:

train set for fitting
validation set for tuning and model comparison
test set for final honest evaluation

In time-sensitive problems, out-of-time splits are often better than random splits.

Step 9: Build a baseline first

A baseline is a simple benchmark.
Examples:

mean prediction for regression
majority class for classification
logistic regression before XGBoost

A sophisticated model is only useful if it beats a sensible baseline.

Step 10: Train candidate models

Typical candidates:

linear / logistic regression
decision tree
random forest
gradient boosting / XGBoost
ANN if appropriate

Step 11: Validate and compare

I should compare models on:

predictive performance
stability
calibration
interpretability
operational feasibility
fairness or policy constraints if relevant

Step 12: Choose champion and challengers

Champion model = the model selected for deployment or current production use
Challenger model = alternative model(s) tracked against the champion

The champion is not always the most complex model.
A slightly weaker but more stable and interpretable model may be better.

Step 13: Documentation and governance

Typical documentation includes:

business objective
scope and exclusions
target definition
data sources and lineage
sample construction
preprocessing logic
feature list
model methodology
assumptions
limitations
validation results
monitoring plan
override / fallback process

Step 14: Deployment

Deployment means the model is put into a real workflow.
This requires:

frozen preprocessing logic
versioned model artifacts
input validation
output format definition
decision thresholds
fallback rules

Step 15: Monitoring and review

After deployment, I must track:

drift
calibration deterioration
performance deterioration
stability of score distributions
data quality changes
business changes

4) Starting the Model: Data Engineering, Cleaning, and Remediation

This is where many real projects succeed or fail.

Missing values

The first thing is to understand why values are missing.

MCAR — Missing Completely At Random

Missingness has no systematic pattern.
Example: a sensor randomly failed.

Effect:

least dangerous type
simpler imputation is often acceptable

Typical remediation:

median / mean imputation for numeric variables
mode / “missing” category for categoricals
row deletion if missingness is tiny and random

MAR — Missing At Random

Missingness depends on observed variables.
Example: income is missing more often for younger customers.

Effect:

manageable if the observed drivers are known

Typical remediation:

group-wise imputation
model-based imputation
missing indicator variable

MNAR — Missing Not At Random

Missingness depends on the missing value itself or unobserved causes.
Example: high-risk customers deliberately do not disclose income.

Effect:

most dangerous type
missingness itself may carry signal

Typical remediation:

missing bucket as its own category
explicit missing flag
domain-driven treatment
sensitivity analysis

Duplicates

Duplicates can be:

true duplicate records
same customer with multiple accounts
one-to-many join duplication
repeated transactions that are actually valid

Typical remediation:

identify business key first
define deduplication rule clearly
never drop duplicates blindly
check before and after join counts

Outliers

Outliers are not automatically bad.
Sometimes they are genuine business reality.

I should distinguish between:

data errors: impossible ages, negative balances, broken timestamps
real extremes: very high income, very high utilization, unusually large transaction

Typical remediation:

fix obvious data errors if recoverable
cap / winsorize extreme tails
log transform skewed variables
robust scaling if algorithm is scale-sensitive
binning if using scorecards
leave them untouched in tree models if they are real and informative

Data leakage

This is one of the most common model failures.

Leakage happens when I use information that would not be known at prediction time.
Examples:

collection outcome used in underwriting model
future delinquency indicator included in features
vectorizer or scaler fit on full dataset before split

Remediation:

define the prediction date carefully
allow only variables available at that time
fit all preprocessing on train data only
apply to validation and test later

Categorical data issues

Common problems:

spelling variants
rare categories
too many categories
categories appearing only in test / production

Remediation:

standardize spelling
group rare categories into “Other”
one-hot encoding, target encoding, WoE encoding depending on use case
define unseen-category handling

Date and time issues

Common problems:

mixed formats
timezone mismatches
impossible sequences
leakage due to future dates

Useful features from dates:

tenure
age of account
days since last delinquency
month / quarter / seasonality
lagged behavior

Scaling and transformation

Not every model needs scaling.

Usually needed for:

linear / logistic models with regularization
distance-based methods like K-means
neural networks
Naive Bayes in some practical pipelines

Usually not necessary for:

decision trees
random forest
gradient boosting / XGBoost

Common transforms:

standardization: mean 0, variance 1
min-max scaling: 0 to 1 range
log transform for heavily skewed positive variables

5) Homoscedasticity, Heteroscedasticity, and Multicollinearity

These are classical modeling issues, especially important in regression.

Homoscedasticity

This means the variance of the residuals is roughly constant across the fitted values.

Plain language:

the spread of errors stays similar across the prediction range

Good case:

predictions for low and high values have similar error spread

Heteroscedasticity

This means the variance of residuals changes across the fitted values.

Plain language:

the error spread is not constant
often the error fan gets wider as the predicted level rises

Why it matters:

coefficients may still be unbiased in OLS under some conditions
but standard errors become unreliable
confidence intervals and p-values become less trustworthy

How to detect:

residual vs fitted plot shows funnel shape or pattern
formal tests like Breusch–Pagan or White test

Remediation:

log transform the dependent variable or skewed predictors
use weighted least squares if variance structure is known
use robust standard errors
consider a different model family better aligned to the data

Multicollinearity

This means predictors are highly correlated with each other.

Plain language:

the model struggles to separate their individual effects cleanly

Why it matters:

coefficients become unstable
signs can flip unexpectedly
p-values become unreliable
interpretation becomes difficult

How to detect:

correlation matrix
VIF (Variance Inflation Factor)

Typical VIF reading:

around 1: little concern
5+: moderate to high concern
10+: serious concern in many practical settings

Remediation:

drop one of the redundant variables
combine them into a ratio or composite feature
use PCA if interpretability can be sacrificed
use Ridge / Elastic Net when prediction matters more than coefficient interpretation

6) L1, L2, Lasso, Ridge, Elastic Net

These are regularization methods.
Regularization means adding a penalty term so the model does not rely too heavily on coefficients.

Why regularization is needed

Without regularization:

coefficients can become very large
the model may overfit noise
multicollinearity can destabilize coefficients

L1 penalty

L1 adds the sum of absolute values of coefficients to the loss function.

A simplified idea:

\text{Loss} = \text{Original Loss} + \lambda \sum |\beta_j|

Effect:

pushes some coefficients exactly to zero
performs feature selection

This is Lasso.

L2 penalty

L2 adds the sum of squared coefficients.

\text{Loss} = \text{Original Loss} + \lambda \sum \beta_j^2

Effect:

shrinks coefficients toward zero
usually keeps all variables
stabilizes estimation under multicollinearity

This is Ridge.

Elastic Net

Elastic Net combines L1 and L2.

Effect:

can zero out some features like Lasso
also stabilizes correlated features like Ridge

When to use them

many correlated predictors
high-dimensional data
overfitting risk
when I need more robust generalization

Intuition for the tuning parameter (\lambda)

small (\lambda): weak penalty, model behaves closer to unregularized form
large (\lambda): stronger shrinkage, simpler model

I usually tune this using cross-validation.

7) Linear Regression

What it is

Linear regression predicts a continuous outcome.

Example:

expected spend
loss amount
revenue
house price

Model form

y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \epsilon

(y) = dependent variable
(x_j) = predictors
(\beta_j) = coefficients
(\epsilon) = error term

What is the estimator here?

An estimator is the rule used to learn the unknown parameters from data.

For ordinary linear regression, the usual estimator is Ordinary Least Squares (OLS).

OLS chooses coefficients that minimize the sum of squared residuals:

\text{SSE} = \sum (y_i - \hat{y}_i)^2

So the estimator is:

“find the coefficient values that make the squared prediction errors as small as possible”

What residuals mean

Residual = actual value − predicted value

e_i = y_i - \hat{y}_i

Residuals tell me what the model failed to explain.

Main assumptions

linearity
independent observations
homoscedasticity
residual normality for classical inference
low multicollinearity if I want stable coefficient interpretation

What coefficients mean

If all else is fixed, a coefficient tells me the expected change in (y) for a one-unit change in that predictor.

Example:

if (\beta_1 = 2.5), then one-unit increase in (x_1) increases predicted (y) by 2.5 on average, holding other variables fixed

When to use linear regression

target is continuous
relationship is roughly linear or can be linearized
interpretability matters
baseline model is needed

Performance measures

Typical regression metrics:

MAE = average absolute error
MSE = average squared error
RMSE = square root of MSE, in the original unit of the target
R² = fraction of variance explained by the model
Adjusted R² = R² penalized for unnecessary variables

What to look for in diagnostics

residual plots
outliers and influential points
heteroscedasticity pattern
multicollinearity
train vs test performance gap

8) Logistic Regression

What it is

Logistic regression predicts the probability of a binary outcome.

Example:

default / non-default
churn / no churn
fraud / non-fraud
suspicious / non-suspicious

Why not use linear regression for a binary target?

Because linear regression can predict values below 0 or above 1, and it does not model class probabilities well.

Logistic model form

Logistic regression models the log-odds of the positive class as a linear function:

\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k

where (p) is the probability of the positive class.

What is the estimator here?

The estimator is usually Maximum Likelihood Estimation (MLE).

MLE chooses coefficients that make the observed outcomes most probable under the model.

In plain language:

OLS minimizes squared errors
MLE maximizes the likelihood of seeing the class labels I actually observed

How coefficients are interpreted

A coefficient changes the log-odds.
Exponentiating a coefficient gives the odds ratio.

positive coefficient → increases odds of positive class
negative coefficient → decreases odds of positive class

Assumptions and practical conditions

correct target coding
independent observations
linearity in the logit, not necessarily in raw probability
limited multicollinearity
enough events in the positive class

When to use logistic regression

binary target
need interpretable baseline
probability output is important
scorecard / policy environment values transparency

Strengths

interpretable
stable baseline
produces probabilities
works well for many tabular problems

Weaknesses

struggles with strongly nonlinear boundaries unless features are engineered
can be affected by multicollinearity and class imbalance
may underperform trees or boosting on complex patterns

9) Bayesian Optimization

What it is

Bayesian optimization is a smart method for tuning hyperparameters when each model training run is expensive.

Where it fits

It fits after I have chosen a model family and before final selection.

Workflow:

Choose model family
→ define hyperparameter search space
→ run optimization method
→ compare best tuned candidate with others

Why not just grid search everything?

Grid search tries many combinations blindly.
Bayesian optimization tries to learn from earlier results and choose promising next combinations.

How it works conceptually

treat validation score as an unknown function of hyperparameters
build a surrogate model of that function
use an acquisition rule to pick the next hyperparameters to try
iterate until budget is exhausted

When it is useful

XGBoost tuning
neural network tuning
expensive pipelines
large hyperparameter spaces

When it is not necessary

very simple model
small search space
quick baseline work

10) Decision Trees

What they are

Decision trees split data recursively into smaller regions.
Each split is based on a feature and a threshold.

Example:

if utilization > 80%, go left
if bureau score < 650, go right

At the end, leaves give predictions.

How they work

At each node, the algorithm chooses the split that gives the biggest purity gain or error reduction.

For classification, common split criteria:

Gini impurity
entropy / information gain

For regression, common split criterion:

reduction in squared error / variance

What is the estimator here?

The estimator is the recursive partitioning rule that chooses splits to minimize impurity or prediction error at each stage.

Important parameters

max_depth
min_samples_split
min_samples_leaf
max_features
pruning-related settings

Strengths

easy to explain
handles nonlinear relationships
handles interactions naturally
little preprocessing needed
insensitive to monotonic transformations

Weaknesses

unstable: small data changes can change the tree a lot
high variance
overfits easily if unrestricted

Assumptions

Trees are far less assumption-heavy than linear models.
They do not require linearity, normality, or homoscedasticity.

11) Ensembles

An ensemble combines multiple models to get a stronger final model.

Why ensembles work

A single model may be noisy, unstable, or too weak.
Combining many models can reduce variance, reduce bias, or both.

Two major ensemble families:

Bagging: many models in parallel, then aggregate
Boosting: models built sequentially, each focusing on previous errors

12) Bagging and Random Forest

This is one of the most important foundations for tabular ML.

Bagging

Bagging = Bootstrap Aggregating.

How it works:

draw many bootstrap samples from the training data
train one model on each sample
aggregate predictions across models

Bootstrap sample means:

sample rows with replacement
some rows appear multiple times
some rows are left out in a given sample

Why bagging helps

It mainly reduces variance.
A single deep tree is unstable.
Averaging many deep trees makes predictions more stable.

Random Forest

Random Forest is bagging applied to decision trees, with an extra source of randomness.

How it works:

build many trees on bootstrap samples
at each split, allow each tree to consider only a random subset of features
aggregate all trees
- classification: majority vote or average probability
- regression: average prediction

Why the random subset of features matters

If all trees always see all features, strong predictors dominate every tree, making trees too similar.
Random feature subsampling decorrelates trees.
That improves the ensemble.

Main parameters

n_estimators
max_depth
min_samples_leaf
min_samples_split
max_features
class_weight for imbalanced problems

Out-of-bag (OOB) idea

Because each bootstrap sample leaves out some observations, those omitted rows can act like a built-in validation set.
That gives OOB error, a useful internal estimate.

Strengths

strong default model for tabular data
handles nonlinearities and interactions
robust to outliers
less preprocessing needed
feature importance available

Weaknesses

less interpretable than linear / single-tree models
can become heavy with many trees
not always best calibrated out of the box
can struggle if the positive class is extremely rare and untreated

When to use Random Forest

tabular structured data
complex nonlinear patterns
many interactions
need a strong, stable benchmark

How to think about it in one sentence

A random forest is a crowd of decorrelated decision trees trained on bootstrap samples, and the final answer is the crowd’s average or vote.

13) Boosting, Gradient Boosting, and XGBoost

Core idea of boosting

Boosting builds models sequentially.
Each new model tries to improve what the previous models got wrong.

Unlike bagging, boosting is not mainly about variance reduction.
It is largely about reducing bias while still controlling variance through regularization.

AdaBoost intuition

start with equal weights on all observations
fit a weak learner
increase weight on wrongly predicted cases
next learner focuses more on difficult observations

Gradient Boosting intuition

Gradient boosting fits new trees to the residual errors or gradients of the current model.

Idea:

model 1 gives a first prediction
compute residuals / gradient information
model 2 learns the remaining error
continue adding small corrective trees

XGBoost

XGBoost is an optimized, regularized, scalable form of gradient boosting.

Why it became so popular:

strong performance on structured tabular data
built-in regularization
efficient computation
handling of missing values
flexible objective functions

Main parameters

n_estimators
learning_rate
max_depth
min_child_weight
subsample
colsample_bytree
gamma
reg_alpha (L1)
reg_lambda (L2)

Important intuition

more trees is not always better
lower learning rate usually means slower but safer learning
deep trees can overfit
subsampling and column subsampling add robustness

Strengths

excellent predictive power
captures nonlinearities and interactions
handles messy tabular data well

Weaknesses

more tuning burden
less transparent than linear models
can overfit if tuned poorly
probability calibration may need extra work

14) PCA and K-Means: Where They Fit

PCA — Principal Component Analysis

PCA is a dimensionality reduction technique.
It transforms correlated variables into a smaller set of orthogonal components.

Why PCA is used

reduce dimensionality
reduce multicollinearity
compress information
make high-dimensional data more manageable

What PCA gives up

Interpretability.
Principal components are combinations of original variables, so they are often harder to explain.

Where PCA fits in workflow

Usually after cleaning and scaling, before modeling.
Most useful when:

many correlated numeric variables
high-dimensional data
prediction matters more than direct interpretability

K-Means clustering

K-Means is an unsupervised algorithm that groups observations into (K) clusters.

How it works:

choose (K)
initialize centroids
assign points to nearest centroid
recompute centroids
repeat until convergence

Where K-Means fits

Not every supervised workflow needs it.
But it is useful for:

customer segmentation
transaction pattern grouping
anomaly investigation support
creating cluster-based features for later supervised models

Important conditions

scale matters a lot
distance drives grouping
outliers can distort centroids

How to choose (K)

elbow method
silhouette score
business interpretability

15) ANN — Brief Foundation

Artificial Neural Networks are flexible nonlinear function approximators.

Basic structure:

input layer
one or more hidden layers
output layer

Each neuron computes a weighted sum, adds a bias, then passes through an activation function.

Why ANNs work:

repeated nonlinear transformations can learn complex patterns

Common strengths:

strong on large, complex datasets
useful for text, image, sequence, and some tabular problems

Common weaknesses:

more data hungry
more tuning burden
less interpretable
can overfit
preprocessing is more important

For many tabular business problems, tree ensembles often outperform or match ANNs with less complexity.

16) Class Imbalance, SMOTE, SHAP, and Other Practical Issues

Class imbalance

This happens when one class is much rarer than the other.

Examples:

fraud is rare
default may be rare
SAR outcomes are rare

Why it matters:

A model can show high accuracy by predicting the majority class most of the time.
That does not mean the model is useful.

Remediation options

1. Better metric choice

Accuracy alone becomes misleading.
Use:

precision
recall
F1
PR-AUC in very imbalanced settings
KS / ROC-AUC depending on context

2. Resampling

undersampling: reduce majority class
oversampling: duplicate minority class
SMOTE: create synthetic minority examples

What SMOTE does

SMOTE = Synthetic Minority Over-sampling Technique.
It generates synthetic minority-class points between existing minority examples.

Good use:

moderate tabular imbalance
training set only

Important caution:

apply only on training data, never before splitting
may amplify noise if minority class is messy

3. Class weights

Tell the algorithm to care more about minority-class errors.
Useful in logistic regression, trees, random forest, XGBoost.

4. Threshold adjustment

Even with the same model probabilities, changing the decision threshold can shift precision and recall.

SHAP

SHAP explains model predictions using Shapley-value logic from cooperative game theory.

What it tells me:

global feature importance
local explanation for one prediction
direction of feature impact

Why it fits in workflow:

after model training
during interpretation and governance
useful for complex models like XGBoost

17) Model Selection: How to Choose the Final Model

Model selection is not “pick the highest score blindly.”

I should evaluate on multiple dimensions.

1. Predictive performance

discrimination
error rate
ranking power
stability on validation / test

2. Calibration

Are predicted probabilities close to actual realized rates?

3. Interpretability

linear / logistic models are easier to explain
trees and boosting may need SHAP or partial dependence tools

4. Robustness and stability

does performance collapse on OOT or shifted data?
is the model sensitive to small changes?

5. Operational feasibility

can it run fast enough?
can required inputs be sourced reliably?
can it be monitored?

6. Governance and policy fit

can documentation be written clearly?
is the model acceptable for the use case?
does the bank need a transparent model for approval or policy reasons?

A final model is often chosen because it balances all these things, not because it wins one metric by a tiny margin.

18) Validation Suite: The Most Important Clarification

This section is the core of classification evaluation.

First big distinction

There are two broad families of classification metrics:

A. Threshold-dependent metrics

These depend on choosing a probability cut-off such as 0.50 or 0.30.
Examples:

confusion matrix
accuracy
precision
recall
F1
TPR
FPR

B. Threshold-independent or ranking-based metrics

These evaluate how well the model ranks positives above negatives across many thresholds.
Examples:

AUROC
Gini
KS

This distinction removes a lot of confusion.

Confusion matrix

For a binary classifier, after picking a threshold:

TP: actual positive, predicted positive
TN: actual negative, predicted negative
FP: actual negative, predicted positive
FN: actual positive, predicted negative

This matrix depends entirely on the chosen threshold.

Accuracy

\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}

Meaning:

what fraction of all predictions were correct?

Problem:

can be misleading under class imbalance

Recall / Sensitivity / TPR

\text{Recall} = \text{TPR} = \frac{TP}{TP + FN}

Meaning:

out of all actual positives, how many did I catch?

Use case intuition:

important when missing a positive is costly
default detection, fraud detection, AML triage often care a lot about recall

Precision

\text{Precision} = \frac{TP}{TP + FP}

Meaning:

out of all predicted positives, how many were truly positive?

Use case intuition:

important when false alarms are expensive
investigator workload, manual review cost, customer friction

F1 score

F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

Meaning:

harmonic mean of precision and recall
useful when I want balance between catching positives and avoiding false alarms

FPR

\text{FPR} = \frac{FP}{FP + TN}

Meaning:

out of all actual negatives, how many did I wrongly flag as positive?

Specificity / TNR

\text{Specificity} = \text{TNR} = \frac{TN}{TN + FP} = 1 - \text{FPR}

Meaning:

how well do I correctly reject negatives?

19) ROC, AUROC, Gini, and KS

These metrics focus on discrimination or rank ordering.

ROC curve

ROC = Receiver Operating Characteristic curve.
It plots:

TPR / Recall on the y-axis
FPR on the x-axis

Each point comes from a different threshold.

Meaning:

how does recall improve as I tolerate more false positives?

AUROC

AUROC = Area Under the ROC Curve.

Interpretation:

probability that a randomly chosen positive gets a higher score than a randomly chosen negative

Range intuition:

0.5 = random
1.0 = perfect
below 0.5 = worse than random, often indicating reversed ranking or labeling issue

Gini

Gini is a rescaling of AUROC:

\text{Gini} = 2 \times \text{AUROC} - 1

So:

AUROC 0.80 corresponds to Gini 0.60

Why Gini is used:

common in credit scoring and scorecard environments

KS — Kolmogorov-Smirnov statistic

KS measures the maximum separation between cumulative positive and cumulative negative distributions across score cut-offs.

Plain language:

if I sort cases by score, KS asks: where is the biggest gap between how quickly positives accumulate and how quickly negatives accumulate?

Why it matters:

widely used in credit risk
gives a sense of separation power at the best discrimination point

The clean memory trick

Confusion matrix metrics ask: “At my chosen cut-off, how many did I catch or miss?”
ROC / AUROC / Gini / KS ask: “How well does the model rank risky cases above safe cases overall?”

That is the central distinction.

20) Model Calibration

Discrimination and calibration are not the same thing.

Discrimination

Can the model rank positives above negatives?

Calibration

Are the predicted probabilities numerically close to actual event rates?

Example:

if I score 100 customers at PD = 0.20, then roughly 20 of them should default over the defined horizon if the model is well calibrated

Why calibration matters

pricing
expected loss
capital estimation
provisioning
cut-off setting
portfolio forecasting

How to check calibration

calibration plot / reliability curve
compare predicted vs observed event rate by score band or decile
Brier score

How to improve calibration

Platt scaling
isotonic regression
recalibration on newer data
better sample design
probability mapping from raw score to observed rate

A model can have good AUROC but poor calibration.
That means it ranks well, but its raw probabilities are numerically off.

21) Champion and Challenger Models

Champion

The champion is the model currently selected for deployment or active use.

Challenger

A challenger is an alternative model tested against the champion.

Why challenger frameworks matter

prevent complacency
allow structured upgrades
help compare stability over time
useful in governance and model review

How selection usually works

Compare champion and challengers on:

performance
stability
calibration
interpretability
monitoring ease
operational complexity
governance fit

A challenger can win on raw performance but still lose overall if it is unstable, poorly calibrated, or too complex to govern.

22) Monitoring: PSI and CSI

Why monitoring exists

A good model at development time can become less useful later because:

customer mix changes
economic environment changes
policy changes
data pipelines change
feature relationships shift

PSI — Population Stability Index

PSI compares the distribution of a score or variable between a reference sample and a current sample.

Most commonly:

training / development distribution vs current monitoring distribution

Formula by bin:

\text{PSI}_i = (A_i - E_i) \ln\left(\frac{A_i}{E_i}\right)

where:

(E_i) = expected proportion in bin (i)
(A_i) = actual proportion in bin (i)

Total PSI is the sum across bins.

What PSI tells me

whether the score distribution or variable distribution has shifted materially

Typical interpretation bands

Common practical convention:

< 0.10: small shift
0.10 to 0.25: moderate shift, investigate
0.25: major shift, review seriously

These are practical rules of thumb, not universal laws.

CSI — Characteristic Stability Index

CSI uses the same idea as PSI but is applied at the feature level.

Meaning:

PSI often monitors score drift
CSI often monitors input variable drift

Important monitoring principle

Bins must be fixed from the reference population.
I should not redefine bins every month based on current data.
Otherwise I hide the drift.

What PSI and CSI do not tell me

They show distribution drift, not necessarily performance drift.
A shifted population may still be scored correctly.
So I should combine drift metrics with:

actual performance monitoring
calibration monitoring
backtesting if outcomes are available

23) What Else Belongs in Validation and Monitoring

A full validation and monitoring view usually includes:

Development validation

data quality checks
methodology review
assumptions review
code review
performance metrics
sensitivity analysis
limitation analysis

Ongoing monitoring

input drift
score drift
bad rate by band
calibration by band
threshold hit rates
reject / approve rates if applicable
override rates
processing failures

Recalibration vs redevelopment

Recalibration: same core model, adjust mapping / intercept / scale / threshold
Redevelopment: rebuild the model because relationships have changed too much

24) Model Governance, Documentation, and Defense

This is the control layer around the model.

Model governance means

The organization has a formal process for:

model approval
documentation
validation
implementation control
monitoring
change management
periodic review

Typical documentation set

Model development document

Includes:

objective
use case
target definition
sample design
methodology
feature treatment
assumptions
limitations
performance results

Validation document

Includes:

independent challenge or review
replication checks
conceptual soundness
outcome analysis
sensitivity checks
monitoring recommendations

Implementation document

Includes:

code version
pipeline logic
variable mapping
scoring steps
input / output specification

Monitoring document

Includes:

frequency
tracked metrics
thresholds and alerts
escalation process

Model defense means

Being able to explain:

what the model does
why it was built this way
what assumptions it makes
where it works well
where it is limited
how it is monitored

A defensible model is not one that claims to be perfect.
It is one whose behavior, limitations, and controls are clearly understood.

25) A Clean End-to-End Decision Map

When facing a new business problem, I can think in this order:

A. Business and target

what is the decision?
what is the unit of prediction?
what does the target mean?
what horizon matters?

B. Data readiness

do I trust the data?
is there leakage?
what are missingness and duplicates like?
what is the level of class imbalance?

C. Preprocessing and feature design

missing value strategy
outlier strategy
encoding
scaling if needed
feature engineering

D. Baseline and candidate models

linear / logistic baseline
tree model
ensemble model
specialized model if needed

E. Evaluation and selection

threshold metrics
ranking metrics
calibration
business cost of errors
interpretability and governance

F. Deployment and monitoring

champion / challenger
fixed preprocessing
drift monitoring
performance monitoring
recalibration or rebuild triggers

That is the real modeling workflow.

26) Compact Memory Aids

OLS vs MLE

OLS: choose coefficients that minimize squared residuals
MLE: choose coefficients that make observed class labels most likely

Linear vs Logistic

Linear regression: predicts a continuous value
Logistic regression: predicts a probability for a binary class

Bagging vs Boosting

Bagging: many models in parallel, average them, mainly reduces variance
Boosting: models built sequentially, each fixes earlier errors, mainly reduces bias and improves fit

Precision vs Recall

Precision: of the cases I flagged, how many were truly positive?
Recall: of the truly positive cases, how many did I catch?

Confusion metrics vs AUROC/Gini/KS

Confusion metrics: depend on one threshold
AUROC/Gini/KS: assess rank ordering across thresholds

Discrimination vs Calibration

Discrimination: can the model rank-order risk?
Calibration: are the predicted probabilities numerically accurate?

PSI vs CSI

PSI: score or overall distribution drift
CSI: feature-level drift

27) Final Closing Summary

A complete quant modeling workflow is not just algorithm knowledge.
It is the combination of:

business understanding
target definition
data quality discipline
preprocessing decisions
statistical awareness
model choice
validation logic
calibration
monitoring
documentation and governance

The strongest way to think is:

define the business decision
define the target clearly
make the data trustworthy
start simple
compare models properly
choose based on performance and usability
calibrate if probabilities matter
monitor after deployment
document everything well

That is what turns modeling from code into a professional workflow.

Quant Modeling Workflow Master Reference — End-to-End Data, Models, Validation, Monitoring, and Governance

The Big Picture

1) Main Portfolio Types in a Bank

Retail portfolios

SME / small business portfolios

Corporate / commercial portfolios

Treasury / market / trading-related portfolios

AML / financial crime and sanctions monitoring

2) How the Data Usually Looks

Customer-level data

Account-level or loan-level data

Transaction-level data

Time-series or panel data

Unstructured or semi-structured data

3) End-to-End Workflow of a Typical Model

Step 1: Define the business problem

Step 2: Define the target clearly

Step 3: Data sourcing and extraction

Step 4: Data quality checks

Step 5: Cleaning and preprocessing

Step 6: EDA

Step 7: Feature engineering

Step 8: Train / validation / test split

Step 9: Build a baseline first

Step 10: Train candidate models

Step 11: Validate and compare

Step 12: Choose champion and challengers

Step 13: Documentation and governance

Step 14: Deployment

Step 15: Monitoring and review

4) Starting the Model: Data Engineering, Cleaning, and Remediation

Missing values

MCAR — Missing Completely At Random

MAR — Missing At Random

MNAR — Missing Not At Random

Duplicates

Outliers

Data leakage

Categorical data issues

Date and time issues

Scaling and transformation

5) Homoscedasticity, Heteroscedasticity, and Multicollinearity

Homoscedasticity

Heteroscedasticity

Multicollinearity

6) L1, L2, Lasso, Ridge, Elastic Net

Why regularization is needed

L1 penalty

L2 penalty

Elastic Net

When to use them

Intuition for the tuning parameter (\lambda)

7) Linear Regression

What it is

Model form

What is the estimator here?

What residuals mean

Main assumptions

What coefficients mean

When to use linear regression

Performance measures

What to look for in diagnostics

8) Logistic Regression

What it is

Why not use linear regression for a binary target?

Logistic model form

What is the estimator here?

How coefficients are interpreted

Assumptions and practical conditions

When to use logistic regression

Strengths

Weaknesses

9) Bayesian Optimization

What it is

Where it fits

Why not just grid search everything?

How it works conceptually

When it is useful

When it is not necessary

10) Decision Trees