End-To-End

Pipelines

A pipeline is a sequence of data processing steps that are chained together. In machine learning, pipelines are crucial because:

Data needs multiple transformations before it becomes usable (e.g., missing value imputation, scaling, feature engineering).
By chaining transformations and model training into a pipeline, you ensure consistency (the same steps are applied to training and future data).
Pipelines help reduce code duplication, improve readability, and simplify cross-validation.

Key Concepts:

Asynchronous components: Each component in the pipeline processes data independently and at different times.
Data stores as interfaces: Components communicate by reading/writing to/from data stores. This separation enhances modularity.
Resilience: If a component fails, others can still work using cached or last-available data.
Monitoring is essential: Without it, failures may silently degrade performance as stale or incorrect data propagates.

Framing the Problem

Before you start training a model, you need to define the problem clearly. This is step zero in any machine learning project. Misframing the problem leads to building the wrong solution.

❓ Ask These Questions:

What type of learning problem is it?
- Supervised Learning
  → You have labeled data (input-output pairs).
  Examples:
  - Predicting house prices (regression)
  - Email spam detection (classification)
- Unsupervised Learning
  → You have no labels, only inputs.
  Examples:
  - Customer segmentation (clustering)
  - Dimensionality reduction for visualization
- Reinforcement Learning
  → An agent learns by interacting with an environment and receiving rewards.
  Examples:
  - Game playing (e.g., AlphaGo)
  - Robotics control
What type of supervised task is it?
- Classification
  Predicting discrete categories (e.g., spam or not spam)
- Regression
  Predicting continuous values (e.g., predicting temperature)
- Multilabel or Multiclass
  - Multiclass → One label from many (e.g., digit recognition: 0–9)
  - Multilabel → Multiple labels can be true at once (e.g., a movie can be both “comedy” and “drama”)
What kind of learning setup fits best?
- Batch Learning (Offline Learning)
  - Train the model on the full dataset at once
  - Model is static after training unless you retrain it from scratch
  - Suitable for large, stable datasets
- Online Learning
  - Model is updated incrementally as new data comes in
  - Ideal for streaming data or situations where data evolves over time
  - Useful when computational resources are limited or retraining from scratch is costly

By answering these questions, you define the type of solution you need, which in turn influences:

Your data collection strategy
The algorithms you’ll consider
How you’ll evaluate performance
The infrastructure for deployment and retraining

📘 Notation

This section defines several common machine learning notations used throughout the book.

Let $m$ be the number of instances in the dataset.
- Example: If the validation set has 2,000 districts, then $m = 2000$ .
Let $x^{(i)}$ be the feature vector of the $i^{t h}$ instance (excluding the label).
Let $y^{(i)}$ be the label (target value) of the $i^{t h}$ instance.
- Example: $x^{(1)} = - 118.29 33.91 141638372, y^{(1)} = 156400$
Let $X$ be the feature matrix, containing all feature vectors in the dataset:
- Each row is an instance: $X = (x^{(1)})^{⊤} (x^{(2)})^{⊤} ⋮ (x^{(m)})^{⊤}$
Let $h$ be the prediction function (also called a hypothesis).
- Given an instance’s feature vector $x^{(i)}$ , the system predicts:
  $\overset{y}{^}^{(i)} = h (x^{(i)})$
- Example: If $h (x^{(1)}) = 158400$ , then the prediction error is:
  $\overset{y}{^}^{(1)} - y^{(1)} = 158400 - 156400 = 2000$
Let $RMSE (X, h)$ be the Root Mean Squared Error cost function evaluated on dataset $X$ using hypothesis $h$ .

✅ Typography Conventions

Scalars → lowercase italic: $m$ , $y^{(i)}$ , $h$
Vectors → lowercase bold: $x^{(i)}$
Matrices → uppercase bold: $X$

📏 Select a Performance Measure: RMSE vs MAE

To evaluate how well a machine learning model performs, we need a way to measure the distance between its predictions and the actual target values. Two common metrics for regression tasks are:

✅ RMSE (Root Mean Squared Error)

Measures the Euclidean distance between predicted and true values.
It uses the $ℓ_{2}$ norm: $RMSE (X, h) = \frac{1}{m} i = 1 \sum m (\overset{y}{^}^{(i)} - y^{(i)})^{2}$
Sensitive to outliers: squaring the errors makes large errors count more.
Preferred when outliers are rare and follow a normal distribution.

✅ MAE (Mean Absolute Error)

Measures the Manhattan distance between predictions and targets.
It uses the $ℓ_{1}$ norm: $MAE (X, h) = \frac{1}{m} i = 1 \sum m \overset{y}{^}^{(i)} - y^{(i)}$
More robust to outliers than RMSE. (It is less affected by extreme values or rare, unusually large/small errors)
Gives equal weight to all errors, regardless of magnitude.

🔢 General Form: $ℓ_{k}$ Norm

For a vector $v$ with $n$ elements: $∥ v ∥_{k} = (i = 1 \sum n ∣ v_{i} ∣^{k})^{1/ k}$
Special cases:
- $ℓ_{0}$ : number of nonzero elements
- $ℓ_{1}$ : Manhattan norm (MAE)
- $ℓ_{2}$ : Euclidean norm (RMSE)
- $ℓ_{\infty}$ : max absolute value in the vector

🧠 Insight

Higher $k$ values focus more on large errors and ignore smaller ones.
Use RMSE when large errors should be penalized more heavily.
Use MAE when all errors should be treated equally.

📌 Tip: Always plot your error distribution. If you have many outliers or a skewed distribution, MAE might give you a more stable picture.

Check the Assumptions

💡 What Does “More Robust to Outliers” Mean?

When we say a metric is more robust to outliers, we mean:

It is less affected by extreme values or rare, unusually large/small errors.
These outliers don’t overly influence the metric’s overall value.

For example:

RMSE squares the errors, so large errors have a disproportionately big impact.
MAE treats all errors equally, so it’s more stable when there are extreme predictions.

Use MAE if:

Your dataset has noise or unpredictable spikes.
You want to avoid letting a few big errors dominate the evaluation.

📒 Notebooks and Google Colab

What Is a Notebook?

A Jupyter Notebook (used in Google Colab) is an interactive coding environment that combines:

Code
Explanatory text (Markdown)
Outputs like graphs, tables, etc.

You can run cells one at a time, view results immediately, and mix code with documentation, making it great for data exploration, ML experiments, and teaching.

🔗 Google Colab Shortcuts

Colab is a free, cloud-based Jupyter notebook environment provided by Google.
It lets you run Python code without any setup on your local machine.
You can access many useful keyboard shortcuts.

Here’s a full list of Colab/Jupyter shortcuts:
👉 Jupyter Notebook Shortcuts – Towards Data Science

To view shortcuts inside Colab:

Click on the “Tools” menu
Select “Keyboard shortcuts”
Or press: Cmd/Ctrl + M + H

💾 Accessing Files in Google Colab

When you run a Colab notebook:

Your Google Drive is mounted under:
```
/content/drive/MyDrive
```

To save a model or data file to your Drive:

!cp /content/my_model_file /content/drive/MyDrive

The ! at the start tells Colab to run a Linux shell command, not Python code.
cp is the command for copying files in Linux.

🛠️ Colab notebooks run on Linux (Ubuntu) virtual machines, so basic Linux commands work!

📦 About Imports

In regular Python scripts:

It’s recommended by PEP 8 (Python’s style guide) to put all import statements at the top of the file.

But in notebooks:

It’s common to import packages where they’re needed, inside individual cells.
This makes your code more modular and easier to follow, especially for tutorials or demos.

Example:

import pandas as pd  # common import for data manipulation

This flexibility is part of why notebooks are so popular for experimentation and teaching.

📊 Understanding Percentiles

Percentiles help describe the distribution of a dataset by indicating how data is spread across values.

🔢 Definitions

A percentile is the value below which a given percentage of data points fall.

For example:
- The 25th percentile = 25% of the values are less than or equal to this number.
- The 50th percentile = median (middle value)
- The 75th percentile = 75% of values fall below this value.

📘 Terminology

Term	Meaning
25th percentile	First quartile (Q1)
50th percentile	Median (Second quartile, Q2)
75th percentile	Third quartile (Q3)

Why the data has been transformed:

The dataset has been preprocessed:
- The median_income values are scaled to range between 0.5 and 15, but not exactly:
  - Minimum is 0.4999
  - Maximum is 15.0001
These are floating-point approximations and represent:
$5,000 → stored as 0.5
$30,000 → stored as 3.0
$150,000 → capped and stored as 15.0

Transformation	Explanation
Scaled	All income values are divided by 10,000 — so income is now in tens of thousands
Capped (clipped)	Extremely low incomes were set to no lower than *~ $5, 000 * (0.5 \times 10 k), an d hi g hin co m es t o n o m ore t han * $ 150,000* (15 × 10k)

📌 What Is Capping?

Capping means limiting values in a dataset so they do not go beyond a certain range.

If a value is too small, it’s set to a minimum threshold.
If a value is too large, it’s set to a maximum threshold. This is done to reduce the effect of outliers or extreme values that can distort analysis or model training.

✅ Why Use Capping?

Reason	Description
Reduce outlier effect	Models like linear regression can be skewed by extreme values.
Avoid division errors	In log transforms, very small values can break calculations.
Keep data in known range	Some ML algorithms assume features fall in a specific range.

⚠️ A Tradeoff

While capping improves stability, you also lose information about how extreme the original values were. That’s okay if those extremes are noise or rare — but bad if they’re meaningful.

📌 Tip: It’s common to work with engineered or normalized features in ML. Just make sure to understand them before jumping into model training.

Why is skewed data harder for ML algorithms?

a. Many ML models assume or work better with linear or symmetric data.

Let’s take linear regression as an example:

It assumes a roughly linear relationship between features and target.
If a feature is right-skewed, that relationship may be nonlinear, distorted, or dominated by outliers.
Result: The model might overfit the tail or underperform on the dense (important) part of the distribution.

It’s harder to find a clear pattern unless you “pull in” those extreme values.

b. Optimization algorithms struggle

Models like logistic regression or SVM rely on gradient descent, which assumes:

Features are on similar scales,
Gradients behave predictably.

Right-skewed features have:

Long tails with rare, extreme values → they contribute too much to the gradients.
That throws off optimization, making training slower or unstable.

c. Distance-based models (like k-NN or clustering) become misleading

Skewed features mean Euclidean distance (how “far” two points are) becomes unreliable.
A single large feature can dominate the entire distance calculation.

What does “transforming” the data mean?

“You can’t just change the distribution, right?”

In ML, transforming means applying a mathematical function to reshape the values, while preserving their relative order or meaning.

✅ Examples of Transformations

Log transform

If x = [1, 10, 100, 1000]

np.log(x)
→ [0, 2.3, 4.6, 6.9]

Still increasing, still ordered.
But now: much less skewed, easier to work with.

Square root

np.sqrt([1, 4, 16, 100])
→ [1, 2, 4, 10]

Again: relative relationships preserved, but large numbers pulled inward.

Test Set

It may feel premature to set aside part of your data early on, especially before exploring the dataset in detail. However, it’s a critical step for ensuring a reliable machine learning workflow.

Why Set Aside a Test Set Early?

Your brain is a powerful pattern detector, but that also makes it prone to overfitting.
If you explore the full dataset (including the test set), you might unconsciously:
- Notice patterns specific to the test set
- Choose or tune models based on those patterns

This can lead to:

Over-optimistic performance estimates
A model that looks good in testing, but performs poorly in the real world

This problem is known as data snooping (or data leakage).

Data Snooping (Bias)

Data snooping happens when information from the test set leaks into the model selection or training process. This contaminates your evaluation.

Example:

You might:

Look at the test set’s distribution
Notice a correlation
Decide to use a specific model architecture because of it

Now, when you measure performance on the test set:

It’s no longer an unseen, independent measure
Your result is biased — and over-optimistic

✅ Solution: Keep the Test Set Sacred

Set aside the test set from the very beginning
Don’t look at it until the very end, after:
- Data exploration
- Feature engineering
- Model selection
- Cross-validation on the training set

Use it only once:

To evaluate the final model’s generalization performance

📌 Tip: You can also create a validation set (or use cross-validation) during model development, and keep the test set completely untouched until the final evaluation.

`.iloc` vs `.loc` in Pandas

In pandas, both .iloc and .loc are used to access rows and columns in a DataFrame — but they work differently:

🔢 `.iloc[]` → Integer Location

Access by position (index number)
Purely integer-based indexing

Example:

df.iloc[0]         # First row
df.iloc[0:3]       # First 3 rows
df.iloc[0, 1]      # Row 0, column 1

🏷️ `.loc[]` → Label-based Location

Access by label (row/column names)
Can use slices, lists, or boolean masks

Example:

df.loc[0]              # Row with label/index 0
df.loc[0:3]            # Rows from label 0 to 3 (inclusive!)
df.loc[0, 'name']      # Value at row 0, column 'name'
df.loc[df['age'] > 30] # All rows where age > 30

Summary Table

Feature	`.iloc[]`	`.loc[]`
Access by	Index position	Index/column label
Returns	Rows/columns by number	Rows/columns by name
Slice	Excludes end	Includes end

📌 Tip: Use .iloc when you’re working with position, and .loc when you’re working with labels.

🌌 The Answer to Life, the Universe, and Everything

In The Hitchhiker’s Guide to the Galaxy, a group of hyper-intelligent, pan-dimensional beings builds a supercomputer named Deep Thought to calculate:

“The Answer to the Ultimate Question of Life, the Universe, and Everything.”

After seven and a half million years of computation, Deep Thought finally responds:

“42.”

😐 Wait, what?

The answer — “42” — is intentionally absurd and meaningless without the right question.

Deep Thought explains:

“I think the problem, to be quite honest with you, is that you’ve never actually known what the question is.”

So they build an even greater computer (called Earth) to figure out the actual question.

What does it mean?

Douglas Adams has said he picked 42 arbitrarily — it’s just a joke. It satirizes:

Our obsession with finding deep, cosmic meaning
The idea that a single number could explain everything

It’s become a cultural meme used to:

Humorously answer unanswerable questions
Signal geeky, sci-fi humor

📘 Fun Fact: In ASCII, * is the 42nd character — which some interpret as a nod to “everything.”

“The answer to life, the universe, and everything is 42.”
— Deep Thought (and every Hitchhiker’s fan)

Does the ML algorithm have memory or state across runs?

No — by default, it does not.

When you rerun your ML training script, the algorithm starts from scratch. So technically, your model doesn’t remember anything from previous runs — you’re right about that.

✅ So what’s the concern, then?

The concern is you, the developer or analyst.

🎯 The Real Problem: You (and your code) leak test data over time

Even if the ML model doesn’t remember anything across runs:

You might see parts of the test set on one run, make decisions, change preprocessing, tweak model hyperparameters, rerun — and now your model is indirectly influenced by test set knowledge.
Over time, you’re accidentally training your model on patterns that leak from the test set — defeating its purpose.

📌 Think of it this way:

You want your test set to simulate real-world, unseen data.
But if it keeps changing randomly every time you rerun your code, you and your model will eventually see the whole dataset.

So the key principle here is:

The test set must remain fixed and untouched across multiple experiments.

This is why we care about repeatable splits.

Why isn’t `np.random.seed(42)` enough?

Setting a random seed is enough for reproducibility — but only as long as your dataset doesn’t change.

Here’s what that means:

✅ What setting the seed does:

np.random.seed(42)
shuffled_indices = np.random.permutation(len(data))

This guarantees the same shuffled order of indices every time for the same dataset.

So:

Your train/test split is the same across runs.
You’re reproducible.

❌ But the problem is:

If you get new data (say, a newer version of your dataset with more rows), np.random.permutation() will shuffle all data again — and:

Previously “test” instances may now end up in training, and vice versa.
The seed doesn’t protect against dataset changes.

This defeats the main goal: to permanently isolate test data even when new data arrives.

✅ So what is the better solution?

Use a deterministic rule based on something that won’t change, like:

A unique, stable ID (e.g., customer ID, row hash, image filename).
Apply a function like hash(id) and assign it to test if it’s in the bottom X% of hash values.

This way:

The same data points always end up in the test set.
New data gets tested only if it’s new.
You don’t leak info from train to test even when your dataset updates.

Sampling Bias from Coarse Location Info

🔍 What It Means:

The dataset includes location information (like latitude and longitude), but this info is coarse — meaning it’s not very precise.

Multiple districts share the exact same location values.

A possible implementation for using hash

from zlib import crc32
 
def is_id_in_test_set(identifier, test_ratio):
	return crc32(np.int64(identifier)) < test_ratio * 2**32
	
def split_data_with_id_hash(data, test_ratio, id_column):
	ids = data[id_column]
	in_test_set = ids.apply(lambda id_: is_id_in_test_set(id_, test_ratio))
	return data.loc[~in_test_set], data.loc[in_test_set]

Creating Unique IDs for Dataset Splitting

The housing dataset lacks a built-in unique identifier column.

A simple solution is to use the row index as an ID by resetting the index:

housing_with_id = housing.reset_index()  # adds an `index` column
train_set, test_set = split_data_with_id_hash(housing_with_id, 0.2, "index")

If using the row index as ID:
- Ensure new data is only appended (no deletion or reordering).

If this isn’t feasible, create an ID from stable features, e.g., combine latitude and longitude:

housing_with_id["id"] = housing["longitude"] * 1000 + housing["latitude"]
train_set, test_set = split_data_with_id_hash(housing_with_id, 0.2, "id")

Using stable features helps maintain consistent IDs over time.

from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

Stratified Sampling and Income Categories

In surveys, maintaining important population ratios (e.g., 48.9% males) in samples is crucial to avoid bias.
Stratified sampling divides the population into homogeneous subgroups (strata) and samples the right number from each, ensuring the test set represents the overall population.
Purely random sampling can lead to skewed samples; for example, a ~10.7% chance of female representation falling outside 48.5–53.5%, biasing results.

🏦 Handling Continuous Variables: Income Categories

Median income is a key predictor for housing prices, but since it’s continuous, it needs to be binned into categories for stratified sampling.
Most median incomes cluster between $15, 000 an d$ 60,000 (scaled values 1.5 to 6), but some go higher.
To avoid bias, strata should be:
- Few enough to have sufficient data in each
- Large enough for meaningful representation
Example code uses pd.cut() to create 5 income categories:
- Category 1: 0 to 1.5 (less than $15,000)
- Category 2: 1.5 to 3, etc.

This allows creating a stratified test set that accurately reflects income distribution in the population.

❓ Should We Stratify by Feature?

Stratifying means splitting your dataset so that your train/test (or validation) sets reflect the distribution of important features in the overall data.

When to Stratify:

If a feature strongly influences the target variable, stratifying on it helps ensure:
- The model sees all important subgroups during training.
- The evaluation on the test set is more reliable and representative.
Common features to stratify on include:
- Categorical features (e.g., gender, region, class)
- Discretized continuous features (e.g., income categories, age bins)

Discretization means converting continuous values into distinct categories or bins.

When Not to Stratify:

If the feature is not relevant or weakly correlated with the target, stratifying may add unnecessary complexity.
If stratification leads to very small strata, it might cause unstable splits.

What Is Stratified Sampling?

Stratified sampling means splitting your dataset in such a way that the distribution of a specific variable (usually categorical) is preserved across the train and test sets.

In other words:

You group data into strata based on some important feature, like income category or class label.
Then you sample from each stratum proportionally.

📊 Why do this?

Because random sampling can break the natural distribution of important features — especially in small datasets.

With stratified sampling, your test set would keep the same proportions:

80% train / 20% test:
Train: 48 low, 24 medium, 8 high  
Test: 12 low, 6 medium, 2 high

🔧 When to Use Stratified Sampling?

✅ Use it when:

You have imbalanced classes (e.g., fraud vs. non-fraud, disease vs. healthy).
You’re splitting based on categorical features (e.g., education level, income group).
The target variable’s distribution matters.

🚫 Not useful when:

The feature has no meaningful groupings,
Or is continuous and evenly distributed.

Stratified in `scikit-learn`

from sklearn.model_selection import StratifiedShuffleSplit
 
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
 
for train_index, test_index in split.split(data, data["income_cat"]):
    strat_train_set = data.loc[train_index]
    strat_test_set = data.loc[test_index]

Here:

income_cat is a categorical version of income (e.g., bucketed into 5 groups).
The split ensures the same proportions in train/test.

from sklearn.model_selection import StratifiedShuffleSplit
 
  
 
splitter = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
 
strat_splits = []
 
for train_index, test_index in splitter.split(housing, housing["income_cat"]):
    strat_train_set_n = housing.iloc[train_index]
    strat_test_set_n = housing.iloc[test_index]
    strat_splits.append([strat_train_set_n, strat_test_set_n])

strat_train_set, strat_test_set = train_test_split(
 
    housing, test_size=0.2, stratify=housing["income_cat"], random_state=42)

⚠️ Neglected but Critical: Test Set Generation

Generating a proper test set is a crucial step in any machine learning project.
It is often neglected, but getting this right helps avoid biases and ensures reliable evaluation.
The concepts learned here will also be important later for cross-validation.
After preparing the test set correctly, the next stage is exploring the data.

Matplotlib plot options

Argument	Meaning
`kind="scatter"`	Tells pandas to make a scatter plot, not a line/bar plot.
`x="longitude"`	Horizontal axis shows longitude (i.e., east-west location).
`y="latitude"`	Vertical axis shows latitude (i.e., north-south location).
`grid=True`	Turns on the grid behind the plot (for better readability).
`alpha=0.2`	Transparency (0 = invisible, 1 = opaque). Makes overlapping points visible.
`s=housing["population"] / 100`	Sets bubble size: more population = larger circle. Divided by 100 to scale it down.
`label="population"`	Label for the plot (shows in legend — not always effective with scatter).
`c="median_house_value"`	Sets color of each point based on house value.
`cmap="jet"`	The colormap: “jet” means low values are blue, high values are red.
`colorbar=True`	Adds a side bar to explain the color scale (house prices).
`legend=True`	Tries to show a legend (usually not very useful here unless you have a categorical label).
`sharex=False`	Don’t force axis sharing — only matters in subplot grids.
`figsize=(10, 7)`	Width × height of the figure, in inches.

numeric only

corr_matrix = housing.corr(numeric_only=True)

📈 Insights from Data Visualization and Correlation

1. Observing Scatter Plots and Data Patterns

The scatter plot shows a strong positive correlation: as one variable increases, so does the other, with points fairly close to the trend line.
A price cap at $500,000 is clearly visible as a horizontal line, reflecting a limit in the data.
- Several other horizontal lines appear at price points like $450, 000,$ 350,000, and $280,000, indicating data quirks or artificial limits.
To prevent the model from simply learning these quirks, consider removing or treating these districts separately.

2. Understanding the Correlation Coefficient

The correlation coefficient measures linear relationships — how much one variable increases or decreases as the other does.
It cannot detect nonlinear relationships, which might be important but invisible to this metric.
Examples show datasets with zero correlation coefficient but clear nonlinear patterns.
A correlation of ±1 means a perfect linear relationship, but it says nothing about the slope or units.
- Example: Height in inches vs. height in feet has a correlation coefficient of 1.

3. Exploring Attribute Combinations

Before feeding data into machine learning algorithms, try combining different attributes to discover useful features.
Feature engineering can improve model performance by capturing more complex patterns.

📌 Tip: Always visualize your data and understand the limitations of simple statistics like correlation before modeling.

❓ Why Are There Data Quirks (Horizontal Lines) in the Plot?

The horizontal lines at certain price values (like $500, 000,$ 450,000, etc.) often happen because of data capping or rounding:
- The dataset might have upper limits set on housing prices (e.g., any house over $500, 000 i srecor d e d a s$ 500,000).
- Some prices might be rounded or grouped into buckets for privacy or simplification.
These quirks are artifacts of data collection or preprocessing, not natural variations.
The problem: models may learn these artificial patterns instead of true underlying relationships, hurting generalization.

Prepare the Data for Machine Learning Algorithms

Preparing Data for Machine Learning: Why Use Functions Instead of Manual Steps

When preparing datasets for machine learning, it’s best to encapsulate your preprocessing steps into functions rather than performing them manually. This approach has several advantages:

1. Reproducibility

Functions allow you to reapply the same transformations to any dataset, such as when you receive updated or fresh data.
Ensures consistent preprocessing across experiments and production runs.

2. Reusability

Over time, you will build a library of transformation utilities (e.g., handling missing values, encoding categories, scaling features).
These can be reused in future projects, reducing development time.

3. Production Integration

The same preprocessing functions can be applied in live systems before passing new incoming data to your ML models.
This ensures the model receives data in the exact format it was trained on.

4. Experimentation

By modularizing transformations, you can easily swap or combine different preprocessing steps.
Enables systematic experimentation to find the optimal data preparation pipeline.

Cleaning

Most machine learning algorithms cannot work with missing features, so these must be handled before training. Common approaches include:

1. Remove Rows with Missing Values

housing.dropna(subset=["total_bedrooms"], inplace=True)

2. Remove the Entire Attribute

housing.drop("total_bedrooms", axis=1)

3. Impute Missing Values

Replace missing entries with a constant, mean, median, or another computed value.

median = housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna(median, inplace=True)

Tip: For production systems, prefer imputation over deletion to avoid losing data. Tools like sklearn.impute.SimpleImputer make this process reproducible.

Imputation with `SimpleImputer`

You decide to use option 3 (imputation) since it is the least destructive. Instead of manually filling missing values with Pandas, you can use Scikit-Learn’s SimpleImputer.

Advantages over Pandas

Stores the computed statistic (e.g., median) for each feature, making it easy to reuse the same imputation values on:
- Training set
- Validation set
- Test set
- New, unseen data in production
Integrates with ML pipelines, ensuring preprocessing is applied automatically during training and inference.
Works with NumPy arrays and Pandas DataFrames, and handles multiple features at once.

Example

from sklearn.impute import SimpleImputer
 
imputer = SimpleImputer(strategy="median")
imputer.fit(housing)                     # Learn medians from training data
housing_transformed = imputer.transform(housing)  # Apply to any dataset

Example: Reusing `SimpleImputer` Across Datasets

Below is a practical example showing how SimpleImputer stores the computed medians from the training set and applies them consistently to other datasets.

import pandas as pd
from sklearn.impute import SimpleImputer
 
# Example datasets
train_data = pd.DataFrame({
    "age": [25, 30, None, 40],
    "salary": [50000, None, 60000, 55000]
})
 
validation_data = pd.DataFrame({
    "age": [None, 28],
    "salary": [52000, None]
})
 
# Step 1: Fit on training data
imputer = SimpleImputer(strategy="median")
imputer.fit(train_data)
 
# The medians are stored internally
print("Stored medians:", imputer.statistics_)
# Output: [30.  55000.]
 
# Step 2: Transform datasets using the same medians
train_filled = imputer.transform(train_data)
validation_filled = imputer.transform(validation_data)
 
print("Transformed Training Data:\n", train_filled)
print("Transformed Validation Data:\n", validation_filled)

Key point:

Even if the validation set has missing values in different places or different distributions, the imputer will still fill them using training-set medians, ensuring consistency between training and inference.

If you want, I can also show an example of applying the same imputer in a production pipeline for new incoming data.

What median is used for filling missing values in test and validation sets?

It should always be the training set’s median (or mean, mode, etc., depending on the strategy).

Here’s why:

Consistency: The model was trained assuming missing values are filled using the training-set statistic. If you use the validation or test set’s own median, you’re introducing different preprocessing logic, which changes the feature distribution.
Avoiding Data Leakage: Computing medians from the validation or test set means you are indirectly using information from those sets during preprocessing, which contaminates your evaluation.
Realistic Deployment: In production, you won’t have access to the “true” distribution of new data — you can only use statistics learned during training.

Summary:

Training set: Fit the imputer → store the median(s).
Validation/Test/New data: Transform using the stored training medians.

This is exactly why SimpleImputer’s fit() (on training) and transform() (on all datasets) pattern exists.

Selecting Numerical Attributes for Imputation

Since the median can only be computed on numerical attributes, create a copy of the dataset containing only numeric columns. This automatically excludes non-numeric attributes such as ocean_proximity:

housing_num = housing.select_dtypes(include=[np.number])

The SimpleImputer will compute the median of each numeric attribute and store the results in its statistics_ attribute.

Even though only total_bedrooms had missing values in the training set, applying the imputer to all numeric attributes ensures that any missing values in future data (e.g., after deployment) will also be handled consistently.

Transforming Data with the Trained Imputer

Use the “trained” imputer to transform the training set by replacing missing values with the learned medians:

X = imputer.transform(housing_num)

Other available strategies:

Mean: strategy="mean" — replaces missing values with the mean of each feature.
Most Frequent: strategy="most_frequent" — replaces with the most common value in each feature (supports non-numerical data).
Constant: strategy="constant", fill_value=... — replaces with a fixed specified value (supports non-numerical data).

Advanced Imputers in `sklearn.impute`

Beyond SimpleImputer, Scikit-Learn provides more powerful imputers designed to handle missing data more intelligently, especially for numerical features.

1. KNNImputer

Replaces each missing value with the mean value of the k-nearest neighbors for that feature.
The “distance” between samples is computed using all available features.
Useful when missing values correlate with the values of similar samples.

Example Use Case:
Imputing missing sensor readings where similar sensors tend to behave alike.

from sklearn.impute import KNNImputer
 
imputer = KNNImputer(n_neighbors=5)
imputed_data = imputer.fit_transform(data)

2. IterativeImputer

Models each feature with missing values as a function of other features using a regression model.
Iteratively updates the imputations by training on the newly imputed data multiple times.
Often yields better imputations by capturing complex relationships between features.

Example Use Case:
Imputing missing clinical measurements where features have nonlinear dependencies.

from sklearn.experimental import enable_iterative_imputer  # noqa
from sklearn.impute import IterativeImputer
 
imputer = IterativeImputer(max_iter=10, random_state=0)
imputed_data = imputer.fit_transform(data)

Scikit-Learn Design Principles

Scikit-Learn’s API follows clear, consistent design principles that make it intuitive and easy to use.

1. Consistency

All objects share a simple and uniform interface, categorized mainly as:

Estimators

Objects that estimate parameters from data (e.g., SimpleImputer).
Use the fit() method to learn from data.
Accept hyperparameters as constructor arguments (e.g., strategy for SimpleImputer).
In supervised learning, fit() accepts both features and labels.

Transformers

Estimators that can transform data (e.g., SimpleImputer).
Provide a transform() method that applies the learned transformation.
Have a fit_transform() method that combines fit() and transform() for efficiency.

Predictors

Estimators that can make predictions on new data (e.g., LinearRegression).
Implement a predict() method that returns predictions.
Include a score() method to evaluate prediction quality using test data and labels.

2. Inspection

Hyperparameters are exposed as public instance variables (e.g., imputer.strategy).
Learned parameters are exposed with a trailing underscore (e.g., imputer.statistics_).

Scikit-Learn Design: Nonproliferation of Classes and Composition

1. Nonproliferation of Classes

Instead of creating many custom data structures or classes, Scikit-Learn represents datasets as standard NumPy arrays or SciPy sparse matrices.
This keeps the library lightweight, compatible with other Python tools, and easy to integrate.
Hyperparameters are simple Python primitives like strings, integers, or floats—not complex objects—making configuration straightforward and transparent.

2. Composition

Scikit-Learn encourages building complex workflows by combining simple, reusable components.
For example, the Pipeline class allows chaining multiple transformers (data preprocessing steps) followed by a final estimator (like a classifier or regressor).
This modular approach makes it easy to construct, reuse, and maintain end-to-end machine learning pipelines.

Example:

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
 
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])

Here, the pipeline applies imputation, scaling, and classification sequentially, combining reusable building blocks into one estimator.

This design philosophy promotes simplicity, interoperability, and ease of experimentation.

housing_tr = pd.DataFrame(X, columns=housing_num.columns,
index=housing_num.index)

Encoding Categorical Features

To convert categorical text data into numbers, you can use Scikit-Learn’s OrdinalEncoder:

from sklearn.preprocessing import OrdinalEncoder
 
ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)

One-Hot Encoding: What and Why

One-Hot Encoding converts categorical data (discrete labels like colors or cities) into a numerical format that machine learning algorithms can use.

Why is it needed?

Most ML models cannot work directly with text labels such as "Red", "Blue", or "Green". They require numeric input.

Simply assigning integers (e.g., Red = 1, Blue = 2, Green = 3) can mislead models into interpreting an ordinal relationship or distance between categories, which usually does not exist.

One-hot encoding avoids this by creating separate binary features for each category, preventing unintended order assumptions.

How One-Hot Encoding Works

One-Hot Encoding converts each category into a separate binary column indicating presence (1) or absence (0).

Example

Suppose we have a categorical feature Color with 3 possible values:

Color
Red
Blue
Green
Blue

After applying One-Hot Encoding, this becomes:

Color_Red	Color_Blue	Color_Green
1	0	0
0	1	0
0	0	1
0	1	0

1 indicates that the row belongs to that category.
0 indicates it does not.

Code Example with Scikit-Learn

from sklearn.preprocessing import OneHotEncoder
import numpy as np
 
data = np.array([["Red"], ["Blue"], ["Green"], ["Blue"]])
encoder = OneHotEncoder(sparse=False)
encoded_data = encoder.fit_transform(data)
 
print(encoded_data)

Output:

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]

One-Hot Encoding is essential to avoid false ordinal relationships in categorical features.

Pros

No unintended order between categories.
Works well for small-to-medium numbers of categories.

Cons

Creates many columns if there are lots of categories (“curse of dimensionality”).
Doesn’t capture relationships between categories.

Tip: In libraries like Pandas, you can do it easily:

import pandas as pd
 
df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue']})
encoded = pd.get_dummies(df, columns=['Color'])
print(encoded)

Sparse Matrix

A sparse matrix is a matrix in which most elements are zero. Storing all elements explicitly (including zeros) in large matrices can waste memory and computation time.

Sparse matrices store only the nonzero elements and their positions, saving space and speeding up operations that ignore zeros.

SciPy Sparse Matrix

The SciPy library provides efficient implementations of sparse matrices in various formats, such as:

CSR (Compressed Sparse Row): Efficient for row slicing and arithmetic operations.
CSC (Compressed Sparse Column): Efficient for column slicing.
COO (Coordinate list): Easy to construct, stores row and column coordinates of nonzero entries.

These sparse matrix types store only nonzero values and their indices, enabling scalable computations on large, sparse datasets (common in text data, recommendation systems, graphs, etc.).

Example

from scipy.sparse import csr_matrix
import numpy as np
 
dense = np.array([
    [0, 0, 1],
    [1, 0, 0],
    [0, 0, 0]
])
 
sparse = csr_matrix(dense)
print(sparse)

Output:

  (0, 2)	1
  (1, 0)	1

Here, only nonzero elements and their coordinates are stored.

Sparse Matrix and One-Hot Encoding in Scikit-Learn

What is a Sparse Matrix?

A sparse matrix efficiently stores matrices with mostly zeros by keeping only the nonzero values and their positions in memory. This reduces memory usage and speeds up computations when dealing with large, mostly empty matrices.

One-Hot Encoding and Sparse Matrices

When applying one-hot encoding to a categorical feature with many categories, the resulting matrix is very large and mostly zeros (only one ‘1’ per row). Scikit-Learn’s OneHotEncoder by default returns a sparse matrix (specifically a CSR — Compressed Sparse Row matrix), which is ideal in this situation.

Example:

from sklearn.preprocessing import OneHotEncoder
 
cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)  # Returns sparse CSR matrix

If you want a dense NumPy array instead, use:

cat_encoder = OneHotEncoder(sparse_output=False)  # For Scikit-Learn 1.2+
housing_cat_1hot_dense = cat_encoder.fit_transform(housing_cat)

Why Using Sparse Matrix Does Not Harm Training

Most machine learning algorithms in Scikit-Learn support sparse input directly, so they work efficiently with sparse matrices without converting them to dense form.
This saves memory and computation time, especially for high-cardinality categorical features.
Algorithms that do not support sparse matrices internally will convert sparse inputs to dense when necessary, but this is less common.

Converting Sparse to Dense

If needed, convert a sparse matrix to a dense NumPy array:

dense_array = housing_cat_1hot.toarray()

Using sparse matrices with one-hot encoding is a best practice for memory efficiency and performance.

One-Hot Encoding: `pd.get_dummies()` vs `OneHotEncoder`

Example with Pandas `get_dummies()`

import pandas as pd
 
df_test = pd.DataFrame({"ocean_proximity": ["INLAND", "NEAR BAY"]})
pd_dummies = pd.get_dummies(df_test)
print(pd_dummies)

Output:

INLAND	NEAR BAY
1	0
0	1

Example with Scikit-Learn `OneHotEncoder`

from sklearn.preprocessing import OneHotEncoder
 
cat_encoder = OneHotEncoder()
cat_encoder.fit([["INLAND"], ["NEAR BAY"], ["<2H OCEAN"]])  # Training on 3 categories
 
encoded = cat_encoder.transform([["INLAND"], ["NEAR BAY"]]).toarray()
print(encoded)
print(cat_encoder.categories_)

Output:

[[1. 0. 0.]
 [0. 1. 0.]]
[array(['<2H OCEAN', 'INLAND', 'NEAR BAY'], dtype=object)]

What Does “Remembering” Mean?

OneHotEncoder remembers the exact categories it was trained on (stored in categories_ attribute).
When you call transform() on new data, it produces a matrix with one column per learned category, in the same order, ensuring the feature space is consistent.
If the new data has a category it was not trained on, by default it raises an error or handles it based on parameters (handle_unknown).

Why Is This Important?

In production, your model expects features with the exact same structure as during training (same columns, same order).
pd.get_dummies() generates columns dynamically based on the input data, which means:
- It can create different columns for different datasets (e.g., training vs. test).
- Unknown categories create new columns, breaking the feature alignment expected by your model.

Summary Table

Feature	`pd.get_dummies()`	`OneHotEncoder`
Learns categories once	No — creates columns dynamically per input	Yes — stores categories after `fit()`
Handles unknown categories	Creates new columns	Can raise error or ignore (`handle_unknown` param)
Consistent output shape	No	Yes
Integration in pipeline	Limited	Fully supported

How `OneHotEncoder` Remembers

When you call:

cat_encoder.fit(data)

It scans the data, records the unique categories for each feature, and saves them as the attribute categories_. Later, when you transform new data, it uses this stored info to encode each category in the right column, preserving feature order and consistency.

If you want, I can show you how to handle unknown categories safely with OneHotEncoder.

Challenges with High-Cardinality Categorical Features

Large Number of Categories

When a categorical feature has many categories (e.g., country codes, professions, species), one-hot encoding creates a very large number of input features.
This can slow down training, increase memory usage, and potentially degrade model performance due to sparsity and overfitting risks.

Alternatives to One-Hot Encoding for High-Cardinality Features

Feature Engineering:
Replace categories with meaningful numerical features related to the category, such as:
- For ocean_proximity, use distance to the ocean.
- For country codes, use population, GDP per capita, or other numeric statistics.
Category Encoders Package:
Use specialized encoding methods (available in the category_encoders package) such as:
- Target encoding
- Hash encoding
- Frequency encoding
Embeddings in Neural Networks:
Replace categories with learnable, low-dimensional vectors (embeddings) that capture semantic relationships.
This is a form of representation learning, commonly used in deep learning (covered in Chapters 13 and 17).

Handling Column Names in Scikit-Learn with DataFrames

When fitting a Scikit-Learn estimator with a Pandas DataFrame, the estimator stores the input column names in the feature_names_in_ attribute.
Later, when you call methods like transform() or predict() on a DataFrame, Scikit-Learn checks that the columns match the original columns it was trained on, helping prevent errors.
Transformers provide a method get_feature_names_out() that returns the names of output features, useful for creating DataFrames from transformed data.

Example:

# After fitting
print(estimator.feature_names_in_)
 
# Get output feature names
output_names = transformer.get_feature_names_out()

This design helps maintain data consistency and traceability through complex preprocessing pipelines.

Why Feature Scaling Matters

Many machine learning algorithms are sensitive to the scale of input features. When numerical features have vastly different ranges—like total_rooms (6 to ~39,000) versus median_income (0 to 15)—models can behave poorly because:

Models tend to give more weight to features with larger numeric ranges, overshadowing smaller-scaled features regardless of their actual importance.
Algorithms that rely on distance calculations (e.g., k-NN, SVM, clustering) or gradient-based optimization (e.g., logistic regression, neural networks) assume features are on comparable scales for effective learning.

Common Feature Scaling Methods

Min-Max Scaling
Rescales features to a fixed range, usually [0, 1]:
$X_{scaled} = \frac{X - X _{m i n}}{X _{m a x} - X _{m i n}}$
Standardization (Z-score Normalization)
Centers features at zero mean and unit variance: $X_{standardized} = \frac{X - μ}{σ}$ Both approaches help models learn effectively by ensuring features contribute proportionally.

Important Note on Feature Scaling

When applying feature scaling, always fit the scaler only on the training data — do not call fit() or fit_transform() on validation, test, or new data.

The scaler learns parameters (e.g., min, max, mean, standard deviation) from the training set.
Then, use transform() to apply the same scaling parameters to all other datasets (validation, test, production).

This ensures the model sees consistent, comparable feature values.

Outliers and Scaling Range

Since the scaler uses statistics from training data, new data containing outliers may be scaled outside the expected range (e.g., less than 0 or greater than 1 in min-max scaling).
To prevent this, set the scaler’s clip=True parameter, which clips values to the specified range after transformation.

This practice maintains data integrity and prevents data leakage while ensuring robust, reliable model performance.

Why Fit Scalers Only on the Training Set?

You should never call fit() or fit_transform() on validation, test, or new data because:

fit() calculates scaling parameters (e.g., mean, standard deviation, min, max) based on the data it sees.
If you fit on validation or test data, you are using information from these sets to influence your scaling, which leads to data leakage—an unrealistic advantage that causes overly optimistic evaluation results.
The model should only learn from the training data. Applying parameters learned on training data to other sets simulates real-world scenarios where future data is unknown.

Difference Between `fit()` and `fit_transform()`

fit(): Computes and stores the necessary parameters for transformation (e.g., mean and std for standardization) from the input data. It does not change the data itself.
transform(): Applies the learned parameters from fit() to scale or transform the data accordingly.
fit_transform(): A convenience method that combines fit() and transform() in one step—used only on the training set to both learn parameters and apply transformation immediately.

Summary of Proper Scaling Workflow

Call fit() or fit_transform() on the training set only.
Call transform() on validation, test, and any new data using the already learned parameters.

This ensures your model evaluation is fair, unbiased, and reliable.

Min-Max Scaling (Normalization)

Min-max scaling is a simple technique to rescale numerical features so that their values lie within a specified range, typically 0 to 1.

How It Works

For each attribute xx, the values are transformed as follows:

x_{scaled} = \frac{x - x _{m i n}}{x _{m a x} - x _{m i n}}

where:

$x_{m i n}$ = minimum value of the attribute in the training data
$x_{m a x}$ = maximum value of the attribute in the training data

This shifts and rescales the data so the smallest value becomes 0 and the largest becomes 1.

Customizing the Range

Scikit-Learn’s MinMaxScaler allows changing the target range with the feature_range parameter. For example, using $[- 1, 1]$ can be better suited for some neural networks, which perform well with inputs centered around zero.

Example Usage

from sklearn.preprocessing import MinMaxScaler
 
min_max_scaler = MinMaxScaler(feature_range=(-1, 1))
housing_num_min_max_scaled = min_max_scaler.fit_transform(housing_num)

This code fits the scaler on the training data housing_num and transforms it to the range $[- 1, 1]$ .

Min-max scaling is easy to apply and useful when you want all features to have the same scale, preserving the shape of the original distribution.

Standardization (Z-score Normalization)

Standardization transforms features by centering and scaling them so they have:

Zero mean (average value = 0)
Unit standard deviation (spread = 1)

How It Works

For each attribute xx, the standardized value zz is calculated as:

$z = \frac{x - μ}{σ}$ where:

$μ$ = mean of the attribute in the training data
$σ$ = standard deviation of the attribute in the training data Unlike min-max scaling, standardization does not constrain values to a fixed range; values can be less than zero or greater than one.

Why Standardization Is More Robust to Outliers

Outliers (extremely large or small values) strongly affect min-max scaling, compressing the majority of data into a small range.
Standardization centers data and scales according to variability, so an outlier (e.g., median income mistakenly recorded as 100 instead of 0–15) has less impact on the overall scaling.

Example Usage

from sklearn.preprocessing import StandardScaler
 
std_scaler = StandardScaler()
housing_num_std_scaled = std_scaler.fit_transform(housing_num)

This fits the scaler on the training data and transforms it to have zero mean and unit variance.

Standardization is often preferred when features have different units or when outliers are present.

Scaling Sparse Matrices without Losing Sparsity

When scaling sparse matrices, subtracting the mean (centering) is problematic because:

Centering requires subtracting the mean from every element, which can turn many zeros into nonzero values.
This breaks the sparse structure by introducing many new nonzero entries, causing a large increase in memory usage and computation time.

Solution: Use `StandardScaler(with_mean=False)`

Setting with_mean=False disables mean subtraction and only scales data by dividing by the standard deviation.
This preserves the sparse structure since zero entries remain zero (no new nonzeros are created).
This allows efficient scaling of sparse matrices without converting them to dense arrays.

Example

from sklearn.preprocessing import StandardScaler
 
scaler = StandardScaler(with_mean=False)
scaled_sparse = scaler.fit_transform(sparse_matrix)

In summary, skipping mean centering is a practical compromise to maintain sparsity and efficiency when scaling sparse data.

Handling Heavy-Tailed Feature Distributions

When a feature’s distribution has a heavy tail—meaning values far from the mean are relatively common—both min-max scaling and standardization tend to compress most values into a small range.

This compression can negatively impact machine learning model performance, as such models generally prefer features with more balanced distributions.

How to Address Heavy Tails

Before scaling, transform the feature to:

Shrink the heavy tail
Make the distribution more symmetrical Common transformations include:
Square root or fractional power (between 0 and 1):
Useful for positive features with a moderate heavy tail to the right.
Logarithm:
Helps with features exhibiting very long, heavy tails, such as those following a power law distribution.

Example: Population Feature

?
Population often follows a power law—districts with 10,000 inhabitants are about 10 times less frequent than those with 1,000, not exponentially less.
Applying the logarithm transforms the distribution closer to a Gaussian (bell-shaped), improving feature behavior for modeling.

Transforming heavy-tailed features before scaling makes machine learning models more effective.

What Is a Gaussian (Bell-Shaped) Distribution?

A Gaussian distribution (also called a normal distribution) is a common continuous probability distribution characterized by:

A symmetric, bell-shaped curve centered around the mean.
Most values cluster around the mean, with fewer values appearing as you move farther away (tails).
The shape is fully defined by its mean (center) and standard deviation (spread).

This distribution often appears in natural phenomena and is preferred in many machine learning models because of its well-behaved statistical properties.

Bucketizing (Discretization) Explained

Bucketizing is a way to handle heavy-tailed features by dividing the feature’s range into buckets (or bins) and replacing each original value with the index of the bucket it falls into.

This turns a continuous numerical feature into a discrete categorical-like feature.

How It Works

Divide the data into equal-sized buckets (e.g., based on percentiles or fixed intervals).
Replace each value with the bucket number (e.g., 0, 1, 2, …).
This creates a feature with an almost uniform distribution since each bucket contains roughly the same number of data points.

Example

Suppose the feature is income, with values ranging widely:

Original Income	Bucket Index (Percentiles)
15,000	0 (0–20th percentile)
35,000	2 (40–60th percentile)
75,000	4 (80–100th percentile)

Here, the continuous income values are replaced by discrete bucket indices.

Benefits

Removes heavy-tailed behavior by smoothing the distribution.
Often no further scaling is needed because the bucket indices are roughly uniformly distributed.
Optionally, divide bucket indices by the number of buckets to normalize values between 0 and 1.

Use Case

You might create an income_cat feature for stratified sampling or directly use bucketized features in models that handle categorical or ordinal data well.

Handling Multimodal Distributions with Bucketizing

When a feature has a multimodal distribution—meaning it has two or more distinct peaks (modes), such as the housing_median_age feature—bucketizing can be useful, but with a different approach:

Instead of treating bucket IDs as numerical values, treat them as categorical values.
Encode these bucket indices using techniques like OneHotEncoding (avoid too many buckets to keep it manageable).

Why This Helps

?
It allows the model to learn separate patterns or rules for different value ranges.
For example, houses around 35 years old might have a specific architectural style that affects price differently, which the model can capture more easily using categorical buckets rather than a continuous numeric age.

Summary

Bucketize multimodal features into categories.
OneHotEncode the buckets for better model flexibility.
Use fewer buckets to avoid overcomplicating the feature space.

Example: Bucketizing and One-Hot Encoding a Multimodal Feature

Suppose you have the housing_median_age feature with multimodal distribution. You can:

Bucketize it into age groups (categories):

Age (years)	Bucket ID
5	0
12	1
35	2
50	3

Encode the buckets as categories using OneHotEncoder:

Age Bucket_0	Age Bucket_1	Age Bucket_2	Age Bucket_3
1	0	0	0
0	1	0	0
0	0	1	0
0	0	0	1

Code Example

import numpy as np
import pandas as pd
from sklearn.preprocessing import KBinsDiscretizer, OneHotEncoder
 
# Sample data
housing_age = np.array([[5], [12], [35], [50]])
 
# Bucketize into 4 bins (treat as categorical)
kbins = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='uniform')
age_buckets = kbins.fit_transform(housing_age)
 
# One-Hot Encode the bucket indices
ohe = OneHotEncoder(sparse=False)
age_buckets_ohe = ohe.fit_transform(age_buckets)
 
print(age_buckets_ohe)

This approach lets your model learn different effects for houses in each age group, handling multimodal distributions effectively.

Support Vector Machine (SVM)

Support Vector Machines are powerful supervised learning models commonly used for classification and regression tasks. They aim to find the optimal separating hyperplane between classes by maximizing the margin between support vectors (critical data points).

Bias-Variance Tradeoff

Bias: The error from erroneous assumptions in the learning algorithm. A model with high bias oversimplifies the problem and cannot capture the underlying patterns (underfitting).
Variance: The error from sensitivity to small fluctuations in the training set. A model with high variance fits noise in training data too closely (overfitting).
The goal is to balance bias and variance to achieve good generalization on unseen data.

Methods to control the tradeoff:

Regularization: Adding a penalty term to the loss function to discourage overly complex models. In SVMs, the C parameter controls the tradeoff between margin width and classification error.
Boosting: Combining multiple weak learners to form a strong learner, reducing bias and variance.
Bagging: Training multiple models on different bootstrap samples and aggregating results to reduce variance.

For more explanation, see:
StatQuest - Bias Variance Tradeoff

Cross Validation

Cross-validation is a technique to evaluate machine learning models by training and testing them on different subsets of the data. It provides an estimate of how well a model will perform on unseen data.

Common types of cross-validation:

Four-Fold Cross Validation: The dataset is split into 4 equal parts; each fold is used once as a test set while the remaining 3 folds form the training set.
Ten-Fold Cross Validation: Similar to four-fold, but with 10 parts, providing a more stable estimate.
Leave-One-Out Cross Validation (LOOCV): Each instance in the dataset is used once as a test set while the rest are training data. It is computationally expensive but can be useful for small datasets.

Cross-validation helps in:

Comparing different models or hyperparameter settings.
Detecting overfitting or underfitting issues.
Selecting the best model for deployment.

Support Vector Machine (SVM)

Margin and Classification

The margin is the shortest distance between an observation (data point) and the decision boundary (threshold/hyperplane).
Maximal Margin Classifier: Finds the hyperplane that maximizes the margin, perfectly separating the classes.
- Problem: Very sensitive to outliers; a single outlier can drastically change the margin and hyperplane.

Soft Margin and Bias-Variance Tradeoff

To handle noisy data and outliers, soft margin classifiers allow some misclassifications by introducing slack variables.
The margin becomes a soft margin, balancing the width of the margin and the number of misclassifications.
This introduces a tradeoff between bias (allowing errors) and variance (overfitting to outliers).
The points that lie on or inside the margin (including misclassified points) are called support vectors — they define the position of the decision boundary.

Support Vector Classifier and SVM

The Support Vector Classifier (SVC) is the soft margin classifier.
The Support Vector Machine extends this concept to allow nonlinear decision boundaries using kernel functions.

Kernel Functions

Kernels implicitly transform the input data into a higher-dimensional space to make it linearly separable without computing coordinates in that space explicitly. This is known as the Kernel Trick.

Polynomial Kernel: Maps data into polynomial feature space.
Radial Basis Function (RBF) Kernel: Measures similarity using distances; often seen as a weighted nearest neighbor approach.

Summary

Concept	Description
Margin	Distance from data point to decision boundary
Maximal Margin Classifier	Finds widest margin, sensitive to outliers
Soft Margin Classifier	Allows misclassifications to balance bias-variance
Support Vectors	Points that define the margin
Kernel Trick	Enables nonlinear separation via implicit mapping
Common Kernels	Polynomial, Radial Basis Function (RBF)

For an in-depth explanation, watch:
StatQuest: Support Vector Machine

Transforming Multimodal Distributions Using Radial Basis Functions (RBF)

Concept

Instead of simply bucketizing a multimodal feature, you can add features representing similarity to each mode (peak) in the distribution. This captures how close a given value is to each important mode, allowing the model to learn smooth, flexible relationships.

What Is a Radial Basis Function (RBF)?

An RBF is a function whose output depends only on the distance between the input and a fixed center point (mode).
The most common RBF is the Gaussian RBF, defined as:

similarity (x, c) = exp (- γ (x - c)^{2})

where:

( x ) = input feature value (e.g., housing age)
( c ) = center point (mode in distribution, e.g., 35 years)
$(γ)$ (gamma) = a hyperparameter controlling how fast similarity decays with distance

How Gaussian RBF Works

When ( x = c ), similarity is 1 (maximum).
As ( x ) moves away from ( c ), similarity approaches 0 exponentially fast.
( \gamma ) controls the width of the “bump” around ( c ):
- Large ( \gamma ): similarity drops quickly (narrow peak).
- Small ( \gamma ): similarity drops slowly (wide peak).

Why Use RBF Features?

Capture soft membership in each mode instead of hard bucket assignment.
Allow the model to smoothly interpolate effects near each mode.
Handle multimodal features flexibly, improving learning of complex patterns.

Example: Housing Median Age and Modes

Assume housing ages cluster around modes at 10, 35, and 60 years.

For each sample with age ( x ), create features:

f_{10} (x) = exp (- γ (x - 10)^{2})

f_{35} (x) = exp (- γ (x - 35)^{2})

f_{60} (x) = exp (- γ (x - 60)^{2})

Each feature measures similarity to one mode.

Code Example Using Scikit-Learn

import numpy as np
from sklearn.metrics.pairwise import rbf_kernel
 
# Sample housing ages
housing_age = np.array([[5], [12], [35], [50]])
 
# Modes to represent
modes = np.array([[10], [35], [60]])
 
# Compute RBF similarity features between housing_age and modes
# gamma controls the decay rate
gamma = 0.1
rbf_features = rbf_kernel(housing_age, modes, gamma=gamma)
 
print(rbf_features)

Output (approximate):

x\mode	10	35	60
5	0.61	0.00	0.00
12	0.92	0.01	0.00
35	0.00	1.00	0.04
50	0.00	0.37	0.32

Interpretation

Age 12 is close to mode 10 (high similarity) and far from others.
Age 35 matches mode 35 exactly (similarity = 1).
Age 50 has moderate similarity to modes 35 and 60.

The model can now learn separate, smooth effects associated with each mode.

Summary

RBF features turn a single multimodal feature into multiple soft membership features.
The Gaussian RBF similarity decreases smoothly with distance from mode centers.
Hyperparameter γ controls how localized these features are.
This approach enables flexible, smooth modeling of complex feature distributions.

What Happens After Creating RBF Features?

Using the RBF Similarity Values

The output of the RBF transformation is one new feature per mode (e.g., three features if you use three modes).
Each of these features measures how similar the original value is to that mode.

How to Use These Features

Add the RBF features as new columns to your original dataset.
You can keep the original feature or replace it with these new similarity features, depending on your modeling choice.

Changes to the DataFrame

Suppose your original DataFrame had a column housing_median_age.
After applying RBF transformation with 3 modes (e.g., 10, 35, 60), you will add three new columns, for example:

housing_median_age	rbf_age_10	rbf_age_35	rbf_age_60
5	0.61	0.00	0.00
12	0.92	0.01	0.00
35	0.00	1.00	0.04
50	0.00	0.37	0.32

These new columns are continuous numerical features that encode how close each sample’s age is to each mode.

Why This Is Useful for Modeling

Models can learn different effects associated with different age ranges via these new features.
The similarity values smoothly vary, allowing the model to capture subtle patterns without hard splits.
This can improve performance on multimodal data compared to using raw or bucketized values alone.

Summary

Step	Effect
Original feature (`housing_median_age`)	May remain or be replaced
RBF similarity features	Added as new numerical columns
DataFrame	Expanded with these new features
Model input	Uses new features to learn smooth, mode-based patterns

You typically add the RBF features alongside your existing data, giving your model richer information to work with.

Transforming Target Values in Machine Learning

So far, we’ve focused on transforming input features, but sometimes the target variable (what you want to predict) also needs transformation.

Why Transform the Target?

If the target distribution has a heavy tail (e.g., skewed house prices), transforming it can improve model performance and stability.
A common transformation is to replace the target y with its logarithm, shrinking large values and making the distribution more balanced.

Important Consideration

If you train a model on a transformed target (e.g., $lo g (y)$ ), the model predicts the transformed values, not the original.
To get predictions on the original scale, you need to apply the inverse transformation (e.g., exponentiate predictions if log was used).

Using Scikit-Learn Transformers for Targets

Most Scikit-Learn transformers provide an inverse_transform() method for this purpose.
For example, if you scale your target using StandardScaler, you can train your model on the scaled labels, then convert predictions back to the original scale.

Code Example

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
 
# Assume housing_labels is a Pandas Series with the target values
 
# Scale target labels (must convert to 2D DataFrame)
target_scaler = StandardScaler()
scaled_labels = target_scaler.fit_transform(housing_labels.to_frame())
 
# Train model on one input feature for simplicity
model = LinearRegression()
model.fit(housing[["median_income"]], scaled_labels)
 
# New data for prediction
some_new_data = housing[["median_income"]].iloc[:5]
 
# Predict scaled targets
scaled_predictions = model.predict(some_new_data)
 
# Convert predictions back to original scale
predictions = target_scaler.inverse_transform(scaled_predictions)

Summary

Step	Description
Transform target variable	Apply transformation (e.g., log, scaling)
Train model	Use transformed targets for training
Predict	Model outputs predictions on transformed scale
Inverse transform	Convert predictions back to original scale (e.g., exp or inverse scaling)

This process ensures consistent handling of target transformations while enabling meaningful final predictions.

Using TransformedTargetRegressor for Target Transformation

This works fine, but a simpler option is to use a TransformedTargetRegressor.

You just need to construct it by providing the regression model and the label transformer.
Then fit it on the training set with the original, unscaled labels.
It automatically transforms the labels during training and applies the inverse transform when making predictions.

Code Example

from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
 
model = TransformedTargetRegressor(
    regressor=LinearRegression(),
    transformer=StandardScaler()
)
 
model.fit(housing[["median_income"]], housing_labels)
predictions = model.predict(some_new_data)

Alright — let’s unpack this step-by-step so it’s clear both what the code is doing and why it’s useful.

1. What’s going on here?

You’re looking at a custom Scikit-Learn transformer called ClusterSimilarity.
This transformer:

Fits a KMeans clustering model on the training data.
Computes similarities between each sample and each cluster center using the Gaussian RBF kernel.
Outputs a feature matrix where each column corresponds to “how similar this sample is to a particular cluster.”

2. Code Breakdown

Class Definition

from sklearn.cluster import KMeans
class ClusterSimilarity(BaseEstimator, TransformerMixin):

BaseEstimator and TransformerMixin
These make your custom class compatible with Scikit-Learn pipelines and API conventions.
By subclassing them, you get:
- Parameter handling (get_params / set_params)
- fit_transform() convenience method (from TransformerMixin)

`init` method

def __init__(self, n_clusters=10, gamma=1.0, random_state=None):
    self.n_clusters = n_clusters
    self.gamma = gamma
    self.random_state = random_state

Purpose: Store parameters so they’re accessible later and compatible with Scikit-Learn’s parameter tuning (GridSearchCV, etc.).
n_clusters → how many groups KMeans will find.
gamma → controls the spread of the Gaussian RBF similarity measure.
random_state → ensures reproducible clustering.

`fit` method

def fit(self, X, y=None, sample_weight=None):
    self.kmeans_ = KMeans(self.n_clusters, random_state=self.random_state)
    self.kmeans_.fit(X, sample_weight=sample_weight)
    return self

Creates a KMeans object with the specified number of clusters.
Fits it on X (your features).
Accepts sample_weight so some samples can influence clustering more than others.
Returns self so it can be chained in pipelines (fit().transform()).

`transform` method

def transform(self, X):
    return rbf_kernel(X, self.kmeans_.cluster_centers_, gamma=self.gamma)

Computes the Gaussian Radial Basis Function (RBF) similarity between:
- Each sample in X
- Each cluster center found during fit
RBF similarity formula (for two vectors xx and cc):

similarity=exp⁡(−γ⋅∣∣x−c∣∣2)\text{similarity} = \exp(-\gamma \cdot ||x - c||^2)
Output: A matrix of shape (n_samples, n_clusters).

`get_feature_names_out` method

def get_feature_names_out(self, names=None):
    return [f"Cluster {i} similarity" for i in range(self.n_clusters)]

Generates human-readable column names for the transformed features.
Useful if you’re converting the result into a DataFrame.

3. Why this works

Clustering + Similarity gives you spatially aware features.
Instead of just assigning each sample to one cluster, you get continuous similarity scores to all clusters.
This captures nuanced relationships — a house can be kind of like cluster 3 and very much like cluster 7.

4. Example Use

cluster_simil = ClusterSimilarity(n_clusters=10, gamma=1., random_state=42)
similarities = cluster_simil.fit_transform(
    housing[["latitude", "longitude"]],
    sample_weight=housing_labels
)

Fits the transformer using latitude and longitude.
Uses housing_labels (median house values) as sample weights so expensive areas influence clusters more.
Produces a similarity matrix like:

[[0.00, 0.14, ..., 0.99, 0.60],
 [0.63, 0.00, ..., 0.04, 0.00],
 [0.00, 0.29, ..., 0.70, 0.30]]

Each row = a district
Each column = similarity to a cluster center

5. Visual Insight

If plotted, districts colored by their closest cluster’s similarity will show:
- Densely populated, expensive regions with high similarity to certain clusters.
- Geographic groupings that are data-driven rather than just map-based.

6. Key Takeaways

This is feature engineering: turning raw coordinates into meaningful numerical features.
Custom transformers like this integrate unsupervised learning (KMeans) into preprocessing pipelines.
Similarity scores are often more useful than raw cluster labels in predictive models.

If you want, I can also draw a diagram showing how KMeans + RBF similarity transforms location data into these cluster similarity features so it’s easier to visualize.

Duck Typing

Duck typing is a programming concept used primarily in dynamically typed languages like Python.

Definition

The idea is:
“If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.”
In programming terms, an object’s suitability is determined by the presence of certain methods and properties, rather than the object’s actual type.

How It Works

Instead of checking an object’s type explicitly (e.g., with isinstance()), the code relies on the object having the necessary behavior (methods or attributes).
If the object supports the required operations, it can be used regardless of its class.

Example

class Duck:
    def quack(self):
        print("Quack!")
 
class Person:
    def quack(self):
        print("I'm quacking like a duck!")
 
def make_it_quack(thing):
    thing.quack()  # Does not check type, just calls quack()
 
d = Duck()
p = Person()
 
make_it_quack(d)  # Output: Quack!
make_it_quack(p)  # Output: I'm quacking like a duck!

Here, both Duck and Person objects are accepted by make_it_quack because both implement a quack() method, regardless of their types.

Advantages

Flexibility: Code works with any object that supports the expected behavior.
Easier to write generic and reusable functions.

Disadvantages

Errors due to missing methods may only show up at runtime.
Harder to enforce strict interfaces or contracts.

Summary

Duck typing emphasizes what an object can do, rather than what it is.

Stochastic

Stochastic refers to systems, processes, or variables that are random or probabilistic in nature, meaning they involve some element of randomness or unpredictability.

Explanation

A stochastic process is one that evolves over time with inherent randomness, such that its future states are not fully determined by past states.
Contrasts with deterministic processes, which have no randomness and produce the same output from the same initial conditions every time.

Examples in Machine Learning

Stochastic Gradient Descent (SGD): Instead of computing the gradient using the whole dataset (deterministic), it uses random samples (mini-batches), introducing randomness that helps escape local minima and speeds up training.
Random Forest: Combines many decision trees trained on random subsets of data and features — inherently stochastic.

Simple analogy

Tossing a fair coin is a stochastic process — the outcome is random.
Rolling a loaded die where the outcome is always 6 is deterministic (no randomness).

Summary

Stochastic means involving randomness or probability, making outcomes partly unpredictable.

Grid Search

https://www.youtube.com/watch?v=lfiw2Rh2v8k&ab_channel=AIForBeginners

https://www.youtube.com/watch?v=G-fXV-o9QV8&ab_channel=DataMListic

Machine Learning

Explorer

04-Notes-Mehrdad

End-To-End

Pipelines

Key Concepts:

Framing the Problem

❓ Ask These Questions:

📘 Notation

✅ Typography Conventions

📏 Select a Performance Measure: RMSE vs MAE

✅ RMSE (Root Mean Squared Error)

✅ MAE (Mean Absolute Error)

🔢 General Form: ℓk​ Norm

🧠 Insight

Check the Assumptions

💡 What Does “More Robust to Outliers” Mean?

📒 Notebooks and Google Colab

What Is a Notebook?

🔗 Google Colab Shortcuts

💾 Accessing Files in Google Colab

📦 About Imports

📊 Understanding Percentiles

🔢 Definitions

📘 Terminology

Why the data has been transformed:

📌 What Is Capping?

✅ Why Use Capping?

⚠️ A Tradeoff

Why is skewed data harder for ML algorithms?

a. Many ML models assume or work better with linear or symmetric data.

b. Optimization algorithms struggle

c. Distance-based models (like k-NN or clustering) become misleading

What does “transforming” the data mean?

✅ Examples of Transformations

Log transform

Square root

Test Set

Why Set Aside a Test Set Early?

Data Snooping (Bias)

Example:

✅ Solution: Keep the Test Set Sacred

.iloc vs .loc in Pandas

🔢 .iloc[] → Integer Location

Example:

🏷️ .loc[] → Label-based Location

Example:

Summary Table

🌌 The Answer to Life, the Universe, and Everything

😐 Wait, what?

What does it mean?

Does the ML algorithm have memory or state across runs?

✅ So what’s the concern, then?

🎯 The Real Problem: You (and your code) leak test data over time

📌 Think of it this way:

Why isn’t np.random.seed(42) enough?

✅ What setting the seed does:

❌ But the problem is:

✅ So what is the better solution?

Use a deterministic rule based on something that won’t change, like:

Sampling Bias from Coarse Location Info

🔍 What It Means:

A possible implementation for using hash

Creating Unique IDs for Dataset Splitting

Stratified Sampling and Income Categories

🏦 Handling Continuous Variables: Income Categories

❓ Should We Stratify by Feature?

When to Stratify:

When Not to Stratify:

What Is Stratified Sampling?

📊 Why do this?

With stratified sampling, your test set would keep the same proportions:

🔧 When to Use Stratified Sampling?

Stratified in scikit-learn

⚠️ Neglected but Critical: Test Set Generation

Matplotlib plot options

numeric only

📈 Insights from Data Visualization and Correlation

1. Observing Scatter Plots and Data Patterns

2. Understanding the Correlation Coefficient

🔢 General Form: $ℓ_{k}$ Norm

`.iloc` vs `.loc` in Pandas

🔢 `.iloc[]` → Integer Location

🏷️ `.loc[]` → Label-based Location

Why isn’t `np.random.seed(42)` enough?

Stratified in `scikit-learn`

Imputation with `SimpleImputer`

Example: Reusing `SimpleImputer` Across Datasets

Advanced Imputers in `sklearn.impute`

One-Hot Encoding: `pd.get_dummies()` vs `OneHotEncoder`

Example with Pandas `get_dummies()`

Example with Scikit-Learn `OneHotEncoder`

How `OneHotEncoder` Remembers

Difference Between `fit()` and `fit_transform()`