End-To-End
Pipelines
A pipeline is a sequence of data processing steps that are chained together. In machine learning, pipelines are crucial because:
- Data needs multiple transformations before it becomes usable (e.g., missing value imputation, scaling, feature engineering).
- By chaining transformations and model training into a pipeline, you ensure consistency (the same steps are applied to training and future data).
- Pipelines help reduce code duplication, improve readability, and simplify cross-validation.
Key Concepts:
- Asynchronous components: Each component in the pipeline processes data independently and at different times.
- Data stores as interfaces: Components communicate by reading/writing to/from data stores. This separation enhances modularity.
- Resilience: If a component fails, others can still work using cached or last-available data.
- Monitoring is essential: Without it, failures may silently degrade performance as stale or incorrect data propagates.
Framing the Problem
Before you start training a model, you need to define the problem clearly. This is step zero in any machine learning project. Misframing the problem leads to building the wrong solution.
❓ Ask These Questions:
-
What type of learning problem is it?
- Supervised Learning
→ You have labeled data (input-output pairs).
Examples:- Predicting house prices (regression)
- Email spam detection (classification)
- Unsupervised Learning
→ You have no labels, only inputs.
Examples:- Customer segmentation (clustering)
- Dimensionality reduction for visualization
- Reinforcement Learning
→ An agent learns by interacting with an environment and receiving rewards.
Examples:- Game playing (e.g., AlphaGo)
- Robotics control
- Supervised Learning
-
What type of supervised task is it?
- Classification
Predicting discrete categories (e.g., spam or not spam) - Regression
Predicting continuous values (e.g., predicting temperature) - Multilabel or Multiclass
- Multiclass → One label from many (e.g., digit recognition: 0–9)
- Multilabel → Multiple labels can be true at once (e.g., a movie can be both “comedy” and “drama”)
- Classification
-
What kind of learning setup fits best?
- Batch Learning (Offline Learning)
- Train the model on the full dataset at once
- Model is static after training unless you retrain it from scratch
- Suitable for large, stable datasets
- Online Learning
- Model is updated incrementally as new data comes in
- Ideal for streaming data or situations where data evolves over time
- Useful when computational resources are limited or retraining from scratch is costly
- Batch Learning (Offline Learning)
By answering these questions, you define the type of solution you need, which in turn influences:
- Your data collection strategy
- The algorithms you’ll consider
- How you’ll evaluate performance
- The infrastructure for deployment and retraining
📘 Notation
This section defines several common machine learning notations used throughout the book.
-
Let be the number of instances in the dataset.
- Example: If the validation set has 2,000 districts, then .
-
Let be the feature vector of the instance (excluding the label).
-
Let be the label (target value) of the instance.
- Example:
-
Let be the feature matrix, containing all feature vectors in the dataset:
- Each row is an instance:
-
Let be the prediction function (also called a hypothesis).
-
Given an instance’s feature vector , the system predicts:
-
Example: If , then the prediction error is:
-
-
Let be the Root Mean Squared Error cost function evaluated on dataset using hypothesis .
✅ Typography Conventions
- Scalars → lowercase italic: , ,
- Vectors → lowercase bold:
- Matrices → uppercase bold:
📏 Select a Performance Measure: RMSE vs MAE
To evaluate how well a machine learning model performs, we need a way to measure the distance between its predictions and the actual target values. Two common metrics for regression tasks are:
✅ RMSE (Root Mean Squared Error)
- Measures the Euclidean distance between predicted and true values.
- It uses the norm:
- Sensitive to outliers: squaring the errors makes large errors count more.
- Preferred when outliers are rare and follow a normal distribution.
✅ MAE (Mean Absolute Error)
- Measures the Manhattan distance between predictions and targets.
- It uses the norm:
- More robust to outliers than RMSE. (It is less affected by extreme values or rare, unusually large/small errors)
- Gives equal weight to all errors, regardless of magnitude.
🔢 General Form: Norm
- For a vector with elements:
- Special cases:
- : number of nonzero elements
- : Manhattan norm (MAE)
- : Euclidean norm (RMSE)
- : max absolute value in the vector
🧠 Insight
- Higher values focus more on large errors and ignore smaller ones.
- Use RMSE when large errors should be penalized more heavily.
- Use MAE when all errors should be treated equally.
📌 Tip: Always plot your error distribution. If you have many outliers or a skewed distribution, MAE might give you a more stable picture.
Check the Assumptions
💡 What Does “More Robust to Outliers” Mean?
When we say a metric is more robust to outliers, we mean:
- It is less affected by extreme values or rare, unusually large/small errors.
- These outliers don’t overly influence the metric’s overall value.
For example:
- RMSE squares the errors, so large errors have a disproportionately big impact.
- MAE treats all errors equally, so it’s more stable when there are extreme predictions.
Use MAE if:
- Your dataset has noise or unpredictable spikes.
- You want to avoid letting a few big errors dominate the evaluation.
📒 Notebooks and Google Colab
What Is a Notebook?
A Jupyter Notebook (used in Google Colab) is an interactive coding environment that combines:
- Code
- Explanatory text (Markdown)
- Outputs like graphs, tables, etc.
You can run cells one at a time, view results immediately, and mix code with documentation, making it great for data exploration, ML experiments, and teaching.
🔗 Google Colab Shortcuts
- Colab is a free, cloud-based Jupyter notebook environment provided by Google.
- It lets you run Python code without any setup on your local machine.
- You can access many useful keyboard shortcuts.
Here’s a full list of Colab/Jupyter shortcuts:
👉 Jupyter Notebook Shortcuts – Towards Data Science
To view shortcuts inside Colab:
- Click on the “Tools” menu
- Select “Keyboard shortcuts”
Or press:Cmd/Ctrl + M + H
💾 Accessing Files in Google Colab
When you run a Colab notebook:
- Your Google Drive is mounted under:
/content/drive/MyDrive
- To save a model or data file to your Drive:
!cp /content/my_model_file /content/drive/MyDrive
- The
!
at the start tells Colab to run a Linux shell command, not Python code. cp
is the command for copying files in Linux.
🛠️ Colab notebooks run on Linux (Ubuntu) virtual machines, so basic Linux commands work!
📦 About Imports
In regular Python scripts:
- It’s recommended by PEP 8 (Python’s style guide) to put all
import
statements at the top of the file.
But in notebooks:
- It’s common to import packages where they’re needed, inside individual cells.
- This makes your code more modular and easier to follow, especially for tutorials or demos.
Example:
import pandas as pd # common import for data manipulation
This flexibility is part of why notebooks are so popular for experimentation and teaching.
📊 Understanding Percentiles
Percentiles help describe the distribution of a dataset by indicating how data is spread across values.
🔢 Definitions
-
A percentile is the value below which a given percentage of data points fall.
For example:
- The 25th percentile = 25% of the values are less than or equal to this number.
- The 50th percentile = median (middle value)
- The 75th percentile = 75% of values fall below this value.
📘 Terminology
Term | Meaning |
---|---|
25th percentile | First quartile (Q1) |
50th percentile | Median (Second quartile, Q2) |
75th percentile | Third quartile (Q3) |
Why the data has been transformed:
- The dataset has been preprocessed:
- The
median_income
values are scaled to range between 0.5 and 15, but not exactly:- Minimum is 0.4999
- Maximum is 15.0001
- The
- These are floating-point approximations and represent:
$5,000
→ stored as0.5
$30,000
→ stored as3.0
$150,000
→ capped and stored as15.0
Transformation | Explanation |
---|---|
Scaled | All income values are divided by 10,000 — so income is now in tens of thousands |
Capped (clipped) | Extremely low incomes were set to no lower than ~150,000 (15 × 10k) |
📌 What Is Capping?
Capping means limiting values in a dataset so they do not go beyond a certain range.
- If a value is too small, it’s set to a minimum threshold.
- If a value is too large, it’s set to a maximum threshold. This is done to reduce the effect of outliers or extreme values that can distort analysis or model training.
✅ Why Use Capping?
Reason | Description |
---|---|
Reduce outlier effect | Models like linear regression can be skewed by extreme values. |
Avoid division errors | In log transforms, very small values can break calculations. |
Keep data in known range | Some ML algorithms assume features fall in a specific range. |
⚠️ A Tradeoff
While capping improves stability, you also lose information about how extreme the original values were. That’s okay if those extremes are noise or rare — but bad if they’re meaningful.
📌 Tip: It’s common to work with engineered or normalized features in ML. Just make sure to understand them before jumping into model training.
Why is skewed data harder for ML algorithms?
a. Many ML models assume or work better with linear or symmetric data.
Let’s take linear regression as an example:
- It assumes a roughly linear relationship between features and target.
- If a feature is right-skewed, that relationship may be nonlinear, distorted, or dominated by outliers.
- Result: The model might overfit the tail or underperform on the dense (important) part of the distribution.
It’s harder to find a clear pattern unless you “pull in” those extreme values.
b. Optimization algorithms struggle
Models like logistic regression or SVM rely on gradient descent, which assumes:
- Features are on similar scales,
- Gradients behave predictably.
Right-skewed features have:
- Long tails with rare, extreme values → they contribute too much to the gradients.
- That throws off optimization, making training slower or unstable.
c. Distance-based models (like k-NN or clustering) become misleading
- Skewed features mean Euclidean distance (how “far” two points are) becomes unreliable.
- A single large feature can dominate the entire distance calculation.
What does “transforming” the data mean?
“You can’t just change the distribution, right?”
In ML, transforming means applying a mathematical function to reshape the values, while preserving their relative order or meaning.
✅ Examples of Transformations
Log transform
If x = [1, 10, 100, 1000]
np.log(x)
→ [0, 2.3, 4.6, 6.9]
- Still increasing, still ordered.
- But now: much less skewed, easier to work with.
Square root
np.sqrt([1, 4, 16, 100])
→ [1, 2, 4, 10]
Again: relative relationships preserved, but large numbers pulled inward.
Test Set
It may feel premature to set aside part of your data early on, especially before exploring the dataset in detail. However, it’s a critical step for ensuring a reliable machine learning workflow.
Why Set Aside a Test Set Early?
- Your brain is a powerful pattern detector, but that also makes it prone to overfitting.
- If you explore the full dataset (including the test set), you might unconsciously:
- Notice patterns specific to the test set
- Choose or tune models based on those patterns
This can lead to:
- Over-optimistic performance estimates
- A model that looks good in testing, but performs poorly in the real world
This problem is known as data snooping (or data leakage).
Data Snooping (Bias)
Data snooping happens when information from the test set leaks into the model selection or training process. This contaminates your evaluation.
Example:
You might:
- Look at the test set’s distribution
- Notice a correlation
- Decide to use a specific model architecture because of it
Now, when you measure performance on the test set:
- It’s no longer an unseen, independent measure
- Your result is biased — and over-optimistic
✅ Solution: Keep the Test Set Sacred
- Set aside the test set from the very beginning
- Don’t look at it until the very end, after:
- Data exploration
- Feature engineering
- Model selection
- Cross-validation on the training set
Use it only once:
- To evaluate the final model’s generalization performance
📌 Tip: You can also create a validation set (or use cross-validation) during model development, and keep the test set completely untouched until the final evaluation.
.iloc
vs .loc
in Pandas
In pandas, both .iloc
and .loc
are used to access rows and columns in a DataFrame — but they work differently:
🔢 .iloc[]
→ Integer Location
- Access by position (index number)
- Purely integer-based indexing
Example:
df.iloc[0] # First row
df.iloc[0:3] # First 3 rows
df.iloc[0, 1] # Row 0, column 1
🏷️ .loc[]
→ Label-based Location
- Access by label (row/column names)
- Can use slices, lists, or boolean masks
Example:
df.loc[0] # Row with label/index 0
df.loc[0:3] # Rows from label 0 to 3 (inclusive!)
df.loc[0, 'name'] # Value at row 0, column 'name'
df.loc[df['age'] > 30] # All rows where age > 30
Summary Table
Feature | .iloc[] | .loc[] |
---|---|---|
Access by | Index position | Index/column label |
Returns | Rows/columns by number | Rows/columns by name |
Slice | Excludes end | Includes end |
📌 Tip: Use .iloc
when you’re working with position, and .loc
when you’re working with labels.
🌌 The Answer to Life, the Universe, and Everything
In The Hitchhiker’s Guide to the Galaxy, a group of hyper-intelligent, pan-dimensional beings builds a supercomputer named Deep Thought to calculate:
“The Answer to the Ultimate Question of Life, the Universe, and Everything.”
After seven and a half million years of computation, Deep Thought finally responds:
“42.”
😐 Wait, what?
The answer — “42” — is intentionally absurd and meaningless without the right question.
Deep Thought explains:
“I think the problem, to be quite honest with you, is that you’ve never actually known what the question is.”
So they build an even greater computer (called Earth) to figure out the actual question.
What does it mean?
Douglas Adams has said he picked 42 arbitrarily — it’s just a joke. It satirizes:
- Our obsession with finding deep, cosmic meaning
- The idea that a single number could explain everything
It’s become a cultural meme used to:
- Humorously answer unanswerable questions
- Signal geeky, sci-fi humor
📘 Fun Fact: In ASCII, *
is the 42nd character — which some interpret as a nod to “everything.”
“The answer to life, the universe, and everything is 42.”
— Deep Thought (and every Hitchhiker’s fan)
Does the ML algorithm have memory or state across runs?
No — by default, it does not.
When you rerun your ML training script, the algorithm starts from scratch. So technically, your model doesn’t remember anything from previous runs — you’re right about that.
✅ So what’s the concern, then?
The concern is you, the developer or analyst.
🎯 The Real Problem: You (and your code) leak test data over time
Even if the ML model doesn’t remember anything across runs:
-
You might see parts of the test set on one run, make decisions, change preprocessing, tweak model hyperparameters, rerun — and now your model is indirectly influenced by test set knowledge.
-
Over time, you’re accidentally training your model on patterns that leak from the test set — defeating its purpose.
📌 Think of it this way:
You want your test set to simulate real-world, unseen data.
But if it keeps changing randomly every time you rerun your code, you and your model will eventually see the whole dataset.
So the key principle here is:
The test set must remain fixed and untouched across multiple experiments.
This is why we care about repeatable splits.
Why isn’t np.random.seed(42)
enough?
Setting a random seed is enough for reproducibility — but only as long as your dataset doesn’t change.
Here’s what that means:
✅ What setting the seed does:
np.random.seed(42)
shuffled_indices = np.random.permutation(len(data))
This guarantees the same shuffled order of indices every time for the same dataset.
So:
-
Your train/test split is the same across runs.
-
You’re reproducible.
❌ But the problem is:
If you get new data (say, a newer version of your dataset with more rows), np.random.permutation()
will shuffle all data again — and:
-
Previously “test” instances may now end up in training, and vice versa.
-
The seed doesn’t protect against dataset changes.
This defeats the main goal: to permanently isolate test data even when new data arrives.
✅ So what is the better solution?
Use a deterministic rule based on something that won’t change, like:
-
A unique, stable ID (e.g., customer ID, row hash, image filename).
-
Apply a function like
hash(id)
and assign it to test if it’s in the bottom X% of hash values.
This way:
- The same data points always end up in the test set.
- New data gets tested only if it’s new.
- You don’t leak info from train to test even when your dataset updates.
Sampling Bias from Coarse Location Info
🔍 What It Means:
The dataset includes location information (like latitude and longitude), but this info is coarse — meaning it’s not very precise.
- Multiple districts share the exact same location values.
A possible implementation for using hash
from zlib import crc32
def is_id_in_test_set(identifier, test_ratio):
return crc32(np.int64(identifier)) < test_ratio * 2**32
def split_data_with_id_hash(data, test_ratio, id_column):
ids = data[id_column]
in_test_set = ids.apply(lambda id_: is_id_in_test_set(id_, test_ratio))
return data.loc[~in_test_set], data.loc[in_test_set]
Creating Unique IDs for Dataset Splitting
- The housing dataset lacks a built-in unique identifier column.
- A simple solution is to use the row index as an ID by resetting the index:
housing_with_id = housing.reset_index() # adds an `index` column train_set, test_set = split_data_with_id_hash(housing_with_id, 0.2, "index")
- If using the row index as ID:
- Ensure new data is only appended (no deletion or reordering).
- If this isn’t feasible, create an ID from stable features, e.g., combine latitude and longitude:
housing_with_id["id"] = housing["longitude"] * 1000 + housing["latitude"] train_set, test_set = split_data_with_id_hash(housing_with_id, 0.2, "id")
- Using stable features helps maintain consistent IDs over time.
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
Stratified Sampling and Income Categories
- In surveys, maintaining important population ratios (e.g., 48.9% males) in samples is crucial to avoid bias.
- Stratified sampling divides the population into homogeneous subgroups (strata) and samples the right number from each, ensuring the test set represents the overall population.
- Purely random sampling can lead to skewed samples; for example, a ~10.7% chance of female representation falling outside 48.5–53.5%, biasing results.
🏦 Handling Continuous Variables: Income Categories
- Median income is a key predictor for housing prices, but since it’s continuous, it needs to be binned into categories for stratified sampling.
- Most median incomes cluster between 60,000 (scaled values 1.5 to 6), but some go higher.
- To avoid bias, strata should be:
- Few enough to have sufficient data in each
- Large enough for meaningful representation
- Example code uses
pd.cut()
to create 5 income categories:- Category 1: 0 to 1.5 (less than $15,000)
- Category 2: 1.5 to 3, etc.
This allows creating a stratified test set that accurately reflects income distribution in the population.
❓ Should We Stratify by Feature?
Stratifying means splitting your dataset so that your train/test (or validation) sets reflect the distribution of important features in the overall data.
When to Stratify:
- If a feature strongly influences the target variable, stratifying on it helps ensure:
- The model sees all important subgroups during training.
- The evaluation on the test set is more reliable and representative.
- Common features to stratify on include:
- Categorical features (e.g., gender, region, class)
- Discretized continuous features (e.g., income categories, age bins)
Discretization means converting continuous values into distinct categories or bins.
When Not to Stratify:
- If the feature is not relevant or weakly correlated with the target, stratifying may add unnecessary complexity.
- If stratification leads to very small strata, it might cause unstable splits.
What Is Stratified Sampling?
Stratified sampling means splitting your dataset in such a way that the distribution of a specific variable (usually categorical) is preserved across the train and test sets.
In other words:
- You group data into strata based on some important feature, like income category or class label.
- Then you sample from each stratum proportionally.
📊 Why do this?
Because random sampling can break the natural distribution of important features — especially in small datasets.
With stratified sampling, your test set would keep the same proportions:
80% train / 20% test:
Train: 48 low, 24 medium, 8 high
Test: 12 low, 6 medium, 2 high
🔧 When to Use Stratified Sampling?
✅ Use it when:
- You have imbalanced classes (e.g., fraud vs. non-fraud, disease vs. healthy).
- You’re splitting based on categorical features (e.g., education level, income group).
- The target variable’s distribution matters.
🚫 Not useful when:
- The feature has no meaningful groupings,
- Or is continuous and evenly distributed.
Stratified in scikit-learn
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(data, data["income_cat"]):
strat_train_set = data.loc[train_index]
strat_test_set = data.loc[test_index]
Here:
income_cat
is a categorical version of income (e.g., bucketed into 5 groups).- The split ensures the same proportions in train/test.
from sklearn.model_selection import StratifiedShuffleSplit
splitter = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
strat_splits = []
for train_index, test_index in splitter.split(housing, housing["income_cat"]):
strat_train_set_n = housing.iloc[train_index]
strat_test_set_n = housing.iloc[test_index]
strat_splits.append([strat_train_set_n, strat_test_set_n])
strat_train_set, strat_test_set = train_test_split(
housing, test_size=0.2, stratify=housing["income_cat"], random_state=42)
⚠️ Neglected but Critical: Test Set Generation
- Generating a proper test set is a crucial step in any machine learning project.
- It is often neglected, but getting this right helps avoid biases and ensures reliable evaluation.
- The concepts learned here will also be important later for cross-validation.
- After preparing the test set correctly, the next stage is exploring the data.
Matplotlib plot options
Argument | Meaning |
---|---|
kind="scatter" | Tells pandas to make a scatter plot, not a line/bar plot. |
x="longitude" | Horizontal axis shows longitude (i.e., east-west location). |
y="latitude" | Vertical axis shows latitude (i.e., north-south location). |
grid=True | Turns on the grid behind the plot (for better readability). |
alpha=0.2 | Transparency (0 = invisible, 1 = opaque). Makes overlapping points visible. |
s=housing["population"] / 100 | Sets bubble size: more population = larger circle. Divided by 100 to scale it down. |
label="population" | Label for the plot (shows in legend — not always effective with scatter). |
c="median_house_value" | Sets color of each point based on house value. |
cmap="jet" | The colormap: “jet” means low values are blue, high values are red. |
colorbar=True | Adds a side bar to explain the color scale (house prices). |
legend=True | Tries to show a legend (usually not very useful here unless you have a categorical label). |
sharex=False | Don’t force axis sharing — only matters in subplot grids. |
figsize=(10, 7) | Width × height of the figure, in inches. |
numeric only
corr_matrix = housing.corr(numeric_only=True)
📈 Insights from Data Visualization and Correlation
1. Observing Scatter Plots and Data Patterns
- The scatter plot shows a strong positive correlation: as one variable increases, so does the other, with points fairly close to the trend line.
- A price cap at $500,000 is clearly visible as a horizontal line, reflecting a limit in the data.
- Several other horizontal lines appear at price points like 350,000, and $280,000, indicating data quirks or artificial limits.
- To prevent the model from simply learning these quirks, consider removing or treating these districts separately.
2. Understanding the Correlation Coefficient
- The correlation coefficient measures linear relationships — how much one variable increases or decreases as the other does.
- It cannot detect nonlinear relationships, which might be important but invisible to this metric.
- Examples show datasets with zero correlation coefficient but clear nonlinear patterns.
- A correlation of ±1 means a perfect linear relationship, but it says nothing about the slope or units.
- Example: Height in inches vs. height in feet has a correlation coefficient of 1.
3. Exploring Attribute Combinations
- Before feeding data into machine learning algorithms, try combining different attributes to discover useful features.
- Feature engineering can improve model performance by capturing more complex patterns.
📌 Tip: Always visualize your data and understand the limitations of simple statistics like correlation before modeling.
❓ Why Are There Data Quirks (Horizontal Lines) in the Plot?
- The horizontal lines at certain price values (like 450,000, etc.) often happen because of data capping or rounding:
- The dataset might have upper limits set on housing prices (e.g., any house over 500,000).
- Some prices might be rounded or grouped into buckets for privacy or simplification.
- These quirks are artifacts of data collection or preprocessing, not natural variations.
- The problem: models may learn these artificial patterns instead of true underlying relationships, hurting generalization.
Prepare the Data for Machine Learning Algorithms
Preparing Data for Machine Learning: Why Use Functions Instead of Manual Steps
When preparing datasets for machine learning, it’s best to encapsulate your preprocessing steps into functions rather than performing them manually. This approach has several advantages:
1. Reproducibility
- Functions allow you to reapply the same transformations to any dataset, such as when you receive updated or fresh data.
- Ensures consistent preprocessing across experiments and production runs.
2. Reusability
- Over time, you will build a library of transformation utilities (e.g., handling missing values, encoding categories, scaling features).
- These can be reused in future projects, reducing development time.
3. Production Integration
- The same preprocessing functions can be applied in live systems before passing new incoming data to your ML models.
- This ensures the model receives data in the exact format it was trained on.
4. Experimentation
- By modularizing transformations, you can easily swap or combine different preprocessing steps.
- Enables systematic experimentation to find the optimal data preparation pipeline.
Cleaning
Most machine learning algorithms cannot work with missing features, so these must be handled before training. Common approaches include:
1. Remove Rows with Missing Values
housing.dropna(subset=["total_bedrooms"], inplace=True)
2. Remove the Entire Attribute
housing.drop("total_bedrooms", axis=1)
3. Impute Missing Values
Replace missing entries with a constant, mean, median, or another computed value.
median = housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna(median, inplace=True)
Tip: For production systems, prefer imputation over deletion to avoid losing data. Tools like
sklearn.impute.SimpleImputer
make this process reproducible.
Imputation with SimpleImputer
You decide to use option 3 (imputation) since it is the least destructive. Instead of manually filling missing values with Pandas, you can use Scikit-Learn’s SimpleImputer
.
Advantages over Pandas
- Stores the computed statistic (e.g., median) for each feature, making it easy to reuse the same imputation values on:
- Training set
- Validation set
- Test set
- New, unseen data in production
- Integrates with ML pipelines, ensuring preprocessing is applied automatically during training and inference.
- Works with NumPy arrays and Pandas DataFrames, and handles multiple features at once.
Example
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")
imputer.fit(housing) # Learn medians from training data
housing_transformed = imputer.transform(housing) # Apply to any dataset
Example: Reusing SimpleImputer
Across Datasets
Below is a practical example showing how SimpleImputer
stores the computed medians from the training set and applies them consistently to other datasets.
import pandas as pd
from sklearn.impute import SimpleImputer
# Example datasets
train_data = pd.DataFrame({
"age": [25, 30, None, 40],
"salary": [50000, None, 60000, 55000]
})
validation_data = pd.DataFrame({
"age": [None, 28],
"salary": [52000, None]
})
# Step 1: Fit on training data
imputer = SimpleImputer(strategy="median")
imputer.fit(train_data)
# The medians are stored internally
print("Stored medians:", imputer.statistics_)
# Output: [30. 55000.]
# Step 2: Transform datasets using the same medians
train_filled = imputer.transform(train_data)
validation_filled = imputer.transform(validation_data)
print("Transformed Training Data:\n", train_filled)
print("Transformed Validation Data:\n", validation_filled)
Key point:
- Even if the validation set has missing values in different places or different distributions, the imputer will still fill them using training-set medians, ensuring consistency between training and inference.
If you want, I can also show an example of applying the same imputer in a production pipeline for new incoming data.
What median is used for filling missing values in test and validation sets?
It should always be the training set’s median (or mean, mode, etc., depending on the strategy).
Here’s why:
- Consistency: The model was trained assuming missing values are filled using the training-set statistic. If you use the validation or test set’s own median, you’re introducing different preprocessing logic, which changes the feature distribution.
- Avoiding Data Leakage: Computing medians from the validation or test set means you are indirectly using information from those sets during preprocessing, which contaminates your evaluation.
- Realistic Deployment: In production, you won’t have access to the “true” distribution of new data — you can only use statistics learned during training.
Summary:
- Training set: Fit the imputer → store the median(s).
- Validation/Test/New data: Transform using the stored training medians.
This is exactly why SimpleImputer
’s fit()
(on training) and transform()
(on all datasets) pattern exists.
Selecting Numerical Attributes for Imputation
Since the median can only be computed on numerical attributes, create a copy of the dataset containing only numeric columns. This automatically excludes non-numeric attributes such as ocean_proximity
:
housing_num = housing.select_dtypes(include=[np.number])
The SimpleImputer
will compute the median of each numeric attribute and store the results in its statistics_
attribute.
Even though only total_bedrooms
had missing values in the training set, applying the imputer to all numeric attributes ensures that any missing values in future data (e.g., after deployment) will also be handled consistently.
Transforming Data with the Trained Imputer
Use the “trained” imputer to transform the training set by replacing missing values with the learned medians:
X = imputer.transform(housing_num)
Other available strategies:
- Mean:
strategy="mean"
— replaces missing values with the mean of each feature. - Most Frequent:
strategy="most_frequent"
— replaces with the most common value in each feature (supports non-numerical data). - Constant:
strategy="constant", fill_value=...
— replaces with a fixed specified value (supports non-numerical data).
Advanced Imputers in sklearn.impute
Beyond SimpleImputer
, Scikit-Learn provides more powerful imputers designed to handle missing data more intelligently, especially for numerical features.
1. KNNImputer
- Replaces each missing value with the mean value of the k-nearest neighbors for that feature.
- The “distance” between samples is computed using all available features.
- Useful when missing values correlate with the values of similar samples.
Example Use Case:
Imputing missing sensor readings where similar sensors tend to behave alike.
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
imputed_data = imputer.fit_transform(data)
2. IterativeImputer
- Models each feature with missing values as a function of other features using a regression model.
- Iteratively updates the imputations by training on the newly imputed data multiple times.
- Often yields better imputations by capturing complex relationships between features.
Example Use Case:
Imputing missing clinical measurements where features have nonlinear dependencies.
from sklearn.experimental import enable_iterative_imputer # noqa
from sklearn.impute import IterativeImputer
imputer = IterativeImputer(max_iter=10, random_state=0)
imputed_data = imputer.fit_transform(data)
Scikit-Learn Design Principles
Scikit-Learn’s API follows clear, consistent design principles that make it intuitive and easy to use.
1. Consistency
All objects share a simple and uniform interface, categorized mainly as:
Estimators
- Objects that estimate parameters from data (e.g.,
SimpleImputer
). - Use the
fit()
method to learn from data. - Accept hyperparameters as constructor arguments (e.g.,
strategy
forSimpleImputer
). - In supervised learning,
fit()
accepts both features and labels.
Transformers
- Estimators that can transform data (e.g.,
SimpleImputer
). - Provide a
transform()
method that applies the learned transformation. - Have a
fit_transform()
method that combinesfit()
andtransform()
for efficiency.
Predictors
- Estimators that can make predictions on new data (e.g.,
LinearRegression
). - Implement a
predict()
method that returns predictions. - Include a
score()
method to evaluate prediction quality using test data and labels.
2. Inspection
- Hyperparameters are exposed as public instance variables (e.g.,
imputer.strategy
). - Learned parameters are exposed with a trailing underscore (e.g.,
imputer.statistics_
).
Scikit-Learn Design: Nonproliferation of Classes and Composition
1. Nonproliferation of Classes
-
Instead of creating many custom data structures or classes, Scikit-Learn represents datasets as standard NumPy arrays or SciPy sparse matrices.
-
This keeps the library lightweight, compatible with other Python tools, and easy to integrate.
-
Hyperparameters are simple Python primitives like strings, integers, or floats—not complex objects—making configuration straightforward and transparent.
2. Composition
-
Scikit-Learn encourages building complex workflows by combining simple, reusable components.
-
For example, the
Pipeline
class allows chaining multiple transformers (data preprocessing steps) followed by a final estimator (like a classifier or regressor). -
This modular approach makes it easy to construct, reuse, and maintain end-to-end machine learning pipelines.
Example:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
('classifier', LogisticRegression())
])
Here, the pipeline applies imputation, scaling, and classification sequentially, combining reusable building blocks into one estimator.
This design philosophy promotes simplicity, interoperability, and ease of experimentation.
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
index=housing_num.index)
Encoding Categorical Features
To convert categorical text data into numbers, you can use Scikit-Learn’s OrdinalEncoder
:
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
One-Hot Encoding: What and Why
One-Hot Encoding converts categorical data (discrete labels like colors or cities) into a numerical format that machine learning algorithms can use.
Why is it needed?
Most ML models cannot work directly with text labels such as "Red"
, "Blue"
, or "Green"
. They require numeric input.
Simply assigning integers (e.g., Red = 1, Blue = 2, Green = 3) can mislead models into interpreting an ordinal relationship or distance between categories, which usually does not exist.
One-hot encoding avoids this by creating separate binary features for each category, preventing unintended order assumptions.
How One-Hot Encoding Works
One-Hot Encoding converts each category into a separate binary column indicating presence (1) or absence (0).
Example
Suppose we have a categorical feature Color with 3 possible values:
Color |
---|
Red |
Blue |
Green |
Blue |
After applying One-Hot Encoding, this becomes:
Color_Red | Color_Blue | Color_Green |
---|---|---|
1 | 0 | 0 |
0 | 1 | 0 |
0 | 0 | 1 |
0 | 1 | 0 |
-
1 indicates that the row belongs to that category.
-
0 indicates it does not.
Code Example with Scikit-Learn
from sklearn.preprocessing import OneHotEncoder
import numpy as np
data = np.array([["Red"], ["Blue"], ["Green"], ["Blue"]])
encoder = OneHotEncoder(sparse=False)
encoded_data = encoder.fit_transform(data)
print(encoded_data)
Output:
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]
[0. 1. 0.]]
One-Hot Encoding is essential to avoid false ordinal relationships in categorical features.
Pros
- No unintended order between categories.
- Works well for small-to-medium numbers of categories.
Cons
- Creates many columns if there are lots of categories (“curse of dimensionality”).
- Doesn’t capture relationships between categories.
Tip: In libraries like Pandas, you can do it easily:
import pandas as pd
df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue']})
encoded = pd.get_dummies(df, columns=['Color'])
print(encoded)
Sparse Matrix
A sparse matrix is a matrix in which most elements are zero. Storing all elements explicitly (including zeros) in large matrices can waste memory and computation time.
Sparse matrices store only the nonzero elements and their positions, saving space and speeding up operations that ignore zeros.
SciPy Sparse Matrix
The SciPy library provides efficient implementations of sparse matrices in various formats, such as:
-
CSR (Compressed Sparse Row): Efficient for row slicing and arithmetic operations.
-
CSC (Compressed Sparse Column): Efficient for column slicing.
-
COO (Coordinate list): Easy to construct, stores row and column coordinates of nonzero entries.
These sparse matrix types store only nonzero values and their indices, enabling scalable computations on large, sparse datasets (common in text data, recommendation systems, graphs, etc.).
Example
from scipy.sparse import csr_matrix
import numpy as np
dense = np.array([
[0, 0, 1],
[1, 0, 0],
[0, 0, 0]
])
sparse = csr_matrix(dense)
print(sparse)
Output:
(0, 2) 1
(1, 0) 1
Here, only nonzero elements and their coordinates are stored.
Sparse Matrix and One-Hot Encoding in Scikit-Learn
What is a Sparse Matrix?
A sparse matrix efficiently stores matrices with mostly zeros by keeping only the nonzero values and their positions in memory. This reduces memory usage and speeds up computations when dealing with large, mostly empty matrices.
One-Hot Encoding and Sparse Matrices
When applying one-hot encoding to a categorical feature with many categories, the resulting matrix is very large and mostly zeros (only one ‘1’ per row). Scikit-Learn’s OneHotEncoder
by default returns a sparse matrix (specifically a CSR — Compressed Sparse Row matrix), which is ideal in this situation.
Example:
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat) # Returns sparse CSR matrix
If you want a dense NumPy array instead, use:
cat_encoder = OneHotEncoder(sparse_output=False) # For Scikit-Learn 1.2+
housing_cat_1hot_dense = cat_encoder.fit_transform(housing_cat)
Why Using Sparse Matrix Does Not Harm Training
-
Most machine learning algorithms in Scikit-Learn support sparse input directly, so they work efficiently with sparse matrices without converting them to dense form.
-
This saves memory and computation time, especially for high-cardinality categorical features.
-
Algorithms that do not support sparse matrices internally will convert sparse inputs to dense when necessary, but this is less common.
Converting Sparse to Dense
If needed, convert a sparse matrix to a dense NumPy array:
dense_array = housing_cat_1hot.toarray()
Using sparse matrices with one-hot encoding is a best practice for memory efficiency and performance.
One-Hot Encoding: pd.get_dummies()
vs OneHotEncoder
Example with Pandas get_dummies()
import pandas as pd
df_test = pd.DataFrame({"ocean_proximity": ["INLAND", "NEAR BAY"]})
pd_dummies = pd.get_dummies(df_test)
print(pd_dummies)
Output:
INLAND | NEAR BAY |
---|---|
1 | 0 |
0 | 1 |
Example with Scikit-Learn OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
cat_encoder.fit([["INLAND"], ["NEAR BAY"], ["<2H OCEAN"]]) # Training on 3 categories
encoded = cat_encoder.transform([["INLAND"], ["NEAR BAY"]]).toarray()
print(encoded)
print(cat_encoder.categories_)
Output:
[[1. 0. 0.]
[0. 1. 0.]]
[array(['<2H OCEAN', 'INLAND', 'NEAR BAY'], dtype=object)]
What Does “Remembering” Mean?
-
OneHotEncoder
remembers the exact categories it was trained on (stored incategories_
attribute). -
When you call
transform()
on new data, it produces a matrix with one column per learned category, in the same order, ensuring the feature space is consistent. -
If the new data has a category it was not trained on, by default it raises an error or handles it based on parameters (
handle_unknown
).
Why Is This Important?
-
In production, your model expects features with the exact same structure as during training (same columns, same order).
-
pd.get_dummies()
generates columns dynamically based on the input data, which means:-
It can create different columns for different datasets (e.g., training vs. test).
-
Unknown categories create new columns, breaking the feature alignment expected by your model.
-
Summary Table
Feature | pd.get_dummies() | OneHotEncoder |
---|---|---|
Learns categories once | No — creates columns dynamically per input | Yes — stores categories after fit() |
Handles unknown categories | Creates new columns | Can raise error or ignore (handle_unknown param) |
Consistent output shape | No | Yes |
Integration in pipeline | Limited | Fully supported |
How OneHotEncoder
Remembers
When you call:
cat_encoder.fit(data)
It scans the data, records the unique categories for each feature, and saves them as the attribute categories_
. Later, when you transform new data, it uses this stored info to encode each category in the right column, preserving feature order and consistency.
If you want, I can show you how to handle unknown categories safely with OneHotEncoder
.
Challenges with High-Cardinality Categorical Features
Large Number of Categories
-
When a categorical feature has many categories (e.g., country codes, professions, species), one-hot encoding creates a very large number of input features.
-
This can slow down training, increase memory usage, and potentially degrade model performance due to sparsity and overfitting risks.
Alternatives to One-Hot Encoding for High-Cardinality Features
-
Feature Engineering:
Replace categories with meaningful numerical features related to the category, such as:-
For
ocean_proximity
, use distance to the ocean. -
For country codes, use population, GDP per capita, or other numeric statistics.
-
-
Category Encoders Package:
Use specialized encoding methods (available in thecategory_encoders
package) such as:- Target encoding
- Hash encoding
- Frequency encoding
-
Embeddings in Neural Networks:
Replace categories with learnable, low-dimensional vectors (embeddings) that capture semantic relationships.
This is a form of representation learning, commonly used in deep learning (covered in Chapters 13 and 17).
Handling Column Names in Scikit-Learn with DataFrames
-
When fitting a Scikit-Learn estimator with a Pandas DataFrame, the estimator stores the input column names in the
feature_names_in_
attribute. -
Later, when you call methods like
transform()
orpredict()
on a DataFrame, Scikit-Learn checks that the columns match the original columns it was trained on, helping prevent errors. -
Transformers provide a method
get_feature_names_out()
that returns the names of output features, useful for creating DataFrames from transformed data.
Example:
# After fitting
print(estimator.feature_names_in_)
# Get output feature names
output_names = transformer.get_feature_names_out()
This design helps maintain data consistency and traceability through complex preprocessing pipelines.
Why Feature Scaling Matters
Many machine learning algorithms are sensitive to the scale of input features. When numerical features have vastly different ranges—like total_rooms
(6 to ~39,000) versus median_income
(0 to 15)—models can behave poorly because:
- Models tend to give more weight to features with larger numeric ranges, overshadowing smaller-scaled features regardless of their actual importance.
- Algorithms that rely on distance calculations (e.g., k-NN, SVM, clustering) or gradient-based optimization (e.g., logistic regression, neural networks) assume features are on comparable scales for effective learning.
Common Feature Scaling Methods
-
Min-Max Scaling
Rescales features to a fixed range, usually [0, 1]: -
Standardization (Z-score Normalization)
Centers features at zero mean and unit variance: Both approaches help models learn effectively by ensuring features contribute proportionally.
Important Note on Feature Scaling
When applying feature scaling, always fit the scaler only on the training data — do not call fit()
or fit_transform()
on validation, test, or new data.
- The scaler learns parameters (e.g., min, max, mean, standard deviation) from the training set.
- Then, use
transform()
to apply the same scaling parameters to all other datasets (validation, test, production).
This ensures the model sees consistent, comparable feature values.
Outliers and Scaling Range
- Since the scaler uses statistics from training data, new data containing outliers may be scaled outside the expected range (e.g., less than 0 or greater than 1 in min-max scaling).
- To prevent this, set the scaler’s
clip=True
parameter, which clips values to the specified range after transformation.
This practice maintains data integrity and prevents data leakage while ensuring robust, reliable model performance.
Why Fit Scalers Only on the Training Set?
You should never call fit()
or fit_transform()
on validation, test, or new data because:
-
fit()
calculates scaling parameters (e.g., mean, standard deviation, min, max) based on the data it sees. -
If you fit on validation or test data, you are using information from these sets to influence your scaling, which leads to data leakage—an unrealistic advantage that causes overly optimistic evaluation results.
-
The model should only learn from the training data. Applying parameters learned on training data to other sets simulates real-world scenarios where future data is unknown.
Difference Between fit()
and fit_transform()
-
fit()
: Computes and stores the necessary parameters for transformation (e.g., mean and std for standardization) from the input data. It does not change the data itself. -
transform()
: Applies the learned parameters fromfit()
to scale or transform the data accordingly. -
fit_transform()
: A convenience method that combinesfit()
andtransform()
in one step—used only on the training set to both learn parameters and apply transformation immediately.
Summary of Proper Scaling Workflow
- Call
fit()
orfit_transform()
on the training set only. - Call
transform()
on validation, test, and any new data using the already learned parameters.
This ensures your model evaluation is fair, unbiased, and reliable.
Min-Max Scaling (Normalization)
Min-max scaling is a simple technique to rescale numerical features so that their values lie within a specified range, typically 0 to 1.
How It Works
For each attribute xx, the values are transformed as follows:
where:
- = minimum value of the attribute in the training data
- = maximum value of the attribute in the training data
This shifts and rescales the data so the smallest value becomes 0 and the largest becomes 1.
Customizing the Range
Scikit-Learn’s MinMaxScaler
allows changing the target range with the feature_range
parameter. For example, using can be better suited for some neural networks, which perform well with inputs centered around zero.
Example Usage
from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler(feature_range=(-1, 1))
housing_num_min_max_scaled = min_max_scaler.fit_transform(housing_num)
This code fits the scaler on the training data housing_num
and transforms it to the range .
Min-max scaling is easy to apply and useful when you want all features to have the same scale, preserving the shape of the original distribution.
Standardization (Z-score Normalization)
Standardization transforms features by centering and scaling them so they have:
- Zero mean (average value = 0)
- Unit standard deviation (spread = 1)
How It Works
For each attribute xx, the standardized value zz is calculated as:
where:
- = mean of the attribute in the training data
- = standard deviation of the attribute in the training data Unlike min-max scaling, standardization does not constrain values to a fixed range; values can be less than zero or greater than one.
Why Standardization Is More Robust to Outliers
- Outliers (extremely large or small values) strongly affect min-max scaling, compressing the majority of data into a small range.
- Standardization centers data and scales according to variability, so an outlier (e.g., median income mistakenly recorded as 100 instead of 0–15) has less impact on the overall scaling.
Example Usage
from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()
housing_num_std_scaled = std_scaler.fit_transform(housing_num)
This fits the scaler on the training data and transforms it to have zero mean and unit variance.
Standardization is often preferred when features have different units or when outliers are present.
Scaling Sparse Matrices without Losing Sparsity
When scaling sparse matrices, subtracting the mean (centering) is problematic because:
- Centering requires subtracting the mean from every element, which can turn many zeros into nonzero values.
- This breaks the sparse structure by introducing many new nonzero entries, causing a large increase in memory usage and computation time.
Solution: Use StandardScaler(with_mean=False)
-
Setting
with_mean=False
disables mean subtraction and only scales data by dividing by the standard deviation. -
This preserves the sparse structure since zero entries remain zero (no new nonzeros are created).
-
This allows efficient scaling of sparse matrices without converting them to dense arrays.
Example
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler(with_mean=False)
scaled_sparse = scaler.fit_transform(sparse_matrix)
In summary, skipping mean centering is a practical compromise to maintain sparsity and efficiency when scaling sparse data.
Handling Heavy-Tailed Feature Distributions
When a feature’s distribution has a heavy tail—meaning values far from the mean are relatively common—both min-max scaling and standardization tend to compress most values into a small range.
This compression can negatively impact machine learning model performance, as such models generally prefer features with more balanced distributions.
How to Address Heavy Tails
Before scaling, transform the feature to:
- Shrink the heavy tail
- Make the distribution more symmetrical Common transformations include:
- Square root or fractional power (between 0 and 1):
Useful for positive features with a moderate heavy tail to the right. - Logarithm:
Helps with features exhibiting very long, heavy tails, such as those following a power law distribution.
Example: Population Feature
- ?
- Population often follows a power law—districts with 10,000 inhabitants are about 10 times less frequent than those with 1,000, not exponentially less.
- Applying the logarithm transforms the distribution closer to a Gaussian (bell-shaped), improving feature behavior for modeling.
Transforming heavy-tailed features before scaling makes machine learning models more effective.
What Is a Gaussian (Bell-Shaped) Distribution?
A Gaussian distribution (also called a normal distribution) is a common continuous probability distribution characterized by:
-
A symmetric, bell-shaped curve centered around the mean.
-
Most values cluster around the mean, with fewer values appearing as you move farther away (tails).
-
The shape is fully defined by its mean (center) and standard deviation (spread).
This distribution often appears in natural phenomena and is preferred in many machine learning models because of its well-behaved statistical properties.
Bucketizing (Discretization) Explained
Bucketizing is a way to handle heavy-tailed features by dividing the feature’s range into buckets (or bins) and replacing each original value with the index of the bucket it falls into.
This turns a continuous numerical feature into a discrete categorical-like feature.
How It Works
-
Divide the data into equal-sized buckets (e.g., based on percentiles or fixed intervals).
-
Replace each value with the bucket number (e.g., 0, 1, 2, …).
-
This creates a feature with an almost uniform distribution since each bucket contains roughly the same number of data points.
Example
Suppose the feature is income, with values ranging widely:
Original Income | Bucket Index (Percentiles) |
---|---|
15,000 | 0 (0–20th percentile) |
35,000 | 2 (40–60th percentile) |
75,000 | 4 (80–100th percentile) |
Here, the continuous income values are replaced by discrete bucket indices.
Benefits
-
Removes heavy-tailed behavior by smoothing the distribution.
-
Often no further scaling is needed because the bucket indices are roughly uniformly distributed.
-
Optionally, divide bucket indices by the number of buckets to normalize values between 0 and 1.
Use Case
You might create an income_cat feature for stratified sampling or directly use bucketized features in models that handle categorical or ordinal data well.
Handling Multimodal Distributions with Bucketizing
When a feature has a multimodal distribution—meaning it has two or more distinct peaks (modes), such as the housing_median_age
feature—bucketizing can be useful, but with a different approach:
-
Instead of treating bucket IDs as numerical values, treat them as categorical values.
-
Encode these bucket indices using techniques like OneHotEncoding (avoid too many buckets to keep it manageable).
Why This Helps
-
?
-
It allows the model to learn separate patterns or rules for different value ranges.
-
For example, houses around 35 years old might have a specific architectural style that affects price differently, which the model can capture more easily using categorical buckets rather than a continuous numeric age.
Summary
-
Bucketize multimodal features into categories.
-
OneHotEncode the buckets for better model flexibility.
-
Use fewer buckets to avoid overcomplicating the feature space.
Example: Bucketizing and One-Hot Encoding a Multimodal Feature
Suppose you have the housing_median_age
feature with multimodal distribution. You can:
- Bucketize it into age groups (categories):
Age (years) | Bucket ID |
---|---|
5 | 0 |
12 | 1 |
35 | 2 |
50 | 3 |
- Encode the buckets as categories using
OneHotEncoder
:
Age Bucket_0 | Age Bucket_1 | Age Bucket_2 | Age Bucket_3 |
---|---|---|---|
1 | 0 | 0 | 0 |
0 | 1 | 0 | 0 |
0 | 0 | 1 | 0 |
0 | 0 | 0 | 1 |
Code Example
import numpy as np
import pandas as pd
from sklearn.preprocessing import KBinsDiscretizer, OneHotEncoder
# Sample data
housing_age = np.array([[5], [12], [35], [50]])
# Bucketize into 4 bins (treat as categorical)
kbins = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='uniform')
age_buckets = kbins.fit_transform(housing_age)
# One-Hot Encode the bucket indices
ohe = OneHotEncoder(sparse=False)
age_buckets_ohe = ohe.fit_transform(age_buckets)
print(age_buckets_ohe)
This approach lets your model learn different effects for houses in each age group, handling multimodal distributions effectively.
Support Vector Machine (SVM)
Support Vector Machines are powerful supervised learning models commonly used for classification and regression tasks. They aim to find the optimal separating hyperplane between classes by maximizing the margin between support vectors (critical data points).
Bias-Variance Tradeoff
-
Bias: The error from erroneous assumptions in the learning algorithm. A model with high bias oversimplifies the problem and cannot capture the underlying patterns (underfitting).
-
Variance: The error from sensitivity to small fluctuations in the training set. A model with high variance fits noise in training data too closely (overfitting).
-
The goal is to balance bias and variance to achieve good generalization on unseen data.
Methods to control the tradeoff:
-
Regularization: Adding a penalty term to the loss function to discourage overly complex models. In SVMs, the C parameter controls the tradeoff between margin width and classification error.
-
Boosting: Combining multiple weak learners to form a strong learner, reducing bias and variance.
-
Bagging: Training multiple models on different bootstrap samples and aggregating results to reduce variance.
For more explanation, see:
StatQuest - Bias Variance Tradeoff
Cross Validation
Cross-validation is a technique to evaluate machine learning models by training and testing them on different subsets of the data. It provides an estimate of how well a model will perform on unseen data.
Common types of cross-validation:
-
Four-Fold Cross Validation: The dataset is split into 4 equal parts; each fold is used once as a test set while the remaining 3 folds form the training set.
-
Ten-Fold Cross Validation: Similar to four-fold, but with 10 parts, providing a more stable estimate.
-
Leave-One-Out Cross Validation (LOOCV): Each instance in the dataset is used once as a test set while the rest are training data. It is computationally expensive but can be useful for small datasets.
Cross-validation helps in:
-
Comparing different models or hyperparameter settings.
-
Detecting overfitting or underfitting issues.
-
Selecting the best model for deployment.
Support Vector Machine (SVM)
Margin and Classification
-
The margin is the shortest distance between an observation (data point) and the decision boundary (threshold/hyperplane).
-
Maximal Margin Classifier: Finds the hyperplane that maximizes the margin, perfectly separating the classes.
- Problem: Very sensitive to outliers; a single outlier can drastically change the margin and hyperplane.
Soft Margin and Bias-Variance Tradeoff
-
To handle noisy data and outliers, soft margin classifiers allow some misclassifications by introducing slack variables.
-
The margin becomes a soft margin, balancing the width of the margin and the number of misclassifications.
-
This introduces a tradeoff between bias (allowing errors) and variance (overfitting to outliers).
-
The points that lie on or inside the margin (including misclassified points) are called support vectors — they define the position of the decision boundary.
Support Vector Classifier and SVM
-
The Support Vector Classifier (SVC) is the soft margin classifier.
-
The Support Vector Machine extends this concept to allow nonlinear decision boundaries using kernel functions.
Kernel Functions
Kernels implicitly transform the input data into a higher-dimensional space to make it linearly separable without computing coordinates in that space explicitly. This is known as the Kernel Trick.
-
Polynomial Kernel: Maps data into polynomial feature space.
-
Radial Basis Function (RBF) Kernel: Measures similarity using distances; often seen as a weighted nearest neighbor approach.
Summary
Concept | Description |
---|---|
Margin | Distance from data point to decision boundary |
Maximal Margin Classifier | Finds widest margin, sensitive to outliers |
Soft Margin Classifier | Allows misclassifications to balance bias-variance |
Support Vectors | Points that define the margin |
Kernel Trick | Enables nonlinear separation via implicit mapping |
Common Kernels | Polynomial, Radial Basis Function (RBF) |
For an in-depth explanation, watch:
StatQuest: Support Vector Machine
Transforming Multimodal Distributions Using Radial Basis Functions (RBF)
Concept
Instead of simply bucketizing a multimodal feature, you can add features representing similarity to each mode (peak) in the distribution. This captures how close a given value is to each important mode, allowing the model to learn smooth, flexible relationships.
What Is a Radial Basis Function (RBF)?
- An RBF is a function whose output depends only on the distance between the input and a fixed center point (mode).
- The most common RBF is the Gaussian RBF, defined as:
where:
- ( x ) = input feature value (e.g., housing age)
- ( c ) = center point (mode in distribution, e.g., 35 years)
- (gamma) = a hyperparameter controlling how fast similarity decays with distance
How Gaussian RBF Works
- When ( x = c ), similarity is 1 (maximum).
- As ( x ) moves away from ( c ), similarity approaches 0 exponentially fast.
- ( \gamma ) controls the width of the “bump” around ( c ):
- Large ( \gamma ): similarity drops quickly (narrow peak).
- Small ( \gamma ): similarity drops slowly (wide peak).
Why Use RBF Features?
- Capture soft membership in each mode instead of hard bucket assignment.
- Allow the model to smoothly interpolate effects near each mode.
- Handle multimodal features flexibly, improving learning of complex patterns.
Example: Housing Median Age and Modes
Assume housing ages cluster around modes at 10, 35, and 60 years.
For each sample with age ( x ), create features:
Each feature measures similarity to one mode.
Code Example Using Scikit-Learn
import numpy as np
from sklearn.metrics.pairwise import rbf_kernel
# Sample housing ages
housing_age = np.array([[5], [12], [35], [50]])
# Modes to represent
modes = np.array([[10], [35], [60]])
# Compute RBF similarity features between housing_age and modes
# gamma controls the decay rate
gamma = 0.1
rbf_features = rbf_kernel(housing_age, modes, gamma=gamma)
print(rbf_features)
Output (approximate):
x\mode | 10 | 35 | 60 |
---|---|---|---|
5 | 0.61 | 0.00 | 0.00 |
12 | 0.92 | 0.01 | 0.00 |
35 | 0.00 | 1.00 | 0.04 |
50 | 0.00 | 0.37 | 0.32 |
Interpretation
- Age 12 is close to mode 10 (high similarity) and far from others.
- Age 35 matches mode 35 exactly (similarity = 1).
- Age 50 has moderate similarity to modes 35 and 60.
The model can now learn separate, smooth effects associated with each mode.
Summary
- RBF features turn a single multimodal feature into multiple soft membership features.
- The Gaussian RBF similarity decreases smoothly with distance from mode centers.
- Hyperparameter γ controls how localized these features are.
- This approach enables flexible, smooth modeling of complex feature distributions.
What Happens After Creating RBF Features?
Using the RBF Similarity Values
- The output of the RBF transformation is one new feature per mode (e.g., three features if you use three modes).
- Each of these features measures how similar the original value is to that mode.
How to Use These Features
- Add the RBF features as new columns to your original dataset.
- You can keep the original feature or replace it with these new similarity features, depending on your modeling choice.
Changes to the DataFrame
- Suppose your original DataFrame had a column
housing_median_age
. - After applying RBF transformation with 3 modes (e.g., 10, 35, 60), you will add three new columns, for example:
housing_median_age | rbf_age_10 | rbf_age_35 | rbf_age_60 |
---|---|---|---|
5 | 0.61 | 0.00 | 0.00 |
12 | 0.92 | 0.01 | 0.00 |
35 | 0.00 | 1.00 | 0.04 |
50 | 0.00 | 0.37 | 0.32 |
- These new columns are continuous numerical features that encode how close each sample’s age is to each mode.
Why This Is Useful for Modeling
- Models can learn different effects associated with different age ranges via these new features.
- The similarity values smoothly vary, allowing the model to capture subtle patterns without hard splits.
- This can improve performance on multimodal data compared to using raw or bucketized values alone.
Summary
Step | Effect |
---|---|
Original feature (housing_median_age ) | May remain or be replaced |
RBF similarity features | Added as new numerical columns |
DataFrame | Expanded with these new features |
Model input | Uses new features to learn smooth, mode-based patterns |
You typically add the RBF features alongside your existing data, giving your model richer information to work with.
Transforming Target Values in Machine Learning
So far, we’ve focused on transforming input features, but sometimes the target variable (what you want to predict) also needs transformation.
Why Transform the Target?
-
If the target distribution has a heavy tail (e.g., skewed house prices), transforming it can improve model performance and stability.
-
A common transformation is to replace the target y with its logarithm, shrinking large values and making the distribution more balanced.
Important Consideration
- If you train a model on a transformed target (e.g., ), the model predicts the transformed values, not the original.
- To get predictions on the original scale, you need to apply the inverse transformation (e.g., exponentiate predictions if log was used).
Using Scikit-Learn Transformers for Targets
-
Most Scikit-Learn transformers provide an
inverse_transform()
method for this purpose. -
For example, if you scale your target using
StandardScaler
, you can train your model on the scaled labels, then convert predictions back to the original scale.
Code Example
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
# Assume housing_labels is a Pandas Series with the target values
# Scale target labels (must convert to 2D DataFrame)
target_scaler = StandardScaler()
scaled_labels = target_scaler.fit_transform(housing_labels.to_frame())
# Train model on one input feature for simplicity
model = LinearRegression()
model.fit(housing[["median_income"]], scaled_labels)
# New data for prediction
some_new_data = housing[["median_income"]].iloc[:5]
# Predict scaled targets
scaled_predictions = model.predict(some_new_data)
# Convert predictions back to original scale
predictions = target_scaler.inverse_transform(scaled_predictions)
Summary
Step | Description |
---|---|
Transform target variable | Apply transformation (e.g., log, scaling) |
Train model | Use transformed targets for training |
Predict | Model outputs predictions on transformed scale |
Inverse transform | Convert predictions back to original scale (e.g., exp or inverse scaling) |
This process ensures consistent handling of target transformations while enabling meaningful final predictions.
Using TransformedTargetRegressor for Target Transformation
This works fine, but a simpler option is to use a TransformedTargetRegressor
.
-
You just need to construct it by providing the regression model and the label transformer.
-
Then fit it on the training set with the original, unscaled labels.
-
It automatically transforms the labels during training and applies the inverse transform when making predictions.
Code Example
from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
model = TransformedTargetRegressor(
regressor=LinearRegression(),
transformer=StandardScaler()
)
model.fit(housing[["median_income"]], housing_labels)
predictions = model.predict(some_new_data)
Alright — let’s unpack this step-by-step so it’s clear both what the code is doing and why it’s useful.
1. What’s going on here?
You’re looking at a custom Scikit-Learn transformer called ClusterSimilarity
.
This transformer:
-
Fits a KMeans clustering model on the training data.
-
Computes similarities between each sample and each cluster center using the Gaussian RBF kernel.
-
Outputs a feature matrix where each column corresponds to “how similar this sample is to a particular cluster.”
2. Code Breakdown
Class Definition
from sklearn.cluster import KMeans
class ClusterSimilarity(BaseEstimator, TransformerMixin):
-
BaseEstimator
andTransformerMixin
These make your custom class compatible with Scikit-Learn pipelines and API conventions. -
By subclassing them, you get:
-
Parameter handling (
get_params
/set_params
) -
fit_transform()
convenience method (fromTransformerMixin
)
-
__init__
method
def __init__(self, n_clusters=10, gamma=1.0, random_state=None):
self.n_clusters = n_clusters
self.gamma = gamma
self.random_state = random_state
-
Purpose: Store parameters so they’re accessible later and compatible with Scikit-Learn’s parameter tuning (
GridSearchCV
, etc.). -
n_clusters
→ how many groups KMeans will find. -
gamma
→ controls the spread of the Gaussian RBF similarity measure. -
random_state
→ ensures reproducible clustering.
fit
method
def fit(self, X, y=None, sample_weight=None):
self.kmeans_ = KMeans(self.n_clusters, random_state=self.random_state)
self.kmeans_.fit(X, sample_weight=sample_weight)
return self
-
Creates a KMeans object with the specified number of clusters.
-
Fits it on
X
(your features). -
Accepts
sample_weight
so some samples can influence clustering more than others. -
Returns
self
so it can be chained in pipelines (fit().transform()
).
transform
method
def transform(self, X):
return rbf_kernel(X, self.kmeans_.cluster_centers_, gamma=self.gamma)
-
Computes the Gaussian Radial Basis Function (RBF) similarity between:
-
Each sample in
X
-
Each cluster center found during
fit
-
-
RBF similarity formula (for two vectors xx and cc):
similarity=exp(−γ⋅∣∣x−c∣∣2)\text{similarity} = \exp(-\gamma \cdot ||x - c||^2)
-
Output: A matrix of shape
(n_samples, n_clusters)
.
get_feature_names_out
method
def get_feature_names_out(self, names=None):
return [f"Cluster {i} similarity" for i in range(self.n_clusters)]
-
Generates human-readable column names for the transformed features.
-
Useful if you’re converting the result into a DataFrame.
3. Why this works
-
Clustering + Similarity gives you spatially aware features.
-
Instead of just assigning each sample to one cluster, you get continuous similarity scores to all clusters.
-
This captures nuanced relationships — a house can be kind of like cluster 3 and very much like cluster 7.
4. Example Use
cluster_simil = ClusterSimilarity(n_clusters=10, gamma=1., random_state=42)
similarities = cluster_simil.fit_transform(
housing[["latitude", "longitude"]],
sample_weight=housing_labels
)
-
Fits the transformer using latitude and longitude.
-
Uses
housing_labels
(median house values) as sample weights so expensive areas influence clusters more. -
Produces a similarity matrix like:
[[0.00, 0.14, ..., 0.99, 0.60],
[0.63, 0.00, ..., 0.04, 0.00],
[0.00, 0.29, ..., 0.70, 0.30]]
-
Each row = a district
-
Each column = similarity to a cluster center
5. Visual Insight
-
If plotted, districts colored by their closest cluster’s similarity will show:
-
Densely populated, expensive regions with high similarity to certain clusters.
-
Geographic groupings that are data-driven rather than just map-based.
-
6. Key Takeaways
-
This is feature engineering: turning raw coordinates into meaningful numerical features.
-
Custom transformers like this integrate unsupervised learning (KMeans) into preprocessing pipelines.
-
Similarity scores are often more useful than raw cluster labels in predictive models.
If you want, I can also draw a diagram showing how KMeans + RBF similarity transforms location data into these cluster similarity features so it’s easier to visualize.
Duck Typing
Duck typing is a programming concept used primarily in dynamically typed languages like Python.
Definition
-
The idea is:
“If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.” -
In programming terms, an object’s suitability is determined by the presence of certain methods and properties, rather than the object’s actual type.
How It Works
-
Instead of checking an object’s type explicitly (e.g., with
isinstance()
), the code relies on the object having the necessary behavior (methods or attributes). -
If the object supports the required operations, it can be used regardless of its class.
Example
class Duck:
def quack(self):
print("Quack!")
class Person:
def quack(self):
print("I'm quacking like a duck!")
def make_it_quack(thing):
thing.quack() # Does not check type, just calls quack()
d = Duck()
p = Person()
make_it_quack(d) # Output: Quack!
make_it_quack(p) # Output: I'm quacking like a duck!
Here, both Duck
and Person
objects are accepted by make_it_quack
because both implement a quack()
method, regardless of their types.
Advantages
-
Flexibility: Code works with any object that supports the expected behavior.
-
Easier to write generic and reusable functions.
Disadvantages
-
Errors due to missing methods may only show up at runtime.
-
Harder to enforce strict interfaces or contracts.
Summary
Duck typing emphasizes what an object can do, rather than what it is.
Stochastic
Stochastic refers to systems, processes, or variables that are random or probabilistic in nature, meaning they involve some element of randomness or unpredictability.
Explanation
- A stochastic process is one that evolves over time with inherent randomness, such that its future states are not fully determined by past states.
- Contrasts with deterministic processes, which have no randomness and produce the same output from the same initial conditions every time.
Examples in Machine Learning
-
Stochastic Gradient Descent (SGD): Instead of computing the gradient using the whole dataset (deterministic), it uses random samples (mini-batches), introducing randomness that helps escape local minima and speeds up training.
-
Random Forest: Combines many decision trees trained on random subsets of data and features — inherently stochastic.
Simple analogy
-
Tossing a fair coin is a stochastic process — the outcome is random.
-
Rolling a loaded die where the outcome is always 6 is deterministic (no randomness).
Summary
Stochastic means involving randomness or probability, making outcomes partly unpredictable.
Grid Search
https://www.youtube.com/watch?v=lfiw2Rh2v8k&ab_channel=AIForBeginners
https://www.youtube.com/watch?v=G-fXV-o9QV8&ab_channel=DataMListic