(Content mostly generated by ChatGPT)

Central Limit Theorem (CLT)

What it is:
The CLT states that if you take many random samples from any population (with finite variance) and compute their means, the distribution of those sample means will approach a normal (Gaussian) distribution as the sample size increases, regardless of the original population’s distribution.

Formally:

\overset{ˉ}{X} \sim N (μ, \frac{σ ^{2}}{n}) as n \to \infty

where ( \mu ) is the population mean, ( \sigma^2 ) is the population variance, and (n) is the sample size.

Why it matters for ML

✅ Parameter estimation & confidence intervals – When you estimate model parameters (like weights, means, etc.), CLT justifies using normal approximations for those estimates.

✅ Hypothesis testing & A/B testing – Many statistical tests assume normality, and CLT explains why we can use those tests even if the original data isn’t normal.

✅ Feature engineering & averaging – Many ensemble methods (like bagging, random forests) rely on averaging predictions; CLT explains why averaging reduces variance and tends to produce a stable, normal-like output.

✅ Understanding loss surfaces & convergence – Stochastic Gradient Descent (SGD) noise often approximates a normal distribution due to CLT.

Importance for ML

⭐⭐⭐⭐⭐ (Critical) – It’s a foundational concept that underpins statistical inference, which is everywhere in ML, especially in probabilistic models, evaluation, and uncertainty estimation.

Law of Large Numbers (LLN)

What it is:
The LLN states that as the sample size nn increases, the sample mean Xˉ\bar{X} will converge to the population mean μ\mu.

Formally:

n \to \infty lim P (\overset{ˉ}{X}_{n} - μ < ϵ) = 1, \forall ϵ > 0

for any small ϵ>0.

In simpler words: the more data you collect, the closer your sample average gets to the true average.

Why it matters for ML

✅ Reliability of training data – Justifies that with enough data, the empirical statistics (mean, variance) of features approximate the true underlying distribution.

✅ Model performance estimation – Cross-validation and test accuracy stabilize with more samples because of LLN.

✅ Monte Carlo methods & probabilistic ML – When estimating expectations by random sampling (e.g., in Bayesian inference, reinforcement learning), LLN ensures that the average of many samples converges to the expected value.

✅ Why “more data beats clever algorithms” – LLN supports the idea that larger datasets reduce variance and improve generalization.

Importance for ML

⭐⭐⭐⭐⭐ (Critical) – Fundamental to why we trust that learning from large datasets works.

Estimator Quality in Statistics & ML

✅ What is Estimation?

In statistics and ML, you’re often trying to estimate some true parameter of the data-generating process (like the mean ( \mu ), variance ( \sigma^2 ), regression coefficients ( \beta ), etc.) based on a sample of data.

You use an estimator (a function of the data) to give you this estimate.
Examples:

( \bar{X} ) estimates ( \mu )
( \hat{\theta} ) estimates ( \theta )

But… how do you know if your estimator is good?

✅ Key Properties of Estimators

1. Bias

Measures how far the average estimate is from the true value.

Bias (\hat{θ}) = E [\hat{θ}] - θ

Unbiased if ( \mathbb{E}[\hat{\theta}] = \theta )
Bias represents systematic error.

📌 Example: Sample mean ( \bar{X} ) is an unbiased estimator of population mean ( \mu ).

2. Variance

Measures how much the estimator varies from sample to sample.

Var (\hat{θ}) = E [(\hat{θ} - E [\hat{θ}])^{2}]

High variance → estimator jumps around a lot.
Low variance → more stable estimates.

📌 Example: Using a small sample leads to higher variance in estimates.

3. Mean Squared Error (MSE)

Combines bias and variance into a single metric.

MSE (\hat{θ}) = E [(\hat{θ} - θ)^{2}] = Bias (\hat{θ})^{2} + Var (\hat{θ})

Helps you trade off between bias and variance.
Commonly used in ML as a loss function.

📌 Example: In regression, MSE is used to evaluate predictions.

4. Consistency

Does the estimator converge to the true value as sample size increases?

\hat{θ}_{n} p θ as n \to \infty

All good estimators should be consistent in large samples.
Related to the Law of Large Numbers.

5. Efficiency (Advanced)

Among all unbiased estimators, is yours the one with smallest variance?

The Cramér-Rao Lower Bound defines the theoretical minimum variance an unbiased estimator can have.

✅ How this applies to ML

Concept	In ML Terms
Bias	Underfitting (too simple model)
Variance	Overfitting (too complex model)
MSE	Typical loss function (e.g., in regression)
Consistency	More data improves model reliability
Efficiency	Important in probabilistic modeling

✅ TL;DR Table

Metric	Meaning	Formula
Bias	Average error from truth	$Bias (\hat{θ}) = E [\hat{θ}] - θ$
Variance	How much estimate varies	$Var (\hat{θ}) = E [(\hat{θ} - E [\hat{θ}])^{2}]$
MSE	Total expected squared error	$MSE = Bias^{2} + Variance$
Consistency	Estimate gets closer to truth as (n \to \infty)	$\hat{θ}_{n} p θ$

Estimating Variance

✅ 1. Naive (Maximum Likelihood) Estimator

If you have a sample ( X_1, X_2, …, X_n ) from a population with variance ( \sigma^2 ), the natural estimator is:

\overset{σ}{^}_{MLE}^{2} = \frac{1}{n} i = 1 \sum n (X_{i} - \overset{ˉ}{X})^{2}

This is the maximum likelihood estimator (MLE) of ( \sigma^2 ).

❌ Problem: It is biased for finite samples — on average, it underestimates the true variance because we use the sample mean ( \bar{X} ) instead of the true mean ( \mu ).

✅ 2. Unbiased (Corrected) Estimator

To remove the bias, we adjust the denominator to ( n-1 ):

s^{2} = \frac{1}{n - 1} i = 1 \sum n (X_{i} - \overset{ˉ}{X})^{2}

This is the usual sample variance.

✔️ Property:

E [s^{2}] = σ^{2}

✅ 3. Why the Correction? (Intuition)

When you calculate ( \bar{X} ), you “use up” one degree of freedom to estimate the mean.
Therefore, the deviations ( X_i - \bar{X} ) are slightly smaller (on average) than deviations from the true mean ( X_i - \mu ).
Dividing by ( n ) underestimates the variance → Bessel’s correction (divide by ( n-1 )) fixes this.

✅ 4. Which Estimator is “Good”?

Estimator	Bias	Variance	When to Use
MLE ((1/n))	Biased (underestimates)	Lower variance	Preferred in ML & large (n) (bias negligible)
Unbiased ((1/(n-1)))	Unbiased	Slightly higher variance	Preferred in classical statistics & small (n)

✅ 5. Importance in ML

In Machine Learning, we often use the MLE ((1/n)) because:
- Datasets are large → bias negligible.
- MLE has nice mathematical properties (consistency, asymptotic normality).
In Statistics, especially with small samples, we prefer the unbiased ((1/(n-1))) version.

Maximum Likelihood Estimation (MLE)

For a video explanation: StatQuest

✅ 1. What is MLE?

Maximum Likelihood Estimation is a method to find the parameter values (( \theta )) of a statistical model that make the observed data most probable.

Pick the parameter values that maximize the probability (likelihood) of seeing the data you observed.

✅ 2. Formal Definition

Given data ( $X_{1}, X_{2}, ..., X_{n}$ ) assumed to come from a distribution with parameter ( $θ$ ):

Likelihood function:

L (θ ∣ X) = P (X_{1}, X_{2}, ..., X_{n} ∣ θ)

For independent samples:

L (θ) = i = 1 \prod n f (X_{i} ∣ θ)

The maximum likelihood estimator is:

\hat{θ}_{MLE} = ar g θ max L (θ)

Often, we maximize the log-likelihood:

ℓ (θ) = lo g L (θ) = i = 1 \sum n lo g f (X_{i} ∣ θ)

Then solve:

\frac{\partial ℓ ( t h e t a )}{p a r t ia lθ} = 0

3. Intuition

The “best” parameters should make your observed data as likely as possible under the assumed distribution.

📌 Example: Estimating a coin’s probability of heads ( p ) → pick ( p ) that maximizes the probability of seeing your flips.

✅ 4. Properties of MLE

Consistency:
( $\hat{θ}_{MLE} \to θ)$ as $(n \to \infty)$ .
Asymptotic normality:
$\hat{θ}_{MLE} \sim N (θ, \frac{1}{n I ( θ )})$
where ( I(\theta) ) = Fisher Information.
Efficiency:
Achieves the lowest possible variance (Cramér-Rao bound) for large ( n ).
Not always unbiased:
Can be biased for small samples.

✅ 5. What is MLE Used For?

Estimating parameters (mean, variance, probabilities).
Machine Learning models:
- Linear/Logistic Regression (likelihood maximization ≈ minimizing MSE / cross-entropy)
- Naive Bayes, Gaussian Mixture Models, HMMs
Basis for probabilistic and Bayesian methods (MLE vs MAP).

✅ 6. Simple Example (Coin Flip)

You flip a coin ( n ) times, observe ( k ) heads.

Likelihood (Bernoulli):

L (p) = p^{k} (1 - p)^{n - k}

Log-likelihood:

ℓ (p) = k lo g p + (n - k) lo g (1 - p)

Differentiate & set to zero:

\frac{d ℓ}{d p} = \frac{k}{p} - \frac{n - k}{1 - p} = 0

Solve:

\overset{p}{^} = \frac{k}{n}

(intuitive: proportion of heads)

New Ones

🔍 Maximum Likelihood Estimation (MLE) in Machine Learning

What is MLE?

MLE estimates model parameters $θ$ that maximize the probability of observed data.
Given data $X = {x^{(1)}, x^{(2)}, ..., x^{(m)}}$ , the likelihood function is:
$h (θ) = P (X ∣ θ)$
Instead of maximizing $h (θ)$ directly, we usually maximize the log-likelihood for convenience:
$ln h (θ) = i = 1 \sum m ln P (x^{(i)} ∣ θ)$

Key Properties of MLE

As the number of samples $n \to \infty$ , the MLE $\hat{θ}$ is unbiased:
$\hat{θ} \to θ$ on average.
Consistency:
$\hat{θ}$ converges in probability to the true $θ$ as $n \to \infty$ .
Asymptotic Normality:
The normalized error is approximately normal:
$\frac{θ ^ - θ}{standard error} \sim N (0, 1)$

Why MLE Matters in ML

Many models (e.g., logistic regression) use MLE to fit parameters.
Understanding MLE helps explain how models learn from data.

📌 Tip: Maximizing the log-likelihood is equivalent to minimizing a related loss function.

Hypothesis Testing

$H_{0}$ (Null Hypothesis): The default assumption; typically states there is no effect or no difference.
$H_{1}$ (Alternative Hypothesis): What you want to test for; usually states there is an effect or difference.

Z-Test

Hypothesis Testing Problems - Z Test & T Statistics - One & Two Tailed Tests 2
t-test vs z-test
Used to determine if there is a significant difference between sample and population means (or between two samples) when the population variance is known.
Test statistic formula:
$z = \frac{x ˉ - μ _{0}}{σ / n}$
where:
$\overset{x}{ˉ}$ = sample mean,
$μ_{0}$ = population mean under $H_{0}$ ,
$σ$ = population standard deviation,
$n$ = sample size.
The $z$ value is compared to critical values from the standard normal distribution to accept or reject $H_{0}$ .

Sensitivity (True Positive Rate)

Measures the ability of a test or classifier to correctly identify positive cases.
$Sensitivity = \frac{True Positives}{True Positives + False Negatives}$
High sensitivity means few false negatives (good at detecting positives).

Fisher’s Exact Test

A statistical significance test used for small sample sizes and categorical data.
Tests for nonrandom association between two categorical variables in a contingency table (often 2x2).
Calculates the exact probability of observing the data assuming $H_{0}$ is true.
Useful when sample sizes are too small for chi-square tests to be valid.

📌 Summary:
Hypothesis testing allows making decisions based on data. Z-tests handle mean comparisons with known variance. Sensitivity assesses detection ability, and Fisher’s Exact Test analyzes categorical associations, especially with small data.

Statistical Testing Concepts

p-value

The p-value is the probability of observing data as extreme (or more) as the current sample, assuming the null hypothesis ( $H_{0}$ ) is true.
It quantifies the evidence against $H_{0}$ .
Small p-values mean strong evidence to reject $H_{0}$ .

p-value Cut-offs (Significance Levels)

Common thresholds for rejecting $H_{0}$ :
- 0.05 (5%): Typical cutoff; reject $H_{0}$ if $p < 0.05$
- 0.01 (1%): Stricter cutoff; reject $H_{0}$ if $p < 0.01$
These cutoffs are called significance levels ( $α$ ).

$χ^{2}$ (Chi-Squared) Test

Chi-Squared by CrashCourse
Tests whether there is a significant association between categorical variables.
Compares observed counts with expected counts under $H_{0}$ .
Test statistic:
$χ^{2} = \sum \frac{( O _{i} - E _{i} ) ^{2}}{E _{i}}$
where $O_{i}$ = observed frequency, $E_{i}$ = expected frequency.
The statistic follows a chi-squared distribution with:
$degrees of freedom = (number of rows - 1) \times (number of columns - 1)$

t-test

Student’s T Distribution - Confidence Intervals & Margin of Error
Hypothesis Testing Problems - Z Test & T Statistics - One & Two Tailed Tests 2 Student’s t-test by Bozeman Science
t-test vs z-test
Degree of Freedom
Tests whether the means of two groups are significantly different when the population variance is unknown.
Test statistic:
$t = \frac{x ˉ - μ _{0}}{s / n}$
where $\overset{x}{ˉ}$ = sample mean, $μ_{0}$ = population mean under $H_{0}$ , $s$ = sample standard deviation, $n$ = sample size.
The $t$ statistic follows a Student’s t-distribution with $n - 1$ degrees of freedom.

📌 Summary:

Use p-values and significance levels to decide whether to reject $H_{0}$ .
$χ^{2}$ test is for categorical data associations.
t-test compares means when variance is unknown.

🚫 Non-Parametric Tests

Parametric and Nonparametric Tests by DATAtab
Non-parametric tests do not assume a specific distribution for the data.
Useful when data violates assumptions of parametric tests (e.g., normality).
Examples: Mann-Whitney U test, Wilcoxon signed-rank test, permutation tests.

🔄 Permutation Test

A non-parametric method to test hypotheses by randomly shuffling labels on data points.
Measures how likely an observed effect is under the null hypothesis by comparing it to a distribution of effects from shuffled data.
Useful for small samples or unknown distributions.

🔢 Multiple Hypothesis Testing

When testing many hypotheses simultaneously, the chance of false positives (Type I errors) increases.
Without correction, if you test 100 hypotheses at $α = 0.05$ , about 5 may appear significant by chance.

🎯 Adjusting p-values: Bonferroni Correction

A simple and conservative method to control family-wise error rate.
Adjusted significance level:
$α_{adjusted} = \frac{α}{m}$
where $m$ = number of tests.
Reject $H_{0}$ only if:
$p < α_{adjusted}$
Controls false positives but can be overly strict, increasing false negatives.

📌 Summary:

Use permutation tests when data distributions are unknown.
Be cautious with multiple testing; adjust p-values to avoid false positives.
Bonferroni is simple but conservative.

Linear Regression: Main Concepts

0. Watch These

The Main Ideas of Fitting a Line to Data (The Main Ideas of Least Squares and Linear Regression.)
Linear Regression, Clearly Explained!!!
- Read the first comment by statquest for correction
Gradient Descent, Step-by-Step

1. Least Squares Fit

Find the line (model) that minimizes the sum of squared residuals.
Residual: the vertical distance between a data point and the regression line.

2. Residuals and Sum of Squares

Residuals:
$residual_{i} = y_{i} - \overset{y}{^}_{i}$
where $y_{i}$ is the actual value, $\overset{y}{^}_{i}$ is the predicted value.
Sum of Squares Total (SST): variation around the mean (total variability in data)
$S S_{mean} = i = 1 \sum n (y_{i} - \overset{y}{ˉ})^{2}$
Sum of Squares Regression (SSR) / Fit: variation explained by the model
$S S_{fit} = i = 1 \sum n (\overset{y}{^}_{i} - \overset{y}{ˉ})^{2}$
Sum of Squares Residuals (SSE): variation unexplained by the model
$S S_{res} = i = 1 \sum n (y_{i} - \overset{y}{^}_{i})^{2}$

3. Coefficient of Determination ( $R^{2}$ )

R-squared, Clearly Explained!!! by StatQuest
Measures how much of the total variation in the dependent variable is explained by the model:
$R^{2} = \frac{S S _{fit}}{S S _{mean}} = 1 - \frac{S S _{res}}{S S _{mean}}$
Interpretation:
- $R^{2} = 0$ means the model explains none of the variation.
- $R^{2} = 1$ means the model explains all the variation.

4. p-value for $R^{2}$

Tests whether the relationship captured by $R^{2}$ is statistically significant.
Small p-value indicates that the model explains a significant portion of the variance, not due to random chance.

5. Direction of Regression

Regression can be run both ways: predicting $y$ from $x$ or $x$ from $y$ .
$R^{2}$ values may differ depending on the direction.

📌 Summary:
Linear regression fits a line minimizing residuals; $R^{2}$ quantifies explained variance; p-values test significance.

Questions About Linear Regression

🧮 1. Does Adding Parameters Always Help?

No, not necessarily.

📌 Key Idea:

Adding more predictors (features) to a linear regression never decreases the training R² (coefficient of determination).
But it can worsen test performance (generalization) due to overfitting.

✍️ Example:

If you have a model

$y = β_{0} + β_{1} x_{1} + ϵ$

and you add another feature $x_{2}$ , the model becomes:

$y = β_{0} + β_{1} x_{1} + β_{2} x_{2} + ϵ$

This will always fit the training data at least as well, but the new parameter might just be fitting noise, not signal — hurting performance on new data.

So, more parameters = more flexible, but not always better for prediction.

📈 2. What is the F-statistic in Linear Regression?

🎯 Purpose:

Used to test whether the regression model explains a significant amount of variance in the dependent variable.

⚙️ Intuition:

It compares two things:
- Explained variance (how much of $y$ the model explains)
- Unexplained variance (residuals/noise)
If the explained variance is much higher than unexplained, the model is statistically significant.

📐 Formula:

F = \frac{Explained Variance per parameter}{Unexplained Variance per residual} = \frac{MSR}{MSE} = \frac{SSR / p}{SSE / ( n - p - 1 )}

Where:

SSR: Regression Sum of Squares (explained)
SSE: Error Sum of Squares (unexplained)
p: number of predictors
n: number of data points

A high F value → model is better than a null model (just the mean).

🧮 3. Degrees of Freedom (DoF)

https://www.youtube.com/watch?v=Cm0vFoGVMB8&ab_channel=CrashCourse

In Regression:

Total DoF = $n - 1$
Regression DoF = $p$ (number of predictors)
Residual DoF = $n - p - 1$

🔍 Why It Matters:

Degrees of freedom are used to standardize variance (mean squares) so you can compare them — like in the F-statistic.
More parameters = fewer residual degrees of freedom → which means you’re using up data to fit the model.

🔁 Summary

Concept	Meaning
Adding features	Can increase training R², but risks overfitting.
F-statistic	Tests whether the model explains significant variation.
Explained vs. Unexplained	SSR vs. SSE (signal vs. noise)
Degrees of freedom	Track how much data is used for estimating vs. testing.

Machine Learning

Explorer

04-Notes-Mehrdad

Central Limit Theorem (CLT)

Why it matters for ML

Importance for ML

Law of Large Numbers (LLN)

Why it matters for ML

Importance for ML

Estimator Quality in Statistics & ML

✅ What is Estimation?

✅ Key Properties of Estimators

1. Bias

2. Variance

3. Mean Squared Error (MSE)

4. Consistency

5. Efficiency (Advanced)

✅ How this applies to ML

✅ TL;DR Table

Estimating Variance

✅ 1. Naive (Maximum Likelihood) Estimator

✅ 2. Unbiased (Corrected) Estimator

✅ 3. Why the Correction? (Intuition)

✅ 4. Which Estimator is “Good”?

✅ 5. Importance in ML

Maximum Likelihood Estimation (MLE)

✅ 1. What is MLE?

✅ 2. Formal Definition

3. Intuition

✅ 4. Properties of MLE

✅ 5. What is MLE Used For?

✅ 6. Simple Example (Coin Flip)

New Ones

🔍 Maximum Likelihood Estimation (MLE) in Machine Learning

What is MLE?

Key Properties of MLE

Why MLE Matters in ML

Hypothesis Testing

Z-Test

Sensitivity (True Positive Rate)

Fisher’s Exact Test

Statistical Testing Concepts

p-value

p-value Cut-offs (Significance Levels)

χ2 (Chi-Squared) Test

t-test

🚫 Non-Parametric Tests

🔄 Permutation Test

🔢 Multiple Hypothesis Testing

🎯 Adjusting p-values: Bonferroni Correction

Linear Regression: Main Concepts

0. Watch These

1. Least Squares Fit

2. Residuals and Sum of Squares

3. Coefficient of Determination (R2)

4. p-value for R2

5. Direction of Regression

Questions About Linear Regression

🧮 1. Does Adding Parameters Always Help?

📌 Key Idea:

✍️ Example:

📈 2. What is the F-statistic in Linear Regression?

🎯 Purpose:

⚙️ Intuition:

📐 Formula:

🧮 3. Degrees of Freedom (DoF)

In Regression:

🔍 Why It Matters:

🔁 Summary

Graph View

Table of Contents

$χ^{2}$ (Chi-Squared) Test

3. Coefficient of Determination ( $R^{2}$ )

4. p-value for $R^{2}$