(Content mostly generated by ChatGPT)

Central Limit Theorem (CLT)

What it is:
The CLT states that if you take many random samples from any population (with finite variance) and compute their means, the distribution of those sample means will approach a normal (Gaussian) distribution as the sample size increases, regardless of the original population’s distribution.

Formally:

where ( \mu ) is the population mean, ( \sigma^2 ) is the population variance, and (n) is the sample size.


Why it matters for ML

Parameter estimation & confidence intervals – When you estimate model parameters (like weights, means, etc.), CLT justifies using normal approximations for those estimates.

Hypothesis testing & A/B testing – Many statistical tests assume normality, and CLT explains why we can use those tests even if the original data isn’t normal.

Feature engineering & averaging – Many ensemble methods (like bagging, random forests) rely on averaging predictions; CLT explains why averaging reduces variance and tends to produce a stable, normal-like output.

Understanding loss surfaces & convergence – Stochastic Gradient Descent (SGD) noise often approximates a normal distribution due to CLT.


Importance for ML

⭐⭐⭐⭐⭐ (Critical) – It’s a foundational concept that underpins statistical inference, which is everywhere in ML, especially in probabilistic models, evaluation, and uncertainty estimation.


Law of Large Numbers (LLN)

What it is:
The LLN states that as the sample size nn increases, the sample mean Xˉ\bar{X} will converge to the population mean μ\mu.

Formally:

for any small ϵ>0.

In simpler words: the more data you collect, the closer your sample average gets to the true average.


Why it matters for ML

Reliability of training data – Justifies that with enough data, the empirical statistics (mean, variance) of features approximate the true underlying distribution.

Model performance estimation – Cross-validation and test accuracy stabilize with more samples because of LLN.

Monte Carlo methods & probabilistic ML – When estimating expectations by random sampling (e.g., in Bayesian inference, reinforcement learning), LLN ensures that the average of many samples converges to the expected value.

Why “more data beats clever algorithms” – LLN supports the idea that larger datasets reduce variance and improve generalization.


Importance for ML

⭐⭐⭐⭐⭐ (Critical) – Fundamental to why we trust that learning from large datasets works.


Estimator Quality in Statistics & ML

What is Estimation?

In statistics and ML, you’re often trying to estimate some true parameter of the data-generating process (like the mean ( \mu ), variance ( \sigma^2 ), regression coefficients ( \beta ), etc.) based on a sample of data.

You use an estimator (a function of the data) to give you this estimate.
Examples:

  • ( \bar{X} ) estimates ( \mu )
  • ( \hat{\theta} ) estimates ( \theta )

But… how do you know if your estimator is good?


Key Properties of Estimators

1. Bias

Measures how far the average estimate is from the true value.

  • Unbiased if ( \mathbb{E}[\hat{\theta}] = \theta )
  • Bias represents systematic error.

📌 Example: Sample mean ( \bar{X} ) is an unbiased estimator of population mean ( \mu ).


2. Variance

Measures how much the estimator varies from sample to sample.

  • High variance → estimator jumps around a lot.
  • Low variance → more stable estimates.

📌 Example: Using a small sample leads to higher variance in estimates.


3. Mean Squared Error (MSE)

Combines bias and variance into a single metric.

  • Helps you trade off between bias and variance.
  • Commonly used in ML as a loss function.

📌 Example: In regression, MSE is used to evaluate predictions.


4. Consistency

Does the estimator converge to the true value as sample size increases?

  • All good estimators should be consistent in large samples.
  • Related to the Law of Large Numbers.

5. Efficiency (Advanced)

Among all unbiased estimators, is yours the one with smallest variance?

  • The Cramér-Rao Lower Bound defines the theoretical minimum variance an unbiased estimator can have.

How this applies to ML

ConceptIn ML Terms
BiasUnderfitting (too simple model)
VarianceOverfitting (too complex model)
MSETypical loss function (e.g., in regression)
ConsistencyMore data improves model reliability
EfficiencyImportant in probabilistic modeling

TL;DR Table

MetricMeaningFormula
BiasAverage error from truth
VarianceHow much estimate varies
MSETotal expected squared error
ConsistencyEstimate gets closer to truth as (n \to \infty)

Estimating Variance

1. Naive (Maximum Likelihood) Estimator

If you have a sample ( X_1, X_2, …, X_n ) from a population with variance ( \sigma^2 ), the natural estimator is:

This is the maximum likelihood estimator (MLE) of ( \sigma^2 ).

Problem: It is biased for finite samples — on average, it underestimates the true variance because we use the sample mean ( \bar{X} ) instead of the true mean ( \mu ).


2. Unbiased (Corrected) Estimator

To remove the bias, we adjust the denominator to ( n-1 ):

This is the usual sample variance.

✔️ Property:


3. Why the Correction? (Intuition)

  • When you calculate ( \bar{X} ), you “use up” one degree of freedom to estimate the mean.
  • Therefore, the deviations ( X_i - \bar{X} ) are slightly smaller (on average) than deviations from the true mean ( X_i - \mu ).
  • Dividing by ( n ) underestimates the variance → Bessel’s correction (divide by ( n-1 )) fixes this.

4. Which Estimator is “Good”?

EstimatorBiasVarianceWhen to Use
MLE ((1/n))Biased (underestimates)Lower variancePreferred in ML & large (n) (bias negligible)
Unbiased ((1/(n-1)))UnbiasedSlightly higher variancePreferred in classical statistics & small (n)

5. Importance in ML

  • In Machine Learning, we often use the MLE ((1/n)) because:
    • Datasets are large → bias negligible.
    • MLE has nice mathematical properties (consistency, asymptotic normality).
  • In Statistics, especially with small samples, we prefer the unbiased ((1/(n-1))) version.

Maximum Likelihood Estimation (MLE)

For a video explanation: StatQuest

1. What is MLE?

Maximum Likelihood Estimation is a method to find the parameter values (( \theta )) of a statistical model that make the observed data most probable.

Pick the parameter values that maximize the probability (likelihood) of seeing the data you observed.


2. Formal Definition

Given data () assumed to come from a distribution with parameter ():

Likelihood function:

For independent samples:

The maximum likelihood estimator is:

Often, we maximize the log-likelihood:

Then solve:


3. Intuition

  • The “best” parameters should make your observed data as likely as possible under the assumed distribution.

📌 Example: Estimating a coin’s probability of heads ( p ) → pick ( p ) that maximizes the probability of seeing your flips.


4. Properties of MLE

  • Consistency:
    ( as .

  • Asymptotic normality:

    where ( I(\theta) ) = Fisher Information.

  • Efficiency:
    Achieves the lowest possible variance (Cramér-Rao bound) for large ( n ).

  • Not always unbiased:
    Can be biased for small samples.


5. What is MLE Used For?

  • Estimating parameters (mean, variance, probabilities).
  • Machine Learning models:
    • Linear/Logistic Regression (likelihood maximization ≈ minimizing MSE / cross-entropy)
    • Naive Bayes, Gaussian Mixture Models, HMMs
  • Basis for probabilistic and Bayesian methods (MLE vs MAP).

6. Simple Example (Coin Flip)

You flip a coin ( n ) times, observe ( k ) heads.

Likelihood (Bernoulli):

Log-likelihood:

Differentiate & set to zero:

Solve:

(intuitive: proportion of heads)

New Ones

🔍 Maximum Likelihood Estimation (MLE) in Machine Learning

What is MLE?

  • MLE estimates model parameters that maximize the probability of observed data.

  • Given data , the likelihood function is:

  • Instead of maximizing directly, we usually maximize the log-likelihood for convenience:


Key Properties of MLE

  1. As the number of samples , the MLE is unbiased:
    on average.

  2. Consistency:
    converges in probability to the true as .

  3. Asymptotic Normality:
    The normalized error is approximately normal:


Why MLE Matters in ML

  • Many models (e.g., logistic regression) use MLE to fit parameters.
  • Understanding MLE helps explain how models learn from data.

📌 Tip: Maximizing the log-likelihood is equivalent to minimizing a related loss function.

Hypothesis Testing

  • (Null Hypothesis): The default assumption; typically states there is no effect or no difference.
  • (Alternative Hypothesis): What you want to test for; usually states there is an effect or difference.

Z-Test

  • Hypothesis Testing Problems - Z Test & T Statistics - One & Two Tailed Tests 2

  • t-test vs z-test

  • Used to determine if there is a significant difference between sample and population means (or between two samples) when the population variance is known.

  • Test statistic formula:

    where:
    = sample mean,
    = population mean under ,
    = population standard deviation,
    = sample size.

  • The value is compared to critical values from the standard normal distribution to accept or reject .


Sensitivity (True Positive Rate)

  • Measures the ability of a test or classifier to correctly identify positive cases.

  • High sensitivity means few false negatives (good at detecting positives).


Fisher’s Exact Test

  • A statistical significance test used for small sample sizes and categorical data.
  • Tests for nonrandom association between two categorical variables in a contingency table (often 2x2).
  • Calculates the exact probability of observing the data assuming is true.
  • Useful when sample sizes are too small for chi-square tests to be valid.

📌 Summary:
Hypothesis testing allows making decisions based on data. Z-tests handle mean comparisons with known variance. Sensitivity assesses detection ability, and Fisher’s Exact Test analyzes categorical associations, especially with small data.

Statistical Testing Concepts


p-value

  • The p-value is the probability of observing data as extreme (or more) as the current sample, assuming the null hypothesis () is true.
  • It quantifies the evidence against .
  • Small p-values mean strong evidence to reject .

p-value Cut-offs (Significance Levels)

  • Common thresholds for rejecting :
    • 0.05 (5%): Typical cutoff; reject if
    • 0.01 (1%): Stricter cutoff; reject if
  • These cutoffs are called significance levels ().

(Chi-Squared) Test

  • Chi-Squared by CrashCourse

  • Tests whether there is a significant association between categorical variables.

  • Compares observed counts with expected counts under .

  • Test statistic:

    where = observed frequency, = expected frequency.

  • The statistic follows a chi-squared distribution with:


t-test


📌 Summary:

  • Use p-values and significance levels to decide whether to reject .
  • test is for categorical data associations.
  • t-test compares means when variance is unknown.

🚫 Non-Parametric Tests

  • Parametric and Nonparametric Tests by DATAtab
  • Non-parametric tests do not assume a specific distribution for the data.
  • Useful when data violates assumptions of parametric tests (e.g., normality).
  • Examples: Mann-Whitney U test, Wilcoxon signed-rank test, permutation tests.

🔄 Permutation Test

  • A non-parametric method to test hypotheses by randomly shuffling labels on data points.
  • Measures how likely an observed effect is under the null hypothesis by comparing it to a distribution of effects from shuffled data.
  • Useful for small samples or unknown distributions.

🔢 Multiple Hypothesis Testing

  • When testing many hypotheses simultaneously, the chance of false positives (Type I errors) increases.
  • Without correction, if you test 100 hypotheses at , about 5 may appear significant by chance.

🎯 Adjusting p-values: Bonferroni Correction

  • A simple and conservative method to control family-wise error rate.

  • Adjusted significance level:

    where = number of tests.

  • Reject only if:

  • Controls false positives but can be overly strict, increasing false negatives.


📌 Summary:

  • Use permutation tests when data distributions are unknown.
  • Be cautious with multiple testing; adjust p-values to avoid false positives.
  • Bonferroni is simple but conservative.

Linear Regression: Main Concepts

0. Watch These

1. Least Squares Fit

  • Find the line (model) that minimizes the sum of squared residuals.
  • Residual: the vertical distance between a data point and the regression line.

2. Residuals and Sum of Squares

  • Residuals:

    where is the actual value, is the predicted value.

  • Sum of Squares Total (SST): variation around the mean (total variability in data)

  • Sum of Squares Regression (SSR) / Fit: variation explained by the model

  • Sum of Squares Residuals (SSE): variation unexplained by the model


3. Coefficient of Determination ()

  • R-squared, Clearly Explained!!! by StatQuest

  • Measures how much of the total variation in the dependent variable is explained by the model:

  • Interpretation:
    - means the model explains none of the variation.

    • means the model explains all the variation.

4. p-value for

  • Tests whether the relationship captured by is statistically significant.
  • Small p-value indicates that the model explains a significant portion of the variance, not due to random chance.

5. Direction of Regression

  • Regression can be run both ways: predicting from or from .
  • values may differ depending on the direction.

📌 Summary:
Linear regression fits a line minimizing residuals; quantifies explained variance; p-values test significance.

Questions About Linear Regression

🧮 1. Does Adding Parameters Always Help?

No, not necessarily.

📌 Key Idea:

  • Adding more predictors (features) to a linear regression never decreases the training R² (coefficient of determination).
  • But it can worsen test performance (generalization) due to overfitting.

✍️ Example:

If you have a model

and you add another feature , the model becomes:

This will always fit the training data at least as well, but the new parameter might just be fitting noise, not signal — hurting performance on new data.

So, more parameters = more flexible, but not always better for prediction.


📈 2. What is the F-statistic in Linear Regression?

🎯 Purpose:

Used to test whether the regression model explains a significant amount of variance in the dependent variable.

⚙️ Intuition:

  • It compares two things:
    • Explained variance (how much of the model explains)
    • Unexplained variance (residuals/noise)
  • If the explained variance is much higher than unexplained, the model is statistically significant.

📐 Formula:

Where:

  • SSR: Regression Sum of Squares (explained)
  • SSE: Error Sum of Squares (unexplained)
  • p: number of predictors
  • n: number of data points

A high F value → model is better than a null model (just the mean).


🧮 3. Degrees of Freedom (DoF)

https://www.youtube.com/watch?v=Cm0vFoGVMB8&ab_channel=CrashCourse

In Regression:

  • Total DoF =

  • Regression DoF = (number of predictors)

  • Residual DoF =

🔍 Why It Matters:

  • Degrees of freedom are used to standardize variance (mean squares) so you can compare them — like in the F-statistic.

  • More parameters = fewer residual degrees of freedom → which means you’re using up data to fit the model.


🔁 Summary

ConceptMeaning
Adding featuresCan increase training R², but risks overfitting.
F-statisticTests whether the model explains significant variation.
Explained vs. UnexplainedSSR vs. SSE (signal vs. noise)
Degrees of freedomTrack how much data is used for estimating vs. testing.