(Content mostly generated by ChatGPT)
Central Limit Theorem (CLT)
What it is:
The CLT states that if you take many random samples from any population (with finite variance) and compute their means, the distribution of those sample means will approach a normal (Gaussian) distribution as the sample size increases, regardless of the original population’s distribution.
Formally:
where ( \mu ) is the population mean, ( \sigma^2 ) is the population variance, and (n) is the sample size.
Why it matters for ML
✅ Parameter estimation & confidence intervals – When you estimate model parameters (like weights, means, etc.), CLT justifies using normal approximations for those estimates.
✅ Hypothesis testing & A/B testing – Many statistical tests assume normality, and CLT explains why we can use those tests even if the original data isn’t normal.
✅ Feature engineering & averaging – Many ensemble methods (like bagging, random forests) rely on averaging predictions; CLT explains why averaging reduces variance and tends to produce a stable, normal-like output.
✅ Understanding loss surfaces & convergence – Stochastic Gradient Descent (SGD) noise often approximates a normal distribution due to CLT.
Importance for ML
⭐⭐⭐⭐⭐ (Critical) – It’s a foundational concept that underpins statistical inference, which is everywhere in ML, especially in probabilistic models, evaluation, and uncertainty estimation.
Law of Large Numbers (LLN)
What it is:
The LLN states that as the sample size nn increases, the sample mean Xˉ\bar{X} will converge to the population mean μ\mu.
Formally:
for any small ϵ>0.
In simpler words: the more data you collect, the closer your sample average gets to the true average.
Why it matters for ML
✅ Reliability of training data – Justifies that with enough data, the empirical statistics (mean, variance) of features approximate the true underlying distribution.
✅ Model performance estimation – Cross-validation and test accuracy stabilize with more samples because of LLN.
✅ Monte Carlo methods & probabilistic ML – When estimating expectations by random sampling (e.g., in Bayesian inference, reinforcement learning), LLN ensures that the average of many samples converges to the expected value.
✅ Why “more data beats clever algorithms” – LLN supports the idea that larger datasets reduce variance and improve generalization.
Importance for ML
⭐⭐⭐⭐⭐ (Critical) – Fundamental to why we trust that learning from large datasets works.
Estimator Quality in Statistics & ML
✅ What is Estimation?
In statistics and ML, you’re often trying to estimate some true parameter of the data-generating process (like the mean ( \mu ), variance ( \sigma^2 ), regression coefficients ( \beta ), etc.) based on a sample of data.
You use an estimator (a function of the data) to give you this estimate.
Examples:
- ( \bar{X} ) estimates ( \mu )
- ( \hat{\theta} ) estimates ( \theta )
But… how do you know if your estimator is good?
✅ Key Properties of Estimators
1. Bias
Measures how far the average estimate is from the true value.
- Unbiased if ( \mathbb{E}[\hat{\theta}] = \theta )
- Bias represents systematic error.
📌 Example: Sample mean ( \bar{X} ) is an unbiased estimator of population mean ( \mu ).
2. Variance
Measures how much the estimator varies from sample to sample.
- High variance → estimator jumps around a lot.
- Low variance → more stable estimates.
📌 Example: Using a small sample leads to higher variance in estimates.
3. Mean Squared Error (MSE)
Combines bias and variance into a single metric.
- Helps you trade off between bias and variance.
- Commonly used in ML as a loss function.
📌 Example: In regression, MSE is used to evaluate predictions.
4. Consistency
Does the estimator converge to the true value as sample size increases?
- All good estimators should be consistent in large samples.
- Related to the Law of Large Numbers.
5. Efficiency (Advanced)
Among all unbiased estimators, is yours the one with smallest variance?
- The Cramér-Rao Lower Bound defines the theoretical minimum variance an unbiased estimator can have.
✅ How this applies to ML
| Concept | In ML Terms | 
|---|---|
| Bias | Underfitting (too simple model) | 
| Variance | Overfitting (too complex model) | 
| MSE | Typical loss function (e.g., in regression) | 
| Consistency | More data improves model reliability | 
| Efficiency | Important in probabilistic modeling | 
✅ TL;DR Table
| Metric | Meaning | Formula | 
|---|---|---|
| Bias | Average error from truth | |
| Variance | How much estimate varies | |
| MSE | Total expected squared error | |
| Consistency | Estimate gets closer to truth as (n \to \infty) | 
Estimating Variance
✅ 1. Naive (Maximum Likelihood) Estimator
If you have a sample ( X_1, X_2, …, X_n ) from a population with variance ( \sigma^2 ), the natural estimator is:
This is the maximum likelihood estimator (MLE) of ( \sigma^2 ).
❌ Problem: It is biased for finite samples — on average, it underestimates the true variance because we use the sample mean ( \bar{X} ) instead of the true mean ( \mu ).
✅ 2. Unbiased (Corrected) Estimator
To remove the bias, we adjust the denominator to ( n-1 ):
This is the usual sample variance.
✔️ Property:
✅ 3. Why the Correction? (Intuition)
- When you calculate ( \bar{X} ), you “use up” one degree of freedom to estimate the mean.
- Therefore, the deviations ( X_i - \bar{X} ) are slightly smaller (on average) than deviations from the true mean ( X_i - \mu ).
- Dividing by ( n ) underestimates the variance → Bessel’s correction (divide by ( n-1 )) fixes this.
✅ 4. Which Estimator is “Good”?
| Estimator | Bias | Variance | When to Use | 
|---|---|---|---|
| MLE ((1/n)) | Biased (underestimates) | Lower variance | Preferred in ML & large (n) (bias negligible) | 
| Unbiased ((1/(n-1))) | Unbiased | Slightly higher variance | Preferred in classical statistics & small (n) | 
✅ 5. Importance in ML
- In Machine Learning, we often use the MLE ((1/n)) because:
- Datasets are large → bias negligible.
- MLE has nice mathematical properties (consistency, asymptotic normality).
 
- In Statistics, especially with small samples, we prefer the unbiased ((1/(n-1))) version.
Maximum Likelihood Estimation (MLE)
For a video explanation: StatQuest
✅ 1. What is MLE?
Maximum Likelihood Estimation is a method to find the parameter values (( \theta )) of a statistical model that make the observed data most probable.
Pick the parameter values that maximize the probability (likelihood) of seeing the data you observed.
✅ 2. Formal Definition
Given data () assumed to come from a distribution with parameter ():
Likelihood function:
For independent samples:
The maximum likelihood estimator is:
Often, we maximize the log-likelihood:
Then solve:
3. Intuition
- The “best” parameters should make your observed data as likely as possible under the assumed distribution.
📌 Example: Estimating a coin’s probability of heads ( p ) → pick ( p ) that maximizes the probability of seeing your flips.
✅ 4. Properties of MLE
- 
Consistency: 
 ( as .
- 
Asymptotic normality: where ( I(\theta) ) = Fisher Information. 
- 
Efficiency: 
 Achieves the lowest possible variance (Cramér-Rao bound) for large ( n ).
- 
Not always unbiased: 
 Can be biased for small samples.
✅ 5. What is MLE Used For?
- Estimating parameters (mean, variance, probabilities).
- Machine Learning models:
- Linear/Logistic Regression (likelihood maximization ≈ minimizing MSE / cross-entropy)
- Naive Bayes, Gaussian Mixture Models, HMMs
 
- Basis for probabilistic and Bayesian methods (MLE vs MAP).
✅ 6. Simple Example (Coin Flip)
You flip a coin ( n ) times, observe ( k ) heads.
Likelihood (Bernoulli):
Log-likelihood:
Differentiate & set to zero:
Solve:
(intuitive: proportion of heads)
New Ones
🔍 Maximum Likelihood Estimation (MLE) in Machine Learning
What is MLE?
- 
MLE estimates model parameters that maximize the probability of observed data. 
- 
Given data , the likelihood function is: 
- 
Instead of maximizing directly, we usually maximize the log-likelihood for convenience: 
Key Properties of MLE
- 
As the number of samples , the MLE is unbiased: 
 on average.
- 
Consistency: 
 converges in probability to the true as .
- 
Asymptotic Normality: 
 The normalized error is approximately normal:
Why MLE Matters in ML
- Many models (e.g., logistic regression) use MLE to fit parameters.
- Understanding MLE helps explain how models learn from data.
📌 Tip: Maximizing the log-likelihood is equivalent to minimizing a related loss function.
Hypothesis Testing
- (Null Hypothesis): The default assumption; typically states there is no effect or no difference.
- (Alternative Hypothesis): What you want to test for; usually states there is an effect or difference.
Z-Test
- 
Hypothesis Testing Problems - Z Test & T Statistics - One & Two Tailed Tests 2 
- 
Used to determine if there is a significant difference between sample and population means (or between two samples) when the population variance is known. 
- 
Test statistic formula: where: 
 = sample mean,
 = population mean under ,
 = population standard deviation,
 = sample size.
- 
The value is compared to critical values from the standard normal distribution to accept or reject . 
Sensitivity (True Positive Rate)
- 
Measures the ability of a test or classifier to correctly identify positive cases. 
- 
High sensitivity means few false negatives (good at detecting positives). 
Fisher’s Exact Test
- A statistical significance test used for small sample sizes and categorical data.
- Tests for nonrandom association between two categorical variables in a contingency table (often 2x2).
- Calculates the exact probability of observing the data assuming is true.
- Useful when sample sizes are too small for chi-square tests to be valid.
📌 Summary:
Hypothesis testing allows making decisions based on data. Z-tests handle mean comparisons with known variance. Sensitivity assesses detection ability, and Fisher’s Exact Test analyzes categorical associations, especially with small data.
Statistical Testing Concepts
p-value
- The p-value is the probability of observing data as extreme (or more) as the current sample, assuming the null hypothesis () is true.
- It quantifies the evidence against .
- Small p-values mean strong evidence to reject .
p-value Cut-offs (Significance Levels)
- Common thresholds for rejecting :
- 0.05 (5%): Typical cutoff; reject if
- 0.01 (1%): Stricter cutoff; reject if
 
- These cutoffs are called significance levels ().
(Chi-Squared) Test
- 
Tests whether there is a significant association between categorical variables. 
- 
Compares observed counts with expected counts under . 
- 
Test statistic: where = observed frequency, = expected frequency. 
- 
The statistic follows a chi-squared distribution with: 
t-test
- 
Student’s T Distribution - Confidence Intervals & Margin of Error 
- 
Hypothesis Testing Problems - Z Test & T Statistics - One & Two Tailed Tests 2 Student’s t-test by Bozeman Science 
- 
Tests whether the means of two groups are significantly different when the population variance is unknown. 
- 
Test statistic: where = sample mean, = population mean under , = sample standard deviation, = sample size. 
- 
The statistic follows a Student’s t-distribution with degrees of freedom. 
📌 Summary:
- Use p-values and significance levels to decide whether to reject .
- test is for categorical data associations.
- t-test compares means when variance is unknown.
🚫 Non-Parametric Tests
- Parametric and Nonparametric Tests by DATAtab
- Non-parametric tests do not assume a specific distribution for the data.
- Useful when data violates assumptions of parametric tests (e.g., normality).
- Examples: Mann-Whitney U test, Wilcoxon signed-rank test, permutation tests.
🔄 Permutation Test
- A non-parametric method to test hypotheses by randomly shuffling labels on data points.
- Measures how likely an observed effect is under the null hypothesis by comparing it to a distribution of effects from shuffled data.
- Useful for small samples or unknown distributions.
🔢 Multiple Hypothesis Testing
- When testing many hypotheses simultaneously, the chance of false positives (Type I errors) increases.
- Without correction, if you test 100 hypotheses at , about 5 may appear significant by chance.
🎯 Adjusting p-values: Bonferroni Correction
- 
A simple and conservative method to control family-wise error rate. 
- 
Adjusted significance level: where = number of tests. 
- 
Reject only if: 
- 
Controls false positives but can be overly strict, increasing false negatives. 
📌 Summary:
- Use permutation tests when data distributions are unknown.
- Be cautious with multiple testing; adjust p-values to avoid false positives.
- Bonferroni is simple but conservative.
Linear Regression: Main Concepts
0. Watch These
- The Main Ideas of Fitting a Line to Data (The Main Ideas of Least Squares and Linear Regression.)
- Linear Regression, Clearly Explained!!!
- Read the first comment by statquest for correction
 
- Gradient Descent, Step-by-Step
1. Least Squares Fit
- Find the line (model) that minimizes the sum of squared residuals.
- Residual: the vertical distance between a data point and the regression line.
2. Residuals and Sum of Squares
- 
Residuals: where is the actual value, is the predicted value. 
- 
Sum of Squares Total (SST): variation around the mean (total variability in data) 
- 
Sum of Squares Regression (SSR) / Fit: variation explained by the model 
- 
Sum of Squares Residuals (SSE): variation unexplained by the model 
3. Coefficient of Determination ()
- 
Measures how much of the total variation in the dependent variable is explained by the model: 
- 
Interpretation: 
 - means the model explains none of the variation.- means the model explains all the variation.
 
4. p-value for
- Tests whether the relationship captured by is statistically significant.
- Small p-value indicates that the model explains a significant portion of the variance, not due to random chance.
5. Direction of Regression
- Regression can be run both ways: predicting from or from .
- values may differ depending on the direction.
📌 Summary:
Linear regression fits a line minimizing residuals;  quantifies explained variance; p-values test significance.
Questions About Linear Regression
🧮 1. Does Adding Parameters Always Help?
No, not necessarily.
📌 Key Idea:
- Adding more predictors (features) to a linear regression never decreases the training R² (coefficient of determination).
- But it can worsen test performance (generalization) due to overfitting.
✍️ Example:
If you have a model
and you add another feature , the model becomes:
This will always fit the training data at least as well, but the new parameter might just be fitting noise, not signal — hurting performance on new data.
So, more parameters = more flexible, but not always better for prediction.
📈 2. What is the F-statistic in Linear Regression?
🎯 Purpose:
Used to test whether the regression model explains a significant amount of variance in the dependent variable.
⚙️ Intuition:
- It compares two things:
- Explained variance (how much of the model explains)
- Unexplained variance (residuals/noise)
 
- If the explained variance is much higher than unexplained, the model is statistically significant.
📐 Formula:
Where:
- SSR: Regression Sum of Squares (explained)
- SSE: Error Sum of Squares (unexplained)
- p: number of predictors
- n: number of data points
A high F value → model is better than a null model (just the mean).
🧮 3. Degrees of Freedom (DoF)
https://www.youtube.com/watch?v=Cm0vFoGVMB8&ab_channel=CrashCourse
In Regression:
- 
Total DoF = 
- 
Regression DoF = (number of predictors) 
- 
Residual DoF = 
🔍 Why It Matters:
- 
Degrees of freedom are used to standardize variance (mean squares) so you can compare them — like in the F-statistic. 
- 
More parameters = fewer residual degrees of freedom → which means you’re using up data to fit the model. 
🔁 Summary
| Concept | Meaning | 
|---|---|
| Adding features | Can increase training R², but risks overfitting. | 
| F-statistic | Tests whether the model explains significant variation. | 
| Explained vs. Unexplained | SSR vs. SSE (signal vs. noise) | 
| Degrees of freedom | Track how much data is used for estimating vs. testing. |