Probability Q&A
Q: Why can the PDF of an exponential distribution be greater than 1?
Asked in: Meeting 01
Answer: The PDF is a density, not a probability. For an exponential distribution:
If , the PDF can exceed 1. But the total probability is the area under the curve, which is always equal to 1.
Q: What does it mean for the exponential distribution to be memoryless?
Asked in: Meeting 01
Answer: The exponential distribution satisfies:
This means the future waiting time does not depend on how long you have already waited. The past does not affect the future because of the independence of time intervals.
Q: What does it mean for the geometric distribution to be memoryless?
Asked in: Meeting 01
Answer: The geometric distribution satisfies:
The proof uses the tail probability . Substituting:
Thus, past failures do not influence the probability of needing additional trials.
Q: What is the difference between covariance and correlation?
Asked in: Meeting 01
Answer:
-
Covariance: Measures how two variables change together, with units = product of their units. Range: .
-
Correlation: Standardized covariance:
It is dimensionless, always in , and easier to interpret.
Q: Why is the correlation unitless?
Asked in: Meeting 01
Answer: Because dividing covariance () by cancels the units:
Q: Is it valid to compare correlations from different datasets?
Asked in: Meeting 01
Answer: Yes, because correlation is standardized and unit-free, so it measures the strength of linear relationships consistently. However:
- Only valid for linear relationships.
- Sample size and context should be considered.
Statistics
Q: We often don’t have the real theta, so, how is bias calculated?
Asked in: Meeting 02
Answer (ChatGPT):
✅ We almost never know the true parameter θ (like the true population mean or variance),
❌ So we can’t compute the bias directly in practice.
So how is bias calculated or estimated?
There are two main settings:
1. Theoretical Setting (Simulation or Assumed Population)
If you know the population distribution (say, Normal(μ=10, σ=2)), or you’re simulating data, then you know the true θ.
In this case, you can calculate the bias:
How?
-
Generate many samples (e.g., 10,000).
-
Compute the estimate (e.g., sample mean, sample variance) on each one.
-
Take the average of those estimates → that’s E[θ^]\mathbb{E}[\hat{\theta}].
-
Subtract the true θ.
✅ This is empirical estimation of bias.
2. Real-World Setting (θ is Unknown)
In real data problems:
- We don’t know θ (the population mean, variance, etc.).
- So we can’t compute true bias directly.
But we can still:
✔️ Rely on theoretical properties:
- Use estimators that are proven to be unbiased (like the sample mean).
- Use the bias-corrected formulas (like in sample variance).
✔️ Use Bootstrap or Simulation to Estimate Bias:
Bootstrap allows you to approximate the bias without knowing θ.
How it works:
-
Take your one sample.
-
Resample (with replacement) many bootstrap samples.
-
Calculate your estimator on each one.
-
Compute average of those estimates.
-
Then estimate:
where is from bootstrap sample bb, and is your original sample’s estimate.
This gives you an approximate bias, even if θ is unknown.
Bottom Line
Case | Can you calculate Bias? | How? |
---|---|---|
You know θ (simulated or theoretical) | ✅ Yes | Average over many samples |
Real-world data | ❌ Not directly | Use bootstrap or theoretical reasoning |
Q: Comparing X
and X̄
as Estimators of the Mean
Answer:
Goal
Estimate the population mean , based on random samples from a distribution with mean and variance .
We compare two estimators:
- : A single random sample (e.g. )
- : The sample mean:
Bias
Estimator | Bias Calculation | Result |
---|---|---|
(X) | ✅ Unbiased | |
✅ Unbiased |
✅ Both ( X ) and are unbiased estimators of .
Variance
Estimator | Variance |
---|---|
( X ) | |
✅ has lower variance, and becomes more stable as .
Efficiency
An estimator is more efficient if it has lower variance, assuming the same bias.
- is more efficient than ( X )
Final Comparison
Estimator | Bias | Variance | Efficient? | Recommended? |
---|---|---|---|---|
( X ) | 0 | ❌ Less efficient | ❌ No | |
0 | ✅ Yes | ✅ Yes |
💡 Conclusion
- Both estimators are unbiased
- But is far more reliable due to lower variance
- ➤ Use when you have more than one sample
Q: Why is if (X) is just a single observation?
The Key Distinction: Random Variable vs Realized Value
When we write:
- (X) is not yet a number.
- It’s a random variable, meaning it hasn’t “rolled the dice” yet.
- It represents the process of drawing one data point at random from a distribution.
Q: What is the Normality Assumption in Linear Regression Analysis?
Answer: Read this
- It states that the residuals (i.e., the errors: the differences between the observed values and the predicted values) should be normally distributed.
- Formally:
If your regression model is: then we assume that
✅ Why Does This Matter?
- For inference, such as:
- Calculating confidence intervals
- Conducting t-tests or F-tests for coefficients
- These statistical tests rely on the assumption that residuals are normal, especially when the sample size is small.
⚠️ What If Residuals Are Not Normal?
- Large samples: Central Limit Theorem often makes this less of a problem.
- Fixes:
- Transform the dependent variable (e.g., log, square root)
- Use robust regression or bootstrapping
- Try a non-linear model if linearity is the issue
✅ ** Summary:**
We assume, without proof, that the residuals (errors) are normally distributed, so that we can perform certain statistical calculations like p-values and confidence intervals.
This assumption becomes less critical when the sample size is large, because the Central Limit Theorem (CLT) makes the sampling distribution of the estimates approximately normal anyway.
But if the assumption is wrong and the sample is small, then:
Our statistical inferences (like hypothesis tests and confidence intervals) may become unreliable or misleading.
We might get wrong conclusions, like thinking a variable is significant when it’s not (or vice versa).
Q: Why use Gradient Descent in Linear Regression? Why not just calculate directly?
When You Don’t Need Gradient Descent (in Simple Linear Regression)
For ordinary least squares (OLS) linear regression, the loss function is convex and differentiable. We can find an exact solution analytically by setting the derivative to zero:
Closed-form solution (Normal Equation):
β̂ = (Xᵀ X)⁻¹ Xᵀ y
✅ Pros:
- Exact
- Fast for small/moderate datasets
When You DO Need Gradient Descent
- High-dimensional data (large n or p):
- Computing (Xᵀ X)⁻¹ is expensive: O(n³)
- Gradient descent scales better
- No closed-form available:
- Logistic regression
- Regularized models (Ridge, Lasso)
- Non-linear models
- Deep Learning:
- No closed-form solution
- Must use iterative methods like gradient descent
Summary Table
Case | Use Closed-Form? | Use Gradient Descent? |
---|---|---|
Small linear regression | ✅ Yes | ❌ Not needed |
Large linear regression | ❌ Too slow | ✅ Yes |
Logistic regression / NN | ❌ No closed-form | ✅ Yes |
Regularized models (Lasso) | ❌ | ✅ Yes |