Statistics Review
Review of key statistics concepts
-
Population: Entire group of interest; sample is drawn from it to estimate its properties.
-
Population properties: True but often unknown population summaries; estimated by sample properties.
- Population proportion $ p $: Fraction of population with a property, $ p = \frac{X}{N} $; estimated by sample proportion $ \hat{p} $.
- Population mode: Most frequent population value/category; related to sample mode as estimate.
- Population median: Middle population value / 50th percentile; related to quartiles and sample median.
- Population mean $ \mu $: Population average, $ \mu = \frac{\sum x_i}{N} $; center used for population deviation and z-score.
-
Unit: One member/observation from population; produces a measured value $ x_i $.
-
Unit deviation: Distance from population mean, $ x_i - \mu $; basis of population variance.
-
Squared unit deviations: Squared distance, $ (x_i - \mu)^2 $; removes signs before summing.
-
Sum of squared unit deviations: Total squared distance, $ \sum (x_i - \mu)^2 $; numerator of population variance.
-
Total population variance: Unnormalized population spread, $ \sum (x_i - \mu)^2 $; becomes variance after division by $ N $.
-
Population variance $ \sigma^2 $: Average squared population deviation, $ \sigma^2 = \frac{\sum (x_i - \mu)^2}{N} $; square of $ \sigma $.
- Population standard deviation $ \sigma $: Typical population distance from mean, $ \sigma = \sqrt{\sigma^2} $; used in z-score.
-
Population variance $ \sigma^2 $: Average squared population deviation, $ \sigma^2 = \frac{\sum (x_i - \mu)^2}{N} $; square of $ \sigma $.
-
Total population variance: Unnormalized population spread, $ \sum (x_i - \mu)^2 $; becomes variance after division by $ N $.
-
Sum of squared unit deviations: Total squared distance, $ \sum (x_i - \mu)^2 $; numerator of population variance.
-
Squared unit deviations: Squared distance, $ (x_i - \mu)^2 $; removes signs before summing.
-
Z-score: Standardized position, $ z = \frac{x_i - \mu}{\sigma} $; relates raw value to mean and SD.
- Standardization: Converts values to mean 0, variance 1; uses z-scores.
-
Unit deviation: Distance from population mean, $ x_i - \mu $; basis of population variance.
-
Population properties: True but often unknown population summaries; estimated by sample properties.
-
Sample: Observed subset $ {x_1, \dots, x_n} $; used to compute sample properties and infer population properties.
-
Sample properties: Summaries computed from sample; estimate population properties.
-
Central tendency: Measures typical/central value; includes mean, median, mode.
- Mode: Most frequent sample value/category; related to observed frequencies.
- Median: Middle ordered value / Q2; robust center related to quartiles and MAD.
-
Mean $ \bar{x} $: Sample average, $ \bar{x} = \frac{\sum x_i}{n} $; center for deviations, variance, regression.
- Centroid / cluster mean $ m_i $: Mean of a cluster; used in WSS/BSS.
-
Position / order statistics: Describes where values fall after sorting; includes range, quantiles, quartiles.
-
Quantiles: Cut ordered data into equal-probability parts; general family containing percentiles/quartiles.
- Percentiles: Quantiles on 0–100 scale; $ P_k $ is value below which $ k\% $ fall.
-
Quartiles: Quantiles splitting data into four parts; Q1, Q2, Q3.
- First quartile Q1: 25th percentile; lower boundary of middle 50%; used in IQR.
- Second quartile Q2: 50th percentile; same as median.
- Third quartile Q3: 75th percentile; upper boundary of middle 50%; used in IQR.
-
Interquartile range IQR: Middle-50% spread, $ \text{IQR} = Q_3 - Q_1 $; robust dispersion measure.
- Quartile deviation: Semi-IQR, $ \frac{Q_3 - Q_1}{2} $; robust spread around median.
-
Quantiles: Cut ordered data into equal-probability parts; general family containing percentiles/quartiles.
-
Central tendency: Measures typical/central value; includes mean, median, mode.
-
Dispersion and variance: Amount of spread in data; related to deviation, variance, SD, range, MAD.
-
Deviation: Distance from mean, $ x_i - \bar{x} $; basis for variance, covariance, regression residual ideas.
- Sum of deviations = 0: For mean-centered data, $ \sum (x_i - \bar{x}) = 0 $; explains why deviations are squared.
-
Squared deviation: Squared distance, $ (x_i - \bar{x})^2 $; basis for sample variance and sums of squares.
-
Sum of squared deviations: Total squared spread, $ \sum (x_i - \bar{x})^2 $; numerator of variance/SST.
-
Total sample variance: Unnormalized sample spread; related to $ \sum (x_i - \bar{x})^2 $.
-
Sample variance $ s^2 $: Average squared sample deviation, $ s^2 = \frac{\sum (x_i - \bar{x})^2}{n-1} $; estimates $ \sigma^2 $.
- Sample standard deviation $ s $: Typical sample distance from mean, $ s = \sqrt{s^2} $; used in SEM and CV.
-
Sample variance $ s^2 $: Average squared sample deviation, $ s^2 = \frac{\sum (x_i - \bar{x})^2}{n-1} $; estimates $ \sigma^2 $.
-
Sums of squares: Squared-distance totals; connect variance, regression, ANOVA, clustering.
-
SS Total / SST: Total variation, $ \text{SST} = \sum (y_i - \bar{y})^2 $; decomposes into regression and error parts.
-
Sum of squared error / SSE: Unexplained variation, $ \text{SSE} = \sum (y_i - \hat{y}_i)^2 $; based on residuals.
-
Mean squared error / MSE: Average squared error, $ \text{MSE} = \frac{\text{SSE}}{\text{df}} $; used for RMSE and F-tests.
-
Root MSE / RMSE: Error in original units, $ \text{RMSE} = \sqrt{\text{MSE}} $; same family as standard error of estimate.
- Standard error of estimate: Typical prediction error, often $ \sqrt{\text{MSE}} $; used for SE of regression coefficients.
-
Root MSE / RMSE: Error in original units, $ \text{RMSE} = \sqrt{\text{MSE}} $; same family as standard error of estimate.
-
Mean squared error / MSE: Average squared error, $ \text{MSE} = \frac{\text{SSE}}{\text{df}} $; used for RMSE and F-tests.
-
Sum of squared regression / SSR: Explained variation, $ \text{SSR} = \sum (\hat{y}_i - \bar{y})^2 $; paired with SSE in $ \text{SST} = \text{SSR} + \text{SSE} $.
-
Mean squared regression / MSR: Explained variance per model df, $ \text{MSR} = \frac{\text{SSR}}{\text{df}_\text{reg}} $; used in F-statistic.
- F-statistic: Ratio $ F = \frac{\text{MSR}}{\text{MSE}} $; tests whether model explains significant variation.
-
Mean squared regression / MSR: Explained variance per model df, $ \text{MSR} = \frac{\text{SSR}}{\text{df}_\text{reg}} $; used in F-statistic.
-
Sum of squared error / SSE: Unexplained variation, $ \text{SSE} = \sum (y_i - \hat{y}_i)^2 $; based on residuals.
-
WSS / SSE: Within-cluster spread, $ \sum (x - m_i)^2 $; smaller means tighter clusters.
- Silhouette coefficient: Cluster quality, $ s = \frac{b - a}{\max(a,b)} $; relates cohesion $ a $ and separation $ b $.
- BSS: Between-cluster spread; larger means better separation.
-
SS Total / SST: Total variation, $ \text{SST} = \sum (y_i - \bar{y})^2 $; decomposes into regression and error parts.
-
Total sample variance: Unnormalized sample spread; related to $ \sum (x_i - \bar{x})^2 $.
-
Sum of squared deviations: Total squared spread, $ \sum (x_i - \bar{x})^2 $; numerator of variance/SST.
-
Median absolute deviation: Robust spread, $ \text{median}( x_i - \text{median}(x) ) $; less sensitive to outliers. - Range: max - min; simplest spread measure, related to order statistics.
- Coefficient of variation CV: Relative spread, $ \text{CV} = \frac{s}{\bar{x}} $; compares variability across different scales.
-
Deviation: Distance from mean, $ x_i - \bar{x} $; basis for variance, covariance, regression residual ideas.
-
Frequencies and proportions: Count-based summaries; basis for categorical analysis and chi-square tests.
- Observed frequencies: Actual category counts $ O_i $; compared with expected frequencies in chi-square tests.
- Expected frequencies: Counts expected under $ H_0 $, often $ E_i $; used in chi-square statistic.
-
Proportion: Sample fraction, $ \hat{p} = \frac{x}{n} $; estimates population proportion $ p $.
- Standard error of proportion: Sampling error of $ \hat{p} $, $ \text{SE} = \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} $; used for CI/tests on proportions.
-
Contingency table: Cross-tab of two categorical variables; cells contain joint frequencies $ n_{ij} $.
- Joint frequencies $ n_{ij} $: Count in row i, column j; basis for support and chi-square distance.
-
Support $ P(X \cap Y) $: Frequency/probability of joint event; base measure for association rules.
-
Confidence $ P(Y \mid X) $: Conditional rule strength, $ P(Y \mid X) = \frac{P(X \cap Y)}{P(X)} $; related to support.
- Lift: Rule strength over chance, $ \frac{P(Y \mid X)}{P(Y)} $; >1 means positive association.
-
Confidence $ P(Y \mid X) $: Conditional rule strength, $ P(Y \mid X) = \frac{P(X \cap Y)}{P(X)} $; related to support.
- Chi-square distance: Categorical profile distance; used in correspondence analysis and related to contingency tables.
-
Sample properties: Summaries computed from sample; estimate population properties.
-
Association between variables: Describes dependence between variables; includes covariance, correlation, contingency, PCA.
-
Covariance: Joint variation, $ \text{cov}(X,Y) = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{n-1} $; sign shows shared direction.
-
Correlation: Standardized covariance, $ r = \frac{\text{cov}(X,Y)}{s_x s_y} $; scale-free linear association.
-
Coefficient of determination $ R^2 $: Explained proportion, $ R^2 = \frac{\text{SSR}}{\text{SST}} = 1 - \frac{\text{SSE}}{\text{SST}} $; square of $ r $ in simple regression.
- Adjusted $ R^2 $: Penalized $ R^2 $ for model size; accounts for predictors/df.
-
Coefficient of determination $ R^2 $: Explained proportion, $ R^2 = \frac{\text{SSR}}{\text{SST}} = 1 - \frac{\text{SSE}}{\text{SST}} $; square of $ r $ in simple regression.
-
PCA: Finds orthogonal directions of maximum variance; uses covariance/correlation matrix.
- Absolute contribution: How much a variable/point contributes to a PCA axis; relates variables to components.
- Relative contribution / $ \cos^2 $: Quality of representation on PCA axis; high $ \cos^2 $ means well represented.
-
Correlation: Standardized covariance, $ r = \frac{\text{cov}(X,Y)}{s_x s_y} $; scale-free linear association.
-
Shape: Distribution form beyond center/spread; includes skewness and kurtosis.
- Skewness: Asymmetry, roughly $ \frac{\sum (x_i - \bar{x})^3}{n s^3} $; related to tail imbalance.
- Kurtosis: Tail/heaviness/peakedness, roughly $ \frac{\sum (x_i - \bar{x})^4}{n s^4} $; related to outlier tendency.
-
Covariance: Joint variation, $ \text{cov}(X,Y) = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{n-1} $; sign shows shared direction.
-
Regression: Models $ y $ from predictors $ x $; connects coefficients, residuals, SSE, t/F tests.
-
Regression coefficients: Parameters controlling fitted model; in simple regression, $ \hat{y} = \beta_0 + \beta_1 x $.
- Intercept $ \beta_0 $: Predicted $ y $ when $ x=0 $; estimated with $ b_0 = \bar{y} - b_1 \bar{x} $.
- Slope $ \beta_1 $: Change in predicted $ y $ per unit $ x $; estimated by $ b_1 = \frac{\text{cov}(x,y)}{\text{var}(x)} $.
-
Residuals: Prediction errors, $ e_i = y_i - \hat{y}_i $; basis for SSE and diagnostics.
- Leverage $ h_{ii} $: Influence potential from unusual predictor values; high leverage can affect fitted line.
- Cook’s distance: Influence of an observation on model fit; combines residual size and leverage.
-
Standard error of estimate: Typical residual size, $ \sqrt{\text{MSE}} $; used to compute coefficient standard errors.
-
SE $ \beta_1 $: Standard error of slope estimate; used in $ t = \frac{b_1}{\text{SE}(b_1)} $.
- Confidence interval for $ \beta_1 $: Plausible slope range, $ b_1 \pm t^* \cdot \text{SE}(b_1) $; related to slope t-test.
-
SE $ \beta_0 $: Standard error of intercept estimate; used in $ t = \frac{b_0}{\text{SE}(b_0)} $.
- Confidence interval for $ \beta_0 $: Plausible intercept range, $ b_0 \pm t^* \cdot \text{SE}(b_0) $; related to intercept t-test.
-
SE $ \beta_1 $: Standard error of slope estimate; used in $ t = \frac{b_1}{\text{SE}(b_1)} $.
-
t-statistics: Coefficient test ratios, $ t = \frac{\text{estimate}}{\text{SE}} $; compare against t-distribution.
- t-stat $ \beta_1 $: Tests whether slope differs from 0; related to SE $ \beta_1 $ and slope CI.
- t-stat $ \beta_0 $: Tests whether intercept differs from 0; related to SE $ \beta_0 $ and intercept CI.
-
Regression coefficients: Parameters controlling fitted model; in simple regression, $ \hat{y} = \beta_0 + \beta_1 x $.
-
Inference and hypothesis testing: Uses sample evidence to evaluate population claims; uses $ H_0 $, $ H_1 $, p-values, CIs.
-
Hypotheses: Pair of claims tested statistically; includes null and alternative hypotheses.
- Null hypothesis $ H_0 $: Default claim/no effect; rejected only if sample evidence is strong.
- Alternative hypothesis $ H_1 $: Claim/effect being tested; supported when $ H_0 $ is rejected.
-
Significance level $ \alpha $: Rejection threshold, usually 0.05; probability of Type I error.
- Confidence level $ 1 - \alpha $: Long-run CI coverage; e.g. 95% when $ \alpha = 0.05 $.
-
Multiple tests: Many tests inflate false positives; corrected by Bonferroni/FDR.
- Bonferroni correction: Conservative threshold $ \frac{\alpha}{m} $ for $ m $ tests; controls family-wise error.
- False discovery rate: Expected proportion of false positives among discoveries; less conservative than Bonferroni.
-
Types of errors: Wrong test decisions; split into Type I and Type II.
- Type I error: Rejecting true $ H_0 $; probability controlled by $ \alpha $.
-
Type II error $ \beta $: Failing to reject false $ H_0 $; related inversely to power.
- Power $ 1 - \beta $: Chance of detecting real effect; increases with effect size and sample size.
-
Effect size: Magnitude of effect independent of sample size; complements p-value.
- Cohen’s d: Standardized mean difference, $ d = \frac{\bar{x}_1 - \bar{x}_2}{s_p} $; used for mean comparisons.
- Eta squared: Proportion of variance explained, $ \eta^2 = \frac{\text{SS}_\text{effect}}{\text{SST}} $; related to ANOVA/F-test.
- p-value: Probability of data at least this extreme under $ H_0 $; compared with $ \alpha $.
- Confidence interval: Plausible parameter range, estimate $ \pm $ critical value $ \times $ SE; related to hypothesis tests.
-
Test statistics: Standardized evidence measures; compared to reference distributions.
-
t-statistic: Mean/coefficient test statistic, $ t = \frac{\text{estimate} - \text{null}}{\text{SE}} $; uses t-distribution.
-
t-distribution: Reference distribution for t-statistics with unknown variance; depends on df.
- Unpaired t-test: Compares two independent means; uses difference in means over SE.
- Paired t-test: Tests mean of paired differences; reduces to one-sample t-test on differences.
- Critical t-value: Boundary from t-distribution; used in rejection rules and confidence intervals.
-
t-distribution: Reference distribution for t-statistics with unknown variance; depends on df.
-
Chi-squared statistic: Count discrepancy, $ \chi^2 = \sum \frac{(O - E)^2}{E} $; compared to chi-square distribution.
-
Chi-squared distribution: Reference distribution for variance/count tests; depends on df.
- Critical chi-squared value: Boundary for chi-square rejection region; used in goodness-of-fit/independence tests.
-
Chi-squared distribution: Reference distribution for variance/count tests; depends on df.
-
F-statistic: Variance ratio, usually $ F = \frac{\text{MSR}}{\text{MSE}} $; tests model/group effects.
-
F-distribution: Reference distribution for ratios of variances; depends on two dfs.
- Critical F-value: Boundary from F-distribution; used in ANOVA/regression significance tests.
-
F-distribution: Reference distribution for ratios of variances; depends on two dfs.
-
t-statistic: Mean/coefficient test statistic, $ t = \frac{\text{estimate} - \text{null}}{\text{SE}} $; uses t-distribution.
-
Hypotheses: Pair of claims tested statistically; includes null and alternative hypotheses.
Retail Mining Report Methods To Review
The retail mining report also uses a few methods that extend the core review above: