Statistics Review

Review of key statistics concepts

Population: Entire group of interest; sample is drawn from it to estimate its properties.
- Population properties: True but often unknown population summaries; estimated by sample properties.
  - Population proportion $ p $: Fraction of population with a property, $ p = \frac{X}{N} $; estimated by sample proportion $ \hat{p} $.
  - Population mode: Most frequent population value/category; related to sample mode as estimate.
  - Population median: Middle population value / 50th percentile; related to quartiles and sample median.
  - Population mean $ \mu $: Population average, $ \mu = \frac{\sum x_i}{N} $; center used for population deviation and z-score.
- Unit: One member/observation from population; produces a measured value $ x_i $.
  - Unit deviation: Distance from population mean, $ x_i - \mu $; basis of population variance.
    - Squared unit deviations: Squared distance, $ (x_i - \mu)^2 $; removes signs before summing.
      - Sum of squared unit deviations: Total squared distance, $ \sum (x_i - \mu)^2 $; numerator of population variance.
        
        Total population variance: Unnormalized population spread, $ \sum (x_i - \mu)^2 $; becomes variance after division by $ N $.
        
        Population variance $ \sigma^2 $: Average squared population deviation, $ \sigma^2 = \frac{\sum (x_i - \mu)^2}{N} $; square of $ \sigma $.
        
        Population standard deviation $ \sigma $: Typical population distance from mean, $ \sigma = \sqrt{\sigma^2} $; used in z-score.
  - Z-score: Standardized position, $ z = \frac{x_i - \mu}{\sigma} $; relates raw value to mean and SD.
    - Standardization: Converts values to mean 0, variance 1; uses z-scores.

Sample: Observed subset $ {x_1, \dots, x_n} $; used to compute sample properties and infer population properties.

Sample properties: Summaries computed from sample; estimate population properties.
- Central tendency: Measures typical/central value; includes mean, median, mode.
  - Mode: Most frequent sample value/category; related to observed frequencies.
  - Median: Middle ordered value / Q2; robust center related to quartiles and MAD.
  - Mean $ \bar{x} $: Sample average, $ \bar{x} = \frac{\sum x_i}{n} $; center for deviations, variance, regression.
    - Centroid / cluster mean $ m_i $: Mean of a cluster; used in WSS/BSS.
- Position / order statistics: Describes where values fall after sorting; includes range, quantiles, quartiles.
  - Quantiles: Cut ordered data into equal-probability parts; general family containing percentiles/quartiles.
    - Percentiles: Quantiles on 0–100 scale; $ P_k $ is value below which $ k\% $ fall.
    - Quartiles: Quantiles splitting data into four parts; Q1, Q2, Q3.
      - First quartile Q1: 25th percentile; lower boundary of middle 50%; used in IQR.
      - Second quartile Q2: 50th percentile; same as median.
      - Third quartile Q3: 75th percentile; upper boundary of middle 50%; used in IQR.
      - Interquartile range IQR: Middle-50% spread, $ \text{IQR} = Q_3 - Q_1 $; robust dispersion measure.
        
        Quartile deviation: Semi-IQR, $ \frac{Q_3 - Q_1}{2} $; robust spread around median.

Dispersion and variance: Amount of spread in data; related to deviation, variance, SD, range, MAD.

Deviation: Distance from mean, $ x_i - \bar{x} $; basis for variance, covariance, regression residual ideas.
- Sum of deviations = 0: For mean-centered data, $ \sum (x_i - \bar{x}) = 0 $; explains why deviations are squared.
- Squared deviation: Squared distance, $ (x_i - \bar{x})^2 $; basis for sample variance and sums of squares.
  - Sum of squared deviations: Total squared spread, $ \sum (x_i - \bar{x})^2 $; numerator of variance/SST.
    - Total sample variance: Unnormalized sample spread; related to $ \sum (x_i - \bar{x})^2 $.
      - Sample variance $ s^2 $: Average squared sample deviation, $ s^2 = \frac{\sum (x_i - \bar{x})^2}{n-1} $; estimates $ \sigma^2 $.
        
        Sample standard deviation $ s $: Typical sample distance from mean, $ s = \sqrt{s^2} $; used in SEM and CV.
    - Sums of squares: Squared-distance totals; connect variance, regression, ANOVA, clustering.
      - SS Total / SST: Total variation, $ \text{SST} = \sum (y_i - \bar{y})^2 $; decomposes into regression and error parts.
        
        Sum of squared error / SSE: Unexplained variation, $ \text{SSE} = \sum (y_i - \hat{y}_i)^2 $; based on residuals.
        
        Mean squared error / MSE: Average squared error, $ \text{MSE} = \frac{\text{SSE}}{\text{df}} $; used for RMSE and F-tests.
        
        Root MSE / RMSE: Error in original units, $ \text{RMSE} = \sqrt{\text{MSE}} $; same family as standard error of estimate.
        
        Standard error of estimate: Typical prediction error, often $ \sqrt{\text{MSE}} $; used for SE of regression coefficients.
        
        Sum of squared regression / SSR: Explained variation, $ \text{SSR} = \sum (\hat{y}_i - \bar{y})^2 $; paired with SSE in $ \text{SST} = \text{SSR} + \text{SSE} $.
        
        Mean squared regression / MSR: Explained variance per model df, $ \text{MSR} = \frac{\text{SSR}}{\text{df}_\text{reg}} $; used in F-statistic.
        
        F-statistic: Ratio $ F = \frac{\text{MSR}}{\text{MSE}} $; tests whether model explains significant variation.
      - WSS / SSE: Within-cluster spread, $ \sum (x - m_i)^2 $; smaller means tighter clusters.
        
        Silhouette coefficient: Cluster quality, $ s = \frac{b - a}{\max(a,b)} $; relates cohesion $ a $ and separation $ b $.
      - BSS: Between-cluster spread; larger means better separation.

Median absolute deviation: Robust spread, $ \text{median}(

x_i - \text{median}(x)

) $; less sensitive to outliers.

Range: max - min; simplest spread measure, related to order statistics.
Coefficient of variation CV: Relative spread, $ \text{CV} = \frac{s}{\bar{x}} $; compares variability across different scales.

Frequencies and proportions: Count-based summaries; basis for categorical analysis and chi-square tests.
- Observed frequencies: Actual category counts $ O_i $; compared with expected frequencies in chi-square tests.
- Expected frequencies: Counts expected under $ H_0 $, often $ E_i $; used in chi-square statistic.
- Proportion: Sample fraction, $ \hat{p} = \frac{x}{n} $; estimates population proportion $ p $.
  - Standard error of proportion: Sampling error of $ \hat{p} $, $ \text{SE} = \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} $; used for CI/tests on proportions.
- Contingency table: Cross-tab of two categorical variables; cells contain joint frequencies $ n_{ij} $.
  - Joint frequencies $ n_{ij} $: Count in row i, column j; basis for support and chi-square distance.
  - Support $ P(X \cap Y) $: Frequency/probability of joint event; base measure for association rules.
    - Confidence $ P(Y \mid X) $: Conditional rule strength, $ P(Y \mid X) = \frac{P(X \cap Y)}{P(X)} $; related to support.
      - Lift: Rule strength over chance, $ \frac{P(Y \mid X)}{P(Y)} $; >1 means positive association.
  - Chi-square distance: Categorical profile distance; used in correspondence analysis and related to contingency tables.

Association between variables: Describes dependence between variables; includes covariance, correlation, contingency, PCA.
- Covariance: Joint variation, $ \text{cov}(X,Y) = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{n-1} $; sign shows shared direction.
  - Correlation: Standardized covariance, $ r = \frac{\text{cov}(X,Y)}{s_x s_y} $; scale-free linear association.
    - Coefficient of determination $ R^2 $: Explained proportion, $ R^2 = \frac{\text{SSR}}{\text{SST}} = 1 - \frac{\text{SSE}}{\text{SST}} $; square of $ r $ in simple regression.
      - Adjusted $ R^2 $: Penalized $ R^2 $ for model size; accounts for predictors/df.
  - PCA: Finds orthogonal directions of maximum variance; uses covariance/correlation matrix.
    - Absolute contribution: How much a variable/point contributes to a PCA axis; relates variables to components.
    - Relative contribution / $ \cos^2 $: Quality of representation on PCA axis; high $ \cos^2 $ means well represented.
- Shape: Distribution form beyond center/spread; includes skewness and kurtosis.
  - Skewness: Asymmetry, roughly $ \frac{\sum (x_i - \bar{x})^3}{n s^3} $; related to tail imbalance.
  - Kurtosis: Tail/heaviness/peakedness, roughly $ \frac{\sum (x_i - \bar{x})^4}{n s^4} $; related to outlier tendency.
Regression: Models $ y $ from predictors $ x $; connects coefficients, residuals, SSE, t/F tests.
- Regression coefficients: Parameters controlling fitted model; in simple regression, $ \hat{y} = \beta_0 + \beta_1 x $.
  - Intercept $ \beta_0 $: Predicted $ y $ when $ x=0 $; estimated with $ b_0 = \bar{y} - b_1 \bar{x} $.
  - Slope $ \beta_1 $: Change in predicted $ y $ per unit $ x $; estimated by $ b_1 = \frac{\text{cov}(x,y)}{\text{var}(x)} $.
- Residuals: Prediction errors, $ e_i = y_i - \hat{y}_i $; basis for SSE and diagnostics.
  - Leverage $ h_{ii} $: Influence potential from unusual predictor values; high leverage can affect fitted line.
  - Cook’s distance: Influence of an observation on model fit; combines residual size and leverage.
- Standard error of estimate: Typical residual size, $ \sqrt{\text{MSE}} $; used to compute coefficient standard errors.
  - SE $ \beta_1 $: Standard error of slope estimate; used in $ t = \frac{b_1}{\text{SE}(b_1)} $.
    - Confidence interval for $ \beta_1 $: Plausible slope range, $ b_1 \pm t^* \cdot \text{SE}(b_1) $; related to slope t-test.
  - SE $ \beta_0 $: Standard error of intercept estimate; used in $ t = \frac{b_0}{\text{SE}(b_0)} $.
    - Confidence interval for $ \beta_0 $: Plausible intercept range, $ b_0 \pm t^* \cdot \text{SE}(b_0) $; related to intercept t-test.
- t-statistics: Coefficient test ratios, $ t = \frac{\text{estimate}}{\text{SE}} $; compare against t-distribution.
  - t-stat $ \beta_1 $: Tests whether slope differs from 0; related to SE $ \beta_1 $ and slope CI.
  - t-stat $ \beta_0 $: Tests whether intercept differs from 0; related to SE $ \beta_0 $ and intercept CI.
Inference and hypothesis testing: Uses sample evidence to evaluate population claims; uses $ H_0 $, $ H_1 $, p-values, CIs.
- Hypotheses: Pair of claims tested statistically; includes null and alternative hypotheses.
  - Null hypothesis $ H_0 $: Default claim/no effect; rejected only if sample evidence is strong.
  - Alternative hypothesis $ H_1 $: Claim/effect being tested; supported when $ H_0 $ is rejected.
- Significance level $ \alpha $: Rejection threshold, usually 0.05; probability of Type I error.
  - Confidence level $ 1 - \alpha $: Long-run CI coverage; e.g. 95% when $ \alpha = 0.05 $.
  - Multiple tests: Many tests inflate false positives; corrected by Bonferroni/FDR.
    - Bonferroni correction: Conservative threshold $ \frac{\alpha}{m} $ for $ m $ tests; controls family-wise error.
    - False discovery rate: Expected proportion of false positives among discoveries; less conservative than Bonferroni.
  - Types of errors: Wrong test decisions; split into Type I and Type II.
    - Type I error: Rejecting true $ H_0 $; probability controlled by $ \alpha $.
    - Type II error $ \beta $: Failing to reject false $ H_0 $; related inversely to power.
      - Power $ 1 - \beta $: Chance of detecting real effect; increases with effect size and sample size.
- Effect size: Magnitude of effect independent of sample size; complements p-value.
  - Cohen’s d: Standardized mean difference, $ d = \frac{\bar{x}_1 - \bar{x}_2}{s_p} $; used for mean comparisons.
  - Eta squared: Proportion of variance explained, $ \eta^2 = \frac{\text{SS}_\text{effect}}{\text{SST}} $; related to ANOVA/F-test.
- p-value: Probability of data at least this extreme under $ H_0 $; compared with $ \alpha $.
- Confidence interval: Plausible parameter range, estimate $ \pm $ critical value $ \times $ SE; related to hypothesis tests.
- Test statistics: Standardized evidence measures; compared to reference distributions.
  - t-statistic: Mean/coefficient test statistic, $ t = \frac{\text{estimate} - \text{null}}{\text{SE}} $; uses t-distribution.
    - t-distribution: Reference distribution for t-statistics with unknown variance; depends on df.
      - Unpaired t-test: Compares two independent means; uses difference in means over SE.
      - Paired t-test: Tests mean of paired differences; reduces to one-sample t-test on differences.
      - Critical t-value: Boundary from t-distribution; used in rejection rules and confidence intervals.
  - Chi-squared statistic: Count discrepancy, $ \chi^2 = \sum \frac{(O - E)^2}{E} $; compared to chi-square distribution.
    - Chi-squared distribution: Reference distribution for variance/count tests; depends on df.
      - Critical chi-squared value: Boundary for chi-square rejection region; used in goodness-of-fit/independence tests.
  - F-statistic: Variance ratio, usually $ F = \frac{\text{MSR}}{\text{MSE}} $; tests model/group effects.
    - F-distribution: Reference distribution for ratios of variances; depends on two dfs.
      - Critical F-value: Boundary from F-distribution; used in ANOVA/regression significance tests.

Retail Mining Report Methods To Review

The retail mining report also uses a few methods that extend the core review above:

Statistics Review

Review of key statistics concepts

Retail Mining Report Methods To Review

Temporal Holdout Evaluation

Customer Behavior Feature Engineering

Hierarchical Clustering and Ward Linkage

Correspondence Analysis

Sparse Matrices and Truncated SVD

Apriori and Eclat Frequent Itemset Mining

Statistics Review

Review of key statistics concepts

Retail Mining Report Methods To Review

Temporal Holdout Evaluation

Customer Behavior Feature Engineering

Hierarchical Clustering and Ward Linkage

Correspondence Analysis

Sparse Matrices and Truncated SVD

Apriori and Eclat Frequent Itemset Mining

Sessions by Day

Productivity by Hour

Session Completion Rate

Time Spent by Task

Sessions by Day of Week

Session Duration Distribution