Loading graph…

Statistics Review

Review of key statistics concepts

  • Population: Entire group of interest; sample is drawn from it to estimate its properties.
    • Population properties: True but often unknown population summaries; estimated by sample properties.
      • Population proportion $ p $: Fraction of population with a property, $ p = \frac{X}{N} $; estimated by sample proportion $ \hat{p} $.
      • Population mode: Most frequent population value/category; related to sample mode as estimate.
      • Population median: Middle population value / 50th percentile; related to quartiles and sample median.
      • Population mean $ \mu $: Population average, $ \mu = \frac{\sum x_i}{N} $; center used for population deviation and z-score.
    • Unit: One member/observation from population; produces a measured value $ x_i $.
      • Unit deviation: Distance from population mean, $ x_i - \mu $; basis of population variance.
        • Squared unit deviations: Squared distance, $ (x_i - \mu)^2 $; removes signs before summing.
          • Sum of squared unit deviations: Total squared distance, $ \sum (x_i - \mu)^2 $; numerator of population variance.
            • Total population variance: Unnormalized population spread, $ \sum (x_i - \mu)^2 $; becomes variance after division by $ N $.
              • Population variance $ \sigma^2 $: Average squared population deviation, $ \sigma^2 = \frac{\sum (x_i - \mu)^2}{N} $; square of $ \sigma $.
                • Population standard deviation $ \sigma $: Typical population distance from mean, $ \sigma = \sqrt{\sigma^2} $; used in z-score.
      • Z-score: Standardized position, $ z = \frac{x_i - \mu}{\sigma} $; relates raw value to mean and SD.
        • Standardization: Converts values to mean 0, variance 1; uses z-scores.
  • Sample: Observed subset $ {x_1, \dots, x_n} $; used to compute sample properties and infer population properties.
    • Sample properties: Summaries computed from sample; estimate population properties.
      • Central tendency: Measures typical/central value; includes mean, median, mode.
        • Mode: Most frequent sample value/category; related to observed frequencies.
        • Median: Middle ordered value / Q2; robust center related to quartiles and MAD.
        • Mean $ \bar{x} $: Sample average, $ \bar{x} = \frac{\sum x_i}{n} $; center for deviations, variance, regression.
          • Centroid / cluster mean $ m_i $: Mean of a cluster; used in WSS/BSS.
      • Position / order statistics: Describes where values fall after sorting; includes range, quantiles, quartiles.
        • Quantiles: Cut ordered data into equal-probability parts; general family containing percentiles/quartiles.
          • Percentiles: Quantiles on 0–100 scale; $ P_k $ is value below which $ k\% $ fall.
          • Quartiles: Quantiles splitting data into four parts; Q1, Q2, Q3.
            • First quartile Q1: 25th percentile; lower boundary of middle 50%; used in IQR.
            • Second quartile Q2: 50th percentile; same as median.
            • Third quartile Q3: 75th percentile; upper boundary of middle 50%; used in IQR.
            • Interquartile range IQR: Middle-50% spread, $ \text{IQR} = Q_3 - Q_1 $; robust dispersion measure.
              • Quartile deviation: Semi-IQR, $ \frac{Q_3 - Q_1}{2} $; robust spread around median.
    • Dispersion and variance: Amount of spread in data; related to deviation, variance, SD, range, MAD.
      • Deviation: Distance from mean, $ x_i - \bar{x} $; basis for variance, covariance, regression residual ideas.
        • Sum of deviations = 0: For mean-centered data, $ \sum (x_i - \bar{x}) = 0 $; explains why deviations are squared.
        • Squared deviation: Squared distance, $ (x_i - \bar{x})^2 $; basis for sample variance and sums of squares.
          • Sum of squared deviations: Total squared spread, $ \sum (x_i - \bar{x})^2 $; numerator of variance/SST.
            • Total sample variance: Unnormalized sample spread; related to $ \sum (x_i - \bar{x})^2 $.
              • Sample variance $ s^2 $: Average squared sample deviation, $ s^2 = \frac{\sum (x_i - \bar{x})^2}{n-1} $; estimates $ \sigma^2 $.
                • Sample standard deviation $ s $: Typical sample distance from mean, $ s = \sqrt{s^2} $; used in SEM and CV.
            • Sums of squares: Squared-distance totals; connect variance, regression, ANOVA, clustering.
              • SS Total / SST: Total variation, $ \text{SST} = \sum (y_i - \bar{y})^2 $; decomposes into regression and error parts.
                • Sum of squared error / SSE: Unexplained variation, $ \text{SSE} = \sum (y_i - \hat{y}_i)^2 $; based on residuals.
                  • Mean squared error / MSE: Average squared error, $ \text{MSE} = \frac{\text{SSE}}{\text{df}} $; used for RMSE and F-tests.
                    • Root MSE / RMSE: Error in original units, $ \text{RMSE} = \sqrt{\text{MSE}} $; same family as standard error of estimate.
                      • Standard error of estimate: Typical prediction error, often $ \sqrt{\text{MSE}} $; used for SE of regression coefficients.
                • Sum of squared regression / SSR: Explained variation, $ \text{SSR} = \sum (\hat{y}_i - \bar{y})^2 $; paired with SSE in $ \text{SST} = \text{SSR} + \text{SSE} $.
                  • Mean squared regression / MSR: Explained variance per model df, $ \text{MSR} = \frac{\text{SSR}}{\text{df}_\text{reg}} $; used in F-statistic.
                    • F-statistic: Ratio $ F = \frac{\text{MSR}}{\text{MSE}} $; tests whether model explains significant variation.
              • WSS / SSE: Within-cluster spread, $ \sum (x - m_i)^2 $; smaller means tighter clusters.
                • Silhouette coefficient: Cluster quality, $ s = \frac{b - a}{\max(a,b)} $; relates cohesion $ a $ and separation $ b $.
              • BSS: Between-cluster spread; larger means better separation.
      • Median absolute deviation: Robust spread, $ \text{median}( x_i - \text{median}(x) ) $; less sensitive to outliers.
      • Range: max - min; simplest spread measure, related to order statistics.
      • Coefficient of variation CV: Relative spread, $ \text{CV} = \frac{s}{\bar{x}} $; compares variability across different scales.
    • Frequencies and proportions: Count-based summaries; basis for categorical analysis and chi-square tests.
      • Observed frequencies: Actual category counts $ O_i $; compared with expected frequencies in chi-square tests.
      • Expected frequencies: Counts expected under $ H_0 $, often $ E_i $; used in chi-square statistic.
      • Proportion: Sample fraction, $ \hat{p} = \frac{x}{n} $; estimates population proportion $ p $.
        • Standard error of proportion: Sampling error of $ \hat{p} $, $ \text{SE} = \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} $; used for CI/tests on proportions.
      • Contingency table: Cross-tab of two categorical variables; cells contain joint frequencies $ n_{ij} $.
        • Joint frequencies $ n_{ij} $: Count in row i, column j; basis for support and chi-square distance.
        • Support $ P(X \cap Y) $: Frequency/probability of joint event; base measure for association rules.
          • Confidence $ P(Y \mid X) $: Conditional rule strength, $ P(Y \mid X) = \frac{P(X \cap Y)}{P(X)} $; related to support.
            • Lift: Rule strength over chance, $ \frac{P(Y \mid X)}{P(Y)} $; >1 means positive association.
        • Chi-square distance: Categorical profile distance; used in correspondence analysis and related to contingency tables.
  • Association between variables: Describes dependence between variables; includes covariance, correlation, contingency, PCA.
    • Covariance: Joint variation, $ \text{cov}(X,Y) = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{n-1} $; sign shows shared direction.
      • Correlation: Standardized covariance, $ r = \frac{\text{cov}(X,Y)}{s_x s_y} $; scale-free linear association.
        • Coefficient of determination $ R^2 $: Explained proportion, $ R^2 = \frac{\text{SSR}}{\text{SST}} = 1 - \frac{\text{SSE}}{\text{SST}} $; square of $ r $ in simple regression.
          • Adjusted $ R^2 $: Penalized $ R^2 $ for model size; accounts for predictors/df.
      • PCA: Finds orthogonal directions of maximum variance; uses covariance/correlation matrix.
        • Absolute contribution: How much a variable/point contributes to a PCA axis; relates variables to components.
        • Relative contribution / $ \cos^2 $: Quality of representation on PCA axis; high $ \cos^2 $ means well represented.
    • Shape: Distribution form beyond center/spread; includes skewness and kurtosis.
      • Skewness: Asymmetry, roughly $ \frac{\sum (x_i - \bar{x})^3}{n s^3} $; related to tail imbalance.
      • Kurtosis: Tail/heaviness/peakedness, roughly $ \frac{\sum (x_i - \bar{x})^4}{n s^4} $; related to outlier tendency.
  • Regression: Models $ y $ from predictors $ x $; connects coefficients, residuals, SSE, t/F tests.
    • Regression coefficients: Parameters controlling fitted model; in simple regression, $ \hat{y} = \beta_0 + \beta_1 x $.
      • Intercept $ \beta_0 $: Predicted $ y $ when $ x=0 $; estimated with $ b_0 = \bar{y} - b_1 \bar{x} $.
      • Slope $ \beta_1 $: Change in predicted $ y $ per unit $ x $; estimated by $ b_1 = \frac{\text{cov}(x,y)}{\text{var}(x)} $.
    • Residuals: Prediction errors, $ e_i = y_i - \hat{y}_i $; basis for SSE and diagnostics.
      • Leverage $ h_{ii} $: Influence potential from unusual predictor values; high leverage can affect fitted line.
      • Cook’s distance: Influence of an observation on model fit; combines residual size and leverage.
    • Standard error of estimate: Typical residual size, $ \sqrt{\text{MSE}} $; used to compute coefficient standard errors.
      • SE $ \beta_1 $: Standard error of slope estimate; used in $ t = \frac{b_1}{\text{SE}(b_1)} $.
        • Confidence interval for $ \beta_1 $: Plausible slope range, $ b_1 \pm t^* \cdot \text{SE}(b_1) $; related to slope t-test.
      • SE $ \beta_0 $: Standard error of intercept estimate; used in $ t = \frac{b_0}{\text{SE}(b_0)} $.
        • Confidence interval for $ \beta_0 $: Plausible intercept range, $ b_0 \pm t^* \cdot \text{SE}(b_0) $; related to intercept t-test.
    • t-statistics: Coefficient test ratios, $ t = \frac{\text{estimate}}{\text{SE}} $; compare against t-distribution.
      • t-stat $ \beta_1 $: Tests whether slope differs from 0; related to SE $ \beta_1 $ and slope CI.
      • t-stat $ \beta_0 $: Tests whether intercept differs from 0; related to SE $ \beta_0 $ and intercept CI.
  • Inference and hypothesis testing: Uses sample evidence to evaluate population claims; uses $ H_0 $, $ H_1 $, p-values, CIs.
    • Hypotheses: Pair of claims tested statistically; includes null and alternative hypotheses.
      • Null hypothesis $ H_0 $: Default claim/no effect; rejected only if sample evidence is strong.
      • Alternative hypothesis $ H_1 $: Claim/effect being tested; supported when $ H_0 $ is rejected.
    • Significance level $ \alpha $: Rejection threshold, usually 0.05; probability of Type I error.
      • Confidence level $ 1 - \alpha $: Long-run CI coverage; e.g. 95% when $ \alpha = 0.05 $.
      • Multiple tests: Many tests inflate false positives; corrected by Bonferroni/FDR.
        • Bonferroni correction: Conservative threshold $ \frac{\alpha}{m} $ for $ m $ tests; controls family-wise error.
        • False discovery rate: Expected proportion of false positives among discoveries; less conservative than Bonferroni.
      • Types of errors: Wrong test decisions; split into Type I and Type II.
        • Type I error: Rejecting true $ H_0 $; probability controlled by $ \alpha $.
        • Type II error $ \beta $: Failing to reject false $ H_0 $; related inversely to power.
          • Power $ 1 - \beta $: Chance of detecting real effect; increases with effect size and sample size.
    • Effect size: Magnitude of effect independent of sample size; complements p-value.
      • Cohen’s d: Standardized mean difference, $ d = \frac{\bar{x}_1 - \bar{x}_2}{s_p} $; used for mean comparisons.
      • Eta squared: Proportion of variance explained, $ \eta^2 = \frac{\text{SS}_\text{effect}}{\text{SST}} $; related to ANOVA/F-test.
    • p-value: Probability of data at least this extreme under $ H_0 $; compared with $ \alpha $.
    • Confidence interval: Plausible parameter range, estimate $ \pm $ critical value $ \times $ SE; related to hypothesis tests.
    • Test statistics: Standardized evidence measures; compared to reference distributions.
      • t-statistic: Mean/coefficient test statistic, $ t = \frac{\text{estimate} - \text{null}}{\text{SE}} $; uses t-distribution.
        • t-distribution: Reference distribution for t-statistics with unknown variance; depends on df.
          • Unpaired t-test: Compares two independent means; uses difference in means over SE.
          • Paired t-test: Tests mean of paired differences; reduces to one-sample t-test on differences.
          • Critical t-value: Boundary from t-distribution; used in rejection rules and confidence intervals.
      • Chi-squared statistic: Count discrepancy, $ \chi^2 = \sum \frac{(O - E)^2}{E} $; compared to chi-square distribution.
        • Chi-squared distribution: Reference distribution for variance/count tests; depends on df.
          • Critical chi-squared value: Boundary for chi-square rejection region; used in goodness-of-fit/independence tests.
      • F-statistic: Variance ratio, usually $ F = \frac{\text{MSR}}{\text{MSE}} $; tests model/group effects.
        • F-distribution: Reference distribution for ratios of variances; depends on two dfs.
          • Critical F-value: Boundary from F-distribution; used in ANOVA/regression significance tests.

Retail Mining Report Methods To Review

The retail mining report also uses a few methods that extend the core review above:

25

25
Ready to start
Statistics Review
Session: 1 | Break: Short
Today: 0 sessions
Total: 0 sessions