Inferential statistics

Table of Contents

  1. Introduction to Statistical Inference
  2. Probability Review
  3. Discrete Probability Distributions
  4. Continuous Probability Distributions

1. Introduction to Statistical Inference

1.1 What is Statistical Inference?

Statistical inference is the process of drawing conclusions about a population based on information obtained from a sample. It allows us to make probabilistic statements about population parameters using sample statistics.

The two main branches of statistical inference are:

  • Estimation: Using sample data to estimate population parameters
  • Hypothesis Testing: Using sample data to test claims about population parameters

1.2 Population vs Sample

Population

A population is the complete set of all items of interest in a statistical study. Population parameters are typically denoted by Greek letters:

  • \(\mu\) (mu): Population mean
  • \(\sigma\) (sigma): Population standard deviation
  • \(p\): Population proportion

Sample

A sample is a subset of the population that is actually observed. Sample statistics are typically denoted by Roman letters:

  • \(\bar{x}\) (x-bar): Sample mean
  • \(s\): Sample standard deviation
  • \(\hat{p}\) (p-hat): Sample proportion

Why Sample?

  • Cost efficiency
  • Time constraints
  • Practical impossibility of measuring entire population
  • Destructive testing scenarios

1.3 Data Science Workflow

The modern data science workflow for inferential statistics follows these steps:

  1. Import: Bring data into R
  2. Tidy: Organize data into tidy format
  3. Transform: Manipulate data for analysis
  4. Visualize: Create plots to understand data
  5. Model: Apply statistical models
  6. Communicate: Share results and insights

1.4 Traditional R vs Tidyverse Approach

Traditional R

# Traditional approach
asia <- gapminder[gapminder$$continent == "Asia", ]
mean(asia$$lifeExp)

Tidyverse Approach

# Tidyverse approach
library(dplyr)
gapminder %>% 
  filter(continent == "Asia") %>% 
  summarize(mean_exp = mean(lifeExp))

The tidyverse offers:

  • More readable code with the pipe operator (%>%)
  • Consistent grammar of data manipulation
  • Integration with modern visualization tools

2. Probability Review

2.1 Random Variables

A random variable is a real-valued function defined on a sample space. For each outcome \(s\) in the sample space \(S\), a random variable \(X\) assigns a real number \(X(s)\).

Types of Random Variables

  1. Discrete Random Variable: Can take a countable number of values
    • Example: Number of heads in 10 coin flips
    • Example: Number of defective items in a batch
  2. Continuous Random Variable: Can take any value within an interval
    • Example: Height of students
    • Example: Time to complete an exam

2.2 Probability Distributions

Discrete Probability Distribution

For a discrete random variable \(X\), the probability distribution is defined by the probability mass function (PMF):

\[P(X = x) = p(x)\]

Properties:

  1. \(p(x) \geq 0\) for all values of \(x\)
  2. \[\sum p(x) = 1\]

Continuous Probability Distribution

For a continuous random variable \(X\), the probability distribution is defined by the probability density function (PDF) \(f(x)\).

For an interval \((a, b]\): \(P(a < X \leq b) = \int_a^b f(x)dx\)

Properties:

  1. \(f(x) \geq 0\) for all \(x\)
  2. \[\int_{-\infty}^{\infty} f(x)dx = 1\]

2.3 Expected Value and Variance

Expected Value (Mean)

The expected value represents the long-run average value of a random variable.

For discrete random variables: \(E(X) = \mu = \sum x \cdot p(x)\)

For continuous random variables: \(E(X) = \mu = \int_{-\infty}^{\infty} x \cdot f(x)dx\)

Variance

Variance measures the spread of a distribution around its mean.

For discrete random variables: \(Var(X) = \sigma^2 = E[(X - \mu)^2] = \sum (x - \mu)^2 \cdot p(x)\)

For continuous random variables: \(Var(X) = \sigma^2 = E[(X - \mu)^2] = \int_{-\infty}^{\infty} (x - \mu)^2 \cdot f(x)dx\)

Standard Deviation

The standard deviation is the square root of the variance: \(\sigma = \sqrt{Var(X)}\)

3. Discrete Probability Distributions

3.1 Binomial Distribution

The binomial distribution models the number of successes in a fixed number of independent trials.

Conditions for Binomial Distribution

  1. Fixed number of trials (\(n\))
  2. Two possible outcomes: Success or Failure
  3. Constant probability of success (\(p\)) for each trial
  4. Independent trials

Probability Mass Function

\[P(X = x) = \binom{n}{x}p^x(1-p)^{n-x}\]

where:

  • \(x\) = number of successes
  • \(n\) = number of trials
  • \(p\) = probability of success
  • \(\binom{n}{x} = \frac{n!}{x!(n-x)!}\) (binomial coefficient)

Mean and Variance

  • Mean: \(\mu = np\)
  • Variance: \(\sigma^2 = np(1-p)\)

Example

If 40% of a class is female, what is the probability that exactly 6 of the first 10 students walking in will be female?

\[P(X = 6) = \binom{10}{6}(0.4)^6(0.6)^4 = 210 \times 0.004096 \times 0.1296 = 0.1115\]

3.2 Poisson Distribution

The Poisson distribution models the number of events occurring in a fixed interval of time or space.

Conditions for Poisson Distribution

  1. Events occur independently
  2. Average rate of occurrence (\(\lambda\)) is constant
  3. Two events cannot occur at exactly the same instant

Probability Mass Function

\[P(X = x) = \frac{e^{-\lambda}\lambda^x}{x!}\]

where:

  • \(x\) = number of events
  • \(\lambda\) = average rate of occurrence
  • \(e\) = Euler’s number (≈ 2.71828)

Mean and Variance

  • Mean: \(\mu = \lambda\)
  • Variance: \(\sigma^2 = \lambda\)

Example

If there are an average of 3 striped trout per 100 yards in a stream, what is the probability of seeing exactly 5 striped trout in the next 100 yards?

\[P(X = 5) = \frac{e^{-3} \times 3^5}{5!} = \frac{0.0498 \times 243}{120} = 0.1008\]

4. Continuous Probability Distributions

4.1 Normal Distribution

The normal distribution is the most important continuous distribution in statistics, often called the “bell curve.”

Probability Density Function

\[f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\]

where:

  • \(\mu\) = mean
  • \(\sigma\) = standard deviation
  • \(\pi\) ≈ 3.14159
  • \(e\) ≈ 2.71828

Properties

  1. Symmetric about the mean
  2. Mean = Median = Mode
  3. Total area under the curve = 1
  4. Approximately 68% of data falls within ±1σ of μ
  5. Approximately 95% of data falls within ±2σ of μ
  6. Approximately 99.7% of data falls within ±3σ of μ

Notation

\(X \sim N(\mu, \sigma^2)\) means “\(X\) follows a normal distribution with mean \(\mu\) and variance \(\sigma^2\)”

4.2 Standard Normal Distribution

The standard normal distribution is a special case of the normal distribution with \(\mu = 0\) and \(\sigma = 1\).

Z-Score Transformation

Any normal random variable \(X \sim N(\mu, \sigma^2)\) can be standardized: \(Z = \frac{X - \mu}{\sigma}\)

where \(Z \sim N(0, 1)\)

Using Z-Tables

To find probabilities for normal distributions:

  1. Convert to z-score
  2. Use standard normal table or R functions

Example

If gestational length \(X \sim N(39, 2^2)\) weeks, what percentage of gestations are less than 40 weeks?

\[Z = \frac{40 - 39}{2} = 0.5\]

Using the standard normal table: \(P(Z \leq 0.5) = 0.6915\)

Therefore, approximately 69.15% of gestations are less than 40 weeks.

R Implementation

# Using pnorm for normal probabilities
pnorm(40, mean = 39, sd = 2)  # Direct calculation
pnorm(0.5)                    # Using z-score

Part 2: Sampling and Point Estimation

5. Sampling and Sampling Distributions

5.1 Introduction to Sampling

Sampling is the process of selecting a subset of individuals from a population to estimate characteristics of the whole population.

Why Sample?

  • Census limitations: Measuring entire populations is often impractical
  • Cost efficiency: Samples are less expensive to collect
  • Time constraints: Samples provide quicker results
  • Destructive testing: Some measurements destroy the item being tested

Types of Sampling

  1. Random Sampling: Every member of the population has an equal chance of selection
  2. Stratified Sampling: Population divided into strata, then random samples from each
  3. Cluster Sampling: Population divided into clusters, some clusters randomly selected
  4. Systematic Sampling: Select every kth element from a list

5.2 Sampling With and Without Replacement

Sampling With Replacement

  • Each selected item is returned to the population before the next draw
  • Same item can be selected multiple times
  • Observations are independent

Sampling Without Replacement

  • Selected items are not returned to the population
  • Same item cannot be selected twice
  • Observations are not independent

R Implementation

# Sampling with replacement
sample(1:10, size = 5, replace = TRUE)

# Sampling without replacement
sample(1:10, size = 5, replace = FALSE)

5.3 Sampling Distributions

A sampling distribution is the probability distribution of a statistic obtained from all possible samples of a specific size from a population.

Key Concepts

  1. Sample Statistic: A value calculated from sample data (e.g., \(\bar{x}\), \(\hat{p}\))
  2. Population Parameter: The true value in the population (e.g., \(\mu\), \(p\))
  3. Sampling Variability: Different samples yield different statistics

The Bowl Example

The classic bowl activity demonstrates sampling concepts:

  • Bowl contains red and white balls (unknown proportion)
  • Take samples of 50 balls using a “shovel”
  • Calculate proportion of red balls in each sample
  • Observe variation in sample proportions
library(moderndive)
library(tidyverse)

# Virtual bowl sampling
bowl %>% 
  rep_sample_n(size = 50, reps = 1000) %>%
  group_by(replicate) %>%
  summarize(prop_red = mean(color == "red"))

5.4 Central Limit Theorem

The Central Limit Theorem (CLT) is fundamental to statistical inference:

Statement of CLT

For a random sample of size \(n\) from a population with mean \(\mu\) and standard deviation \(\sigma\):

  1. The sampling distribution of \(\bar{X}\) has mean \(\mu_{\bar{X}} = \mu\)
  2. The sampling distribution of \(\bar{X}\) has standard deviation \(\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}}\)
  3. As \(n\) increases, the sampling distribution of \(\bar{X}\) approaches a normal distribution

Mathematical Expression

\(\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)\) for large \(n\)

Implications

  1. Sample means are normally distributed for large samples
  2. Larger samples produce more precise estimates
  3. We can make probability statements about sample means

5.5 Standard Error

The standard error (SE) measures the variability of a sample statistic.

Standard Error of the Mean

\[SE(\bar{X}) = \frac{\sigma}{\sqrt{n}}\]

If \(\sigma\) is unknown, we estimate it with \(s\): \(SE(\bar{X}) = \frac{s}{\sqrt{n}}\)

Standard Error of a Proportion

\[SE(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}\]

If \(p\) is unknown, we estimate it with \(\hat{p}\): \(SE(\hat{p}) = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\)

Effect of Sample Size

As \(n\) increases:

  • Standard error decreases
  • Estimates become more precise
  • Sampling distribution becomes narrower

6. Point Estimation

6.1 Introduction to Point Estimation

A point estimate is a single value used to estimate a population parameter.

Common Point Estimates

  • Sample mean (\(\bar{x}\)) estimates population mean (\(\mu\))
  • Sample proportion (\(\hat{p}\)) estimates population proportion (\(p\))
  • Sample variance (\(s^2\)) estimates population variance (\(\sigma^2\))

Properties of Good Estimators

  1. Unbiased: \(E(\hat{\theta}) = \theta\)
  2. Consistent: \(\hat{\theta}\) converges to \(\theta\) as \(n \to \infty\)
  3. Efficient: Minimum variance among unbiased estimators

6.2 Method of Moments

The method of moments equates sample moments to population moments to derive estimators.

First Moment (Mean)

Population: \(E(X) = \mu\) Sample: \(\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i\)

Second Moment (Variance)

Population: \(E(X^2) = \sigma^2 + \mu^2\) Sample: \(\frac{1}{n}\sum_{i=1}^n X_i^2\)

Example: Poisson Distribution

For \(X \sim \text{Poisson}(\lambda)\):

  • \[E(X) = \lambda\]
  • Method of moments estimator: \(\hat{\lambda} = \bar{X}\)

6.3 Maximum Likelihood Estimation (MLE)

Maximum likelihood estimation finds parameter values that maximize the probability of observing the sample data.

Likelihood Function

For independent observations \(x_1, x_2, ..., x_n\): \(L(\theta) = \prod_{i=1}^n f(x_i\|\theta)\)

Log-Likelihood Function

\[\ell(\theta) = \log L(\theta) = \sum_{i=1}^n \log f(x_i\|\theta)\]

Finding MLE

  1. Write the likelihood function
  2. Take the logarithm
  3. Differentiate with respect to \(\theta\)
  4. Set derivative equal to zero
  5. Solve for \(\theta\)

Example: Normal Distribution

For \(X_1, ..., X_n \sim N(\mu, \sigma^2)\):

Log-likelihood: \(\ell(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n(x_i - \mu)^2\)

MLEs:

  • \[\hat{\mu}_{MLE} = \bar{x}\]
  • \[\hat{\sigma}^2_{MLE} = \frac{1}{n}\sum_{i=1}^n(x_i - \bar{x})^2\]

6.4 Examples of Point Estimation

Example 1: Binomial Distribution

For \(Y \sim \text{Binomial}(n, p)\):

  • MLE: \(\hat{p} = \frac{Y}{n}\)

Example 2: Exponential Distribution

For \(X_1, ..., X_n \sim \text{Exponential}(\lambda)\):

  • MLE: \(\hat{\lambda} = \frac{1}{\bar{x}}\)

Example 3: Poisson Distribution

For \(X_1, ..., X_n \sim \text{Poisson}(\mu)\):

  • MLE: \(\hat{\mu} = \bar{x}\)

R Implementation

# Sample data
set.seed(123)
data <- rpois(100, lambda = 5)

# Method of moments estimate
mom_estimate <- mean(data)

# Maximum likelihood estimate (same for Poisson)
mle_estimate <- mean(data)

print(paste("Estimate:", mle_estimate))

6.5 Sampling Distribution of Estimators

The sampling distribution of an estimator shows how the estimator varies across all possible samples.

For Sample Mean

If \(X_1, ..., X_n \sim N(\mu, \sigma^2)\): \(\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)\)

For Sample Proportion

If \(n\) is large: \(\hat{p} \sim N\left(p, \frac{p(1-p)}{n}\right)\)

Bootstrap Distribution

When theoretical distributions are unknown, we can use bootstrap methods:

library(infer)

# Generate bootstrap distribution
boot_dist <- data %>%
  specify(response = variable) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean")

# Visualize
boot_dist %>% visualize()

6.6 Properties of Estimators

Bias

The bias of an estimator \(\hat{\theta}\) is: \(\text{Bias}(\hat{\theta}) = E(\hat{\theta}) - \theta\)

An estimator is unbiased if \(\text{Bias}(\hat{\theta}) = 0\).

Mean Squared Error (MSE)

\[\text{MSE}(\hat{\theta}) = E[(\hat{\theta} - \theta)^2] = \text{Var}(\hat{\theta}) + [\text{Bias}(\hat{\theta})]^2\]

Consistency

An estimator \(\hat{\theta}*n\) is consistent if: \(\lim*{n \to \infty} P(\|\hat{\theta}_n - \theta\| > \epsilon) = 0\) for any \(\epsilon > 0\)

Efficiency

Among unbiased estimators, the one with minimum variance is called efficient.

The relative efficiency of two estimators is: \(\text{RE}(\hat{\theta}_1, \hat{\theta}_2) = \frac{\text{Var}(\hat{\theta}_2)}{\text{Var}(\hat{\theta}_1)}\)

Part 3: Confidence Intervals - Theory

7. Introduction to Confidence Intervals

7.1 What is a Confidence Interval?

A confidence interval (CI) is a range of values that likely contains the true population parameter with a specified level of confidence.

Key Components

  1. Point estimate: Best single estimate of the parameter
  2. Margin of error: How far the interval extends from the point estimate
  3. Confidence level: Probability that the method produces an interval containing the true parameter

General Form

\[\text{CI} = \text{Point Estimate} \pm \text{Margin of Error}\]

Interpretation

A 95% confidence interval means that if we repeated the sampling process many times and calculated a CI each time, approximately 95% of these intervals would contain the true population parameter.

7.2 The Traditional (Standard Error) Method

The traditional method for constructing confidence intervals relies on:

  1. The sampling distribution of the statistic
  2. The standard error
  3. A critical value from the appropriate distribution

General Formula

\[\text{CI} = \hat{\theta} \pm z_{\alpha/2} \times SE(\hat{\theta})\]

where:

  • \(\hat{\theta}\) = point estimate
  • \(z_{\alpha/2}\) = critical value
  • \(SE(\hat{\theta})\) = standard error
  • \(\alpha\) = significance level (1 - confidence level)

Common Critical Values

| Confidence Level | \(\alpha\) | \(z_{\alpha/2}\) | | —————- | ——– | ————– | | 90% | 0.10 | 1.645 | | 95% | 0.05 | 1.960 | | 99% | 0.01 | 2.576 |

7.3 Confidence Interval for Population Mean

Case 1: Known Population Standard Deviation (\(\sigma\))

When \(\sigma\) is known and either:

  • The population is normally distributed, or
  • The sample size is large (\(n \geq 30\))
\[\text{CI} = \bar{x} \pm z_{\alpha/2} \times \frac{\sigma}{\sqrt{n}}\]

Case 2: Unknown Population Standard Deviation

When \(\sigma\) is unknown, we use the \(t\)-distribution:

\[\text{CI} = \bar{x} \pm t_{\alpha/2, n-1} \times \frac{s}{\sqrt{n}}\]

where:

  • \(t_{\alpha/2, n-1}\) = critical value from \(t\)-distribution with \(n-1\) degrees of freedom
  • \(s\) = sample standard deviation

Conditions for Using the t-Distribution

  1. Random sample
  2. Population is approximately normal, or \(n \geq 30\)
  3. Independence of observations

Example

Customer support response times: \(\bar{x} = 57\) minutes, \(s = 15\) minutes, \(n = 200\)

For a 95% CI: \(SE = \frac{15}{\sqrt{200}} = 1.06\) \(\text{Margin of Error} = 1.96 \times 1.06 = 2.08\) \(\text{CI} = 57 \pm 2.08 = [54.92, 59.08]\)

We are 95% confident that the true mean response time is between 54.92 and 59.08 minutes.

R Implementation

# Using t.test function
data <- c(12, 16, 14, 18, 20, 10, 15, 14, 16, 13)
t.test(data, conf.level = 0.95)$$conf.int

# Manual calculation
x_bar <- mean(data)
s <- sd(data)
n <- length(data)
se <- s / sqrt(n)
t_critical <- qt(0.975, df = n - 1)
margin_error <- t_critical * se
ci_lower <- x_bar - margin_error
ci_upper <- x_bar + margin_error

7.4 Confidence Interval for Population Proportion

For a population proportion \(p\), we use the sample proportion \(\hat{p}\):

\[\text{CI} = \hat{p} \pm z_{\alpha/2} \times \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\]

Conditions

  1. Random sample
  2. Independence
  3. Success-failure condition: \(n\hat{p} \geq 10\) and \(n(1-\hat{p}) \geq 10\)

Example

Customer satisfaction survey: 200 out of 250 customers are satisfied

\(\hat{p} = \frac{200}{250} = 0.80\) \(SE = \sqrt{\frac{0.80 \times 0.20}{250}} = 0.0253\) \(\text{Margin of Error} = 1.96 \times 0.0253 = 0.0496\) \(\text{CI} = 0.80 \pm 0.0496 = [0.750, 0.850]\)

We are 95% confident that between 75% and 85% of all customers are satisfied.

R Implementation

# Using prop.test
prop.test(x = 200, n = 250, conf.level = 0.95)$$conf.int

# Manual calculation
p_hat <- 200/250
se <- sqrt(p_hat * (1 - p_hat) / 250)
margin_error <- 1.96 * se
ci_lower <- p_hat - margin_error
ci_upper <- p_hat + margin_error

7.5 Margin of Error and Width of Confidence Intervals

Margin of Error Components

\[\text{Margin of Error} = \text{Critical Value} \times \text{Standard Error}\]

Factors Affecting CI Width

  1. Sample size: Larger \(n\) → narrower CI
  2. Confidence level: Higher confidence → wider CI
  3. Population variability: Larger \(\sigma\) → wider CI

Relationship with Sample Size

For means: \(\text{Width} \propto \frac{1}{\sqrt{n}}\)

Quadrupling the sample size cuts the CI width in half.

7.6 Sample Size Determination

For Population Mean

To achieve a margin of error \(E\) with confidence level \((1-\alpha)\):

\[n = \left(\frac{z_{\alpha/2} \times \sigma}{E}\right)^2\]

For Population Proportion

\[n = \left(\frac{z_{\alpha/2}}{E}\right)^2 \times p(1-p)\]

If \(p\) is unknown, use \(p = 0.5\) for the most conservative estimate.

Example

To estimate proportion of students who prefer games within 4% margin of error with 95% confidence:

\[n = \left(\frac{1.96}{0.04}\right)^2 \times 0.5 \times 0.5 = 600.25 \approx 601\]

7.7 Interpreting Confidence Intervals

Correct Interpretations

  1. “We are 95% confident that the true population mean lies between [lower, upper]”
  2. “If we repeated this procedure many times, 95% of the resulting intervals would contain the true parameter”
  3. “The method used produces intervals that contain the true parameter 95% of the time”

Incorrect Interpretations

  1. ❌ “There is a 95% probability that the true mean is in this interval”
  2. ❌ “95% of the data falls within this interval”
  3. ❌ “95% of sample means fall within this interval”

Key Points

  • The parameter is fixed; the interval is random
  • Confidence is in the method, not in a specific interval
  • Once calculated, the interval either contains the parameter or it doesn’t

7.8 Finite Population Correction

When sampling from a finite population without replacement, and the sample size is more than 5% of the population:

\[\text{FPC} = \sqrt{\frac{N-n}{N-1}}\]

The corrected standard error becomes: \(SE_{corrected} = SE \times \text{FPC}\)

Example

Population of 1,200 students, sample of 400: \(\text{FPC} = \sqrt{\frac{1200-400}{1200-1}} = 0.816\)

The standard error is reduced by approximately 18.4%.

7.9 Confidence Intervals in R

Using Base R

# For means
t.test(data, conf.level = 0.95)$$conf.int

# For proportions
prop.test(x = successes, n = total, conf.level = 0.95)$$conf.int

Using tidyverse and infer

library(infer)

# For means
data %>%
  specify(response = variable) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean") %>%
  get_confidence_interval(level = 0.95, type = "percentile")

# For proportions
data %>%
  specify(response = variable, success = "yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_confidence_interval(level = 0.95, type = "percentile")

7.10 Visual Representation

Confidence intervals can be visualized using:

library(ggplot2)

# Create data frame for plotting
ci_data <- data.frame(
  estimate = c(x_bar),
  lower = c(ci_lower),
  upper = c(ci_upper)
)

# Plot with error bars
ggplot(ci_data, aes(x = "Sample", y = estimate)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.2) +
  labs(y = "Estimate", title = "95% Confidence Interval")

Part 4: Confidence Intervals - Advanced & Bootstrap Methods

8. Two-Sample Confidence Intervals

8.1 Confidence Interval for Difference in Means

Independent Samples

When comparing means from two independent populations:

\[\text{CI} = (\bar{x}_1 - \bar{x}*2) \pm t*{\alpha/2, df} \times SE(\bar{x}_1 - \bar{x}_2)\]

where: \(SE(\bar{x}_1 - \bar{x}_2) = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\)

Degrees of Freedom (Welch’s approximation)

\[df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1-1} + \frac{(s_2^2/n_2)^2}{n_2-1}}\]

Example

Comparing incomes between Cleveland and Sacramento:

  • Cleveland: \(\bar{x}_1 = 45,000\), \(s_1 = 15,000\), \(n_1 = 212\)
  • Sacramento: \(\bar{x}_2 = 48,000\), \(s_2 = 18,000\), \(n_2 = 175\)
# Using t.test
t.test(cleveland_income, sacramento_income, conf.level = 0.95)$$conf.int

# Manual calculation
x1_bar <- 45000; s1 <- 15000; n1 <- 212
x2_bar <- 48000; s2 <- 18000; n2 <- 175
se_diff <- sqrt(s1^2/n1 + s2^2/n2)
df <- ((s1^2/n1 + s2^2/n2)^2) / ((s1^2/n1)^2/(n1-1) + (s2^2/n2)^2/(n2-1))
t_critical <- qt(0.975, df)
margin_error <- t_critical * se_diff
ci_lower <- (x1_bar - x2_bar) - margin_error
ci_upper <- (x1_bar - x2_bar) + margin_error

Paired Samples

For paired data (e.g., before-after measurements):

  1. Calculate differences: \(d_i = x_{1i} - x_{2i}\)
  2. Find mean difference: \(\bar{d}\)
  3. Find standard deviation of differences: \(s_d\)
\[\text{CI} = \bar{d} \pm t_{\alpha/2, n-1} \times \frac{s_d}{\sqrt{n}}\]

Example: Weight Loss Program

Before-after weights for 10 participants:

before <- c(65, 60, 63, 58, 70, 72, 80, 81, 83, 69)
after <- c(55, 52, 60, 55, 60, 66, 70, 66, 63, 60)
differences <- after - before

# Using t.test for paired data
t.test(after, before, paired = TRUE, conf.level = 0.95)$$conf.int

# Manual calculation
d_bar <- mean(differences)
s_d <- sd(differences)
n <- length(differences)
se_d <- s_d / sqrt(n)
t_critical <- qt(0.975, df = n - 1)
margin_error <- t_critical * se_d
ci_lower <- d_bar - margin_error
ci_upper <- d_bar + margin_error

8.2 Confidence Interval for Difference in Proportions

For two independent proportions:

\[\text{CI} = (\hat{p}_1 - \hat{p}*2) \pm z*{\alpha/2} \times SE(\hat{p}_1 - \hat{p}_2)\]

where: \(SE(\hat{p}_1 - \hat{p}_2) = \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}\)

Example

Comparing satisfaction rates between two products:

  • Product A: 180 satisfied out of 200 (\(\hat{p}_1 = 0.90\))
  • Product B: 150 satisfied out of 200 (\(\hat{p}_2 = 0.75\))
# Using prop.test
prop.test(x = c(180, 150), n = c(200, 200), conf.level = 0.95)$$conf.int

# Manual calculation
p1_hat <- 180/200; n1 <- 200
p2_hat <- 150/200; n2 <- 200
se_diff <- sqrt(p1_hat*(1-p1_hat)/n1 + p2_hat*(1-p2_hat)/n2)
z_critical <- qnorm(0.975)
margin_error <- z_critical * se_diff
ci_lower <- (p1_hat - p2_hat) - margin_error
ci_upper <- (p1_hat - p2_hat) + margin_error

9. Bootstrap Methods for Confidence Intervals

9.1 Introduction to Bootstrap

The bootstrap is a resampling method that estimates the sampling distribution by repeatedly sampling with replacement from the original data.

Key Concepts

  1. Resampling: Drawing samples from the original data
  2. With replacement: Same observation can be selected multiple times
  3. Bootstrap distribution: Distribution of statistics from resampled data

Advantages

  • No distributional assumptions required
  • Works for complex statistics
  • Provides empirical sampling distribution

9.2 Bootstrap Procedure

  1. Start with original sample of size \(n\)
  2. Draw a bootstrap sample of size \(n\) with replacement
  3. Calculate statistic on bootstrap sample
  4. Repeat steps 2-3 many times (typically 1000+)
  5. Use bootstrap distribution to construct CI

R Implementation

library(infer)

# Bootstrap distribution for mean
boot_dist <- data %>%
  specify(response = variable) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean")

# Visualize bootstrap distribution
boot_dist %>% visualize()

9.3 Bootstrap Confidence Intervals

Percentile Method

Uses percentiles of the bootstrap distribution:

For a 95% CI, use the 2.5th and 97.5th percentiles:

# Percentile CI
percentile_ci <- boot_dist %>%
  get_confidence_interval(level = 0.95, type = "percentile")

Standard Error Method

Uses bootstrap standard error with normal approximation:

\[\text{CI} = \hat{\theta} \pm z_{\alpha/2} \times SE_{boot}\]

where \(SE_{boot}\) is the standard deviation of the bootstrap distribution.

# Standard error CI
se_ci <- boot_dist %>%
  get_confidence_interval(level = 0.95, type = "se", 
                         point_estimate = mean(original_data))

9.4 Bootstrap for Different Statistics

For Proportions

boot_prop <- data %>%
  specify(response = variable, success = "yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop")

For Difference in Means

boot_diff_means <- data %>%
  specify(response ~ group) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "diff in means", order = c("A", "B"))

For Correlation

boot_cor <- data %>%
  specify(x ~ y) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "correlation")

9.5 Visualizing Bootstrap Results

# Bootstrap distribution with CI
boot_dist %>%
  visualize() +
  shade_confidence_interval(endpoints = percentile_ci)

# Comparison of methods
library(gridExtra)
p1 <- boot_dist %>%
  visualize() +
  shade_ci(endpoints = percentile_ci, color = "blue", fill = "lightblue") +
  ggtitle("Percentile Method")

p2 <- boot_dist %>%
  visualize() +
  shade_ci(endpoints = se_ci, color = "red", fill = "pink") +
  ggtitle("Standard Error Method")

grid.arrange(p1, p2, ncol = 2)

10. Sample Size Determination

10.1 Sample Size for Estimating a Mean

To achieve margin of error \(E\) with confidence level \((1-\alpha)\):

\[n = \left(\frac{z_{\alpha/2} \times \sigma}{E}\right)^2\]

When \(\sigma\) is Unknown

  1. Use estimate from pilot study
  2. Use range/4 as rough estimate
  3. Use conservative estimate

Example

To estimate mean income within \(500 with 95% confidence, assuming\)\sigma = \(4,500\):

\[n = \left(\frac{1.96 \times 4500}{500}\right)^2 = 311.17 \approx 312\]

10.2 Sample Size for Estimating a Proportion

\[n = \left(\frac{z_{\alpha/2}}{E}\right)^2 \times p(1-p)\]

Conservative Approach

When \(p\) is unknown, use \(p = 0.5\):

\[n = \left(\frac{z_{\alpha/2}}{E}\right)^2 \times 0.25\]

Example

To estimate proportion within 3% with 95% confidence:

\[n = \left(\frac{1.96}{0.03}\right)^2 \times 0.25 = 1067.11 \approx 1068\]

10.3 Sample Size for Comparing Two Means

\[n = \frac{2\sigma^2(z_{\alpha/2} + z_\beta)^2}{(\mu_1 - \mu_2)^2}\]

where:

  • \(\beta\) = probability of Type II error
  • \(\mu_1 - \mu_2\) = minimum detectable difference

10.4 Effect of Confidence Level and Margin of Error

Relationship with sample size:

  • Halving margin of error → quadruples sample size
  • Increasing confidence level → increases sample size

Trade-offs:

  • Precision vs. cost
  • Confidence vs. feasibility

10.5 Finite Population Correction

For finite populations:

\[n_{adjusted} = \frac{n}{1 + \frac{n-1}{N}}\]

where \(N\) is population size.

10.6 R Implementation for Sample Size

# For proportions
library(pwr)
pwr.p.test(h = ES.h(p1 = 0.5, p2 = 0.47), 
           sig.level = 0.05, 
           power = 0.80)

# Custom function for mean
sample_size_mean <- function(sigma, margin_error, conf_level = 0.95) {
  z <- qnorm((1 + conf_level) / 2)
  n <- ceiling((z * sigma / margin_error)^2)
  return(n)
}

# Custom function for proportion
sample_size_prop <- function(p, margin_error, conf_level = 0.95) {
  z <- qnorm((1 + conf_level) / 2)
  if (is.null(p)) p <- 0.5  # Conservative estimate
  n <- ceiling((z^2 * p * (1-p)) / margin_error^2)
  return(n)
}

10.7 Practical Considerations

  1. Pilot studies: Small preliminary studies to estimate parameters
  2. Resource constraints: Budget, time, personnel
  3. Non-response: Plan for 10-30% non-response rate
  4. Stratification: May require different sample sizes per stratum
  5. Cluster sampling: Requires larger samples due to design effect

Part 5: Hypothesis Testing Framework

11. Introduction to Hypothesis Testing

11.1 What is Hypothesis Testing?

Hypothesis testing is a formal procedure for using sample data to evaluate claims about population parameters. It provides a framework for making decisions under uncertainty.

Key Components

  1. Null hypothesis (H₀): The claim being tested (usually status quo)
  2. Alternative hypothesis (H₁ or Hₐ): The competing claim
  3. Test statistic: A value calculated from sample data
  4. p-value: Probability of observing data as extreme as ours, assuming H₀ is true
  5. Decision rule: Criteria for rejecting or failing to reject H₀

The Criminal Trial Analogy

| Criminal Trial | Hypothesis Test | | —————————- | —————————– | | Defendant presumed innocent | H₀ presumed true | | Prosecution must prove guilt | Must show evidence against H₀ | | “Beyond reasonable doubt” | Significance level α | | Guilty verdict | Reject H₀ | | Not guilty verdict | Fail to reject H₀ |

11.2 Null and Alternative Hypotheses

Null Hypothesis (H₀)

  • Statement of “no effect” or “no difference”
  • Represents the status quo
  • Must contain equality (=, ≤, or ≥)
  • What we assume true until proven otherwise

Alternative Hypothesis (H₁ or Hₐ)

  • Statement we’re trying to find evidence for
  • Represents the research claim
  • Determines if test is one-tailed or two-tailed

Types of Tests

  1. Two-tailed test: H₁: μ ≠ μ₀
    • Tests for difference in either direction
    • Example: “The mean is different from 50”
  2. Right-tailed test: H₁: μ > μ₀
    • Tests if parameter is greater
    • Example: “The mean is greater than 50”
  3. Left-tailed test: H₁: μ < μ₀
    • Tests if parameter is less
    • Example: “The mean is less than 50”

Examples

Claim: Average response time is 5 minutes
H₀: μ = 5
H₁: μ ≠ 5 (two-tailed)

Claim: New drug lowers blood pressure
H₀: μ₁ ≥ μ₂ (new drug mean ≥ placebo mean)
H₁: μ₁ < μ₂ (new drug mean < placebo mean)

Claim: More than 60% support the proposal
H₀: p ≤ 0.60
H₁: p > 0.60

11.3 Type I and Type II Errors

Error Types

Reality Decision: Reject H₀ Decision: Fail to Reject H₀
H₀ True Type I Error (α) Correct Decision
H₀ False Correct Decision Type II Error (β)

Type I Error (False Positive)

  • Rejecting H₀ when it’s actually true
  • Probability = α (significance level)
  • Example: Convicting an innocent person

Type II Error (False Negative)

  • Failing to reject H₀ when it’s actually false
  • Probability = β
  • Example: Acquitting a guilty person

Power of a Test

  • Power = 1 - β
  • Probability of correctly rejecting false H₀
  • Affected by:
    • Sample size (larger n → higher power)
    • Effect size (larger effect → higher power)
    • Significance level α (larger α → higher power)
    • Population variability (smaller σ → higher power)

11.4 Significance Level and p-values

Significance Level (α)

  • Threshold for rejecting H₀
  • Pre-specified before testing
  • Common values: 0.05, 0.01, 0.10
  • Represents acceptable Type I error rate

p-value

  • Probability of observing data as extreme as ours (or more extreme), assuming H₀ is true
  • Smaller p-value = stronger evidence against H₀
  • Compare to α to make decision

Decision Rule

  • If p-value ≤ α: Reject H₀
  • If p-value > α: Fail to reject H₀

Interpreting p-values

  • p < 0.01: Very strong evidence against H₀
  • 0.01 ≤ p < 0.05: Strong evidence against H₀
  • 0.05 ≤ p < 0.10: Weak evidence against H₀
  • p ≥ 0.10: Little or no evidence against H₀

11.5 Test Statistics

General Form

\[\text{Test Statistic} = \frac{\text{Point Estimate - Hypothesized Value}}{\text{Standard Error}}\]

Common Test Statistics

  1. Z-statistic (known σ or large samples): \(Z = \frac{\bar{X} - \mu_0}{\sigma/\sqrt{n}}\)
  2. t-statistic (unknown σ): \(t = \frac{\bar{X} - \mu_0}{s/\sqrt{n}}\)
  3. Z-statistic for proportions: \(Z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}}\)

Critical Values

Values that separate rejection and non-rejection regions:

  • Two-tailed: ±z_{α/2} or ±t_{α/2,df}
  • Right-tailed: z_α or t_{α,df}
  • Left-tailed: -z_α or -t_{α,df}

11.6 Steps in Hypothesis Testing

  1. State the hypotheses
    • Identify H₀ and H₁
    • Determine if one-tailed or two-tailed
  2. Set significance level
    • Choose α (typically 0.05)
  3. Check conditions
    • Random sampling
    • Independence
    • Normality (if required)
  4. Calculate test statistic
    • Use appropriate formula
  5. Find p-value
    • Use distribution tables or software
  6. Make decision
    • Compare p-value to α
  7. State conclusion
    • In context of the problem

Example: Testing Mean Response Time

A company claims average response time is 5 minutes. Sample of 100 calls shows \(\bar{x} = 5.5\) minutes, \(s = 2\) minutes.

# 1. Hypotheses
# H₀: μ = 5
# H₁: μ ≠ 5

# 2. Significance level
alpha <- 0.05

# 3. Check conditions (assumed met)

# 4. Calculate test statistic
x_bar <- 5.5
mu_0 <- 5
s <- 2
n <- 100
t_stat <- (x_bar - mu_0) / (s / sqrt(n))

# 5. Find p-value
p_value <- 2 * pt(abs(t_stat), df = n - 1, lower.tail = FALSE)

# 6. Decision
if (p_value <= alpha) {
  decision <- "Reject H₀"
} else {
  decision <- "Fail to reject H₀"
}

# 7. Conclusion
cat("t-statistic:", t_stat, "\n")
cat("p-value:", p_value, "\n")
cat("Decision:", decision, "\n")

11.7 Relationship Between Confidence Intervals and Hypothesis Tests

For two-tailed tests at significance level α:

  • H₀ is rejected if and only if the (1-α)100% CI doesn’t contain the hypothesized value
  • The CI contains all values that would not be rejected by the hypothesis test

Example

95% CI for μ: [52, 58]

  • H₀: μ = 50 → Reject (50 not in CI)
  • H₀: μ = 55 → Fail to reject (55 in CI)

11.8 Statistical vs. Practical Significance

Statistical Significance

  • p-value < α
  • Evidence against H₀
  • Affected by sample size

Practical Significance

  • Size of the effect
  • Real-world importance
  • Context-dependent

Key Point

Large samples can make tiny differences statistically significant, but they may not be practically important.

11.9 Common Misconceptions

What p-values are NOT

  1. ❌ Probability that H₀ is true
  2. ❌ Probability that H₁ is true
  3. ❌ Probability of making a Type I error (that’s α)
  4. ❌ Measure of effect size

What “fail to reject H₀” does NOT mean

  1. ❌ H₀ is proven true
  2. ❌ H₀ is accepted
  3. ❌ There is no effect

11.10 Using R for Hypothesis Testing

Traditional Methods

# One-sample t-test
t.test(data, mu = hypothesized_value, alternative = "two.sided")

# Two-sample t-test
t.test(group1, group2, alternative = "two.sided")

# Paired t-test
t.test(x, y, paired = TRUE, alternative = "two.sided")

# Proportion test
prop.test(x = successes, n = total, p = hypothesized_prop,
          alternative = "two.sided")

Modern Approach with infer

library(infer)

# Generate null distribution
null_dist <- data %>%
  specify(response = variable) %>%
  hypothesize(null = "point", mu = hypothesized_value) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean")

# Calculate p-value
p_value <- null_dist %>%
  get_p_value(obs_stat = observed_stat, direction = "two-sided")

# Visualize
null_dist %>%
  visualize() +
  shade_p_value(obs_stat = observed_stat, direction = "two-sided")

11.11 Reporting Results

A complete hypothesis test report should include:

  1. Statement of hypotheses
  2. Significance level used
  3. Test statistic value
  4. p-value
  5. Decision (reject/fail to reject)
  6. Conclusion in context

Example Report

“We tested whether the mean response time differs from 5 minutes (H₀: μ = 5 vs. H₁: μ ≠ 5) at the α = 0.05 significance level. Based on a sample of 100 calls with mean 5.5 minutes and standard deviation 2 minutes, the test statistic was t = 2.5 with p-value = 0.014. Since p < 0.05, we reject H₀ and conclude there is sufficient evidence that the mean response time differs from 5 minutes.”

Part 6: One-Sample Tests

12. One-Sample Tests

12.1 One-Sample Z-Test

The z-test is used when:

  • Population standard deviation (σ) is known
  • Sample size is large (n ≥ 30)
  • Population is normally distributed (or large sample)

Test Statistic

\[Z = \frac{\bar{X} - \mu_0}{\sigma/\sqrt{n}}\]

Decision Rule

  • Two-tailed: Reject H₀ if |Z| > z_{α/2}
  • Right-tailed: Reject H₀ if Z > z_α
  • Left-tailed: Reject H₀ if Z < -z_α

Example: Rice Consumption

Testing if average rice consumption differs from 3.9 kg/year, with known σ = 4.5:

# Given data
x_bar <- 4.2     # Sample mean
mu_0 <- 3.9      # Hypothesized mean
sigma <- 4.5     # Known population SD
n <- 150         # Sample size
alpha <- 0.05    # Significance level

# Calculate z-statistic
z_stat <- (x_bar - mu_0) / (sigma / sqrt(n))

# Calculate p-value (two-tailed)
p_value <- 2 * (1 - pnorm(abs(z_stat)))

# Critical values
z_critical <- qnorm(1 - alpha/2)

# Decision
if (abs(z_stat) > z_critical) {
  print("Reject H₀")
} else {
  print("Fail to reject H₀")
}

# Using R's built-in function (requires BSDA package)
library(BSDA)
z.test(x = sample_data, mu = 3.9, sigma.x = 4.5, 
       alternative = "two.sided")

12.2 One-Sample t-Test

The t-test is used when:

  • Population standard deviation is unknown
  • Sample standard deviation (s) is used instead
  • Population is approximately normal (or large sample)

Test Statistic

\[t = \frac{\bar{X} - \mu_0}{s/\sqrt{n}}\]

Degrees of Freedom

df = n - 1

Example: Call Center Response Time

Testing if mean response time equals 120 minutes:

# Sample data
response_times <- c(115, 125, 118, 122, 130, 119, 121, 117, 124, 116)
mu_0 <- 120
alpha <- 0.05

# Using t.test function
result <- t.test(response_times, mu = mu_0, alternative = "two.sided")
print(result)

# Manual calculation
x_bar <- mean(response_times)
s <- sd(response_times)
n <- length(response_times)
se <- s / sqrt(n)
t_stat <- (x_bar - mu_0) / se
df <- n - 1
p_value <- 2 * pt(-abs(t_stat), df)

# Confidence interval
t_critical <- qt(1 - alpha/2, df)
margin_error <- t_critical * se
ci_lower <- x_bar - margin_error
ci_upper <- x_bar + margin_error

cat("t-statistic:", t_stat, "\n")
cat("p-value:", p_value, "\n")
cat("95% CI: [", ci_lower, ",", ci_upper, "]\n")

Using infer Package

library(infer)

# Calculate observed statistic
obs_stat <- response_times %>%
  specify(response = response_times) %>%
  calculate(stat = "t", mu = mu_0)

# Generate null distribution
null_dist <- response_times %>%
  specify(response = response_times) %>%
  hypothesize(null = "point", mu = mu_0) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "t", mu = mu_0)

# Get p-value
p_value <- null_dist %>%
  get_p_value(obs_stat = obs_stat, direction = "two-sided")

# Visualize
null_dist %>%
  visualize() +
  shade_p_value(obs_stat = obs_stat, direction = "two-sided")

12.3 One-Sample Test for Proportion

Used to test claims about population proportions.

Test Statistic

\[Z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}}\]

Conditions

  1. Random sample
  2. Independence
  3. Success-failure condition: np₀ ≥ 10 and n(1-p₀) ≥ 10

Example: Customer Satisfaction

Testing if satisfaction rate equals 80%:

# Given data
n <- 100         # Sample size
x <- 73          # Number of successes
p_0 <- 0.80      # Hypothesized proportion
alpha <- 0.05

# Calculate sample proportion
p_hat <- x / n

# Calculate test statistic
z_stat <- (p_hat - p_0) / sqrt(p_0 * (1 - p_0) / n)

# Calculate p-value (two-tailed)
p_value <- 2 * pnorm(-abs(z_stat))

# Using prop.test
result <- prop.test(x = x, n = n, p = p_0, 
                    alternative = "two.sided", 
                    correct = FALSE)

# Using infer
satisfaction_data <- data.frame(
  satisfied = c(rep("yes", x), rep("no", n - x))
)

# Observed statistic
obs_stat <- satisfaction_data %>%
  specify(response = satisfied, success = "yes") %>%
  calculate(stat = "prop")

# Null distribution
null_dist <- satisfaction_data %>%
  specify(response = satisfied, success = "yes") %>%
  hypothesize(null = "point", p = p_0) %>%
  generate(reps = 1000, type = "draw") %>%
  calculate(stat = "prop")

# P-value
p_value <- null_dist %>%
  get_p_value(obs_stat = obs_stat, direction = "two-sided")

12.4 One-Tailed Tests

Right-Tailed Test (Greater Than)

H₀: μ ≤ μ₀ H₁: μ > μ₀

# Right-tailed t-test
t.test(data, mu = mu_0, alternative = "greater")

# Right-tailed proportion test
prop.test(x, n, p = p_0, alternative = "greater")

Left-Tailed Test (Less Than)

H₀: μ ≥ μ₀ H₁: μ < μ₀

# Left-tailed t-test
t.test(data, mu = mu_0, alternative = "less")

# Left-tailed proportion test
prop.test(x, n, p = p_0, alternative = "less")

12.5 Chi-Square Test for Variance

Testing claims about population variance σ²:

Test Statistic

\[\chi^2 = \frac{(n-1)s^2}{\sigma_0^2}\]

Degrees of Freedom

df = n - 1

Example

# Sample data
sample_data <- c(10, 12, 9, 11, 13, 10, 12, 11, 9, 10)
sigma_0_squared <- 4  # Hypothesized variance
alpha <- 0.05

# Calculate test statistic
n <- length(sample_data)
s_squared <- var(sample_data)
chi_squared <- (n - 1) * s_squared / sigma_0_squared

# Calculate p-value (two-tailed)
p_value_lower <- pchisq(chi_squared, df = n - 1)
p_value_upper <- 1 - pchisq(chi_squared, df = n - 1)
p_value <- 2 * min(p_value_lower, p_value_upper)

# Critical values
chi_lower <- qchisq(alpha/2, df = n - 1)
chi_upper <- qchisq(1 - alpha/2, df = n - 1)

12.6 Sign Test (Non-parametric)

Used when normality assumption is violated.

Procedure

  1. Count values above and below hypothesized median
  2. Use binomial distribution with p = 0.5
# Sign test for median
library(BSDA)
SIGN.test(data, md = hypothesized_median, 
          alternative = "two.sided")

# Manual implementation
above <- sum(data > hypothesized_median)
below <- sum(data < hypothesized_median)
n_eff <- above + below
p_value <- 2 * pbinom(min(above, below), n_eff, 0.5)

12.7 Practical Examples

Example 1: Age at First Marriage

Testing if mean age at first marriage > 23 years:

# Load data
age_data <- read.csv("age_at_marriage.csv")
mu_0 <- 23

# Check normality
hist(age_data$$age, main = "Distribution of Age at First Marriage")
qqnorm(age_data$$age)
qqline(age_data$$age)

# Perform t-test
result <- t.test(age_data$$age, mu = mu_0, alternative = "greater")
print(result)

# Using infer
obs_stat <- age_data %>%
  specify(response = age) %>%
  calculate(stat = "mean")

null_dist <- age_data %>%
  specify(response = age) %>%
  hypothesize(null = "point", mu = mu_0) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean")

p_value <- null_dist %>%
  get_p_value(obs_stat = obs_stat, direction = "greater")

# Visualize
null_dist %>%
  visualize() +
  shade_p_value(obs_stat = obs_stat, direction = "greater")

Example 2: Proportion of Female Students

Testing if proportion of females = 0.50:

# Survey data
total_students <- 500
female_students <- 237

# Traditional approach
prop.test(x = female_students, n = total_students, 
          p = 0.50, alternative = "two.sided")

# Using infer
gender_data <- data.frame(
  gender = c(rep("female", female_students), 
             rep("male", total_students - female_students))
)

obs_stat <- gender_data %>%
  specify(response = gender, success = "female") %>%
  calculate(stat = "prop")

null_dist <- gender_data %>%
  specify(response = gender, success = "female") %>%
  hypothesize(null = "point", p = 0.50) %>%
  generate(reps = 1000, type = "draw") %>%
  calculate(stat = "prop")

p_value <- null_dist %>%
  get_p_value(obs_stat = obs_stat, direction = "two-sided")

12.8 Effect Size for One-Sample Tests

Cohen’s d for Means

\[d = \frac{\|\bar{X} - \mu_0\|}{s}\]

Interpretation:

  • Small effect: d ≈ 0.2
  • Medium effect: d ≈ 0.5
  • Large effect: d ≈ 0.8
# Calculate Cohen's d
cohens_d <- abs(mean(data) - mu_0) / sd(data)

Effect Size for Proportions

\[h = 2 \arcsin(\sqrt{\hat{p}}) - 2 \arcsin(\sqrt{p_0})\]
# Calculate h
h <- 2 * asin(sqrt(p_hat)) - 2 * asin(sqrt(p_0))

12.9 Power Analysis

For One-Sample t-Test

library(pwr)

# Power calculation
pwr.t.test(n = NULL, d = 0.5, sig.level = 0.05, 
           power = 0.80, type = "one.sample")

# Sample size calculation
pwr.t.test(d = 0.5, sig.level = 0.05, power = 0.80, 
           type = "one.sample")

For One-Sample Proportion Test

# Power calculation
pwr.p.test(h = ES.h(p1 = 0.75, p2 = 0.80), 
           sig.level = 0.05, power = 0.80)

12.10 Reporting Results

Template for Reporting

“A one-sample [test type] was conducted to determine whether [parameter] differed from [hypothesized value]. The sample [statistic] was [value] ([measure of spread]). The test was [significant/not significant], t(df) = [test statistic], p = [p-value]. [Effect size measure] indicated a [small/medium/large] effect size. We [reject/fail to reject] the null hypothesis and conclude that [contextual conclusion].”

Example Report

“A one-sample t-test was conducted to determine whether the mean response time differed from 120 minutes. The sample mean was 118.7 minutes (SD = 4.2). The test was not significant, t(9) = -0.98, p = 0.353. Cohen’s d = 0.31 indicated a small effect size. We fail to reject the null hypothesis and conclude that there is insufficient evidence to suggest the mean response time differs from 120 minutes.”

Part 7: Two-Sample Tests

13. Two-Sample Tests

13.1 Independent Samples t-Test

The independent samples t-test compares means from two unrelated groups.

Assumptions

  1. Independent random samples
  2. Approximately normal distributions (or large samples)
  3. Equal variances (classical t-test) or unequal variances (Welch’s t-test)

Test Statistic (Equal Variances)

\[t = \frac{(\bar{X}_1 - \bar{X}_2) - (\mu_1 - \mu_2)}{s_p\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}\]

where pooled standard deviation: \(s_p = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}}\)

Test Statistic (Unequal Variances - Welch’s)

\[t = \frac{(\bar{X}_1 - \bar{X}_2) - (\mu_1 - \mu_2)}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}\]

Degrees of Freedom (Welch’s)

\[df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1-1} + \frac{(s_2^2/n_2)^2}{n_2-1}}\]

Example: Income Comparison

Comparing average income between Cleveland and Sacramento:

# Load data
income_data <- read.delim("cleSac.txt")
cleveland <- income_data$$income[income_data$$metro_area == "Cleveland_OH"]
sacramento <- income_data$$income[income_data$$metro_area == "Sacramento_CA"]

# Check assumptions
# Normality
par(mfrow = c(1, 2))
hist(cleveland, main = "Cleveland Income")
hist(sacramento, main = "Sacramento Income")

# Equal variances
var.test(cleveland, sacramento)

# Perform t-test (Welch's by default)
t_test_result <- t.test(cleveland, sacramento, 
                        alternative = "two.sided",
                        var.equal = FALSE)
print(t_test_result)

# Manual calculation
x1_bar <- mean(cleveland)
x2_bar <- mean(sacramento)
s1 <- sd(cleveland)
s2 <- sd(sacramento)
n1 <- length(cleveland)
n2 <- length(sacramento)

t_stat <- (x1_bar - x2_bar) / sqrt(s1^2/n1 + s2^2/n2)
df <- ((s1^2/n1 + s2^2/n2)^2) / 
      ((s1^2/n1)^2/(n1-1) + (s2^2/n2)^2/(n2-1))
p_value <- 2 * pt(-abs(t_stat), df)

# Using infer
library(infer)
obs_diff <- income_data %>%
  specify(income ~ metro_area) %>%
  calculate(stat = "diff in means", 
            order = c("Cleveland_OH", "Sacramento_CA"))

null_dist <- income_data %>%
  specify(income ~ metro_area) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", 
            order = c("Cleveland_OH", "Sacramento_CA"))

p_value <- null_dist %>%
  get_p_value(obs_stat = obs_diff, direction = "two-sided")

# Visualize
null_dist %>%
  visualize() +
  shade_p_value(obs_stat = obs_diff, direction = "two-sided")

13.2 Paired Samples t-Test

Used when comparing two related measurements (e.g., before-after, matched pairs).

Test Statistic

\[t = \frac{\bar{d} - \mu_d}{s_d/\sqrt{n}}\]

where:

  • \(\bar{d}\) = mean of differences
  • \(s_d\) = standard deviation of differences
  • \(\mu_d\) = hypothesized mean difference (usually 0)

Example: Weight Loss Program

# Before and after weights
before <- c(65, 60, 63, 58, 70, 72, 80, 81, 83, 69)
after <- c(55, 52, 60, 55, 60, 66, 70, 66, 63, 60)

# Calculate differences
differences <- after - before

# Check normality of differences
hist(differences, main = "Distribution of Weight Differences")
qqnorm(differences)
qqline(differences)

# Paired t-test
paired_result <- t.test(after, before, 
                        paired = TRUE, 
                        alternative = "less")
print(paired_result)

# Manual calculation
d_bar <- mean(differences)
s_d <- sd(differences)
n <- length(differences)
t_stat <- d_bar / (s_d / sqrt(n))
p_value <- pt(t_stat, df = n - 1)

# Using infer
weight_data <- data.frame(
  id = rep(1:10, 2),
  weight = c(before, after),
  time = rep(c("before", "after"), each = 10)
)

obs_diff <- weight_data %>%
  specify(weight ~ time) %>%
  calculate(stat = "diff in means", 
            order = c("after", "before"))

null_dist <- weight_data %>%
  specify(weight ~ time) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", 
            order = c("after", "before"))

p_value <- null_dist %>%
  get_p_value(obs_stat = obs_diff, direction = "less")

Example: Water Quality Testing

Comparing zinc concentration in surface vs. bottom water:

# Load and prepare data
zinc_data <- read.csv("zinc_tidy.csv")
zinc_diff <- zinc_data %>%
  group_by(loc_id) %>%
  summarize(pair_diff = diff(concentration))

# Test if surface water has lower concentration
t.test(zinc_diff$$pair_diff, 
       mu = 0, 
       alternative = "less")

# Visual comparison
library(ggplot2)
ggplot(zinc_data, aes(x = depth, y = concentration)) +
  geom_boxplot() +
  geom_line(aes(group = loc_id), alpha = 0.3) +
  geom_point(aes(color = depth), size = 3) +
  labs(title = "Zinc Concentration by Depth",
       x = "Water Depth",
       y = "Zinc Concentration")

13.3 Two-Sample Test for Proportions

Compares proportions between two independent groups.

Test Statistic

\[Z = \frac{(\hat{p}_1 - \hat{p}_2) - 0}{\sqrt{\hat{p}(1-\hat{p})(\frac{1}{n_1} + \frac{1}{n_2})}}\]

where pooled proportion: \(\hat{p} = \frac{x_1 + x_2}{n_1 + n_2}\)

Example: Gender and Weight Loss

Testing if weight loss rates differ by gender:

# Data
male_success <- 45
male_total <- 100
female_success <- 62
female_total <- 100

# Two-proportion test
prop.test(x = c(male_success, female_success),
          n = c(male_total, female_total),
          alternative = "two.sided",
          correct = FALSE)

# Manual calculation
p1_hat <- male_success / male_total
p2_hat <- female_success / female_total
p_pool <- (male_success + female_success) / (male_total + female_total)
se_diff <- sqrt(p_pool * (1 - p_pool) * (1/male_total + 1/female_total))
z_stat <- (p1_hat - p2_hat) / se_diff
p_value <- 2 * pnorm(-abs(z_stat))

# Using infer
gender_data <- data.frame(
  gender = c(rep("male", male_total), rep("female", female_total)),
  success = c(rep(c("yes", "no"), c(male_success, male_total - male_success)),
              rep(c("yes", "no"), c(female_success, female_total - female_success)))
)

obs_diff <- gender_data %>%
  specify(success ~ gender, success = "yes") %>%
  calculate(stat = "diff in props", 
            order = c("male", "female"))

null_dist <- gender_data %>%
  specify(success ~ gender, success = "yes") %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in props", 
            order = c("male", "female"))

p_value <- null_dist %>%
  get_p_value(obs_stat = obs_diff, direction = "two-sided")

13.4 Mann-Whitney U Test (Wilcoxon Rank-Sum)

Non-parametric alternative to independent samples t-test.

When to Use

  • Data not normally distributed
  • Ordinal data
  • Small sample sizes
  • Presence of outliers
# Mann-Whitney U test
wilcox.test(cleveland, sacramento, 
            alternative = "two.sided")

# With continuity correction
wilcox.test(cleveland, sacramento, 
            alternative = "two.sided", 
            correct = TRUE)

# Visual comparison using violin plots
library(ggplot2)
ggplot(income_data, aes(x = metro_area, y = income)) +
  geom_violin(fill = "lightblue") +
  geom_boxplot(width = 0.1) +
  labs(title = "Income Distribution by City",
       x = "Metropolitan Area",
       y = "Income")

13.5 Wilcoxon Signed-Rank Test

Non-parametric alternative to paired t-test.

# Wilcoxon signed-rank test
wilcox.test(after, before, 
            paired = TRUE, 
            alternative = "less")

# Manual implementation
differences <- after - before
ranks <- rank(abs(differences))
signed_ranks <- ranks * sign(differences)
W <- sum(signed_ranks[signed_ranks > 0])

13.6 Chi-Square Test for Independence

Tests whether two categorical variables are independent.

Test Statistic

\[\chi^2 = \sum \frac{(O - E)^2}{E}\]

where:

  • O = observed frequency
  • E = expected frequency

Example: Species and Petal Width

# Prepare data
iris_subset <- iris[iris$$Species != "virginica", ]
iris_subset$$Species <- factor(iris_subset$$Species)
iris_subset$$Petal.Width.Cat <- cut(iris_subset$$Petal.Width, 
                                   breaks = quantile(iris_subset$$Petal.Width, 
                                                    probs = c(0, 0.5, 1)), 
                                   include.lowest = TRUE,
                                   labels = c("Below", "Above"))

# Create contingency table
cont_table <- table(iris_subset$$Petal.Width.Cat, iris_subset$$Species)
print(cont_table)

# Chi-square test
chi_test <- chisq.test(cont_table)
print(chi_test)

# Expected values
print(chi_test$$expected)

# Residuals
print(chi_test$$residuals)

# Visualization
library(ggplot2)
ggplot(iris_subset, aes(x = Species, fill = Petal.Width.Cat)) +
  geom_bar(position = "dodge") +
  labs(title = "Distribution of Petal Width by Species",
       x = "Species",
       y = "Count",
       fill = "Petal Width")

13.7 Effect Size for Two-Sample Tests

Cohen’s d for Independent Samples

\[d = \frac{\|\bar{X}_1 - \bar{X}_2\|}{s_p}\]

Cohen’s d for Paired Samples

\[d = \frac{\|\bar{d}\|}{s_d}\]
# Effect size for independent samples
library(effsize)
cohen.d(cleveland, sacramento)

# Effect size for paired samples
cohen.d(after, before, paired = TRUE)

# Manual calculation
d_independent <- abs(mean(cleveland) - mean(sacramento)) / 
                 sqrt(((length(cleveland) - 1) * var(cleveland) + 
                       (length(sacramento) - 1) * var(sacramento)) / 
                      (length(cleveland) + length(sacramento) - 2))

d_paired <- abs(mean(differences)) / sd(differences)

13.8 Power Analysis for Two-Sample Tests

library(pwr)

# Power for independent t-test
pwr.t2n.test(n1 = 100, n2 = 120, d = 0.5, 
             sig.level = 0.05, 
             alternative = "two.sided")

# Sample size for paired t-test
pwr.t.test(d = 0.3, sig.level = 0.05, 
           power = 0.80, type = "paired")

# Power for two proportions
pwr.2p.test(h = ES.h(p1 = 0.45, p2 = 0.60), 
            sig.level = 0.05, power = 0.80)

13.9 Comprehensive Example: Medicine Effectiveness

# Load data
medicine_data <- read.csv("medicine_survey.txt")

# Research question: Do females have less weight than males before treatment?

# Subset data
female_weight <- medicine_data$$KgBefore[medicine_data$$Gender == "Female"]
male_weight <- medicine_data$$KgBefore[medicine_data$$Gender == "Male"]

# Exploratory analysis
library(ggplot2)
ggplot(medicine_data, aes(x = Gender, y = KgBefore, fill = Gender)) +
  geom_boxplot() +
  geom_jitter(width = 0.2, alpha = 0.5) +
  labs(title = "Weight Distribution by Gender",
       x = "Gender",
       y = "Weight Before Treatment (kg)")

# Check assumptions
shapiro.test(female_weight)
shapiro.test(male_weight)
var.test(female_weight, male_weight)

# Perform t-test
result <- t.test(female_weight, male_weight, 
                 alternative = "less",
                 var.equal = FALSE)
print(result)

# Effect size
cohen.d(female_weight, male_weight)

# Bootstrap confidence interval
library(infer)
boot_dist <- medicine_data %>%
  specify(KgBefore ~ Gender) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "diff in means", 
            order = c("Female", "Male"))

ci_boot <- boot_dist %>%
  get_confidence_interval(level = 0.95, type = "percentile")

# Report results
cat("Mean weight (Female):", mean(female_weight), "kg\n")
cat("Mean weight (Male):", mean(male_weight), "kg\n")
cat("Difference:", mean(female_weight) - mean(male_weight), "kg\n")
cat("95% CI (traditional):", result$$conf.int, "\n")
cat("95% CI (bootstrap):", ci_boot$$lower_ci, "to", ci_boot$$upper_ci, "\n")
cat("p-value:", result$$p.value, "\n")

13.10 Reporting Two-Sample Test Results

Template

“A [test type] was conducted to compare [variable] between [group 1] and [group 2]. There was a [significant/non-significant] difference in [variable] between [group 1] (M = [mean1], SD = [sd1]) and [group 2] (M = [mean2], SD = [sd2]); t(df) = [test statistic], p = [p-value]. The effect size was [small/medium/large] (d = [effect size]). These results suggest that [contextual conclusion].”

Example Report

“An independent samples t-test was conducted to compare weight before treatment between females and males. There was a significant difference in weight between females (M = 65.4, SD = 8.2) and males (M = 72.1, SD = 9.5); t(198) = -5.33, p < .001. The effect size was medium (d = 0.75). These results suggest that females had significantly lower weight than males before treatment.”

Part 8: R Implementation & Case Studies

14. R Implementation: Traditional vs Simulation Methods

14.1 Traditional Methods Overview

Traditional statistical methods rely on mathematical theory and distributional assumptions.

Key Functions in Base R

# t-tests
t.test()           # One-sample, two-sample, paired t-tests
var.test()         # F-test for equality of variances

# Proportion tests
prop.test()        # One-sample, two-sample proportion tests
binom.test()       # Exact binomial test

# Non-parametric tests
wilcox.test()      # Wilcoxon rank-sum/signed-rank tests
chisq.test()       # Chi-square test

# ANOVA
aov()              # Analysis of variance
summary()          # ANOVA table

Example: Traditional t-test workflow

# Load data
data <- read.csv("sample_data.csv")

# Check assumptions
shapiro.test(data$$response)  # Normality
var.test(group1, group2)     # Equal variances

# Perform test
result <- t.test(response ~ group, data = data)

# Extract results
result$$statistic  # Test statistic
result$$p.value    # p-value
result$$conf.int   # Confidence interval

14.2 Simulation-Based Methods with infer

The infer package provides a unified grammar for statistical inference using simulation methods.

The infer Workflow

  1. specify(): Specify variable(s) of interest
  2. hypothesize(): Declare null hypothesis
  3. generate(): Generate data reflecting null hypothesis
  4. calculate(): Calculate summary statistic
  5. visualize(): Visualize the null distribution

Basic infer Structure

library(infer)

# General workflow
data %>%
  specify(formula) %>%           # Specify variables
  hypothesize(null = "...") %>%  # State null hypothesis
  generate(reps = n) %>%         # Generate null distribution
  calculate(stat = "...") %>%    # Calculate statistics
  visualize()                    # Visualize results

14.3 Comparing Traditional and Simulation Approaches

One-Sample Mean Test

Traditional Approach:

# Traditional t-test
t_result <- t.test(data$$variable, mu = null_value)

Simulation Approach:

# Calculate observed statistic
obs_stat <- data %>%
  specify(response = variable) %>%
  calculate(stat = "mean")

# Generate null distribution
null_dist <- data %>%
  specify(response = variable) %>%
  hypothesize(null = "point", mu = null_value) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean")

# Get p-value
p_value <- null_dist %>%
  get_p_value(obs_stat = obs_stat, direction = "two-sided")

# Visualize
null_dist %>%
  visualize() +
  shade_p_value(obs_stat = obs_stat, direction = "two-sided")

Two-Sample Mean Test

Traditional Approach:

# Traditional independent t-test
t_result <- t.test(value ~ group, data = data)

Simulation Approach:

# Calculate observed difference
obs_diff <- data %>%
  specify(value ~ group) %>%
  calculate(stat = "diff in means", order = c("A", "B"))

# Generate null distribution
null_dist <- data %>%
  specify(value ~ group) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("A", "B"))

# Get p-value and visualize
p_value <- null_dist %>%
  get_p_value(obs_stat = obs_diff, direction = "two-sided")

14.4 Case Study 1: Bowl Sampling Activity

Estimating the proportion of red balls in a bowl using physical and virtual sampling.

library(moderndive)
library(tidyverse)
library(infer)

# Physical sampling results
tactile_prop_red <- read.csv("tactile_sampling.csv")

# Virtual sampling
virtual_samples <- bowl %>% 
  rep_sample_n(size = 50, reps = 33)

# Calculate proportions
virtual_prop_red <- virtual_samples %>% 
  group_by(replicate) %>% 
  summarize(red = sum(color == "red")) %>% 
  mutate(prop_red = red / 50)

# Compare distributions
combined_data <- bind_rows(
  virtual_prop_red %>% mutate(type = "Virtual"),
  tactile_prop_red %>% mutate(type = "Tactile")
)

ggplot(combined_data, aes(x = prop_red, fill = type)) +
  geom_histogram(binwidth = 0.05, position = "dodge", alpha = 0.7) +
  labs(title = "Comparison of Sampling Methods",
       x = "Proportion of Red Balls",
       y = "Count",
       fill = "Sampling Type") +
  theme_minimal()

# Effect of sample size
sample_sizes <- c(25, 50, 100, 200)
results <- list()

for (n in sample_sizes) {
  samples <- bowl %>% 
    rep_sample_n(size = n, reps = 1000)
  
  props <- samples %>% 
    group_by(replicate) %>% 
    summarize(prop_red = mean(color == "red"))
  
  results<span title='There is no note that matches this link.' class='invalid-link'>  <span class='invalid-link-brackets'>[[</span>  paste0("n_", n)  <span class='invalid-link-brackets'>]]</span></span> <- data.frame(
    n = n,
    sd = sd(props$$prop_red),
    mean = mean(props$$prop_red)
  )
}

results_df <- do.call(rbind, results)

ggplot(results_df, aes(x = n, y = sd)) +
  geom_line() +
  geom_point(size = 3) +
  labs(title = "Standard Error vs Sample Size",
       x = "Sample Size",
       y = "Standard Error") +
  theme_minimal()

14.5 Case Study 2: Cell Phone Usage Analysis

Analyzing student cell phone usage patterns from the “usecell” dataset.

# Load data
usecell <- read.csv("usecell.txt")

# Research question: Is mean calling time > 120 minutes?

# Traditional approach
t_result <- t.test(usecell$$CallTime, mu = 120, alternative = "greater")
print(t_result)

# Simulation approach
obs_mean <- usecell %>%
  specify(response = CallTime) %>%
  calculate(stat = "mean")

null_dist <- usecell %>%
  specify(response = CallTime) %>%
  hypothesize(null = "point", mu = 120) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean")

p_value <- null_dist %>%
  get_p_value(obs_stat = obs_mean, direction = "greater")

# Visualization
null_dist %>%
  visualize() +
  shade_p_value(obs_stat = obs_mean, direction = "greater") +
  labs(title = "Null Distribution for Mean Call Time",
       subtitle = "H0: μ = 120 minutes")

# Proportion analysis: Games as favorite app
games_prop <- usecell %>%
  mutate(is_games = FavoriteApp == "games") %>%
  specify(response = is_games, success = "TRUE") %>%
  calculate(stat = "prop")

# Bootstrap CI for proportion
boot_dist <- usecell %>%
  mutate(is_games = FavoriteApp == "games") %>%
  specify(response = is_games, success = "TRUE") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop")

ci_90 <- boot_dist %>%
  get_confidence_interval(level = 0.90, type = "percentile")

# Visualize with CI
boot_dist %>%
  visualize() +
  shade_confidence_interval(endpoints = ci_90) +
  labs(title = "Bootstrap Distribution of Games Proportion",
       subtitle = "90% Confidence Interval")

14.6 Case Study 3: Income Comparison Study

Comparing income between different cities using multiple approaches.

# Load data
cle_sac <- read.delim("cleSac.txt")

# Exploratory analysis
ggplot(cle_sac, aes(x = metro_area, y = income)) +
  geom_violin(fill = "lightblue") +
  geom_boxplot(width = 0.1) +
  stat_summary(fun = mean, geom = "point", color = "red", size = 3) +
  labs(title = "Income Distribution by Metropolitan Area",
       x = "City",
       y = "Income")

# Traditional t-test
t_result <- t.test(income ~ metro_area, data = cle_sac)

# Simulation-based approach
obs_diff <- cle_sac %>%
  specify(income ~ metro_area) %>%
  calculate(stat = "diff in means", 
            order = c("Cleveland_OH", "Sacramento_CA"))

null_dist <- cle_sac %>%
  specify(income ~ metro_area) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", 
            order = c("Cleveland_OH", "Sacramento_CA"))

# Compare approaches
cat("Traditional p-value:", t_result$$p.value, "\n")
cat("Simulation p-value:", 
    get_p_value(null_dist, obs_stat = obs_diff, direction = "two-sided")$$p_value, "\n")

# Bootstrap confidence interval
boot_dist <- cle_sac %>%
  specify(income ~ metro_area) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "diff in means", 
            order = c("Cleveland_OH", "Sacramento_CA"))

ci_boot <- boot_dist %>%
  get_confidence_interval(level = 0.95, type = "percentile")

# Visualization of results
p1 <- null_dist %>%
  visualize() +
  shade_p_value(obs_stat = obs_diff, direction = "two-sided") +
  labs(title = "Null Distribution")

p2 <- boot_dist %>%
  visualize() +
  shade_confidence_interval(endpoints = ci_boot) +
  labs(title = "Bootstrap Distribution")

gridExtra::grid.arrange(p1, p2, ncol = 2)

14.7 Case Study 4: Pizza Delivery Analysis

Analyzing pizza delivery times with various statistical approaches.

library(DescTools)
data(d.pizza)

# Filter for driver Carter
carter_data <- d.pizza %>%
  filter(driver == "Carter")

# Traditional confidence interval
t_result <- t.test(carter_data$$delivery_time)
traditional_ci <- t_result$$conf.int

# Bootstrap confidence interval (percentile method)
boot_dist_carter <- carter_data %>%
  specify(response = delivery_time) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean")

bootstrap_ci <- boot_dist_carter %>%
  get_confidence_interval(level = 0.99, type = "percentile")

# Standard error method
se_ci <- boot_dist_carter %>%
  get_confidence_interval(level = 0.99, type = "se",
                         point_estimate = mean(carter_data$$delivery_time))

# Compare confidence intervals
ci_comparison <- data.frame(
  Method = c("Traditional", "Bootstrap Percentile", "Bootstrap SE"),
  Lower = c(traditional_ci[1], bootstrap_ci$$lower_ci, se_ci$$lower_ci),
  Upper = c(traditional_ci[2], bootstrap_ci$$upper_ci, se_ci$$upper_ci)
)

print(ci_comparison)

# Visualize distributions
p1 <- ggplot(carter_data, aes(x = delivery_time)) +
  geom_histogram(binwidth = 2, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Delivery Times",
       x = "Delivery Time (minutes)",
       y = "Count")

p2 <- boot_dist_carter %>%
  visualize() +
  shade_confidence_interval(endpoints = bootstrap_ci) +
  labs(title = "Bootstrap Distribution",
       x = "Mean Delivery Time")

gridExtra::grid.arrange(p1, p2, ncol = 2)

14.8 Case Study 5: Weight Loss Program Evaluation

Evaluating effectiveness of a weight loss program using paired data.

# Load data
weight_data <- read.csv("Weight.txt")

# Traditional paired t-test
t_result <- t.test(weight_data$$Weight.after, 
                   weight_data$$Weight.before, 
                   paired = TRUE, 
                   alternative = "less")

# Calculate differences
weight_data <- weight_data %>%
  mutate(difference = Weight.after - Weight.before)

# Simulation approach
obs_mean_diff <- weight_data %>%
  specify(response = difference) %>%
  calculate(stat = "mean")

null_dist <- weight_data %>%
  specify(response = difference) %>%
  hypothesize(null = "point", mu = 0) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean")

# Visualization
p1 <- ggplot(weight_data, aes(x = difference)) +
  geom_histogram(binwidth = 2, fill = "coral", color = "black") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Distribution of Weight Differences",
       x = "Weight Change (kg)",
       y = "Count")

p2 <- null_dist %>%
  visualize() +
  shade_p_value(obs_stat = obs_mean_diff, direction = "less") +
  labs(title = "Null Distribution",
       subtitle = "H0: μd = 0")

# Before-after plot
p3 <- ggplot(weight_data, aes(x = Weight.before, y = Weight.after)) +
  geom_point(size = 3) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Before vs After Weight",
       x = "Weight Before (kg)",
       y = "Weight After (kg)")

gridExtra::grid.arrange(p1, p2, p3, ncol = 2)

14.9 Creating Comprehensive Reports

# Function to create comprehensive analysis report
create_analysis_report <- function(data, response_var, group_var = NULL, 
                                 test_type = "mean", null_value = NULL,
                                 confidence_level = 0.95) {
  
  # Descriptive statistics
  if (is.null(group_var)) {
    desc_stats <- data %>%
      summarize(
        n = n(),
        mean = mean(!!sym(response_var), na.rm = TRUE),
        sd = sd(!!sym(response_var), na.rm = TRUE),
        median = median(!!sym(response_var), na.rm = TRUE),
        IQR = IQR(!!sym(response_var), na.rm = TRUE)
      )
  } else {
    desc_stats <- data %>%
      group_by(!!sym(group_var)) %>%
      summarize(
        n = n(),
        mean = mean(!!sym(response_var), na.rm = TRUE),
        sd = sd(!!sym(response_var), na.rm = TRUE),
        median = median(!!sym(response_var), na.rm = TRUE),
        IQR = IQR(!!sym(response_var), na.rm = TRUE)
      )
  }
  
  # Traditional test
  if (is.null(group_var)) {
    traditional_result <- t.test(data<span title='There is no note that matches this link.' class='invalid-link'>  <span class='invalid-link-brackets'>[[</span>  response_var  <span class='invalid-link-brackets'>]]</span></span>, mu = null_value)
  } else {
    formula <- as.formula(paste(response_var, "~", group_var))
    traditional_result <- t.test(formula, data = data)
  }
  
  # Simulation-based approach
  if (is.null(group_var)) {
    obs_stat <- data %>%
      specify(response = !!sym(response_var)) %>%
      calculate(stat = test_type)
    
    null_dist <- data %>%
      specify(response = !!sym(response_var)) %>%
      hypothesize(null = "point", mu = null_value) %>%
      generate(reps = 1000, type = "bootstrap") %>%
      calculate(stat = test_type)
  } else {
    obs_stat <- data %>%
      specify(as.formula(paste(response_var, "~", group_var))) %>%
      calculate(stat = paste("diff in", test_type, "s"))
    
    null_dist <- data %>%
      specify(as.formula(paste(response_var, "~", group_var))) %>%
      hypothesize(null = "independence") %>%
      generate(reps = 1000, type = "permute") %>%
      calculate(stat = paste("diff in", test_type, "s"))
  }
  
  # Generate report
  report <- list(
    descriptive_stats = desc_stats,
    traditional_test = traditional_result,
    simulation_pvalue = get_p_value(null_dist, obs_stat = obs_stat, 
                                   direction = "two-sided"),
    observed_statistic = obs_stat,
    null_distribution = null_dist
  )
  
  return(report)
}

# Example usage
report <- create_analysis_report(
  data = cle_sac,
  response_var = "income",
  group_var = "metro_area",
  test_type = "mean"
)

# Print report
print(report$$descriptive_stats)
print(summary(report$$traditional_test))
print(report$$simulation_pvalue)

14.10 Best Practices and Guidelines

When to Use Each Approach

Traditional Methods:

  • When assumptions are clearly met
  • For standard, well-established tests
  • When computational resources are limited
  • For formal reporting requirements

Simulation Methods:

  • When assumptions are violated
  • For complex statistics
  • For educational purposes
  • When visualization aids understanding

Workflow Recommendations

  1. Always start with exploratory data analysis
  2. Check assumptions before choosing method
  3. Use both approaches when possible for validation
  4. Focus on practical significance, not just statistical
  5. Create reproducible analyses with clear documentation

Code Organization

# Recommended project structure
project/
├── data/
   ├── raw/
   └── processed/
├── scripts/
   ├── 01_data_cleaning.R
   ├── 02_exploratory_analysis.R
   ├── 03_statistical_tests.R
   └── 04_visualization.R
├── output/
   ├── figures/
   └── tables/
└── reports/
    └── final_report.Rmd

Part 9: Sample Questions & Solutions

15. Sample Questions and Solutions

15.1 Probability and Distributions

Question 1

Four fair coins are flipped. If the outcomes are assumed independent, what is the probability that two heads and two tails are obtained?

Solution:

# This is a binomial probability problem
# n = 4 trials, k = 2 successes, p = 0.5

# Using binomial formula
prob <- choose(4, 2) * (0.5)^2 * (0.5)^2
print(prob)  # 0.375

# Using R function
dbinom(2, size = 4, prob = 0.5)  # 0.375

The probability is 0.375 or 37.5%.

Question 2

Between 2 p.m. and 4 p.m., telephone calls arrive at a switchboard at an average rate of 2.5 per minute. Find the probability that in a minute, there will be: a) Zero calls b) Exactly two calls c) More than six calls

Solution:

# This is a Poisson distribution problem
# λ = 2.5 calls per minute

# a) Zero calls
prob_zero <- dpois(0, lambda = 2.5)
print(prob_zero)  # 0.082

# b) Two calls
prob_two <- dpois(2, lambda = 2.5)
print(prob_two)  # 0.257

# c) More than six calls
prob_more_six <- 1 - ppois(6, lambda = 2.5)
print(prob_more_six)  # 0.014

Question 3

The level of blood pressure of a population of women has mean 125 mm Hg and standard deviation 12. A sample of 36 women is selected. What is the probability that the sample mean will be between 120 and 130?

Solution:

# Parameters
mu <- 125
sigma <- 12
n <- 36

# Standard error
se <- sigma / sqrt(n)

# Calculate z-scores
z1 <- (120 - mu) / se
z2 <- (130 - mu) / se

# Probability
prob <- pnorm(z2) - pnorm(z1)
print(prob)  # 0.987

# Alternative approach
pnorm(130, mean = mu, sd = se) - pnorm(120, mean = mu, sd = se)

15.2 Point Estimation

Question 4

Find the maximum likelihood estimator for the parameter λ of a Poisson distribution based on a random sample x₁, x₂, …, xₙ.

Solution: The likelihood function for Poisson is: \(L(\lambda) = \prod_{i=1}^n \frac{e^{-\lambda}\lambda^{x_i}}{x_i!}\)

Taking the log: \(\ell(\lambda) = -n\lambda + \sum_{i=1}^n x_i \log(\lambda) - \sum_{i=1}^n \log(x_i!)\)

Taking derivative and setting to zero: \(\frac{d\ell}{d\lambda} = -n + \frac{\sum x_i}{\lambda} = 0\)

Therefore: \(\hat{\lambda}_{MLE} = \bar{x}\)

# Example with data
taxi_arrivals <- c(23, 15, 23, 21, 13)
lambda_mle <- mean(taxi_arrivals)
print(lambda_mle)  # 19

Question 5

The following sample represents M&M chocolate weights. Calculate point estimates for the population mean and standard deviation.

Solution:

# Sample data
weights <- c(111.2, 110.8, 112.1, 111.5, 110.9, 111.7, 110.6, 111.3, 112.0, 110.7)

# Point estimate for mean
mean_estimate <- mean(weights)
print(mean_estimate)  # 111.28

# Point estimate for standard deviation
sd_estimate <- sd(weights)
print(sd_estimate)  # 0.577

# Standard error
se <- sd_estimate / sqrt(length(weights))
print(se)  # 0.183

15.3 Confidence Intervals

Question 6

An accounting firm surveys 100 people about time to complete tax forms. The sample mean is 22.1 hours with known standard deviation of 7.9 hours. Construct a 90% confidence interval for the population mean.

Solution:

# Given information
n <- 100
x_bar <- 22.1
sigma <- 7.9
conf_level <- 0.90

# Calculate margin of error
z_star <- qnorm((1 + conf_level) / 2)
margin_error <- z_star * sigma / sqrt(n)

# Confidence interval
ci_lower <- x_bar - margin_error
ci_upper <- x_bar + margin_error

cat("90% CI: [", ci_lower, ",", ci_upper, "]\n")
# 90% CI: [ 20.8, 23.4 ]

# Error bound
error_bound <- margin_error
print(error_bound)  # 1.3

Question 7

In a survey of 1000 students, 7.3% prefer “games” as their favorite app. Find a 95% confidence interval for the population proportion.

Solution:

# Given information
n <- 1000
p_hat <- 0.073

# Traditional method
se <- sqrt(p_hat * (1 - p_hat) / n)
z_star <- qnorm(0.975)
margin_error <- z_star * se

ci_lower <- p_hat - margin_error
ci_upper <- p_hat + margin_error

cat("95% CI: [", ci_lower, ",", ci_upper, "]\n")
# 95% CI: [ 0.057, 0.089 ]

# Using prop.test
prop.test(x = 73, n = 1000, conf.level = 0.95, correct = FALSE)$$conf.int

# Bootstrap method
library(infer)
set.seed(123)
games_data <- data.frame(
  favorite = c(rep("games", 73), rep("other", 927))
)

boot_dist <- games_data %>%
  specify(response = favorite, success = "games") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop")

boot_ci <- boot_dist %>%
  get_ci(level = 0.95, type = "percentile")
print(boot_ci)

15.4 Hypothesis Testing - One Sample

Question 8

A CEO claims that 80% of customers are satisfied. A survey of 100 customers finds 73 satisfied. Test this claim at α = 0.05.

Solution:

# Hypotheses:
# H₀: p = 0.80
# H₁: p ≠ 0.80

# Traditional approach
n <- 100
x <- 73
p_0 <- 0.80
alpha <- 0.05

# Test statistic
p_hat <- x/n
z_stat <- (p_hat - p_0) / sqrt(p_0 * (1 - p_0) / n)
p_value <- 2 * pnorm(-abs(z_stat))

cat("z-statistic:", z_stat, "\n")
cat("p-value:", p_value, "\n")
# z-statistic: -1.75
# p-value: 0.08

# Decision
if (p_value < alpha) {
  print("Reject H₀")
} else {
  print("Fail to reject H₀")
}
# Fail to reject H₀

# Simulation approach
satisfaction_data <- data.frame(
  status = c(rep("satisfied", 73), rep("unsatisfied", 27))
)

obs_prop <- satisfaction_data %>%
  specify(response = status, success = "satisfied") %>%
  calculate(stat = "prop")

null_dist <- satisfaction_data %>%
  specify(response = status, success = "satisfied") %>%
  hypothesize(null = "point", p = 0.80) %>%
  generate(reps = 1000, type = "draw") %>%
  calculate(stat = "prop")

p_value_sim <- null_dist %>%
  get_p_value(obs_stat = obs_prop, direction = "two-sided")

null_dist %>%
  visualize() +
  shade_p_value(obs_stat = obs_prop, direction = "two-sided")

Question 9

The average Rice consumption is claimed to be 3.9 kg with σ = 4.5. A sample of 16 countries shows mean 4.2 kg. Test if consumption differs from 3.9 at α = 0.05.

Solution:

# Given information
n <- 16
x_bar <- 4.2
mu_0 <- 3.9
sigma <- 4.5
alpha <- 0.05

# Hypotheses:
# H₀: μ = 3.9
# H₁: μ ≠ 3.9

# Z-test (known σ)
z_stat <- (x_bar - mu_0) / (sigma / sqrt(n))
p_value <- 2 * (1 - pnorm(abs(z_stat)))

cat("z-statistic:", z_stat, "\n")
cat("p-value:", p_value, "\n")
# z-statistic: 0.267
# p-value: 0.79

# Decision
if (p_value < alpha) {
  print("Reject H₀")
} else {
  print("Fail to reject H₀")
}
# Fail to reject H₀

15.5 Hypothesis Testing - Two Samples

Question 10

Test whether income differs between Cleveland and Sacramento. Cleveland: n₁=212, x̄₁=45,000, s₁=15,000. Sacramento: n₂=175, x̄₂=48,000, s₂=18,000. Use α=0.05.

Solution:

# Given information
n1 <- 212; x1_bar <- 45000; s1 <- 15000
n2 <- 175; x2_bar <- 48000; s2 <- 18000
alpha <- 0.05

# Hypotheses:
# H₀: μ₁ = μ₂
# H₁: μ₁ ≠ μ₂

# Welch's t-test (unequal variances)
se_diff <- sqrt(s1^2/n1 + s2^2/n2)
t_stat <- (x1_bar - x2_bar) / se_diff

# Degrees of freedom (Welch's formula)
df <- ((s1^2/n1 + s2^2/n2)^2) / 
      ((s1^2/n1)^2/(n1-1) + (s2^2/n2)^2/(n2-1))

p_value <- 2 * pt(-abs(t_stat), df)

cat("t-statistic:", t_stat, "\n")
cat("degrees of freedom:", df, "\n")
cat("p-value:", p_value, "\n")
# t-statistic: -1.80
# degrees of freedom: 351.2
# p-value: 0.073

# Confidence interval
margin_error <- qt(0.975, df) * se_diff
ci_lower <- (x1_bar - x2_bar) - margin_error
ci_upper <- (x1_bar - x2_bar) + margin_error

cat("95% CI for difference: [", ci_lower, ",", ci_upper, "]\n")
# 95% CI for difference: [ -6286, 286 ]

Question 11

A weight loss program has before/after data for 10 participants. Test if the program is effective at α = 0.05.

Solution:

# Data
before <- c(65, 60, 63, 58, 70, 72, 80, 81, 83, 69)
after <- c(55, 52, 60, 55, 60, 66, 70, 66, 63, 60)

# Hypotheses:
# H₀: μd = 0
# H₁: μd < 0 (weight decreases)

# Paired t-test
differences <- after - before
d_bar <- mean(differences)
s_d <- sd(differences)
n <- length(differences)

t_stat <- d_bar / (s_d / sqrt(n))
p_value <- pt(t_stat, df = n - 1)

cat("Mean difference:", d_bar, "\n")
cat("t-statistic:", t_stat, "\n")
cat("p-value:", p_value, "\n")
# Mean difference: -7.4
# t-statistic: -5.75
# p-value: 0.000145

# Decision
if (p_value < 0.05) {
  print("Reject H₀ - Program is effective")
} else {
  print("Fail to reject H₀")
}

# Confidence interval
t.test(after, before, paired = TRUE, 
       alternative = "less", conf.level = 0.95)$$conf.int

15.6 Chi-Square Tests

Question 12

Test if species and petal width category are independent in the iris dataset at α = 0.05.

Solution:

# Prepare data
iris_subset <- iris[iris$$Species != "virginica", ]
iris_subset$$Petal.Width.Cat <- cut(iris_subset$$Petal.Width, 
                                   breaks = quantile(iris_subset$$Petal.Width, 
                                                    probs = c(0, 0.5, 1)), 
                                   labels = c("Below", "Above"),
                                   include.lowest = TRUE)

# Create contingency table
cont_table <- table(iris_subset$$Species, iris_subset$$Petal.Width.Cat)
print(cont_table)

# Chi-square test
chi_result <- chisq.test(cont_table)
print(chi_result)

# Expected frequencies
print(chi_result$$expected)

# Decision
if (chi_result$$p.value < 0.05) {
  print("Reject H₀ - Variables are dependent")
} else {
  print("Fail to reject H₀")
}

15.7 Sample Size Determination

Question 13

How many customers should be surveyed to estimate satisfaction proportion within 3% margin of error with 95% confidence? Assume no prior estimate of p.

Solution:

# Given information
margin_error <- 0.03
conf_level <- 0.95
p_estimate <- 0.5  # Conservative estimate

# Calculate sample size
z_star <- qnorm((1 + conf_level) / 2)
n <- ceiling((z_star^2 * p_estimate * (1 - p_estimate)) / margin_error^2)

print(n)  # 1068

# Function for sample size calculation
sample_size_prop <- function(margin_error, conf_level = 0.95, p = 0.5) {
  z_star <- qnorm((1 + conf_level) / 2)
  n <- ceiling((z_star^2 * p * (1 - p)) / margin_error^2)
  return(n)
}

# With different confidence levels
cat("90% confidence:", sample_size_prop(0.03, 0.90), "\n")
cat("95% confidence:", sample_size_prop(0.03, 0.95), "\n")
cat("99% confidence:", sample_size_prop(0.03, 0.99), "\n")

15.8 Comprehensive Practice Problem

Question 14

A study examines the effect of a new medicine on cholesterol levels. Data includes 50 patients with measurements before and after treatment. Perform a complete analysis including descriptive statistics, visualization, hypothesis testing, and confidence intervals.

Solution:

# Simulate data
set.seed(123)
n <- 50
before <- rnorm(n, mean = 220, sd = 30)
after <- before - rnorm(n, mean = 25, sd = 15)  # Treatment effect

cholesterol_data <- data.frame(
  patient_id = 1:n,
  before = before,
  after = after,
  difference = after - before
)

# 1. Descriptive Statistics
desc_stats <- cholesterol_data %>%
  summarize(
    mean_before = mean(before),
    sd_before = sd(before),
    mean_after = mean(after),
    sd_after = sd(after),
    mean_diff = mean(difference),
    sd_diff = sd(difference)
  )

print(desc_stats)

# 2. Visualization
library(ggplot2)
library(gridExtra)

p1 <- ggplot(cholesterol_data) +
  geom_histogram(aes(x = difference), bins = 15, 
                 fill = "lightblue", color = "black") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Distribution of Cholesterol Changes",
       x = "Change in Cholesterol (mg/dL)",
       y = "Count")

p2 <- ggplot(cholesterol_data, aes(x = before, y = after)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, 
              linetype = "dashed", color = "red") +
  labs(title = "Before vs After Treatment",
       x = "Before Treatment (mg/dL)",
       y = "After Treatment (mg/dL)")

grid.arrange(p1, p2, ncol = 2)

# 3. Hypothesis Testing
# H₀: μd = 0, H₁: μd < 0 (treatment decreases cholesterol)

# Traditional approach
t_result <- t.test(cholesterol_data$$after, 
                   cholesterol_data$$before,
                   paired = TRUE,
                   alternative = "less")
print(t_result)

# Simulation approach
obs_mean_diff <- mean(cholesterol_data$$difference)

null_dist <- cholesterol_data %>%
  specify(response = difference) %>%
  hypothesize(null = "point", mu = 0) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean")

p_value_sim <- null_dist %>%
  get_p_value(obs_stat = obs_mean_diff, direction = "less")

# 4. Confidence Intervals
# Traditional CI
ci_traditional <- t.test(cholesterol_data$$difference)$$conf.int

# Bootstrap CI
boot_dist <- cholesterol_data %>%
  specify(response = difference) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean")

ci_bootstrap <- boot_dist %>%
  get_ci(level = 0.95, type = "percentile")

# 5. Effect Size
library(effsize)
d <- cohen.d(cholesterol_data$$after, 
             cholesterol_data$$before, 
             paired = TRUE)

# 6. Complete Report
cat("CHOLESTEROL REDUCTION ANALYSIS\n")
cat("==============================\n\n")
cat("Sample size:", n, "\n")
cat("Mean reduction:", round(mean(cholesterol_data$$difference), 2), "mg/dL\n")
cat("Standard deviation:", round(sd(cholesterol_data$$difference), 2), "\n")
cat("\nHypothesis Test Results:\n")
cat("t-statistic:", round(t_result$$statistic, 3), "\n")
cat("p-value:", format.pval(t_result$$p.value), "\n")
cat("Decision: Reject H₀ at α = 0.05\n")
cat("\n95% Confidence Intervals:\n")
cat("Traditional:", round(ci_traditional[1], 2), "to", 
    round(ci_traditional[2], 2), "\n")
cat("Bootstrap:", round(ci_bootstrap$$lower_ci, 2), "to", 
    round(ci_bootstrap$$upper_ci, 2), "\n")
cat("\nEffect Size (Cohen's d):", round(d$$estimate, 2), "\n")
cat("\nConclusion: The medicine significantly reduces cholesterol levels.")

15.9 Exam-Style Questions

Question 15 (Multiple Choice)

Which of the following increases the width of a confidence interval? a) Increasing sample size b) Decreasing confidence level c) Increasing population variance d) Decreasing margin of error

Answer: c) Increasing population variance

Explanation: Width = 2 × margin of error = 2 × z* × σ/√n. Increasing σ increases width.

Question 16 (Short Answer)

Explain the difference between Type I and Type II errors in hypothesis testing.

Answer:

  • Type I Error: Rejecting H₀ when it’s actually true (false positive)
  • Type II Error: Failing to reject H₀ when it’s actually false (false negative)
  • Type I error rate = α (significance level)
  • Type II error rate = β (related to power: Power = 1 - β)

Question 17 (Calculation)

Given: X ~ N(μ, σ²), sample of n=16, x̄=50, s=8. Test H₀: μ=45 vs H₁: μ≠45 at α=0.05.

Solution:

# Given information
n <- 16
x_bar <- 50
s <- 8
mu_0 <- 45
alpha <- 0.05

# Test statistic
t_stat <- (x_bar - mu_0) / (s / sqrt(n))
df <- n - 1
p_value <- 2 * pt(-abs(t_stat), df)

cat("t =", t_stat, "\n")        # t = 2.5
cat("df =", df, "\n")           # df = 15
cat("p-value =", p_value, "\n") # p-value = 0.0247

# Critical value
t_critical <- qt(1 - alpha/2, df)
cat("Critical value:", t_critical, "\n")  # 2.131

# Decision
if (abs(t_stat) > t_critical) {
  print("Reject H₀")
} else {
  print("Fail to reject H₀")
}
# Reject H₀

Part 10: Appendix & Reference

16. Appendix and Reference Materials

16.1 Statistical Tables

Standard Normal (Z) Table

Common critical values:

Confidence Level α Two-tailed z* One-tailed z*
90% 0.10 1.645 1.282
95% 0.05 1.960 1.645
99% 0.01 2.576 2.326

t-Distribution Critical Values

Selected values for common degrees of freedom:

df α = 0.10 α = 0.05 α = 0.01
5 2.015 2.571 4.032
10 1.812 2.228 3.169
15 1.753 2.131 2.947
20 1.725 2.086 2.845
30 1.697 2.042 2.750
1.645 1.960 2.576

16.2 R Function Reference

Base R Statistical Functions

# Probability distributions
dnorm(x, mean, sd)     # Normal density
pnorm(q, mean, sd)     # Normal cumulative probability
qnorm(p, mean, sd)     # Normal quantile
rnorm(n, mean, sd)     # Generate normal random values

# Similar functions exist for:
# t-distribution: dt(), pt(), qt(), rt()
# Chi-square: dchisq(), pchisq(), qchisq(), rchisq()
# Binomial: dbinom(), pbinom(), qbinom(), rbinom()
# Poisson: dpois(), ppois(), qpois(), rpois()

# Hypothesis tests
t.test()               # t-tests (one-sample, two-sample, paired)
prop.test()            # Proportion tests
chisq.test()           # Chi-square test
var.test()             # F-test for variance
wilcox.test()          # Wilcoxon tests

# Confidence intervals
confint()              # Generic confidence interval function

dplyr Package Functions

# Data manipulation
filter()               # Filter rows
select()               # Select columns
mutate()               # Create new variables
summarize()            # Compute summaries
group_by()             # Group data
arrange()              # Sort data
%>%                    # Pipe operator

infer Package Workflow

# Statistical inference pipeline
specify()              # Specify variables
hypothesize()          # State null hypothesis
generate()             # Generate null distribution
calculate()            # Calculate statistics
visualize()            # Visualize results
get_p_value()         # Extract p-value
get_confidence_interval() # Extract CI
shade_p_value()       # Shade p-value on plot
shade_confidence_interval() # Shade CI on plot

16.3 Formula Sheet

Descriptive Statistics

  • Mean: \(\bar{x} = \frac{\sum x_i}{n}\)
  • Variance: \(s^2 = \frac{\sum (x_i - \bar{x})^2}{n-1}\)
  • Standard Deviation: \(s = \sqrt{s^2}\)

Standard Error Formulas

  • Mean: \(SE(\bar{X}) = \frac{\sigma}{\sqrt{n}}\) or \(\frac{s}{\sqrt{n}}\)
  • Proportion: \(SE(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}\) or \(\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\)
  • Difference in Means: \(SE(\bar{X}_1 - \bar{X}_2) = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\)
  • Difference in Proportions: \(SE(\hat{p}_1 - \hat{p}_2) = \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}\)

Confidence Intervals

  • General form: \(\text{Point Estimate} \pm \text{Critical Value} \times \text{Standard Error}\)
  • Mean (σ known): \(\bar{x} \pm z_{\alpha/2} \times \frac{\sigma}{\sqrt{n}}\)
  • Mean (σ unknown): \(\bar{x} \pm t_{\alpha/2,n-1} \times \frac{s}{\sqrt{n}}\)
  • Proportion: \(\hat{p} \pm z_{\alpha/2} \times \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\)

Test Statistics

  • Z-test: \(Z = \frac{\bar{X} - \mu_0}{\sigma/\sqrt{n}}\)
  • t-test: \(t = \frac{\bar{X} - \mu_0}{s/\sqrt{n}}\)
  • Proportion Z-test: \(Z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}}\)
  • Chi-square: \(\chi^2 = \sum \frac{(O - E)^2}{E}\)

Sample Size Formulas

  • For mean: \(n = \left(\frac{z_{\alpha/2} \times \sigma}{E}\right)^2\)
  • For proportion: \(n = \left(\frac{z_{\alpha/2}}{E}\right)^2 \times p(1-p)\)

16.4 Common R Programming Patterns

Data Import

# CSV files
data <- read.csv("filename.csv")
data <- read_csv("filename.csv")  # tidyverse

# Excel files
library(readxl)
data <- read_excel("filename.xlsx")

# Tab-delimited files
data <- read.delim("filename.txt")

Data Cleaning

# Remove missing values
clean_data <- na.omit(data)

# Filter out specific values
filtered_data <- data %>%
  filter(!is.na(variable)) %>%
  filter(variable > 0)

# Create new variables
data <- data %>%
  mutate(new_var = old_var * 2,
         category = ifelse(value > threshold, "High", "Low"))

Visualization Templates

# Histogram with normal curve
ggplot(data, aes(x = variable)) +
  geom_histogram(aes(y = ..density..), bins = 30) +
  stat_function(fun = dnorm, 
                args = list(mean = mean(data$$variable), 
                           sd = sd(data$$variable)),
                color = "red", size = 1)

# Box plot with points
ggplot(data, aes(x = group, y = value)) +
  geom_boxplot() +
  geom_jitter(width = 0.2, alpha = 0.5)

# Before-after plot
ggplot(data, aes(x = before, y = after)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, 
              linetype = "dashed", color = "red")

16.5 Quick Reference Guide

Choosing the Right Test

Scenario Parametric Test Non-parametric Alternative
One sample mean One-sample t-test Wilcoxon signed-rank test
Two independent means Independent t-test Mann-Whitney U test
Two paired means Paired t-test Wilcoxon signed-rank test
One proportion Z-test for proportion Binomial test
Two proportions Z-test for two proportions Chi-square test
Multiple categories Chi-square test Fisher’s exact test
Variance F-test Levene’s test

Assumptions Checklist

  1. Independence: Observations are independent

  2. Normality

    Data follows normal distribution

    • Check with: histogram, Q-Q plot, Shapiro-Wilk test
  3. Equal variances

    Groups have similar spread

    • Check with: F-test, Levene’s test
  4. Random sampling: Data collected randomly

  5. Large enough sample: n ≥ 30 for CLT

16.6 Troubleshooting Guide

Common Errors and Solutions

Error: “object not found”

# Solution: Check spelling, ensure object exists
ls()  # List all objects in environment

Error: “non-numeric argument to binary operator”

# Solution: Convert to numeric
data$$variable <- as.numeric(data$$variable)

Error: “missing values in object”

# Solution: Handle NA values
mean(data$$variable, na.rm = TRUE)
complete_cases <- na.omit(data)

Error: Package not found

# Solution: Install package
install.packages("package_name")
library(package_name)

16.7 Best Practices Checklist

Before Analysis

  • Check data quality and completeness
  • Explore data visually
  • Verify assumptions
  • Define hypotheses clearly
  • Choose appropriate significance level

During Analysis

  • Document all steps
  • Use both traditional and simulation methods when possible
  • Check for outliers and influential points
  • Validate results with different approaches
  • Calculate effect sizes

After Analysis

  • Interpret results in context
  • Report both statistical and practical significance
  • Include confidence intervals
  • Create clear visualizations
  • Write comprehensive conclusions

16.8 Additional Resources

Online Resources

  1. “Modern Dive: Statistical Inference via Data Science” - Ismay & Kim
  2. “Introduction to Statistical Learning with R” - James et al.
  3. “R for Data Science” - Wickham & Grolemund
  4. “OpenIntro Statistics” - Diez, Barr, & Çetinkaya-Rundel

R Packages for Statistical Analysis

# Core packages
install.packages(c("tidyverse", "infer", "moderndive"))

# Additional useful packages
install.packages(c(
  "broom",      # Tidy model outputs
  "car",        # Regression diagnostics
  "effsize",    # Effect size calculations
  "pwr",        % Power analysis
  "ggpubr",     # Publication-ready plots
  "rstatix"     # Pipe-friendly statistics
))

16.9 Summary

We covered:

  1. Foundations: Probability, distributions, and R programming
  2. Estimation: Point estimates and confidence intervals
  3. Hypothesis Testing: One-sample and two-sample tests
  4. Modern Methods: Bootstrap and simulation approaches
  5. Practical Applications: Real-world case studies
  6. Best Practices: Reproducible research and clear reporting

Key takeaways:

  • Always start with exploratory data analysis
  • Check assumptions before applying tests
  • Use both traditional and modern methods
  • Focus on practical significance, not just statistical
  • Report results clearly and in context

See

  • Regression

  • [[ statistics/Statistical Learning ]]

25

25
Ready to start
Inferential statistics
Session: 1 | Break: Short
Today: 0 sessions
Total: 0 sessions