Inferential statistics
Table of Contents
- Introduction to Statistical Inference
- Probability Review
- Discrete Probability Distributions
- Continuous Probability Distributions
1. Introduction to Statistical Inference
1.1 What is Statistical Inference?
Statistical inference is the process of drawing conclusions about a population based on information obtained from a sample. It allows us to make probabilistic statements about population parameters using sample statistics.
The two main branches of statistical inference are:
- Estimation: Using sample data to estimate population parameters
- Hypothesis Testing: Using sample data to test claims about population parameters
1.2 Population vs Sample
Population
A population is the complete set of all items of interest in a statistical study. Population parameters are typically denoted by Greek letters:
- \(\mu\) (mu): Population mean
- \(\sigma\) (sigma): Population standard deviation
- \(p\): Population proportion
Sample
A sample is a subset of the population that is actually observed. Sample statistics are typically denoted by Roman letters:
- \(\bar{x}\) (x-bar): Sample mean
- \(s\): Sample standard deviation
- \(\hat{p}\) (p-hat): Sample proportion
Why Sample?
- Cost efficiency
- Time constraints
- Practical impossibility of measuring entire population
- Destructive testing scenarios
1.3 Data Science Workflow
The modern data science workflow for inferential statistics follows these steps:
- Import: Bring data into R
- Tidy: Organize data into tidy format
- Transform: Manipulate data for analysis
- Visualize: Create plots to understand data
- Model: Apply statistical models
- Communicate: Share results and insights
1.4 Traditional R vs Tidyverse Approach
Traditional R
# Traditional approach
asia <- gapminder[gapminder$$continent == "Asia", ]
mean(asia$$lifeExp)
Tidyverse Approach
# Tidyverse approach
library(dplyr)
gapminder %>%
filter(continent == "Asia") %>%
summarize(mean_exp = mean(lifeExp))
The tidyverse offers:
- More readable code with the pipe operator (
%>%
) - Consistent grammar of data manipulation
- Integration with modern visualization tools
2. Probability Review
2.1 Random Variables
A random variable is a real-valued function defined on a sample space. For each outcome \(s\) in the sample space \(S\), a random variable \(X\) assigns a real number \(X(s)\).
Types of Random Variables
-
Discrete Random Variable: Can take a countable number of values
- Example: Number of heads in 10 coin flips
- Example: Number of defective items in a batch
-
Continuous Random Variable: Can take any value within an interval
- Example: Height of students
- Example: Time to complete an exam
2.2 Probability Distributions
Discrete Probability Distribution
For a discrete random variable \(X\), the probability distribution is defined by the probability mass function (PMF):
\[P(X = x) = p(x)\]Properties:
- \(p(x) \geq 0\) for all values of \(x\)
- \[\sum p(x) = 1\]
Continuous Probability Distribution
For a continuous random variable \(X\), the probability distribution is defined by the probability density function (PDF) \(f(x)\).
For an interval \((a, b]\): \(P(a < X \leq b) = \int_a^b f(x)dx\)
Properties:
- \(f(x) \geq 0\) for all \(x\)
- \[\int_{-\infty}^{\infty} f(x)dx = 1\]
2.3 Expected Value and Variance
Expected Value (Mean)
The expected value represents the long-run average value of a random variable.
For discrete random variables: \(E(X) = \mu = \sum x \cdot p(x)\)
For continuous random variables: \(E(X) = \mu = \int_{-\infty}^{\infty} x \cdot f(x)dx\)
Variance
Variance measures the spread of a distribution around its mean.
For discrete random variables: \(Var(X) = \sigma^2 = E[(X - \mu)^2] = \sum (x - \mu)^2 \cdot p(x)\)
For continuous random variables: \(Var(X) = \sigma^2 = E[(X - \mu)^2] = \int_{-\infty}^{\infty} (x - \mu)^2 \cdot f(x)dx\)
Standard Deviation
The standard deviation is the square root of the variance: \(\sigma = \sqrt{Var(X)}\)
3. Discrete Probability Distributions
3.1 Binomial Distribution
The binomial distribution models the number of successes in a fixed number of independent trials.
Conditions for Binomial Distribution
- Fixed number of trials (\(n\))
- Two possible outcomes: Success or Failure
- Constant probability of success (\(p\)) for each trial
- Independent trials
Probability Mass Function
\[P(X = x) = \binom{n}{x}p^x(1-p)^{n-x}\]where:
- \(x\) = number of successes
- \(n\) = number of trials
- \(p\) = probability of success
- \(\binom{n}{x} = \frac{n!}{x!(n-x)!}\) (binomial coefficient)
Mean and Variance
- Mean: \(\mu = np\)
- Variance: \(\sigma^2 = np(1-p)\)
Example
If 40% of a class is female, what is the probability that exactly 6 of the first 10 students walking in will be female?
\[P(X = 6) = \binom{10}{6}(0.4)^6(0.6)^4 = 210 \times 0.004096 \times 0.1296 = 0.1115\]3.2 Poisson Distribution
The Poisson distribution models the number of events occurring in a fixed interval of time or space.
Conditions for Poisson Distribution
- Events occur independently
- Average rate of occurrence (\(\lambda\)) is constant
- Two events cannot occur at exactly the same instant
Probability Mass Function
\[P(X = x) = \frac{e^{-\lambda}\lambda^x}{x!}\]where:
- \(x\) = number of events
- \(\lambda\) = average rate of occurrence
- \(e\) = Euler’s number (≈ 2.71828)
Mean and Variance
- Mean: \(\mu = \lambda\)
- Variance: \(\sigma^2 = \lambda\)
Example
If there are an average of 3 striped trout per 100 yards in a stream, what is the probability of seeing exactly 5 striped trout in the next 100 yards?
\[P(X = 5) = \frac{e^{-3} \times 3^5}{5!} = \frac{0.0498 \times 243}{120} = 0.1008\]4. Continuous Probability Distributions
4.1 Normal Distribution
The normal distribution is the most important continuous distribution in statistics, often called the “bell curve.”
Probability Density Function
\[f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\]where:
- \(\mu\) = mean
- \(\sigma\) = standard deviation
- \(\pi\) ≈ 3.14159
- \(e\) ≈ 2.71828
Properties
- Symmetric about the mean
- Mean = Median = Mode
- Total area under the curve = 1
- Approximately 68% of data falls within ±1σ of μ
- Approximately 95% of data falls within ±2σ of μ
- Approximately 99.7% of data falls within ±3σ of μ
Notation
\(X \sim N(\mu, \sigma^2)\) means “\(X\) follows a normal distribution with mean \(\mu\) and variance \(\sigma^2\)”
4.2 Standard Normal Distribution
The standard normal distribution is a special case of the normal distribution with \(\mu = 0\) and \(\sigma = 1\).
Z-Score Transformation
Any normal random variable \(X \sim N(\mu, \sigma^2)\) can be standardized: \(Z = \frac{X - \mu}{\sigma}\)
where \(Z \sim N(0, 1)\)
Using Z-Tables
To find probabilities for normal distributions:
- Convert to z-score
- Use standard normal table or R functions
Example
If gestational length \(X \sim N(39, 2^2)\) weeks, what percentage of gestations are less than 40 weeks?
\[Z = \frac{40 - 39}{2} = 0.5\]Using the standard normal table: \(P(Z \leq 0.5) = 0.6915\)
Therefore, approximately 69.15% of gestations are less than 40 weeks.
R Implementation
# Using pnorm for normal probabilities
pnorm(40, mean = 39, sd = 2) # Direct calculation
pnorm(0.5) # Using z-score
Part 2: Sampling and Point Estimation
5. Sampling and Sampling Distributions
5.1 Introduction to Sampling
Sampling is the process of selecting a subset of individuals from a population to estimate characteristics of the whole population.
Why Sample?
- Census limitations: Measuring entire populations is often impractical
- Cost efficiency: Samples are less expensive to collect
- Time constraints: Samples provide quicker results
- Destructive testing: Some measurements destroy the item being tested
Types of Sampling
- Random Sampling: Every member of the population has an equal chance of selection
- Stratified Sampling: Population divided into strata, then random samples from each
- Cluster Sampling: Population divided into clusters, some clusters randomly selected
- Systematic Sampling: Select every kth element from a list
5.2 Sampling With and Without Replacement
Sampling With Replacement
- Each selected item is returned to the population before the next draw
- Same item can be selected multiple times
- Observations are independent
Sampling Without Replacement
- Selected items are not returned to the population
- Same item cannot be selected twice
- Observations are not independent
R Implementation
# Sampling with replacement
sample(1:10, size = 5, replace = TRUE)
# Sampling without replacement
sample(1:10, size = 5, replace = FALSE)
5.3 Sampling Distributions
A sampling distribution is the probability distribution of a statistic obtained from all possible samples of a specific size from a population.
Key Concepts
- Sample Statistic: A value calculated from sample data (e.g., \(\bar{x}\), \(\hat{p}\))
- Population Parameter: The true value in the population (e.g., \(\mu\), \(p\))
- Sampling Variability: Different samples yield different statistics
The Bowl Example
The classic bowl activity demonstrates sampling concepts:
- Bowl contains red and white balls (unknown proportion)
- Take samples of 50 balls using a “shovel”
- Calculate proportion of red balls in each sample
- Observe variation in sample proportions
library(moderndive)
library(tidyverse)
# Virtual bowl sampling
bowl %>%
rep_sample_n(size = 50, reps = 1000) %>%
group_by(replicate) %>%
summarize(prop_red = mean(color == "red"))
5.4 Central Limit Theorem
The Central Limit Theorem (CLT) is fundamental to statistical inference:
Statement of CLT
For a random sample of size \(n\) from a population with mean \(\mu\) and standard deviation \(\sigma\):
- The sampling distribution of \(\bar{X}\) has mean \(\mu_{\bar{X}} = \mu\)
- The sampling distribution of \(\bar{X}\) has standard deviation \(\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}}\)
- As \(n\) increases, the sampling distribution of \(\bar{X}\) approaches a normal distribution
Mathematical Expression
\(\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)\) for large \(n\)
Implications
- Sample means are normally distributed for large samples
- Larger samples produce more precise estimates
- We can make probability statements about sample means
5.5 Standard Error
The standard error (SE) measures the variability of a sample statistic.
Standard Error of the Mean
\[SE(\bar{X}) = \frac{\sigma}{\sqrt{n}}\]If \(\sigma\) is unknown, we estimate it with \(s\): \(SE(\bar{X}) = \frac{s}{\sqrt{n}}\)
Standard Error of a Proportion
\[SE(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}\]If \(p\) is unknown, we estimate it with \(\hat{p}\): \(SE(\hat{p}) = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\)
Effect of Sample Size
As \(n\) increases:
- Standard error decreases
- Estimates become more precise
- Sampling distribution becomes narrower
6. Point Estimation
6.1 Introduction to Point Estimation
A point estimate is a single value used to estimate a population parameter.
Common Point Estimates
- Sample mean (\(\bar{x}\)) estimates population mean (\(\mu\))
- Sample proportion (\(\hat{p}\)) estimates population proportion (\(p\))
- Sample variance (\(s^2\)) estimates population variance (\(\sigma^2\))
Properties of Good Estimators
- Unbiased: \(E(\hat{\theta}) = \theta\)
- Consistent: \(\hat{\theta}\) converges to \(\theta\) as \(n \to \infty\)
- Efficient: Minimum variance among unbiased estimators
6.2 Method of Moments
The method of moments equates sample moments to population moments to derive estimators.
First Moment (Mean)
Population: \(E(X) = \mu\) Sample: \(\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i\)
Second Moment (Variance)
Population: \(E(X^2) = \sigma^2 + \mu^2\) Sample: \(\frac{1}{n}\sum_{i=1}^n X_i^2\)
Example: Poisson Distribution
For \(X \sim \text{Poisson}(\lambda)\):
- \[E(X) = \lambda\]
- Method of moments estimator: \(\hat{\lambda} = \bar{X}\)
6.3 Maximum Likelihood Estimation (MLE)
Maximum likelihood estimation finds parameter values that maximize the probability of observing the sample data.
Likelihood Function
For independent observations \(x_1, x_2, ..., x_n\): \(L(\theta) = \prod_{i=1}^n f(x_i\|\theta)\)
Log-Likelihood Function
\[\ell(\theta) = \log L(\theta) = \sum_{i=1}^n \log f(x_i\|\theta)\]Finding MLE
- Write the likelihood function
- Take the logarithm
- Differentiate with respect to \(\theta\)
- Set derivative equal to zero
- Solve for \(\theta\)
Example: Normal Distribution
For \(X_1, ..., X_n \sim N(\mu, \sigma^2)\):
Log-likelihood: \(\ell(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n(x_i - \mu)^2\)
MLEs:
- \[\hat{\mu}_{MLE} = \bar{x}\]
- \[\hat{\sigma}^2_{MLE} = \frac{1}{n}\sum_{i=1}^n(x_i - \bar{x})^2\]
6.4 Examples of Point Estimation
Example 1: Binomial Distribution
For \(Y \sim \text{Binomial}(n, p)\):
- MLE: \(\hat{p} = \frac{Y}{n}\)
Example 2: Exponential Distribution
For \(X_1, ..., X_n \sim \text{Exponential}(\lambda)\):
- MLE: \(\hat{\lambda} = \frac{1}{\bar{x}}\)
Example 3: Poisson Distribution
For \(X_1, ..., X_n \sim \text{Poisson}(\mu)\):
- MLE: \(\hat{\mu} = \bar{x}\)
R Implementation
# Sample data
set.seed(123)
data <- rpois(100, lambda = 5)
# Method of moments estimate
mom_estimate <- mean(data)
# Maximum likelihood estimate (same for Poisson)
mle_estimate <- mean(data)
print(paste("Estimate:", mle_estimate))
6.5 Sampling Distribution of Estimators
The sampling distribution of an estimator shows how the estimator varies across all possible samples.
For Sample Mean
If \(X_1, ..., X_n \sim N(\mu, \sigma^2)\): \(\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)\)
For Sample Proportion
If \(n\) is large: \(\hat{p} \sim N\left(p, \frac{p(1-p)}{n}\right)\)
Bootstrap Distribution
When theoretical distributions are unknown, we can use bootstrap methods:
library(infer)
# Generate bootstrap distribution
boot_dist <- data %>%
specify(response = variable) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean")
# Visualize
boot_dist %>% visualize()
6.6 Properties of Estimators
Bias
The bias of an estimator \(\hat{\theta}\) is: \(\text{Bias}(\hat{\theta}) = E(\hat{\theta}) - \theta\)
An estimator is unbiased if \(\text{Bias}(\hat{\theta}) = 0\).
Mean Squared Error (MSE)
\[\text{MSE}(\hat{\theta}) = E[(\hat{\theta} - \theta)^2] = \text{Var}(\hat{\theta}) + [\text{Bias}(\hat{\theta})]^2\]Consistency
An estimator \(\hat{\theta}*n\) is consistent if: \(\lim*{n \to \infty} P(\|\hat{\theta}_n - \theta\| > \epsilon) = 0\) for any \(\epsilon > 0\)
Efficiency
Among unbiased estimators, the one with minimum variance is called efficient.
The relative efficiency of two estimators is: \(\text{RE}(\hat{\theta}_1, \hat{\theta}_2) = \frac{\text{Var}(\hat{\theta}_2)}{\text{Var}(\hat{\theta}_1)}\)
Part 3: Confidence Intervals - Theory
7. Introduction to Confidence Intervals
7.1 What is a Confidence Interval?
A confidence interval (CI) is a range of values that likely contains the true population parameter with a specified level of confidence.
Key Components
- Point estimate: Best single estimate of the parameter
- Margin of error: How far the interval extends from the point estimate
- Confidence level: Probability that the method produces an interval containing the true parameter
General Form
\[\text{CI} = \text{Point Estimate} \pm \text{Margin of Error}\]Interpretation
A 95% confidence interval means that if we repeated the sampling process many times and calculated a CI each time, approximately 95% of these intervals would contain the true population parameter.
7.2 The Traditional (Standard Error) Method
The traditional method for constructing confidence intervals relies on:
- The sampling distribution of the statistic
- The standard error
- A critical value from the appropriate distribution
General Formula
\[\text{CI} = \hat{\theta} \pm z_{\alpha/2} \times SE(\hat{\theta})\]where:
- \(\hat{\theta}\) = point estimate
- \(z_{\alpha/2}\) = critical value
- \(SE(\hat{\theta})\) = standard error
- \(\alpha\) = significance level (1 - confidence level)
Common Critical Values
| Confidence Level | \(\alpha\) | \(z_{\alpha/2}\) | | —————- | ——– | ————– | | 90% | 0.10 | 1.645 | | 95% | 0.05 | 1.960 | | 99% | 0.01 | 2.576 |
7.3 Confidence Interval for Population Mean
Case 1: Known Population Standard Deviation (\(\sigma\))
When \(\sigma\) is known and either:
- The population is normally distributed, or
- The sample size is large (\(n \geq 30\))
Case 2: Unknown Population Standard Deviation
When \(\sigma\) is unknown, we use the \(t\)-distribution:
\[\text{CI} = \bar{x} \pm t_{\alpha/2, n-1} \times \frac{s}{\sqrt{n}}\]where:
- \(t_{\alpha/2, n-1}\) = critical value from \(t\)-distribution with \(n-1\) degrees of freedom
- \(s\) = sample standard deviation
Conditions for Using the t-Distribution
- Random sample
- Population is approximately normal, or \(n \geq 30\)
- Independence of observations
Example
Customer support response times: \(\bar{x} = 57\) minutes, \(s = 15\) minutes, \(n = 200\)
For a 95% CI: \(SE = \frac{15}{\sqrt{200}} = 1.06\) \(\text{Margin of Error} = 1.96 \times 1.06 = 2.08\) \(\text{CI} = 57 \pm 2.08 = [54.92, 59.08]\)
We are 95% confident that the true mean response time is between 54.92 and 59.08 minutes.
R Implementation
# Using t.test function
data <- c(12, 16, 14, 18, 20, 10, 15, 14, 16, 13)
t.test(data, conf.level = 0.95)$$conf.int
# Manual calculation
x_bar <- mean(data)
s <- sd(data)
n <- length(data)
se <- s / sqrt(n)
t_critical <- qt(0.975, df = n - 1)
margin_error <- t_critical * se
ci_lower <- x_bar - margin_error
ci_upper <- x_bar + margin_error
7.4 Confidence Interval for Population Proportion
For a population proportion \(p\), we use the sample proportion \(\hat{p}\):
\[\text{CI} = \hat{p} \pm z_{\alpha/2} \times \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\]Conditions
- Random sample
- Independence
- Success-failure condition: \(n\hat{p} \geq 10\) and \(n(1-\hat{p}) \geq 10\)
Example
Customer satisfaction survey: 200 out of 250 customers are satisfied
\(\hat{p} = \frac{200}{250} = 0.80\) \(SE = \sqrt{\frac{0.80 \times 0.20}{250}} = 0.0253\) \(\text{Margin of Error} = 1.96 \times 0.0253 = 0.0496\) \(\text{CI} = 0.80 \pm 0.0496 = [0.750, 0.850]\)
We are 95% confident that between 75% and 85% of all customers are satisfied.
R Implementation
# Using prop.test
prop.test(x = 200, n = 250, conf.level = 0.95)$$conf.int
# Manual calculation
p_hat <- 200/250
se <- sqrt(p_hat * (1 - p_hat) / 250)
margin_error <- 1.96 * se
ci_lower <- p_hat - margin_error
ci_upper <- p_hat + margin_error
7.5 Margin of Error and Width of Confidence Intervals
Margin of Error Components
\[\text{Margin of Error} = \text{Critical Value} \times \text{Standard Error}\]Factors Affecting CI Width
- Sample size: Larger \(n\) → narrower CI
- Confidence level: Higher confidence → wider CI
- Population variability: Larger \(\sigma\) → wider CI
Relationship with Sample Size
For means: \(\text{Width} \propto \frac{1}{\sqrt{n}}\)
Quadrupling the sample size cuts the CI width in half.
7.6 Sample Size Determination
For Population Mean
To achieve a margin of error \(E\) with confidence level \((1-\alpha)\):
\[n = \left(\frac{z_{\alpha/2} \times \sigma}{E}\right)^2\]For Population Proportion
\[n = \left(\frac{z_{\alpha/2}}{E}\right)^2 \times p(1-p)\]If \(p\) is unknown, use \(p = 0.5\) for the most conservative estimate.
Example
To estimate proportion of students who prefer games within 4% margin of error with 95% confidence:
\[n = \left(\frac{1.96}{0.04}\right)^2 \times 0.5 \times 0.5 = 600.25 \approx 601\]7.7 Interpreting Confidence Intervals
Correct Interpretations
- “We are 95% confident that the true population mean lies between [lower, upper]”
- “If we repeated this procedure many times, 95% of the resulting intervals would contain the true parameter”
- “The method used produces intervals that contain the true parameter 95% of the time”
Incorrect Interpretations
- ❌ “There is a 95% probability that the true mean is in this interval”
- ❌ “95% of the data falls within this interval”
- ❌ “95% of sample means fall within this interval”
Key Points
- The parameter is fixed; the interval is random
- Confidence is in the method, not in a specific interval
- Once calculated, the interval either contains the parameter or it doesn’t
7.8 Finite Population Correction
When sampling from a finite population without replacement, and the sample size is more than 5% of the population:
\[\text{FPC} = \sqrt{\frac{N-n}{N-1}}\]The corrected standard error becomes: \(SE_{corrected} = SE \times \text{FPC}\)
Example
Population of 1,200 students, sample of 400: \(\text{FPC} = \sqrt{\frac{1200-400}{1200-1}} = 0.816\)
The standard error is reduced by approximately 18.4%.
7.9 Confidence Intervals in R
Using Base R
# For means
t.test(data, conf.level = 0.95)$$conf.int
# For proportions
prop.test(x = successes, n = total, conf.level = 0.95)$$conf.int
Using tidyverse and infer
library(infer)
# For means
data %>%
specify(response = variable) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean") %>%
get_confidence_interval(level = 0.95, type = "percentile")
# For proportions
data %>%
specify(response = variable, success = "yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_confidence_interval(level = 0.95, type = "percentile")
7.10 Visual Representation
Confidence intervals can be visualized using:
library(ggplot2)
# Create data frame for plotting
ci_data <- data.frame(
estimate = c(x_bar),
lower = c(ci_lower),
upper = c(ci_upper)
)
# Plot with error bars
ggplot(ci_data, aes(x = "Sample", y = estimate)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.2) +
labs(y = "Estimate", title = "95% Confidence Interval")
Part 4: Confidence Intervals - Advanced & Bootstrap Methods
8. Two-Sample Confidence Intervals
8.1 Confidence Interval for Difference in Means
Independent Samples
When comparing means from two independent populations:
\[\text{CI} = (\bar{x}_1 - \bar{x}*2) \pm t*{\alpha/2, df} \times SE(\bar{x}_1 - \bar{x}_2)\]where: \(SE(\bar{x}_1 - \bar{x}_2) = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\)
Degrees of Freedom (Welch’s approximation)
\[df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1-1} + \frac{(s_2^2/n_2)^2}{n_2-1}}\]Example
Comparing incomes between Cleveland and Sacramento:
- Cleveland: \(\bar{x}_1 = 45,000\), \(s_1 = 15,000\), \(n_1 = 212\)
- Sacramento: \(\bar{x}_2 = 48,000\), \(s_2 = 18,000\), \(n_2 = 175\)
# Using t.test
t.test(cleveland_income, sacramento_income, conf.level = 0.95)$$conf.int
# Manual calculation
x1_bar <- 45000; s1 <- 15000; n1 <- 212
x2_bar <- 48000; s2 <- 18000; n2 <- 175
se_diff <- sqrt(s1^2/n1 + s2^2/n2)
df <- ((s1^2/n1 + s2^2/n2)^2) / ((s1^2/n1)^2/(n1-1) + (s2^2/n2)^2/(n2-1))
t_critical <- qt(0.975, df)
margin_error <- t_critical * se_diff
ci_lower <- (x1_bar - x2_bar) - margin_error
ci_upper <- (x1_bar - x2_bar) + margin_error
Paired Samples
For paired data (e.g., before-after measurements):
- Calculate differences: \(d_i = x_{1i} - x_{2i}\)
- Find mean difference: \(\bar{d}\)
- Find standard deviation of differences: \(s_d\)
Example: Weight Loss Program
Before-after weights for 10 participants:
before <- c(65, 60, 63, 58, 70, 72, 80, 81, 83, 69)
after <- c(55, 52, 60, 55, 60, 66, 70, 66, 63, 60)
differences <- after - before
# Using t.test for paired data
t.test(after, before, paired = TRUE, conf.level = 0.95)$$conf.int
# Manual calculation
d_bar <- mean(differences)
s_d <- sd(differences)
n <- length(differences)
se_d <- s_d / sqrt(n)
t_critical <- qt(0.975, df = n - 1)
margin_error <- t_critical * se_d
ci_lower <- d_bar - margin_error
ci_upper <- d_bar + margin_error
8.2 Confidence Interval for Difference in Proportions
For two independent proportions:
\[\text{CI} = (\hat{p}_1 - \hat{p}*2) \pm z*{\alpha/2} \times SE(\hat{p}_1 - \hat{p}_2)\]where: \(SE(\hat{p}_1 - \hat{p}_2) = \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}\)
Example
Comparing satisfaction rates between two products:
- Product A: 180 satisfied out of 200 (\(\hat{p}_1 = 0.90\))
- Product B: 150 satisfied out of 200 (\(\hat{p}_2 = 0.75\))
# Using prop.test
prop.test(x = c(180, 150), n = c(200, 200), conf.level = 0.95)$$conf.int
# Manual calculation
p1_hat <- 180/200; n1 <- 200
p2_hat <- 150/200; n2 <- 200
se_diff <- sqrt(p1_hat*(1-p1_hat)/n1 + p2_hat*(1-p2_hat)/n2)
z_critical <- qnorm(0.975)
margin_error <- z_critical * se_diff
ci_lower <- (p1_hat - p2_hat) - margin_error
ci_upper <- (p1_hat - p2_hat) + margin_error
9. Bootstrap Methods for Confidence Intervals
9.1 Introduction to Bootstrap
The bootstrap is a resampling method that estimates the sampling distribution by repeatedly sampling with replacement from the original data.
Key Concepts
- Resampling: Drawing samples from the original data
- With replacement: Same observation can be selected multiple times
- Bootstrap distribution: Distribution of statistics from resampled data
Advantages
- No distributional assumptions required
- Works for complex statistics
- Provides empirical sampling distribution
9.2 Bootstrap Procedure
- Start with original sample of size \(n\)
- Draw a bootstrap sample of size \(n\) with replacement
- Calculate statistic on bootstrap sample
- Repeat steps 2-3 many times (typically 1000+)
- Use bootstrap distribution to construct CI
R Implementation
library(infer)
# Bootstrap distribution for mean
boot_dist <- data %>%
specify(response = variable) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean")
# Visualize bootstrap distribution
boot_dist %>% visualize()
9.3 Bootstrap Confidence Intervals
Percentile Method
Uses percentiles of the bootstrap distribution:
For a 95% CI, use the 2.5th and 97.5th percentiles:
# Percentile CI
percentile_ci <- boot_dist %>%
get_confidence_interval(level = 0.95, type = "percentile")
Standard Error Method
Uses bootstrap standard error with normal approximation:
\[\text{CI} = \hat{\theta} \pm z_{\alpha/2} \times SE_{boot}\]where \(SE_{boot}\) is the standard deviation of the bootstrap distribution.
# Standard error CI
se_ci <- boot_dist %>%
get_confidence_interval(level = 0.95, type = "se",
point_estimate = mean(original_data))
9.4 Bootstrap for Different Statistics
For Proportions
boot_prop <- data %>%
specify(response = variable, success = "yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop")
For Difference in Means
boot_diff_means <- data %>%
specify(response ~ group) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "diff in means", order = c("A", "B"))
For Correlation
boot_cor <- data %>%
specify(x ~ y) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "correlation")
9.5 Visualizing Bootstrap Results
# Bootstrap distribution with CI
boot_dist %>%
visualize() +
shade_confidence_interval(endpoints = percentile_ci)
# Comparison of methods
library(gridExtra)
p1 <- boot_dist %>%
visualize() +
shade_ci(endpoints = percentile_ci, color = "blue", fill = "lightblue") +
ggtitle("Percentile Method")
p2 <- boot_dist %>%
visualize() +
shade_ci(endpoints = se_ci, color = "red", fill = "pink") +
ggtitle("Standard Error Method")
grid.arrange(p1, p2, ncol = 2)
10. Sample Size Determination
10.1 Sample Size for Estimating a Mean
To achieve margin of error \(E\) with confidence level \((1-\alpha)\):
\[n = \left(\frac{z_{\alpha/2} \times \sigma}{E}\right)^2\]When \(\sigma\) is Unknown
- Use estimate from pilot study
- Use range/4 as rough estimate
- Use conservative estimate
Example
To estimate mean income within \(500 with 95% confidence, assuming\)\sigma = \(4,500\):
\[n = \left(\frac{1.96 \times 4500}{500}\right)^2 = 311.17 \approx 312\]10.2 Sample Size for Estimating a Proportion
\[n = \left(\frac{z_{\alpha/2}}{E}\right)^2 \times p(1-p)\]Conservative Approach
When \(p\) is unknown, use \(p = 0.5\):
\[n = \left(\frac{z_{\alpha/2}}{E}\right)^2 \times 0.25\]Example
To estimate proportion within 3% with 95% confidence:
\[n = \left(\frac{1.96}{0.03}\right)^2 \times 0.25 = 1067.11 \approx 1068\]10.3 Sample Size for Comparing Two Means
\[n = \frac{2\sigma^2(z_{\alpha/2} + z_\beta)^2}{(\mu_1 - \mu_2)^2}\]where:
- \(\beta\) = probability of Type II error
- \(\mu_1 - \mu_2\) = minimum detectable difference
10.4 Effect of Confidence Level and Margin of Error
Relationship with sample size:
- Halving margin of error → quadruples sample size
- Increasing confidence level → increases sample size
Trade-offs:
- Precision vs. cost
- Confidence vs. feasibility
10.5 Finite Population Correction
For finite populations:
\[n_{adjusted} = \frac{n}{1 + \frac{n-1}{N}}\]where \(N\) is population size.
10.6 R Implementation for Sample Size
# For proportions
library(pwr)
pwr.p.test(h = ES.h(p1 = 0.5, p2 = 0.47),
sig.level = 0.05,
power = 0.80)
# Custom function for mean
sample_size_mean <- function(sigma, margin_error, conf_level = 0.95) {
z <- qnorm((1 + conf_level) / 2)
n <- ceiling((z * sigma / margin_error)^2)
return(n)
}
# Custom function for proportion
sample_size_prop <- function(p, margin_error, conf_level = 0.95) {
z <- qnorm((1 + conf_level) / 2)
if (is.null(p)) p <- 0.5 # Conservative estimate
n <- ceiling((z^2 * p * (1-p)) / margin_error^2)
return(n)
}
10.7 Practical Considerations
- Pilot studies: Small preliminary studies to estimate parameters
- Resource constraints: Budget, time, personnel
- Non-response: Plan for 10-30% non-response rate
- Stratification: May require different sample sizes per stratum
- Cluster sampling: Requires larger samples due to design effect
Part 5: Hypothesis Testing Framework
11. Introduction to Hypothesis Testing
11.1 What is Hypothesis Testing?
Hypothesis testing is a formal procedure for using sample data to evaluate claims about population parameters. It provides a framework for making decisions under uncertainty.
Key Components
- Null hypothesis (H₀): The claim being tested (usually status quo)
- Alternative hypothesis (H₁ or Hₐ): The competing claim
- Test statistic: A value calculated from sample data
- p-value: Probability of observing data as extreme as ours, assuming H₀ is true
- Decision rule: Criteria for rejecting or failing to reject H₀
The Criminal Trial Analogy
| Criminal Trial | Hypothesis Test | | —————————- | —————————– | | Defendant presumed innocent | H₀ presumed true | | Prosecution must prove guilt | Must show evidence against H₀ | | “Beyond reasonable doubt” | Significance level α | | Guilty verdict | Reject H₀ | | Not guilty verdict | Fail to reject H₀ |
11.2 Null and Alternative Hypotheses
Null Hypothesis (H₀)
- Statement of “no effect” or “no difference”
- Represents the status quo
- Must contain equality (=, ≤, or ≥)
- What we assume true until proven otherwise
Alternative Hypothesis (H₁ or Hₐ)
- Statement we’re trying to find evidence for
- Represents the research claim
- Determines if test is one-tailed or two-tailed
Types of Tests
-
Two-tailed test: H₁: μ ≠ μ₀
- Tests for difference in either direction
- Example: “The mean is different from 50”
-
Right-tailed test: H₁: μ > μ₀
- Tests if parameter is greater
- Example: “The mean is greater than 50”
-
Left-tailed test: H₁: μ < μ₀
- Tests if parameter is less
- Example: “The mean is less than 50”
Examples
Claim: Average response time is 5 minutes
H₀: μ = 5
H₁: μ ≠ 5 (two-tailed)
Claim: New drug lowers blood pressure
H₀: μ₁ ≥ μ₂ (new drug mean ≥ placebo mean)
H₁: μ₁ < μ₂ (new drug mean < placebo mean)
Claim: More than 60% support the proposal
H₀: p ≤ 0.60
H₁: p > 0.60
11.3 Type I and Type II Errors
Error Types
Reality | Decision: Reject H₀ | Decision: Fail to Reject H₀ |
---|---|---|
H₀ True | Type I Error (α) | Correct Decision |
H₀ False | Correct Decision | Type II Error (β) |
Type I Error (False Positive)
- Rejecting H₀ when it’s actually true
- Probability = α (significance level)
- Example: Convicting an innocent person
Type II Error (False Negative)
- Failing to reject H₀ when it’s actually false
- Probability = β
- Example: Acquitting a guilty person
Power of a Test
- Power = 1 - β
- Probability of correctly rejecting false H₀
- Affected by:
- Sample size (larger n → higher power)
- Effect size (larger effect → higher power)
- Significance level α (larger α → higher power)
- Population variability (smaller σ → higher power)
11.4 Significance Level and p-values
Significance Level (α)
- Threshold for rejecting H₀
- Pre-specified before testing
- Common values: 0.05, 0.01, 0.10
- Represents acceptable Type I error rate
p-value
- Probability of observing data as extreme as ours (or more extreme), assuming H₀ is true
- Smaller p-value = stronger evidence against H₀
- Compare to α to make decision
Decision Rule
- If p-value ≤ α: Reject H₀
- If p-value > α: Fail to reject H₀
Interpreting p-values
- p < 0.01: Very strong evidence against H₀
- 0.01 ≤ p < 0.05: Strong evidence against H₀
- 0.05 ≤ p < 0.10: Weak evidence against H₀
- p ≥ 0.10: Little or no evidence against H₀
11.5 Test Statistics
General Form
\[\text{Test Statistic} = \frac{\text{Point Estimate - Hypothesized Value}}{\text{Standard Error}}\]Common Test Statistics
- Z-statistic (known σ or large samples): \(Z = \frac{\bar{X} - \mu_0}{\sigma/\sqrt{n}}\)
- t-statistic (unknown σ): \(t = \frac{\bar{X} - \mu_0}{s/\sqrt{n}}\)
- Z-statistic for proportions: \(Z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}}\)
Critical Values
Values that separate rejection and non-rejection regions:
- Two-tailed: ±z_{α/2} or ±t_{α/2,df}
- Right-tailed: z_α or t_{α,df}
- Left-tailed: -z_α or -t_{α,df}
11.6 Steps in Hypothesis Testing
-
State the hypotheses
- Identify H₀ and H₁
- Determine if one-tailed or two-tailed
-
Set significance level
- Choose α (typically 0.05)
-
Check conditions
- Random sampling
- Independence
- Normality (if required)
-
Calculate test statistic
- Use appropriate formula
-
Find p-value
- Use distribution tables or software
-
Make decision
- Compare p-value to α
-
State conclusion
- In context of the problem
Example: Testing Mean Response Time
A company claims average response time is 5 minutes. Sample of 100 calls shows \(\bar{x} = 5.5\) minutes, \(s = 2\) minutes.
# 1. Hypotheses
# H₀: μ = 5
# H₁: μ ≠ 5
# 2. Significance level
alpha <- 0.05
# 3. Check conditions (assumed met)
# 4. Calculate test statistic
x_bar <- 5.5
mu_0 <- 5
s <- 2
n <- 100
t_stat <- (x_bar - mu_0) / (s / sqrt(n))
# 5. Find p-value
p_value <- 2 * pt(abs(t_stat), df = n - 1, lower.tail = FALSE)
# 6. Decision
if (p_value <= alpha) {
decision <- "Reject H₀"
} else {
decision <- "Fail to reject H₀"
}
# 7. Conclusion
cat("t-statistic:", t_stat, "\n")
cat("p-value:", p_value, "\n")
cat("Decision:", decision, "\n")
11.7 Relationship Between Confidence Intervals and Hypothesis Tests
For two-tailed tests at significance level α:
- H₀ is rejected if and only if the (1-α)100% CI doesn’t contain the hypothesized value
- The CI contains all values that would not be rejected by the hypothesis test
Example
95% CI for μ: [52, 58]
- H₀: μ = 50 → Reject (50 not in CI)
- H₀: μ = 55 → Fail to reject (55 in CI)
11.8 Statistical vs. Practical Significance
Statistical Significance
- p-value < α
- Evidence against H₀
- Affected by sample size
Practical Significance
- Size of the effect
- Real-world importance
- Context-dependent
Key Point
Large samples can make tiny differences statistically significant, but they may not be practically important.
11.9 Common Misconceptions
What p-values are NOT
- ❌ Probability that H₀ is true
- ❌ Probability that H₁ is true
- ❌ Probability of making a Type I error (that’s α)
- ❌ Measure of effect size
What “fail to reject H₀” does NOT mean
- ❌ H₀ is proven true
- ❌ H₀ is accepted
- ❌ There is no effect
11.10 Using R for Hypothesis Testing
Traditional Methods
# One-sample t-test
t.test(data, mu = hypothesized_value, alternative = "two.sided")
# Two-sample t-test
t.test(group1, group2, alternative = "two.sided")
# Paired t-test
t.test(x, y, paired = TRUE, alternative = "two.sided")
# Proportion test
prop.test(x = successes, n = total, p = hypothesized_prop,
alternative = "two.sided")
Modern Approach with infer
library(infer)
# Generate null distribution
null_dist <- data %>%
specify(response = variable) %>%
hypothesize(null = "point", mu = hypothesized_value) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean")
# Calculate p-value
p_value <- null_dist %>%
get_p_value(obs_stat = observed_stat, direction = "two-sided")
# Visualize
null_dist %>%
visualize() +
shade_p_value(obs_stat = observed_stat, direction = "two-sided")
11.11 Reporting Results
A complete hypothesis test report should include:
- Statement of hypotheses
- Significance level used
- Test statistic value
- p-value
- Decision (reject/fail to reject)
- Conclusion in context
Example Report
“We tested whether the mean response time differs from 5 minutes (H₀: μ = 5 vs. H₁: μ ≠ 5) at the α = 0.05 significance level. Based on a sample of 100 calls with mean 5.5 minutes and standard deviation 2 minutes, the test statistic was t = 2.5 with p-value = 0.014. Since p < 0.05, we reject H₀ and conclude there is sufficient evidence that the mean response time differs from 5 minutes.”
Part 6: One-Sample Tests
12. One-Sample Tests
12.1 One-Sample Z-Test
The z-test is used when:
- Population standard deviation (σ) is known
- Sample size is large (n ≥ 30)
- Population is normally distributed (or large sample)
Test Statistic
\[Z = \frac{\bar{X} - \mu_0}{\sigma/\sqrt{n}}\]Decision Rule
- Two-tailed: Reject H₀ if |Z| > z_{α/2}
- Right-tailed: Reject H₀ if Z > z_α
- Left-tailed: Reject H₀ if Z < -z_α
Example: Rice Consumption
Testing if average rice consumption differs from 3.9 kg/year, with known σ = 4.5:
# Given data
x_bar <- 4.2 # Sample mean
mu_0 <- 3.9 # Hypothesized mean
sigma <- 4.5 # Known population SD
n <- 150 # Sample size
alpha <- 0.05 # Significance level
# Calculate z-statistic
z_stat <- (x_bar - mu_0) / (sigma / sqrt(n))
# Calculate p-value (two-tailed)
p_value <- 2 * (1 - pnorm(abs(z_stat)))
# Critical values
z_critical <- qnorm(1 - alpha/2)
# Decision
if (abs(z_stat) > z_critical) {
print("Reject H₀")
} else {
print("Fail to reject H₀")
}
# Using R's built-in function (requires BSDA package)
library(BSDA)
z.test(x = sample_data, mu = 3.9, sigma.x = 4.5,
alternative = "two.sided")
12.2 One-Sample t-Test
The t-test is used when:
- Population standard deviation is unknown
- Sample standard deviation (s) is used instead
- Population is approximately normal (or large sample)
Test Statistic
\[t = \frac{\bar{X} - \mu_0}{s/\sqrt{n}}\]Degrees of Freedom
df = n - 1
Example: Call Center Response Time
Testing if mean response time equals 120 minutes:
# Sample data
response_times <- c(115, 125, 118, 122, 130, 119, 121, 117, 124, 116)
mu_0 <- 120
alpha <- 0.05
# Using t.test function
result <- t.test(response_times, mu = mu_0, alternative = "two.sided")
print(result)
# Manual calculation
x_bar <- mean(response_times)
s <- sd(response_times)
n <- length(response_times)
se <- s / sqrt(n)
t_stat <- (x_bar - mu_0) / se
df <- n - 1
p_value <- 2 * pt(-abs(t_stat), df)
# Confidence interval
t_critical <- qt(1 - alpha/2, df)
margin_error <- t_critical * se
ci_lower <- x_bar - margin_error
ci_upper <- x_bar + margin_error
cat("t-statistic:", t_stat, "\n")
cat("p-value:", p_value, "\n")
cat("95% CI: [", ci_lower, ",", ci_upper, "]\n")
Using infer Package
library(infer)
# Calculate observed statistic
obs_stat <- response_times %>%
specify(response = response_times) %>%
calculate(stat = "t", mu = mu_0)
# Generate null distribution
null_dist <- response_times %>%
specify(response = response_times) %>%
hypothesize(null = "point", mu = mu_0) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "t", mu = mu_0)
# Get p-value
p_value <- null_dist %>%
get_p_value(obs_stat = obs_stat, direction = "two-sided")
# Visualize
null_dist %>%
visualize() +
shade_p_value(obs_stat = obs_stat, direction = "two-sided")
12.3 One-Sample Test for Proportion
Used to test claims about population proportions.
Test Statistic
\[Z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}}\]Conditions
- Random sample
- Independence
- Success-failure condition: np₀ ≥ 10 and n(1-p₀) ≥ 10
Example: Customer Satisfaction
Testing if satisfaction rate equals 80%:
# Given data
n <- 100 # Sample size
x <- 73 # Number of successes
p_0 <- 0.80 # Hypothesized proportion
alpha <- 0.05
# Calculate sample proportion
p_hat <- x / n
# Calculate test statistic
z_stat <- (p_hat - p_0) / sqrt(p_0 * (1 - p_0) / n)
# Calculate p-value (two-tailed)
p_value <- 2 * pnorm(-abs(z_stat))
# Using prop.test
result <- prop.test(x = x, n = n, p = p_0,
alternative = "two.sided",
correct = FALSE)
# Using infer
satisfaction_data <- data.frame(
satisfied = c(rep("yes", x), rep("no", n - x))
)
# Observed statistic
obs_stat <- satisfaction_data %>%
specify(response = satisfied, success = "yes") %>%
calculate(stat = "prop")
# Null distribution
null_dist <- satisfaction_data %>%
specify(response = satisfied, success = "yes") %>%
hypothesize(null = "point", p = p_0) %>%
generate(reps = 1000, type = "draw") %>%
calculate(stat = "prop")
# P-value
p_value <- null_dist %>%
get_p_value(obs_stat = obs_stat, direction = "two-sided")
12.4 One-Tailed Tests
Right-Tailed Test (Greater Than)
H₀: μ ≤ μ₀ H₁: μ > μ₀
# Right-tailed t-test
t.test(data, mu = mu_0, alternative = "greater")
# Right-tailed proportion test
prop.test(x, n, p = p_0, alternative = "greater")
Left-Tailed Test (Less Than)
H₀: μ ≥ μ₀ H₁: μ < μ₀
# Left-tailed t-test
t.test(data, mu = mu_0, alternative = "less")
# Left-tailed proportion test
prop.test(x, n, p = p_0, alternative = "less")
12.5 Chi-Square Test for Variance
Testing claims about population variance σ²:
Test Statistic
\[\chi^2 = \frac{(n-1)s^2}{\sigma_0^2}\]Degrees of Freedom
df = n - 1
Example
# Sample data
sample_data <- c(10, 12, 9, 11, 13, 10, 12, 11, 9, 10)
sigma_0_squared <- 4 # Hypothesized variance
alpha <- 0.05
# Calculate test statistic
n <- length(sample_data)
s_squared <- var(sample_data)
chi_squared <- (n - 1) * s_squared / sigma_0_squared
# Calculate p-value (two-tailed)
p_value_lower <- pchisq(chi_squared, df = n - 1)
p_value_upper <- 1 - pchisq(chi_squared, df = n - 1)
p_value <- 2 * min(p_value_lower, p_value_upper)
# Critical values
chi_lower <- qchisq(alpha/2, df = n - 1)
chi_upper <- qchisq(1 - alpha/2, df = n - 1)
12.6 Sign Test (Non-parametric)
Used when normality assumption is violated.
Procedure
- Count values above and below hypothesized median
- Use binomial distribution with p = 0.5
# Sign test for median
library(BSDA)
SIGN.test(data, md = hypothesized_median,
alternative = "two.sided")
# Manual implementation
above <- sum(data > hypothesized_median)
below <- sum(data < hypothesized_median)
n_eff <- above + below
p_value <- 2 * pbinom(min(above, below), n_eff, 0.5)
12.7 Practical Examples
Example 1: Age at First Marriage
Testing if mean age at first marriage > 23 years:
# Load data
age_data <- read.csv("age_at_marriage.csv")
mu_0 <- 23
# Check normality
hist(age_data$$age, main = "Distribution of Age at First Marriage")
qqnorm(age_data$$age)
qqline(age_data$$age)
# Perform t-test
result <- t.test(age_data$$age, mu = mu_0, alternative = "greater")
print(result)
# Using infer
obs_stat <- age_data %>%
specify(response = age) %>%
calculate(stat = "mean")
null_dist <- age_data %>%
specify(response = age) %>%
hypothesize(null = "point", mu = mu_0) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean")
p_value <- null_dist %>%
get_p_value(obs_stat = obs_stat, direction = "greater")
# Visualize
null_dist %>%
visualize() +
shade_p_value(obs_stat = obs_stat, direction = "greater")
Example 2: Proportion of Female Students
Testing if proportion of females = 0.50:
# Survey data
total_students <- 500
female_students <- 237
# Traditional approach
prop.test(x = female_students, n = total_students,
p = 0.50, alternative = "two.sided")
# Using infer
gender_data <- data.frame(
gender = c(rep("female", female_students),
rep("male", total_students - female_students))
)
obs_stat <- gender_data %>%
specify(response = gender, success = "female") %>%
calculate(stat = "prop")
null_dist <- gender_data %>%
specify(response = gender, success = "female") %>%
hypothesize(null = "point", p = 0.50) %>%
generate(reps = 1000, type = "draw") %>%
calculate(stat = "prop")
p_value <- null_dist %>%
get_p_value(obs_stat = obs_stat, direction = "two-sided")
12.8 Effect Size for One-Sample Tests
Cohen’s d for Means
\[d = \frac{\|\bar{X} - \mu_0\|}{s}\]Interpretation:
- Small effect: d ≈ 0.2
- Medium effect: d ≈ 0.5
- Large effect: d ≈ 0.8
# Calculate Cohen's d
cohens_d <- abs(mean(data) - mu_0) / sd(data)
Effect Size for Proportions
\[h = 2 \arcsin(\sqrt{\hat{p}}) - 2 \arcsin(\sqrt{p_0})\]# Calculate h
h <- 2 * asin(sqrt(p_hat)) - 2 * asin(sqrt(p_0))
12.9 Power Analysis
For One-Sample t-Test
library(pwr)
# Power calculation
pwr.t.test(n = NULL, d = 0.5, sig.level = 0.05,
power = 0.80, type = "one.sample")
# Sample size calculation
pwr.t.test(d = 0.5, sig.level = 0.05, power = 0.80,
type = "one.sample")
For One-Sample Proportion Test
# Power calculation
pwr.p.test(h = ES.h(p1 = 0.75, p2 = 0.80),
sig.level = 0.05, power = 0.80)
12.10 Reporting Results
Template for Reporting
“A one-sample [test type] was conducted to determine whether [parameter] differed from [hypothesized value]. The sample [statistic] was [value] ([measure of spread]). The test was [significant/not significant], t(df) = [test statistic], p = [p-value]. [Effect size measure] indicated a [small/medium/large] effect size. We [reject/fail to reject] the null hypothesis and conclude that [contextual conclusion].”
Example Report
“A one-sample t-test was conducted to determine whether the mean response time differed from 120 minutes. The sample mean was 118.7 minutes (SD = 4.2). The test was not significant, t(9) = -0.98, p = 0.353. Cohen’s d = 0.31 indicated a small effect size. We fail to reject the null hypothesis and conclude that there is insufficient evidence to suggest the mean response time differs from 120 minutes.”
Part 7: Two-Sample Tests
13. Two-Sample Tests
13.1 Independent Samples t-Test
The independent samples t-test compares means from two unrelated groups.
Assumptions
- Independent random samples
- Approximately normal distributions (or large samples)
- Equal variances (classical t-test) or unequal variances (Welch’s t-test)
Test Statistic (Equal Variances)
\[t = \frac{(\bar{X}_1 - \bar{X}_2) - (\mu_1 - \mu_2)}{s_p\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}\]where pooled standard deviation: \(s_p = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}}\)
Test Statistic (Unequal Variances - Welch’s)
\[t = \frac{(\bar{X}_1 - \bar{X}_2) - (\mu_1 - \mu_2)}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}\]Degrees of Freedom (Welch’s)
\[df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1-1} + \frac{(s_2^2/n_2)^2}{n_2-1}}\]Example: Income Comparison
Comparing average income between Cleveland and Sacramento:
# Load data
income_data <- read.delim("cleSac.txt")
cleveland <- income_data$$income[income_data$$metro_area == "Cleveland_OH"]
sacramento <- income_data$$income[income_data$$metro_area == "Sacramento_CA"]
# Check assumptions
# Normality
par(mfrow = c(1, 2))
hist(cleveland, main = "Cleveland Income")
hist(sacramento, main = "Sacramento Income")
# Equal variances
var.test(cleveland, sacramento)
# Perform t-test (Welch's by default)
t_test_result <- t.test(cleveland, sacramento,
alternative = "two.sided",
var.equal = FALSE)
print(t_test_result)
# Manual calculation
x1_bar <- mean(cleveland)
x2_bar <- mean(sacramento)
s1 <- sd(cleveland)
s2 <- sd(sacramento)
n1 <- length(cleveland)
n2 <- length(sacramento)
t_stat <- (x1_bar - x2_bar) / sqrt(s1^2/n1 + s2^2/n2)
df <- ((s1^2/n1 + s2^2/n2)^2) /
((s1^2/n1)^2/(n1-1) + (s2^2/n2)^2/(n2-1))
p_value <- 2 * pt(-abs(t_stat), df)
# Using infer
library(infer)
obs_diff <- income_data %>%
specify(income ~ metro_area) %>%
calculate(stat = "diff in means",
order = c("Cleveland_OH", "Sacramento_CA"))
null_dist <- income_data %>%
specify(income ~ metro_area) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means",
order = c("Cleveland_OH", "Sacramento_CA"))
p_value <- null_dist %>%
get_p_value(obs_stat = obs_diff, direction = "two-sided")
# Visualize
null_dist %>%
visualize() +
shade_p_value(obs_stat = obs_diff, direction = "two-sided")
13.2 Paired Samples t-Test
Used when comparing two related measurements (e.g., before-after, matched pairs).
Test Statistic
\[t = \frac{\bar{d} - \mu_d}{s_d/\sqrt{n}}\]where:
- \(\bar{d}\) = mean of differences
- \(s_d\) = standard deviation of differences
- \(\mu_d\) = hypothesized mean difference (usually 0)
Example: Weight Loss Program
# Before and after weights
before <- c(65, 60, 63, 58, 70, 72, 80, 81, 83, 69)
after <- c(55, 52, 60, 55, 60, 66, 70, 66, 63, 60)
# Calculate differences
differences <- after - before
# Check normality of differences
hist(differences, main = "Distribution of Weight Differences")
qqnorm(differences)
qqline(differences)
# Paired t-test
paired_result <- t.test(after, before,
paired = TRUE,
alternative = "less")
print(paired_result)
# Manual calculation
d_bar <- mean(differences)
s_d <- sd(differences)
n <- length(differences)
t_stat <- d_bar / (s_d / sqrt(n))
p_value <- pt(t_stat, df = n - 1)
# Using infer
weight_data <- data.frame(
id = rep(1:10, 2),
weight = c(before, after),
time = rep(c("before", "after"), each = 10)
)
obs_diff <- weight_data %>%
specify(weight ~ time) %>%
calculate(stat = "diff in means",
order = c("after", "before"))
null_dist <- weight_data %>%
specify(weight ~ time) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means",
order = c("after", "before"))
p_value <- null_dist %>%
get_p_value(obs_stat = obs_diff, direction = "less")
Example: Water Quality Testing
Comparing zinc concentration in surface vs. bottom water:
# Load and prepare data
zinc_data <- read.csv("zinc_tidy.csv")
zinc_diff <- zinc_data %>%
group_by(loc_id) %>%
summarize(pair_diff = diff(concentration))
# Test if surface water has lower concentration
t.test(zinc_diff$$pair_diff,
mu = 0,
alternative = "less")
# Visual comparison
library(ggplot2)
ggplot(zinc_data, aes(x = depth, y = concentration)) +
geom_boxplot() +
geom_line(aes(group = loc_id), alpha = 0.3) +
geom_point(aes(color = depth), size = 3) +
labs(title = "Zinc Concentration by Depth",
x = "Water Depth",
y = "Zinc Concentration")
13.3 Two-Sample Test for Proportions
Compares proportions between two independent groups.
Test Statistic
\[Z = \frac{(\hat{p}_1 - \hat{p}_2) - 0}{\sqrt{\hat{p}(1-\hat{p})(\frac{1}{n_1} + \frac{1}{n_2})}}\]where pooled proportion: \(\hat{p} = \frac{x_1 + x_2}{n_1 + n_2}\)
Example: Gender and Weight Loss
Testing if weight loss rates differ by gender:
# Data
male_success <- 45
male_total <- 100
female_success <- 62
female_total <- 100
# Two-proportion test
prop.test(x = c(male_success, female_success),
n = c(male_total, female_total),
alternative = "two.sided",
correct = FALSE)
# Manual calculation
p1_hat <- male_success / male_total
p2_hat <- female_success / female_total
p_pool <- (male_success + female_success) / (male_total + female_total)
se_diff <- sqrt(p_pool * (1 - p_pool) * (1/male_total + 1/female_total))
z_stat <- (p1_hat - p2_hat) / se_diff
p_value <- 2 * pnorm(-abs(z_stat))
# Using infer
gender_data <- data.frame(
gender = c(rep("male", male_total), rep("female", female_total)),
success = c(rep(c("yes", "no"), c(male_success, male_total - male_success)),
rep(c("yes", "no"), c(female_success, female_total - female_success)))
)
obs_diff <- gender_data %>%
specify(success ~ gender, success = "yes") %>%
calculate(stat = "diff in props",
order = c("male", "female"))
null_dist <- gender_data %>%
specify(success ~ gender, success = "yes") %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in props",
order = c("male", "female"))
p_value <- null_dist %>%
get_p_value(obs_stat = obs_diff, direction = "two-sided")
13.4 Mann-Whitney U Test (Wilcoxon Rank-Sum)
Non-parametric alternative to independent samples t-test.
When to Use
- Data not normally distributed
- Ordinal data
- Small sample sizes
- Presence of outliers
# Mann-Whitney U test
wilcox.test(cleveland, sacramento,
alternative = "two.sided")
# With continuity correction
wilcox.test(cleveland, sacramento,
alternative = "two.sided",
correct = TRUE)
# Visual comparison using violin plots
library(ggplot2)
ggplot(income_data, aes(x = metro_area, y = income)) +
geom_violin(fill = "lightblue") +
geom_boxplot(width = 0.1) +
labs(title = "Income Distribution by City",
x = "Metropolitan Area",
y = "Income")
13.5 Wilcoxon Signed-Rank Test
Non-parametric alternative to paired t-test.
# Wilcoxon signed-rank test
wilcox.test(after, before,
paired = TRUE,
alternative = "less")
# Manual implementation
differences <- after - before
ranks <- rank(abs(differences))
signed_ranks <- ranks * sign(differences)
W <- sum(signed_ranks[signed_ranks > 0])
13.6 Chi-Square Test for Independence
Tests whether two categorical variables are independent.
Test Statistic
\[\chi^2 = \sum \frac{(O - E)^2}{E}\]where:
- O = observed frequency
- E = expected frequency
Example: Species and Petal Width
# Prepare data
iris_subset <- iris[iris$$Species != "virginica", ]
iris_subset$$Species <- factor(iris_subset$$Species)
iris_subset$$Petal.Width.Cat <- cut(iris_subset$$Petal.Width,
breaks = quantile(iris_subset$$Petal.Width,
probs = c(0, 0.5, 1)),
include.lowest = TRUE,
labels = c("Below", "Above"))
# Create contingency table
cont_table <- table(iris_subset$$Petal.Width.Cat, iris_subset$$Species)
print(cont_table)
# Chi-square test
chi_test <- chisq.test(cont_table)
print(chi_test)
# Expected values
print(chi_test$$expected)
# Residuals
print(chi_test$$residuals)
# Visualization
library(ggplot2)
ggplot(iris_subset, aes(x = Species, fill = Petal.Width.Cat)) +
geom_bar(position = "dodge") +
labs(title = "Distribution of Petal Width by Species",
x = "Species",
y = "Count",
fill = "Petal Width")
13.7 Effect Size for Two-Sample Tests
Cohen’s d for Independent Samples
\[d = \frac{\|\bar{X}_1 - \bar{X}_2\|}{s_p}\]Cohen’s d for Paired Samples
\[d = \frac{\|\bar{d}\|}{s_d}\]# Effect size for independent samples
library(effsize)
cohen.d(cleveland, sacramento)
# Effect size for paired samples
cohen.d(after, before, paired = TRUE)
# Manual calculation
d_independent <- abs(mean(cleveland) - mean(sacramento)) /
sqrt(((length(cleveland) - 1) * var(cleveland) +
(length(sacramento) - 1) * var(sacramento)) /
(length(cleveland) + length(sacramento) - 2))
d_paired <- abs(mean(differences)) / sd(differences)
13.8 Power Analysis for Two-Sample Tests
library(pwr)
# Power for independent t-test
pwr.t2n.test(n1 = 100, n2 = 120, d = 0.5,
sig.level = 0.05,
alternative = "two.sided")
# Sample size for paired t-test
pwr.t.test(d = 0.3, sig.level = 0.05,
power = 0.80, type = "paired")
# Power for two proportions
pwr.2p.test(h = ES.h(p1 = 0.45, p2 = 0.60),
sig.level = 0.05, power = 0.80)
13.9 Comprehensive Example: Medicine Effectiveness
# Load data
medicine_data <- read.csv("medicine_survey.txt")
# Research question: Do females have less weight than males before treatment?
# Subset data
female_weight <- medicine_data$$KgBefore[medicine_data$$Gender == "Female"]
male_weight <- medicine_data$$KgBefore[medicine_data$$Gender == "Male"]
# Exploratory analysis
library(ggplot2)
ggplot(medicine_data, aes(x = Gender, y = KgBefore, fill = Gender)) +
geom_boxplot() +
geom_jitter(width = 0.2, alpha = 0.5) +
labs(title = "Weight Distribution by Gender",
x = "Gender",
y = "Weight Before Treatment (kg)")
# Check assumptions
shapiro.test(female_weight)
shapiro.test(male_weight)
var.test(female_weight, male_weight)
# Perform t-test
result <- t.test(female_weight, male_weight,
alternative = "less",
var.equal = FALSE)
print(result)
# Effect size
cohen.d(female_weight, male_weight)
# Bootstrap confidence interval
library(infer)
boot_dist <- medicine_data %>%
specify(KgBefore ~ Gender) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "diff in means",
order = c("Female", "Male"))
ci_boot <- boot_dist %>%
get_confidence_interval(level = 0.95, type = "percentile")
# Report results
cat("Mean weight (Female):", mean(female_weight), "kg\n")
cat("Mean weight (Male):", mean(male_weight), "kg\n")
cat("Difference:", mean(female_weight) - mean(male_weight), "kg\n")
cat("95% CI (traditional):", result$$conf.int, "\n")
cat("95% CI (bootstrap):", ci_boot$$lower_ci, "to", ci_boot$$upper_ci, "\n")
cat("p-value:", result$$p.value, "\n")
13.10 Reporting Two-Sample Test Results
Template
“A [test type] was conducted to compare [variable] between [group 1] and [group 2]. There was a [significant/non-significant] difference in [variable] between [group 1] (M = [mean1], SD = [sd1]) and [group 2] (M = [mean2], SD = [sd2]); t(df) = [test statistic], p = [p-value]. The effect size was [small/medium/large] (d = [effect size]). These results suggest that [contextual conclusion].”
Example Report
“An independent samples t-test was conducted to compare weight before treatment between females and males. There was a significant difference in weight between females (M = 65.4, SD = 8.2) and males (M = 72.1, SD = 9.5); t(198) = -5.33, p < .001. The effect size was medium (d = 0.75). These results suggest that females had significantly lower weight than males before treatment.”
Part 8: R Implementation & Case Studies
14. R Implementation: Traditional vs Simulation Methods
14.1 Traditional Methods Overview
Traditional statistical methods rely on mathematical theory and distributional assumptions.
Key Functions in Base R
# t-tests
t.test() # One-sample, two-sample, paired t-tests
var.test() # F-test for equality of variances
# Proportion tests
prop.test() # One-sample, two-sample proportion tests
binom.test() # Exact binomial test
# Non-parametric tests
wilcox.test() # Wilcoxon rank-sum/signed-rank tests
chisq.test() # Chi-square test
# ANOVA
aov() # Analysis of variance
summary() # ANOVA table
Example: Traditional t-test workflow
# Load data
data <- read.csv("sample_data.csv")
# Check assumptions
shapiro.test(data$$response) # Normality
var.test(group1, group2) # Equal variances
# Perform test
result <- t.test(response ~ group, data = data)
# Extract results
result$$statistic # Test statistic
result$$p.value # p-value
result$$conf.int # Confidence interval
14.2 Simulation-Based Methods with infer
The infer
package provides a unified grammar for statistical inference using simulation methods.
The infer Workflow
-
specify()
: Specify variable(s) of interest -
hypothesize()
: Declare null hypothesis -
generate()
: Generate data reflecting null hypothesis -
calculate()
: Calculate summary statistic -
visualize()
: Visualize the null distribution
Basic infer Structure
library(infer)
# General workflow
data %>%
specify(formula) %>% # Specify variables
hypothesize(null = "...") %>% # State null hypothesis
generate(reps = n) %>% # Generate null distribution
calculate(stat = "...") %>% # Calculate statistics
visualize() # Visualize results
14.3 Comparing Traditional and Simulation Approaches
One-Sample Mean Test
Traditional Approach:
# Traditional t-test
t_result <- t.test(data$$variable, mu = null_value)
Simulation Approach:
# Calculate observed statistic
obs_stat <- data %>%
specify(response = variable) %>%
calculate(stat = "mean")
# Generate null distribution
null_dist <- data %>%
specify(response = variable) %>%
hypothesize(null = "point", mu = null_value) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean")
# Get p-value
p_value <- null_dist %>%
get_p_value(obs_stat = obs_stat, direction = "two-sided")
# Visualize
null_dist %>%
visualize() +
shade_p_value(obs_stat = obs_stat, direction = "two-sided")
Two-Sample Mean Test
Traditional Approach:
# Traditional independent t-test
t_result <- t.test(value ~ group, data = data)
Simulation Approach:
# Calculate observed difference
obs_diff <- data %>%
specify(value ~ group) %>%
calculate(stat = "diff in means", order = c("A", "B"))
# Generate null distribution
null_dist <- data %>%
specify(value ~ group) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("A", "B"))
# Get p-value and visualize
p_value <- null_dist %>%
get_p_value(obs_stat = obs_diff, direction = "two-sided")
14.4 Case Study 1: Bowl Sampling Activity
Estimating the proportion of red balls in a bowl using physical and virtual sampling.
library(moderndive)
library(tidyverse)
library(infer)
# Physical sampling results
tactile_prop_red <- read.csv("tactile_sampling.csv")
# Virtual sampling
virtual_samples <- bowl %>%
rep_sample_n(size = 50, reps = 33)
# Calculate proportions
virtual_prop_red <- virtual_samples %>%
group_by(replicate) %>%
summarize(red = sum(color == "red")) %>%
mutate(prop_red = red / 50)
# Compare distributions
combined_data <- bind_rows(
virtual_prop_red %>% mutate(type = "Virtual"),
tactile_prop_red %>% mutate(type = "Tactile")
)
ggplot(combined_data, aes(x = prop_red, fill = type)) +
geom_histogram(binwidth = 0.05, position = "dodge", alpha = 0.7) +
labs(title = "Comparison of Sampling Methods",
x = "Proportion of Red Balls",
y = "Count",
fill = "Sampling Type") +
theme_minimal()
# Effect of sample size
sample_sizes <- c(25, 50, 100, 200)
results <- list()
for (n in sample_sizes) {
samples <- bowl %>%
rep_sample_n(size = n, reps = 1000)
props <- samples %>%
group_by(replicate) %>%
summarize(prop_red = mean(color == "red"))
results<span title='There is no note that matches this link.' class='invalid-link'> <span class='invalid-link-brackets'>[[</span> paste0("n_", n) <span class='invalid-link-brackets'>]]</span></span> <- data.frame(
n = n,
sd = sd(props$$prop_red),
mean = mean(props$$prop_red)
)
}
results_df <- do.call(rbind, results)
ggplot(results_df, aes(x = n, y = sd)) +
geom_line() +
geom_point(size = 3) +
labs(title = "Standard Error vs Sample Size",
x = "Sample Size",
y = "Standard Error") +
theme_minimal()
14.5 Case Study 2: Cell Phone Usage Analysis
Analyzing student cell phone usage patterns from the “usecell” dataset.
# Load data
usecell <- read.csv("usecell.txt")
# Research question: Is mean calling time > 120 minutes?
# Traditional approach
t_result <- t.test(usecell$$CallTime, mu = 120, alternative = "greater")
print(t_result)
# Simulation approach
obs_mean <- usecell %>%
specify(response = CallTime) %>%
calculate(stat = "mean")
null_dist <- usecell %>%
specify(response = CallTime) %>%
hypothesize(null = "point", mu = 120) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean")
p_value <- null_dist %>%
get_p_value(obs_stat = obs_mean, direction = "greater")
# Visualization
null_dist %>%
visualize() +
shade_p_value(obs_stat = obs_mean, direction = "greater") +
labs(title = "Null Distribution for Mean Call Time",
subtitle = "H0: μ = 120 minutes")
# Proportion analysis: Games as favorite app
games_prop <- usecell %>%
mutate(is_games = FavoriteApp == "games") %>%
specify(response = is_games, success = "TRUE") %>%
calculate(stat = "prop")
# Bootstrap CI for proportion
boot_dist <- usecell %>%
mutate(is_games = FavoriteApp == "games") %>%
specify(response = is_games, success = "TRUE") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop")
ci_90 <- boot_dist %>%
get_confidence_interval(level = 0.90, type = "percentile")
# Visualize with CI
boot_dist %>%
visualize() +
shade_confidence_interval(endpoints = ci_90) +
labs(title = "Bootstrap Distribution of Games Proportion",
subtitle = "90% Confidence Interval")
14.6 Case Study 3: Income Comparison Study
Comparing income between different cities using multiple approaches.
# Load data
cle_sac <- read.delim("cleSac.txt")
# Exploratory analysis
ggplot(cle_sac, aes(x = metro_area, y = income)) +
geom_violin(fill = "lightblue") +
geom_boxplot(width = 0.1) +
stat_summary(fun = mean, geom = "point", color = "red", size = 3) +
labs(title = "Income Distribution by Metropolitan Area",
x = "City",
y = "Income")
# Traditional t-test
t_result <- t.test(income ~ metro_area, data = cle_sac)
# Simulation-based approach
obs_diff <- cle_sac %>%
specify(income ~ metro_area) %>%
calculate(stat = "diff in means",
order = c("Cleveland_OH", "Sacramento_CA"))
null_dist <- cle_sac %>%
specify(income ~ metro_area) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means",
order = c("Cleveland_OH", "Sacramento_CA"))
# Compare approaches
cat("Traditional p-value:", t_result$$p.value, "\n")
cat("Simulation p-value:",
get_p_value(null_dist, obs_stat = obs_diff, direction = "two-sided")$$p_value, "\n")
# Bootstrap confidence interval
boot_dist <- cle_sac %>%
specify(income ~ metro_area) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "diff in means",
order = c("Cleveland_OH", "Sacramento_CA"))
ci_boot <- boot_dist %>%
get_confidence_interval(level = 0.95, type = "percentile")
# Visualization of results
p1 <- null_dist %>%
visualize() +
shade_p_value(obs_stat = obs_diff, direction = "two-sided") +
labs(title = "Null Distribution")
p2 <- boot_dist %>%
visualize() +
shade_confidence_interval(endpoints = ci_boot) +
labs(title = "Bootstrap Distribution")
gridExtra::grid.arrange(p1, p2, ncol = 2)
14.7 Case Study 4: Pizza Delivery Analysis
Analyzing pizza delivery times with various statistical approaches.
library(DescTools)
data(d.pizza)
# Filter for driver Carter
carter_data <- d.pizza %>%
filter(driver == "Carter")
# Traditional confidence interval
t_result <- t.test(carter_data$$delivery_time)
traditional_ci <- t_result$$conf.int
# Bootstrap confidence interval (percentile method)
boot_dist_carter <- carter_data %>%
specify(response = delivery_time) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean")
bootstrap_ci <- boot_dist_carter %>%
get_confidence_interval(level = 0.99, type = "percentile")
# Standard error method
se_ci <- boot_dist_carter %>%
get_confidence_interval(level = 0.99, type = "se",
point_estimate = mean(carter_data$$delivery_time))
# Compare confidence intervals
ci_comparison <- data.frame(
Method = c("Traditional", "Bootstrap Percentile", "Bootstrap SE"),
Lower = c(traditional_ci[1], bootstrap_ci$$lower_ci, se_ci$$lower_ci),
Upper = c(traditional_ci[2], bootstrap_ci$$upper_ci, se_ci$$upper_ci)
)
print(ci_comparison)
# Visualize distributions
p1 <- ggplot(carter_data, aes(x = delivery_time)) +
geom_histogram(binwidth = 2, fill = "skyblue", color = "black") +
labs(title = "Distribution of Delivery Times",
x = "Delivery Time (minutes)",
y = "Count")
p2 <- boot_dist_carter %>%
visualize() +
shade_confidence_interval(endpoints = bootstrap_ci) +
labs(title = "Bootstrap Distribution",
x = "Mean Delivery Time")
gridExtra::grid.arrange(p1, p2, ncol = 2)
14.8 Case Study 5: Weight Loss Program Evaluation
Evaluating effectiveness of a weight loss program using paired data.
# Load data
weight_data <- read.csv("Weight.txt")
# Traditional paired t-test
t_result <- t.test(weight_data$$Weight.after,
weight_data$$Weight.before,
paired = TRUE,
alternative = "less")
# Calculate differences
weight_data <- weight_data %>%
mutate(difference = Weight.after - Weight.before)
# Simulation approach
obs_mean_diff <- weight_data %>%
specify(response = difference) %>%
calculate(stat = "mean")
null_dist <- weight_data %>%
specify(response = difference) %>%
hypothesize(null = "point", mu = 0) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean")
# Visualization
p1 <- ggplot(weight_data, aes(x = difference)) +
geom_histogram(binwidth = 2, fill = "coral", color = "black") +
geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Distribution of Weight Differences",
x = "Weight Change (kg)",
y = "Count")
p2 <- null_dist %>%
visualize() +
shade_p_value(obs_stat = obs_mean_diff, direction = "less") +
labs(title = "Null Distribution",
subtitle = "H0: μd = 0")
# Before-after plot
p3 <- ggplot(weight_data, aes(x = Weight.before, y = Weight.after)) +
geom_point(size = 3) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
labs(title = "Before vs After Weight",
x = "Weight Before (kg)",
y = "Weight After (kg)")
gridExtra::grid.arrange(p1, p2, p3, ncol = 2)
14.9 Creating Comprehensive Reports
# Function to create comprehensive analysis report
create_analysis_report <- function(data, response_var, group_var = NULL,
test_type = "mean", null_value = NULL,
confidence_level = 0.95) {
# Descriptive statistics
if (is.null(group_var)) {
desc_stats <- data %>%
summarize(
n = n(),
mean = mean(!!sym(response_var), na.rm = TRUE),
sd = sd(!!sym(response_var), na.rm = TRUE),
median = median(!!sym(response_var), na.rm = TRUE),
IQR = IQR(!!sym(response_var), na.rm = TRUE)
)
} else {
desc_stats <- data %>%
group_by(!!sym(group_var)) %>%
summarize(
n = n(),
mean = mean(!!sym(response_var), na.rm = TRUE),
sd = sd(!!sym(response_var), na.rm = TRUE),
median = median(!!sym(response_var), na.rm = TRUE),
IQR = IQR(!!sym(response_var), na.rm = TRUE)
)
}
# Traditional test
if (is.null(group_var)) {
traditional_result <- t.test(data<span title='There is no note that matches this link.' class='invalid-link'> <span class='invalid-link-brackets'>[[</span> response_var <span class='invalid-link-brackets'>]]</span></span>, mu = null_value)
} else {
formula <- as.formula(paste(response_var, "~", group_var))
traditional_result <- t.test(formula, data = data)
}
# Simulation-based approach
if (is.null(group_var)) {
obs_stat <- data %>%
specify(response = !!sym(response_var)) %>%
calculate(stat = test_type)
null_dist <- data %>%
specify(response = !!sym(response_var)) %>%
hypothesize(null = "point", mu = null_value) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = test_type)
} else {
obs_stat <- data %>%
specify(as.formula(paste(response_var, "~", group_var))) %>%
calculate(stat = paste("diff in", test_type, "s"))
null_dist <- data %>%
specify(as.formula(paste(response_var, "~", group_var))) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = paste("diff in", test_type, "s"))
}
# Generate report
report <- list(
descriptive_stats = desc_stats,
traditional_test = traditional_result,
simulation_pvalue = get_p_value(null_dist, obs_stat = obs_stat,
direction = "two-sided"),
observed_statistic = obs_stat,
null_distribution = null_dist
)
return(report)
}
# Example usage
report <- create_analysis_report(
data = cle_sac,
response_var = "income",
group_var = "metro_area",
test_type = "mean"
)
# Print report
print(report$$descriptive_stats)
print(summary(report$$traditional_test))
print(report$$simulation_pvalue)
14.10 Best Practices and Guidelines
When to Use Each Approach
Traditional Methods:
- When assumptions are clearly met
- For standard, well-established tests
- When computational resources are limited
- For formal reporting requirements
Simulation Methods:
- When assumptions are violated
- For complex statistics
- For educational purposes
- When visualization aids understanding
Workflow Recommendations
- Always start with exploratory data analysis
- Check assumptions before choosing method
- Use both approaches when possible for validation
- Focus on practical significance, not just statistical
- Create reproducible analyses with clear documentation
Code Organization
# Recommended project structure
project/
├── data/
│ ├── raw/
│ └── processed/
├── scripts/
│ ├── 01_data_cleaning.R
│ ├── 02_exploratory_analysis.R
│ ├── 03_statistical_tests.R
│ └── 04_visualization.R
├── output/
│ ├── figures/
│ └── tables/
└── reports/
└── final_report.Rmd
Part 9: Sample Questions & Solutions
15. Sample Questions and Solutions
15.1 Probability and Distributions
Question 1
Four fair coins are flipped. If the outcomes are assumed independent, what is the probability that two heads and two tails are obtained?
Solution:
# This is a binomial probability problem
# n = 4 trials, k = 2 successes, p = 0.5
# Using binomial formula
prob <- choose(4, 2) * (0.5)^2 * (0.5)^2
print(prob) # 0.375
# Using R function
dbinom(2, size = 4, prob = 0.5) # 0.375
The probability is 0.375 or 37.5%.
Question 2
Between 2 p.m. and 4 p.m., telephone calls arrive at a switchboard at an average rate of 2.5 per minute. Find the probability that in a minute, there will be: a) Zero calls b) Exactly two calls c) More than six calls
Solution:
# This is a Poisson distribution problem
# λ = 2.5 calls per minute
# a) Zero calls
prob_zero <- dpois(0, lambda = 2.5)
print(prob_zero) # 0.082
# b) Two calls
prob_two <- dpois(2, lambda = 2.5)
print(prob_two) # 0.257
# c) More than six calls
prob_more_six <- 1 - ppois(6, lambda = 2.5)
print(prob_more_six) # 0.014
Question 3
The level of blood pressure of a population of women has mean 125 mm Hg and standard deviation 12. A sample of 36 women is selected. What is the probability that the sample mean will be between 120 and 130?
Solution:
# Parameters
mu <- 125
sigma <- 12
n <- 36
# Standard error
se <- sigma / sqrt(n)
# Calculate z-scores
z1 <- (120 - mu) / se
z2 <- (130 - mu) / se
# Probability
prob <- pnorm(z2) - pnorm(z1)
print(prob) # 0.987
# Alternative approach
pnorm(130, mean = mu, sd = se) - pnorm(120, mean = mu, sd = se)
15.2 Point Estimation
Question 4
Find the maximum likelihood estimator for the parameter λ of a Poisson distribution based on a random sample x₁, x₂, …, xₙ.
Solution: The likelihood function for Poisson is: \(L(\lambda) = \prod_{i=1}^n \frac{e^{-\lambda}\lambda^{x_i}}{x_i!}\)
Taking the log: \(\ell(\lambda) = -n\lambda + \sum_{i=1}^n x_i \log(\lambda) - \sum_{i=1}^n \log(x_i!)\)
Taking derivative and setting to zero: \(\frac{d\ell}{d\lambda} = -n + \frac{\sum x_i}{\lambda} = 0\)
Therefore: \(\hat{\lambda}_{MLE} = \bar{x}\)
# Example with data
taxi_arrivals <- c(23, 15, 23, 21, 13)
lambda_mle <- mean(taxi_arrivals)
print(lambda_mle) # 19
Question 5
The following sample represents M&M chocolate weights. Calculate point estimates for the population mean and standard deviation.
Solution:
# Sample data
weights <- c(111.2, 110.8, 112.1, 111.5, 110.9, 111.7, 110.6, 111.3, 112.0, 110.7)
# Point estimate for mean
mean_estimate <- mean(weights)
print(mean_estimate) # 111.28
# Point estimate for standard deviation
sd_estimate <- sd(weights)
print(sd_estimate) # 0.577
# Standard error
se <- sd_estimate / sqrt(length(weights))
print(se) # 0.183
15.3 Confidence Intervals
Question 6
An accounting firm surveys 100 people about time to complete tax forms. The sample mean is 22.1 hours with known standard deviation of 7.9 hours. Construct a 90% confidence interval for the population mean.
Solution:
# Given information
n <- 100
x_bar <- 22.1
sigma <- 7.9
conf_level <- 0.90
# Calculate margin of error
z_star <- qnorm((1 + conf_level) / 2)
margin_error <- z_star * sigma / sqrt(n)
# Confidence interval
ci_lower <- x_bar - margin_error
ci_upper <- x_bar + margin_error
cat("90% CI: [", ci_lower, ",", ci_upper, "]\n")
# 90% CI: [ 20.8, 23.4 ]
# Error bound
error_bound <- margin_error
print(error_bound) # 1.3
Question 7
In a survey of 1000 students, 7.3% prefer “games” as their favorite app. Find a 95% confidence interval for the population proportion.
Solution:
# Given information
n <- 1000
p_hat <- 0.073
# Traditional method
se <- sqrt(p_hat * (1 - p_hat) / n)
z_star <- qnorm(0.975)
margin_error <- z_star * se
ci_lower <- p_hat - margin_error
ci_upper <- p_hat + margin_error
cat("95% CI: [", ci_lower, ",", ci_upper, "]\n")
# 95% CI: [ 0.057, 0.089 ]
# Using prop.test
prop.test(x = 73, n = 1000, conf.level = 0.95, correct = FALSE)$$conf.int
# Bootstrap method
library(infer)
set.seed(123)
games_data <- data.frame(
favorite = c(rep("games", 73), rep("other", 927))
)
boot_dist <- games_data %>%
specify(response = favorite, success = "games") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop")
boot_ci <- boot_dist %>%
get_ci(level = 0.95, type = "percentile")
print(boot_ci)
15.4 Hypothesis Testing - One Sample
Question 8
A CEO claims that 80% of customers are satisfied. A survey of 100 customers finds 73 satisfied. Test this claim at α = 0.05.
Solution:
# Hypotheses:
# H₀: p = 0.80
# H₁: p ≠ 0.80
# Traditional approach
n <- 100
x <- 73
p_0 <- 0.80
alpha <- 0.05
# Test statistic
p_hat <- x/n
z_stat <- (p_hat - p_0) / sqrt(p_0 * (1 - p_0) / n)
p_value <- 2 * pnorm(-abs(z_stat))
cat("z-statistic:", z_stat, "\n")
cat("p-value:", p_value, "\n")
# z-statistic: -1.75
# p-value: 0.08
# Decision
if (p_value < alpha) {
print("Reject H₀")
} else {
print("Fail to reject H₀")
}
# Fail to reject H₀
# Simulation approach
satisfaction_data <- data.frame(
status = c(rep("satisfied", 73), rep("unsatisfied", 27))
)
obs_prop <- satisfaction_data %>%
specify(response = status, success = "satisfied") %>%
calculate(stat = "prop")
null_dist <- satisfaction_data %>%
specify(response = status, success = "satisfied") %>%
hypothesize(null = "point", p = 0.80) %>%
generate(reps = 1000, type = "draw") %>%
calculate(stat = "prop")
p_value_sim <- null_dist %>%
get_p_value(obs_stat = obs_prop, direction = "two-sided")
null_dist %>%
visualize() +
shade_p_value(obs_stat = obs_prop, direction = "two-sided")
Question 9
The average Rice consumption is claimed to be 3.9 kg with σ = 4.5. A sample of 16 countries shows mean 4.2 kg. Test if consumption differs from 3.9 at α = 0.05.
Solution:
# Given information
n <- 16
x_bar <- 4.2
mu_0 <- 3.9
sigma <- 4.5
alpha <- 0.05
# Hypotheses:
# H₀: μ = 3.9
# H₁: μ ≠ 3.9
# Z-test (known σ)
z_stat <- (x_bar - mu_0) / (sigma / sqrt(n))
p_value <- 2 * (1 - pnorm(abs(z_stat)))
cat("z-statistic:", z_stat, "\n")
cat("p-value:", p_value, "\n")
# z-statistic: 0.267
# p-value: 0.79
# Decision
if (p_value < alpha) {
print("Reject H₀")
} else {
print("Fail to reject H₀")
}
# Fail to reject H₀
15.5 Hypothesis Testing - Two Samples
Question 10
Test whether income differs between Cleveland and Sacramento. Cleveland: n₁=212, x̄₁=45,000, s₁=15,000. Sacramento: n₂=175, x̄₂=48,000, s₂=18,000. Use α=0.05.
Solution:
# Given information
n1 <- 212; x1_bar <- 45000; s1 <- 15000
n2 <- 175; x2_bar <- 48000; s2 <- 18000
alpha <- 0.05
# Hypotheses:
# H₀: μ₁ = μ₂
# H₁: μ₁ ≠ μ₂
# Welch's t-test (unequal variances)
se_diff <- sqrt(s1^2/n1 + s2^2/n2)
t_stat <- (x1_bar - x2_bar) / se_diff
# Degrees of freedom (Welch's formula)
df <- ((s1^2/n1 + s2^2/n2)^2) /
((s1^2/n1)^2/(n1-1) + (s2^2/n2)^2/(n2-1))
p_value <- 2 * pt(-abs(t_stat), df)
cat("t-statistic:", t_stat, "\n")
cat("degrees of freedom:", df, "\n")
cat("p-value:", p_value, "\n")
# t-statistic: -1.80
# degrees of freedom: 351.2
# p-value: 0.073
# Confidence interval
margin_error <- qt(0.975, df) * se_diff
ci_lower <- (x1_bar - x2_bar) - margin_error
ci_upper <- (x1_bar - x2_bar) + margin_error
cat("95% CI for difference: [", ci_lower, ",", ci_upper, "]\n")
# 95% CI for difference: [ -6286, 286 ]
Question 11
A weight loss program has before/after data for 10 participants. Test if the program is effective at α = 0.05.
Solution:
# Data
before <- c(65, 60, 63, 58, 70, 72, 80, 81, 83, 69)
after <- c(55, 52, 60, 55, 60, 66, 70, 66, 63, 60)
# Hypotheses:
# H₀: μd = 0
# H₁: μd < 0 (weight decreases)
# Paired t-test
differences <- after - before
d_bar <- mean(differences)
s_d <- sd(differences)
n <- length(differences)
t_stat <- d_bar / (s_d / sqrt(n))
p_value <- pt(t_stat, df = n - 1)
cat("Mean difference:", d_bar, "\n")
cat("t-statistic:", t_stat, "\n")
cat("p-value:", p_value, "\n")
# Mean difference: -7.4
# t-statistic: -5.75
# p-value: 0.000145
# Decision
if (p_value < 0.05) {
print("Reject H₀ - Program is effective")
} else {
print("Fail to reject H₀")
}
# Confidence interval
t.test(after, before, paired = TRUE,
alternative = "less", conf.level = 0.95)$$conf.int
15.6 Chi-Square Tests
Question 12
Test if species and petal width category are independent in the iris dataset at α = 0.05.
Solution:
# Prepare data
iris_subset <- iris[iris$$Species != "virginica", ]
iris_subset$$Petal.Width.Cat <- cut(iris_subset$$Petal.Width,
breaks = quantile(iris_subset$$Petal.Width,
probs = c(0, 0.5, 1)),
labels = c("Below", "Above"),
include.lowest = TRUE)
# Create contingency table
cont_table <- table(iris_subset$$Species, iris_subset$$Petal.Width.Cat)
print(cont_table)
# Chi-square test
chi_result <- chisq.test(cont_table)
print(chi_result)
# Expected frequencies
print(chi_result$$expected)
# Decision
if (chi_result$$p.value < 0.05) {
print("Reject H₀ - Variables are dependent")
} else {
print("Fail to reject H₀")
}
15.7 Sample Size Determination
Question 13
How many customers should be surveyed to estimate satisfaction proportion within 3% margin of error with 95% confidence? Assume no prior estimate of p.
Solution:
# Given information
margin_error <- 0.03
conf_level <- 0.95
p_estimate <- 0.5 # Conservative estimate
# Calculate sample size
z_star <- qnorm((1 + conf_level) / 2)
n <- ceiling((z_star^2 * p_estimate * (1 - p_estimate)) / margin_error^2)
print(n) # 1068
# Function for sample size calculation
sample_size_prop <- function(margin_error, conf_level = 0.95, p = 0.5) {
z_star <- qnorm((1 + conf_level) / 2)
n <- ceiling((z_star^2 * p * (1 - p)) / margin_error^2)
return(n)
}
# With different confidence levels
cat("90% confidence:", sample_size_prop(0.03, 0.90), "\n")
cat("95% confidence:", sample_size_prop(0.03, 0.95), "\n")
cat("99% confidence:", sample_size_prop(0.03, 0.99), "\n")
15.8 Comprehensive Practice Problem
Question 14
A study examines the effect of a new medicine on cholesterol levels. Data includes 50 patients with measurements before and after treatment. Perform a complete analysis including descriptive statistics, visualization, hypothesis testing, and confidence intervals.
Solution:
# Simulate data
set.seed(123)
n <- 50
before <- rnorm(n, mean = 220, sd = 30)
after <- before - rnorm(n, mean = 25, sd = 15) # Treatment effect
cholesterol_data <- data.frame(
patient_id = 1:n,
before = before,
after = after,
difference = after - before
)
# 1. Descriptive Statistics
desc_stats <- cholesterol_data %>%
summarize(
mean_before = mean(before),
sd_before = sd(before),
mean_after = mean(after),
sd_after = sd(after),
mean_diff = mean(difference),
sd_diff = sd(difference)
)
print(desc_stats)
# 2. Visualization
library(ggplot2)
library(gridExtra)
p1 <- ggplot(cholesterol_data) +
geom_histogram(aes(x = difference), bins = 15,
fill = "lightblue", color = "black") +
geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Distribution of Cholesterol Changes",
x = "Change in Cholesterol (mg/dL)",
y = "Count")
p2 <- ggplot(cholesterol_data, aes(x = before, y = after)) +
geom_point() +
geom_abline(slope = 1, intercept = 0,
linetype = "dashed", color = "red") +
labs(title = "Before vs After Treatment",
x = "Before Treatment (mg/dL)",
y = "After Treatment (mg/dL)")
grid.arrange(p1, p2, ncol = 2)
# 3. Hypothesis Testing
# H₀: μd = 0, H₁: μd < 0 (treatment decreases cholesterol)
# Traditional approach
t_result <- t.test(cholesterol_data$$after,
cholesterol_data$$before,
paired = TRUE,
alternative = "less")
print(t_result)
# Simulation approach
obs_mean_diff <- mean(cholesterol_data$$difference)
null_dist <- cholesterol_data %>%
specify(response = difference) %>%
hypothesize(null = "point", mu = 0) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean")
p_value_sim <- null_dist %>%
get_p_value(obs_stat = obs_mean_diff, direction = "less")
# 4. Confidence Intervals
# Traditional CI
ci_traditional <- t.test(cholesterol_data$$difference)$$conf.int
# Bootstrap CI
boot_dist <- cholesterol_data %>%
specify(response = difference) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean")
ci_bootstrap <- boot_dist %>%
get_ci(level = 0.95, type = "percentile")
# 5. Effect Size
library(effsize)
d <- cohen.d(cholesterol_data$$after,
cholesterol_data$$before,
paired = TRUE)
# 6. Complete Report
cat("CHOLESTEROL REDUCTION ANALYSIS\n")
cat("==============================\n\n")
cat("Sample size:", n, "\n")
cat("Mean reduction:", round(mean(cholesterol_data$$difference), 2), "mg/dL\n")
cat("Standard deviation:", round(sd(cholesterol_data$$difference), 2), "\n")
cat("\nHypothesis Test Results:\n")
cat("t-statistic:", round(t_result$$statistic, 3), "\n")
cat("p-value:", format.pval(t_result$$p.value), "\n")
cat("Decision: Reject H₀ at α = 0.05\n")
cat("\n95% Confidence Intervals:\n")
cat("Traditional:", round(ci_traditional[1], 2), "to",
round(ci_traditional[2], 2), "\n")
cat("Bootstrap:", round(ci_bootstrap$$lower_ci, 2), "to",
round(ci_bootstrap$$upper_ci, 2), "\n")
cat("\nEffect Size (Cohen's d):", round(d$$estimate, 2), "\n")
cat("\nConclusion: The medicine significantly reduces cholesterol levels.")
15.9 Exam-Style Questions
Question 15 (Multiple Choice)
Which of the following increases the width of a confidence interval? a) Increasing sample size b) Decreasing confidence level c) Increasing population variance d) Decreasing margin of error
Answer: c) Increasing population variance
Explanation: Width = 2 × margin of error = 2 × z* × σ/√n. Increasing σ increases width.
Question 16 (Short Answer)
Explain the difference between Type I and Type II errors in hypothesis testing.
Answer:
- Type I Error: Rejecting H₀ when it’s actually true (false positive)
- Type II Error: Failing to reject H₀ when it’s actually false (false negative)
- Type I error rate = α (significance level)
- Type II error rate = β (related to power: Power = 1 - β)
Question 17 (Calculation)
Given: X ~ N(μ, σ²), sample of n=16, x̄=50, s=8. Test H₀: μ=45 vs H₁: μ≠45 at α=0.05.
Solution:
# Given information
n <- 16
x_bar <- 50
s <- 8
mu_0 <- 45
alpha <- 0.05
# Test statistic
t_stat <- (x_bar - mu_0) / (s / sqrt(n))
df <- n - 1
p_value <- 2 * pt(-abs(t_stat), df)
cat("t =", t_stat, "\n") # t = 2.5
cat("df =", df, "\n") # df = 15
cat("p-value =", p_value, "\n") # p-value = 0.0247
# Critical value
t_critical <- qt(1 - alpha/2, df)
cat("Critical value:", t_critical, "\n") # 2.131
# Decision
if (abs(t_stat) > t_critical) {
print("Reject H₀")
} else {
print("Fail to reject H₀")
}
# Reject H₀
Part 10: Appendix & Reference
16. Appendix and Reference Materials
16.1 Statistical Tables
Standard Normal (Z) Table
Common critical values:
Confidence Level | α | Two-tailed z* | One-tailed z* |
---|---|---|---|
90% | 0.10 | 1.645 | 1.282 |
95% | 0.05 | 1.960 | 1.645 |
99% | 0.01 | 2.576 | 2.326 |
t-Distribution Critical Values
Selected values for common degrees of freedom:
df | α = 0.10 | α = 0.05 | α = 0.01 |
---|---|---|---|
5 | 2.015 | 2.571 | 4.032 |
10 | 1.812 | 2.228 | 3.169 |
15 | 1.753 | 2.131 | 2.947 |
20 | 1.725 | 2.086 | 2.845 |
30 | 1.697 | 2.042 | 2.750 |
∞ | 1.645 | 1.960 | 2.576 |
16.2 R Function Reference
Base R Statistical Functions
# Probability distributions
dnorm(x, mean, sd) # Normal density
pnorm(q, mean, sd) # Normal cumulative probability
qnorm(p, mean, sd) # Normal quantile
rnorm(n, mean, sd) # Generate normal random values
# Similar functions exist for:
# t-distribution: dt(), pt(), qt(), rt()
# Chi-square: dchisq(), pchisq(), qchisq(), rchisq()
# Binomial: dbinom(), pbinom(), qbinom(), rbinom()
# Poisson: dpois(), ppois(), qpois(), rpois()
# Hypothesis tests
t.test() # t-tests (one-sample, two-sample, paired)
prop.test() # Proportion tests
chisq.test() # Chi-square test
var.test() # F-test for variance
wilcox.test() # Wilcoxon tests
# Confidence intervals
confint() # Generic confidence interval function
dplyr Package Functions
# Data manipulation
filter() # Filter rows
select() # Select columns
mutate() # Create new variables
summarize() # Compute summaries
group_by() # Group data
arrange() # Sort data
%>% # Pipe operator
infer Package Workflow
# Statistical inference pipeline
specify() # Specify variables
hypothesize() # State null hypothesis
generate() # Generate null distribution
calculate() # Calculate statistics
visualize() # Visualize results
get_p_value() # Extract p-value
get_confidence_interval() # Extract CI
shade_p_value() # Shade p-value on plot
shade_confidence_interval() # Shade CI on plot
16.3 Formula Sheet
Descriptive Statistics
- Mean: \(\bar{x} = \frac{\sum x_i}{n}\)
- Variance: \(s^2 = \frac{\sum (x_i - \bar{x})^2}{n-1}\)
- Standard Deviation: \(s = \sqrt{s^2}\)
Standard Error Formulas
- Mean: \(SE(\bar{X}) = \frac{\sigma}{\sqrt{n}}\) or \(\frac{s}{\sqrt{n}}\)
- Proportion: \(SE(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}\) or \(\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\)
- Difference in Means: \(SE(\bar{X}_1 - \bar{X}_2) = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\)
- Difference in Proportions: \(SE(\hat{p}_1 - \hat{p}_2) = \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}\)
Confidence Intervals
- General form: \(\text{Point Estimate} \pm \text{Critical Value} \times \text{Standard Error}\)
- Mean (σ known): \(\bar{x} \pm z_{\alpha/2} \times \frac{\sigma}{\sqrt{n}}\)
- Mean (σ unknown): \(\bar{x} \pm t_{\alpha/2,n-1} \times \frac{s}{\sqrt{n}}\)
- Proportion: \(\hat{p} \pm z_{\alpha/2} \times \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\)
Test Statistics
- Z-test: \(Z = \frac{\bar{X} - \mu_0}{\sigma/\sqrt{n}}\)
- t-test: \(t = \frac{\bar{X} - \mu_0}{s/\sqrt{n}}\)
- Proportion Z-test: \(Z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}}\)
- Chi-square: \(\chi^2 = \sum \frac{(O - E)^2}{E}\)
Sample Size Formulas
- For mean: \(n = \left(\frac{z_{\alpha/2} \times \sigma}{E}\right)^2\)
- For proportion: \(n = \left(\frac{z_{\alpha/2}}{E}\right)^2 \times p(1-p)\)
16.4 Common R Programming Patterns
Data Import
# CSV files
data <- read.csv("filename.csv")
data <- read_csv("filename.csv") # tidyverse
# Excel files
library(readxl)
data <- read_excel("filename.xlsx")
# Tab-delimited files
data <- read.delim("filename.txt")
Data Cleaning
# Remove missing values
clean_data <- na.omit(data)
# Filter out specific values
filtered_data <- data %>%
filter(!is.na(variable)) %>%
filter(variable > 0)
# Create new variables
data <- data %>%
mutate(new_var = old_var * 2,
category = ifelse(value > threshold, "High", "Low"))
Visualization Templates
# Histogram with normal curve
ggplot(data, aes(x = variable)) +
geom_histogram(aes(y = ..density..), bins = 30) +
stat_function(fun = dnorm,
args = list(mean = mean(data$$variable),
sd = sd(data$$variable)),
color = "red", size = 1)
# Box plot with points
ggplot(data, aes(x = group, y = value)) +
geom_boxplot() +
geom_jitter(width = 0.2, alpha = 0.5)
# Before-after plot
ggplot(data, aes(x = before, y = after)) +
geom_point() +
geom_abline(slope = 1, intercept = 0,
linetype = "dashed", color = "red")
16.5 Quick Reference Guide
Choosing the Right Test
Scenario | Parametric Test | Non-parametric Alternative |
---|---|---|
One sample mean | One-sample t-test | Wilcoxon signed-rank test |
Two independent means | Independent t-test | Mann-Whitney U test |
Two paired means | Paired t-test | Wilcoxon signed-rank test |
One proportion | Z-test for proportion | Binomial test |
Two proportions | Z-test for two proportions | Chi-square test |
Multiple categories | Chi-square test | Fisher’s exact test |
Variance | F-test | Levene’s test |
Assumptions Checklist
-
Independence: Observations are independent
-
- Normality
-
Data follows normal distribution
- Check with: histogram, Q-Q plot, Shapiro-Wilk test
-
- Equal variances
-
Groups have similar spread
- Check with: F-test, Levene’s test
-
Random sampling: Data collected randomly
- Large enough sample: n ≥ 30 for CLT
16.6 Troubleshooting Guide
Common Errors and Solutions
Error: “object not found”
# Solution: Check spelling, ensure object exists
ls() # List all objects in environment
Error: “non-numeric argument to binary operator”
# Solution: Convert to numeric
data$$variable <- as.numeric(data$$variable)
Error: “missing values in object”
# Solution: Handle NA values
mean(data$$variable, na.rm = TRUE)
complete_cases <- na.omit(data)
Error: Package not found
# Solution: Install package
install.packages("package_name")
library(package_name)
16.7 Best Practices Checklist
Before Analysis
- Check data quality and completeness
- Explore data visually
- Verify assumptions
- Define hypotheses clearly
- Choose appropriate significance level
During Analysis
- Document all steps
- Use both traditional and simulation methods when possible
- Check for outliers and influential points
- Validate results with different approaches
- Calculate effect sizes
After Analysis
- Interpret results in context
- Report both statistical and practical significance
- Include confidence intervals
- Create clear visualizations
- Write comprehensive conclusions
16.8 Additional Resources
Online Resources
- R Documentation: https://www.rdocumentation.org/
- CRAN Task Views: https://cran.r-project.org/web/views/
- Stack Overflow R tag: https://stackoverflow.com/questions/tagged/r
- RStudio Cheatsheets: https://www.rstudio.com/resources/cheatsheets/
Recommended Textbooks
- “Modern Dive: Statistical Inference via Data Science” - Ismay & Kim
- “Introduction to Statistical Learning with R” - James et al.
- “R for Data Science” - Wickham & Grolemund
- “OpenIntro Statistics” - Diez, Barr, & Çetinkaya-Rundel
R Packages for Statistical Analysis
# Core packages
install.packages(c("tidyverse", "infer", "moderndive"))
# Additional useful packages
install.packages(c(
"broom", # Tidy model outputs
"car", # Regression diagnostics
"effsize", # Effect size calculations
"pwr", % Power analysis
"ggpubr", # Publication-ready plots
"rstatix" # Pipe-friendly statistics
))
16.9 Summary
We covered:
- Foundations: Probability, distributions, and R programming
- Estimation: Point estimates and confidence intervals
- Hypothesis Testing: One-sample and two-sample tests
- Modern Methods: Bootstrap and simulation approaches
- Practical Applications: Real-world case studies
- Best Practices: Reproducible research and clear reporting
Key takeaways:
- Always start with exploratory data analysis
- Check assumptions before applying tests
- Use both traditional and modern methods
- Focus on practical significance, not just statistical
- Report results clearly and in context
See
-
Regression
-
[[ statistics/Statistical Learning ]]