Survey sample size determination, precision analysis, optimal allocation, stratification, and power analysis for R.
Sample sizes
library(svyplan)
# Proportion with margin of error
n_prop(p = 0.3, moe = 0.05)
#> Sample size for proportion (wald)
#> n = 323 (p = 0.30, moe = 0.050)
# Mean with finite population and design effect
n_mean(var = 100, moe = 2, N = 5000, deff = 1.5)
#> Sample size for mean
#> n = 142 (var = 100.00, moe = 2.000, deff = 1.50)Response rate adjustment
All functions accept a resp_rate parameter. The sample size is inflated by 1 / resp_rate to account for expected non-response:
n_prop(p = 0.3, moe = 0.05, deff = 1.5, resp_rate = 0.8)
#> Sample size for proportion (wald)
#> n = 606 (net: 485) (p = 0.30, moe = 0.050, deff = 1.50, resp_rate = 0.80)Precision analysis
Given a sample size, how precise will your estimates be? The prec_*() functions are the inverse of n_*():
prec_prop(p = 0.3, n = 400)
#> Sampling precision for proportion (wald)
#> n = 400
#> se = 0.0229, moe = 0.0449, cv = 0.0764
prec_mean(var = 100, n = 400, mu = 50)
#> Sampling precision for mean
#> n = 400
#> se = 0.5000, moe = 0.9800, cv = 0.0100Round-trip between size and precision
All n_*() and prec_*() functions are S3 generics. Pass a precision result to n_*() to recover the sample size, or pass a sample size result to prec_*() to compute the achieved precision:
# Start with a precision target
s <- n_prop(p = 0.3, moe = 0.05, deff = 1.5)
# What precision does this n achieve?
p <- prec_prop(s)
p
#> Sampling precision for proportion (wald)
#> n = 485
#> se = 0.0255, moe = 0.0500, cv = 0.0850
# Recover the original n
n_prop(p)
#> Sample size for proportion (wald)
#> n = 485 (p = 0.30, moe = 0.050, deff = 1.50)Multi-indicator surveys
Household surveys track many indicators at once. n_multi() finds the sample size that satisfies all precision targets simultaneously.
targets <- data.frame(
name = c("stunting", "vaccination", "anemia"),
p = c(0.25, 0.70, 0.12),
moe = c(0.05, 0.05, 0.03),
deff = c(2.0, 1.5, 2.5)
)
n_multi(targets)
#> Multi-indicator sample size
#> n = 1127 (binding: anemia)
#> ---
#> name .n .binding
#> stunting 577
#> vaccination 485
#> anemia 1127 *Per-domain optimization works by adding domain columns to the targets data frame.
Multistage cluster designs
# Optimal 2-stage allocation within a budget
n_cluster(cost = c(500, 50), delta = 0.05, budget = 100000)
#> Optimal 2-stage allocation
#> n_psu = 85 | psu_size = 14 -> total n = 1190 (unrounded: 1159.1)
#> cv = 0.0376, cost = 100000
# Precision for a given allocation
prec_cluster(n = c(50, 12), delta = 0.05)
#> Sampling precision for 2-stage cluster
#> n_psu = 50 | psu_size = 12 -> total n = 600
#> cv = 0.0508Variance components can be estimated from frame data and passed directly to n_cluster():
set.seed(1)
frame <- data.frame(
district = rep(1:40, each = 20),
income = rep(rnorm(40, 500, 100), each = 20) + rnorm(800, 0, 50)
)
vc <- varcomp(income ~ district, data = frame)
vc
#> Variance components (2-stage)
#> varb = 0.0307, varw = 0.0105
#> delta = 0.7448
#> k = 1.0311
#> Unit relvariance = 0.0400
n_cluster(cost = c(500, 50), delta = vc, cv = 0.05)
#> Optimal 2-stage allocation
#> n_psu = 15 | psu_size = 2 -> total n = 30 (unrounded: 26.97443)
#> cv = 0.0500, cost = 8635Sensitivity analysis
predict() evaluates a result at new parameter combinations, returning a data frame suitable for plotting:
x <- n_prop(p = 0.3, moe = 0.05, deff = 1.5)
predict(x, expand.grid(
deff = c(1, 1.5, 2, 2.5),
resp_rate = c(0.7, 0.8, 0.9, 1.0)
))
#> deff resp_rate n se moe cv
#> 1 1.0 0.7 460.9751 0.02551067 0.05 0.08503558
#> 2 1.5 0.7 691.4626 0.02551067 0.05 0.08503558
#> 3 2.0 0.7 921.9501 0.02551067 0.05 0.08503558
#> 4 2.5 0.7 1152.4376 0.02551067 0.05 0.08503558
#> 5 1.0 0.8 403.3532 0.02551067 0.05 0.08503558
#> 6 1.5 0.8 605.0298 0.02551067 0.05 0.08503558
#> 7 2.0 0.8 806.7064 0.02551067 0.05 0.08503558
#> 8 2.5 0.8 1008.3829 0.02551067 0.05 0.08503558
#> 9 1.0 0.9 358.5362 0.02551067 0.05 0.08503558
#> 10 1.5 0.9 537.8042 0.02551067 0.05 0.08503558
#> 11 2.0 0.9 717.0723 0.02551067 0.05 0.08503558
#> 12 2.5 0.9 896.3404 0.02551067 0.05 0.08503558
#> 13 1.0 1.0 322.6825 0.02551067 0.05 0.08503558
#> 14 1.5 1.0 484.0238 0.02551067 0.05 0.08503558
#> 15 2.0 1.0 645.3651 0.02551067 0.05 0.08503558
#> 16 2.5 1.0 806.7064 0.02551067 0.05 0.08503558Works for all object types: sample size, precision, cluster, and power.
Strata boundaries
strata_bound() finds optimal boundaries for a continuous stratification variable.
set.seed(1)
x <- rlnorm(5000, meanlog = 6, sdlog = 1.2)
strata_bound(x, n_strata = 4, n = 300, method = "cumrootf")
#> Strata boundaries (Dalenius-Hodges, 4 strata)
#> Boundaries: 400.0, 1300.0, 3300.0
#> n = 300, cv = 0.0211
#> Allocation: neyman
#> ---
#> stratum lower upper N_h W_h S_h n_h
#> 1 4.92557 400.00 2523 0.505 107.1 44
#> 2 400.00000 1300.00 1610 0.322 250.3 66
#> 3 1300.00000 3300.00 647 0.129 544.0 58
#> 4 3300.00000 39039.61 220 0.044 3675.1 132Four methods are available: Dalenius-Hodges ("cumrootf"), geometric ("geo"), Lavallee-Hidiroglou ("lh"), and Kozak ("kozak").
Power analysis
Solve for sample size, power, or minimum detectable effect. Supports design effects, finite population correction, response rate adjustment, and panel overlap.
# Sample size to detect a 5pp change from 70% with deff = 2
power_prop(p1 = 0.70, p2 = 0.75, deff = 2.0)
#> Power analysis for proportions (solved for sample size)
#> n = 2496 (per group), power = 0.800, effect = 0.0500
#> (p1 = 0.700, p2 = 0.750, alpha = 0.05, deff = 2.00)
# MDE with n = 1500 per group
power_prop(p1 = 0.70, n = 1500, deff = 2.0)
#> Power analysis for proportions (solved for minimum detectable effect)
#> n = 1500 (per group), power = 0.800, effect = 0.0639
#> (p1 = 0.700, p2 = 0.764, alpha = 0.05, deff = 2.00)
# Means
power_mean(effect = 5, var = 200)
#> Power analysis for means (solved for sample size)
#> n = 126 (per group), power = 0.800, effect = 5.0000
#> (alpha = 0.05)plot() draws the power-vs-sample-size curve with reference lines at the solved point:
pw <- power_prop(p1 = 0.70, p2 = 0.75, power = 0.80, deff = 2.0)
plot(pw)
Design effects
# Planning: expected cluster design effect
design_effect(delta = 0.05, psu_size = 20, method = "cluster")
#> [1] 1.95
# Diagnostic: Kish design effect from weights
set.seed(1)
w <- runif(500, 0.5, 4)
design_effect(w, method = "kish")
#> [1] 1.196494
effective_n(w, method = "kish")
#> [1] 417.8876