Overview
Before drawing a sample, you need to answer two questions: how many
units to select, and how to divide them across strata. The
svyplan package provides tools for both.
samplyr integrates with svyplan so that
planning results flow directly into draw() and
execute().
The three packages form a pipeline:
- svyplan determines sample sizes, evaluates precision, and computes strata boundaries (planning).
- samplyr specifies and executes the sampling design (selection).
- sondage provides the low-level sampling algorithms (engine).
This vignette uses the ken_enterprises dataset, a
synthetic frame of 6,823 Kenyan establishments with revenue, employee
counts, and export status.
data(ken_enterprises)
glimpse(ken_enterprises)
#> Rows: 6,823
#> Columns: 9
#> $ enterprise_id <chr> "KEN_00001", "KEN_00002", "KEN_00003", "KEN_00004", "…
#> $ county <fct> Kiambu, Kiambu, Kiambu, Kiambu, Kiambu, Kiambu, Kiamb…
#> $ region <fct> Kiambu, Kiambu, Kiambu, Kiambu, Kiambu, Kiambu, Kiamb…
#> $ sector <fct> Chemicals & Plastics, Chemicals & Plastics, Chemicals…
#> $ size_class <fct> Small, Medium, Medium, Medium, Large, Large, Small, S…
#> $ employees <int> 13, 54, 37, 24, 107, 136, 10, 14, 43, 30, 70, 42, 84,…
#> $ revenue_millions <dbl> 25.1, 57.8, 51.7, 40.5, 73.7, 507.9, 25.5, 16.8, 39.9…
#> $ year_established <int> 1993, 2005, 2002, 2014, 1994, 2007, 2005, 1997, 1991,…
#> $ exporter <lgl> TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, …Sample Size for a Proportion
Suppose we want to estimate the proportion of exporting firms. We know from the census that about 17% of firms export. We want a margin of error of 3 percentage points at 95% confidence.
n_export <- n_prop(p = 0.17, moe = 0.03)
n_export
#> Sample size for proportion (wald)
#> n = 603 (p = 0.17, moe = 0.030)The result is a svyplan_n object. You can pass it
directly to draw():
samp <- sampling_design() |>
draw(n = n_export) |>
execute(ken_enterprises, seed = 1)
nrow(samp)
#> [1] 603Sample Size for a Mean
We also want to estimate mean revenue with a margin of error of 30 million KES. We compute the population variance from the frame:
rev_var <- var(ken_enterprises$revenue_millions)
n_rev <- n_mean(var = rev_var, moe = 30)
n_rev
#> Sample size for mean
#> n = 982 (var = 230035.94, moe = 30.000)Multiple Indicators
In practice, a survey measures several things at once.
n_multi() takes the binding constraint across
indicators:
targets <- data.frame(
name = c("exporter_rate", "mean_revenue"),
p = c(0.17, NA),
var = c(NA, rev_var),
moe = c(0.03, 30)
)
n_survey <- n_multi(targets)
n_survey
#> Multi-indicator sample size
#> n = 982 (binding: mean_revenue)
#> ---
#> name .n .binding
#> exporter_rate 603
#> mean_revenue 982 *The sample size is driven by whichever indicator requires the most
units. Pass it to draw() like any other sample size:
samp <- sampling_design() |>
draw(n = n_survey) |>
execute(ken_enterprises, seed = 2)
nrow(samp)
#> [1] 982Power Analysis
If the goal is to detect a difference (for example, comparing
exporter rates between two regions), power_prop() gives the
required per-group sample size:
n_power <- power_prop(p1 = 0.17, p2 = 0.25, power = 0.8)
n_power
#> Power analysis for proportions (solved for sample size)
#> n = 403 (per group), power = 0.800, effect = 0.0800
#> (p1 = 0.170, p2 = 0.250, alpha = 0.05)This also works directly with draw():
samp <- sampling_design() |>
draw(n = n_power) |>
execute(ken_enterprises, seed = 3)
nrow(samp)
#> [1] 403Response Rate Adjustment
All svyplan functions accept a resp_rate parameter. The
sample size is inflated by 1 / resp_rate to account for
expected non-response:
n_prop(p = 0.17, moe = 0.03, resp_rate = 0.85)
#> Sample size for proportion (wald)
#> n = 709 (net: 603) (p = 0.17, moe = 0.030, resp_rate = 0.85)The net sample size (after non-response) matches the original target.
This works in n_mean(), n_multi(),
n_cluster(), and power_*() functions as
well.
Precision Analysis
Given a fixed sample size, how precise will the estimates be? The
prec_*() functions are the inverse of
n_*():
prec_prop(p = 0.17, n = 600)
#> Sampling precision for proportion (wald)
#> n = 600
#> se = 0.0153, moe = 0.0301, cv = 0.0902
prec_mean(var = rev_var, n = 300, mu = mean(ken_enterprises$revenue_millions))
#> Sampling precision for mean
#> n = 300
#> se = 27.6909, moe = 54.2732, cv = 0.2044Round-trip between size and precision
All n_*() and prec_*() functions are S3
generics. Pass a precision result to n_*() to recover the
sample size, or a sample size result to prec_*() to compute
the achieved precision:
s <- n_prop(p = 0.17, moe = 0.03)
p <- prec_prop(s)
p
#> Sampling precision for proportion (wald)
#> n = 603
#> se = 0.0153, moe = 0.0300, cv = 0.0900
# Recover the original n
n_prop(p)
#> Sample size for proportion (wald)
#> n = 603 (p = 0.17, moe = 0.030)You can override a parameter on the way back:
n_prop(p, cv = 0.10)
#> Sample size for proportion (wald)
#> n = 489 (p = 0.17, cv = 0.100)Sensitivity Analysis
How sensitive is a sample size to assumptions about the design effect
or response rate? The predict() method evaluates any
svyplan result at new parameter combinations:
x <- n_prop(p = 0.17, moe = 0.03, deff = 1.5)
predict(x, expand.grid(
deff = c(1.0, 1.5, 2.0, 2.5),
resp_rate = c(0.8, 0.9, 1.0)
))
#> deff resp_rate n se moe cv
#> 1 1.0 0.8 752.8192 0.0153064 0.03 0.09003767
#> 2 1.5 0.8 1129.2288 0.0153064 0.03 0.09003767
#> 3 2.0 0.8 1505.6384 0.0153064 0.03 0.09003767
#> 4 2.5 0.8 1882.0481 0.0153064 0.03 0.09003767
#> 5 1.0 0.9 669.1726 0.0153064 0.03 0.09003767
#> 6 1.5 0.9 1003.7590 0.0153064 0.03 0.09003767
#> 7 2.0 0.9 1338.3453 0.0153064 0.03 0.09003767
#> 8 2.5 0.9 1672.9316 0.0153064 0.03 0.09003767
#> 9 1.0 1.0 602.2554 0.0153064 0.03 0.09003767
#> 10 1.5 1.0 903.3831 0.0153064 0.03 0.09003767
#> 11 2.0 1.0 1204.5108 0.0153064 0.03 0.09003767
#> 12 2.5 1.0 1505.6384 0.0153064 0.03 0.09003767This works for all svyplan object types: sample size, precision, cluster, and power.
Multistage Cluster Planning
For cluster designs, n_cluster() finds the optimal
allocation across stages given costs and the degree of clustering.
Variance components
If a frame with cluster identifiers is available,
varcomp() estimates the between-cluster and within-cluster
variance components:
vc <- varcomp(
revenue_millions ~ county,
data = ken_enterprises
)
vc
#> Variance components (2-stage)
#> varb = 5.6221, varw = 70.8033
#> delta = 0.0736
#> k = 6.0984
#> Unit relvariance = 12.5320The delta value (measure of homogeneity) feeds directly
into n_cluster() and design_effect().
Optimal allocation
# Minimize cost to achieve CV = 0.05
n_cluster(cost = c(500, 50), delta = vc, cv = 0.05)
#> Optimal 2-stage allocation
#> n_psu = 4773 | psu_size = 12 -> total n = 57276 (unrounded: 53558)
#> cv = 0.0500, cost = 5064188
# Maximize precision within a budget
n_cluster(cost = c(500, 50), delta = vc, budget = 100000)
#> Optimal 2-stage allocation
#> n_psu = 95 | psu_size = 12 -> total n = 1140 (unrounded: 1057)
#> cv = 0.3558, cost = 100000The total sample size from n_cluster() can be passed to
draw():
cl <- n_cluster(cost = c(500, 50), delta = vc, budget = 100000)
samp <- sampling_design() |>
draw(n = cl) |>
execute(ken_enterprises, seed = 5)
nrow(samp)
#> [1] 1140Stratification Boundaries
Enterprise surveys typically stratify by size. Rather than using
predefined size classes, strata_bound() finds optimal
boundaries on a continuous variable. The cumulative square root of
frequency method minimizes the coefficient of variation for a given
number of strata and total sample size:
bounds <- strata_bound(
ken_enterprises$revenue_millions,
n_strata = 4,
method = "cumrootf",
cv = 0.05
)
bounds
#> Strata boundaries (Dalenius-Hodges, 4 strata)
#> Boundaries: 60.0, 230.0, 910.0
#> n = 101, cv = 0.0500
#> Allocation: neyman
#> ---
#> stratum lower upper N_h W_h S_h n_h
#> 1 1.3 60.0 4692 0.688 15.0 14
#> 2 60.0 230.0 1492 0.219 44.0 13
#> 3 230.0 910.0 448 0.066 179.9 16
#> 4 910.0 8725.7 191 0.028 1565.2 58The result includes jointly optimized boundaries and stratum sample
sizes. The predict() method assigns each frame unit to a
stratum:
labels <- paste0("S", 1:4)
ken_enterprises$rev_stratum <- predict(
bounds,
ken_enterprises$revenue_millions,
labels = labels
)
table(ken_enterprises$rev_stratum)
#>
#> S1 S2 S3 S4
#> 4692 1492 448 191Use the stratum allocations from strata_bound() directly
as a named vector for draw(). This preserves the joint
optimization between boundaries and allocation:
n_alloc <- setNames(bounds$strata$n_h, labels)
n_alloc
#> S1 S2 S3 S4
#> 14 13 16 58
samp <- sampling_design() |>
stratify_by(rev_stratum) |>
draw(n = n_alloc) |>
execute(ken_enterprises, seed = 4)
samp |>
count(rev_stratum, name = "n_sampled")
#> # A tibble: 4 × 2
#> rev_stratum n_sampled
#> <fct> <int>
#> 1 S1 14
#> 2 S2 13
#> 3 S3 16
#> 4 S4 58Note that the boundaries and allocation are coupled. The
cumrootf method optimizes them together to achieve the
target CV. Using strata_bound() for boundaries but a
different allocation in samplyr (for example, proportional
instead of Neyman) might break this optimality. If you need a different
allocation method, compute boundaries and allocation separately.
Design Effects After Sampling
After executing a design, you can evaluate its efficiency using
design_effect() and effective_n(). These are
re-exported from svyplan and work directly on
tbl_sample objects.
Kish Design Effect
The Kish design effect measures the loss of precision due to unequal weights. It equals 1 for self-weighting designs:
design_effect(samp)
#> [1] 3.811689
effective_n(samp)
#> [1] 26.49744The effective sample size is the number of observations an SRS would need to achieve the same precision.
Spencer Method
The Spencer method accounts for the relationship between the outcome
variable and the selection probabilities. Pass the outcome as a bare
column name. The selection probabilities are extracted automatically
from .weight_1:
design_effect(samp, y = revenue_millions, method = "spencer")
#> [1] 0.102805Henry Method
The Henry method also accounts for a calibration covariate. Both
y and x_cal are column names in the
sample:
design_effect(samp, y = revenue_millions, x_cal = employees,
method = "henry")
#> [1] 0.04859152Chen-Rust Decomposition
For stratified or clustered designs, the Chen-Rust (CR) method
decomposes the design effect into weighting (deff_w),
clustering (deff_c), and stratification
(deff_s) components:
cr <- design_effect(samp, y = revenue_millions, method = "cr")
cr$strata
#> stratum n_h cv2_w deff_w deff_s
#> 1 S1 14 0 1 0.003603710
#> 2 S2 13 0 1 0.004195195
#> 3 S3 16 0 1 0.006554564
#> 4 S4 58 0 1 0.009419020
cr$overall
#> [1] 0.02377249The tbl_sample method extracts stratification and
clustering variables from the stored design metadata. You only need to
supply the outcome.
Cluster Planning
Before collecting data, you can anticipate the design effect for a
cluster design using the intraclass correlation coefficient (ICC) and
cluster size. This uses svyplan::design_effect() directly
(not through a tbl_sample):
design_effect(delta = 0.05, psu_size = 10)
#> [1] 1.45
design_effect(delta = 0.15, psu_size = 10)
#> [1] 2.35With an ICC of 0.05 and 10 units per cluster, the design effect is 1.45. An ICC of 0.15 raises it to 2.35. This helps decide the number of clusters and units per cluster during the planning stage.
Summary
The typical planning workflow can be:
- Determine sample size with
n_prop(),n_mean(),n_multi(), orpower_prop()/power_mean(). - Evaluate precision at the chosen n with
prec_prop(),prec_mean(). - Run sensitivity analysis with
predict()to check robustness. - Optionally create stratification boundaries with
strata_bound()and assign strata withpredict(). - For cluster designs, estimate variance components with
varcomp()and optimize the allocation withn_cluster(). - Pass the sample size to
draw()and execute the design. - Evaluate the realized design with
design_effect()andeffective_n().
All svyplan sample size objects (svyplan_n,
svyplan_power, svyplan_cluster) work directly
with draw(), so the planning and execution steps connect
without manual conversion.