Overview
Before drawing a sample, you need to answer two questions: how many
units to select, and how to divide them across strata. The
svyplan package provides tools for both.
samplyr integrates with svyplan so that
planning results flow directly into draw() and
execute().
The three packages form a pipeline:
- svyplan determines sample sizes, evaluates precision, and computes strata boundaries (planning).
- samplyr specifies and executes the sampling design (selection).
- sondage provides the low-level sampling algorithms (engine).
This vignette uses the ken_enterprises dataset, a
synthetic frame of 17,004 Kenyan establishments with revenue, employee
counts, and export status.
data(ken_enterprises)
glimpse(ken_enterprises)
#> Rows: 17,004
#> Columns: 9
#> $ enterprise_id <chr> "KEN_00001", "KEN_00002", "KEN_00003", "KEN_00004", "…
#> $ county <fct> Kiambu, Kiambu, Kiambu, Kiambu, Kiambu, Kirinyaga, Mu…
#> $ region <fct> Central, Central, Central, Central, Central, Central,…
#> $ sector <fct> Chemicals & Chemical Products, Chemicals & Chemical P…
#> $ size_class <fct> Large, Medium, Small, Small, Small, Small, Small, Med…
#> $ employees <int> 179, 43, 13, 9, 15, 8, 17, 44, 10, 42, 6, 11, 424, 39…
#> $ revenue_millions <dbl> 265.5, 108.0, 25.6, 32.9, 45.7, 15.0, 40.0, 67.3, 40.…
#> $ year_established <int> 1982, 1994, 1982, 2002, 2014, 2011, 1992, 2016, 2012,…
#> $ exporter <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE,…Sample Size for a Proportion
Suppose we want to estimate the proportion of exporting firms. We know from the census that about 17% of firms export. We want a margin of error of 3 percentage points at 95% confidence.
n_export <- n_prop(p = 0.17, moe = 0.03)
n_export
#> Sample size for proportion (wald)
#> n = 603 (p = 0.17, moe = 0.030)The result is a svyplan_n object. You can pass it
directly to draw():
samp <- sampling_design() |>
draw(n = n_export) |>
execute(ken_enterprises, seed = 1)
nrow(samp)
#> [1] 603Sample Size for a Mean
We also want to estimate mean revenue with a margin of error of 30 million KES. We compute the population variance from the frame:
rev_var <- var(ken_enterprises$revenue_millions)
n_rev <- n_mean(var = rev_var, moe = 30)
n_rev
#> Sample size for mean
#> n = 611 (var = 143091.44, moe = 30.000)Multiple Indicators
In practice, a survey measures several things at once.
n_multi() takes the binding constraint across
indicators:
targets <- data.frame(
name = c("exporter_rate", "mean_revenue"),
p = c(0.17, NA),
var = c(NA, rev_var),
moe = c(0.03, 30)
)
n_survey <- n_multi(targets)
n_survey
#> Multi-indicator sample size
#> n = 611 (binding: mean_revenue)
#> ---
#> name .n .cv_target .cv_achieved .binding
#> exporter_rate 603 0.09003767 0.08940893
#> mean_revenue 611 NA NA *The sample size is driven by whichever indicator requires the most
units. Pass it to draw() like any other sample size:
samp <- sampling_design() |>
draw(n = n_survey) |>
execute(ken_enterprises, seed = 2)
nrow(samp)
#> [1] 611Power Analysis
If the goal is to detect a difference (for example, comparing
exporter rates between two regions), power_prop() gives the
required per-group sample size:
n_power <- power_prop(p1 = 0.17, p2 = 0.25, power = 0.8)
n_power
#> Power analysis for proportions (solved for sample size)
#> n = 403 (per group), power = 0.800, effect = 0.0800
#> (p1 = 0.170, p2 = 0.250, alpha = 0.05)This also works directly with draw():
samp <- sampling_design() |>
draw(n = n_power) |>
execute(ken_enterprises, seed = 3)
nrow(samp)
#> [1] 403Response Rate Adjustment
All svyplan functions accept a resp_rate parameter. The
sample size is inflated by 1 / resp_rate to account for
expected non-response:
n_prop(p = 0.17, moe = 0.03, resp_rate = 0.85)
#> Sample size for proportion (wald)
#> n = 709 (net: 603) (p = 0.17, moe = 0.030, resp_rate = 0.85)The net sample size (after non-response) matches the original target.
This works in n_mean(), n_multi(),
n_cluster(), and power_*() functions as
well.
Precision Analysis
Given a fixed sample size, how precise will the estimates be? The
prec_*() functions are the inverse of
n_*():
prec_prop(p = 0.17, n = 600)
#> Sampling precision for proportion (wald)
#> n = 600
#> se = 0.0153, moe = 0.0301, cv = 0.0902
prec_mean(var = rev_var, n = 300, mu = mean(ken_enterprises$revenue_millions))
#> Sampling precision for mean
#> n = 300
#> se = 21.8397, moe = 42.8050, cv = 0.2485Round-trip between size and precision
All n_*() and prec_*() functions are S3
generics. Pass a precision result to n_*() to recover the
sample size, or a sample size result to prec_*() to compute
the achieved precision:
s <- n_prop(p = 0.17, moe = 0.03)
p <- prec_prop(s)
p
#> Sampling precision for proportion (wald)
#> n = 603
#> se = 0.0153, moe = 0.0300, cv = 0.0900
# Recover the original n
n_prop(p)
#> Sample size for proportion (wald)
#> n = 603 (p = 0.17, moe = 0.030)You can override a parameter on the way back:
n_prop(p, cv = 0.10)
#> Sample size for proportion (wald)
#> n = 489 (p = 0.17, cv = 0.100)Sensitivity Analysis
How sensitive is a sample size to assumptions about the design effect
or response rate? The predict() method evaluates any
svyplan result at new parameter combinations:
x <- n_prop(p = 0.17, moe = 0.03, deff = 1.5)
predict(x, expand.grid(
deff = c(1.0, 1.5, 2.0, 2.5),
resp_rate = c(0.8, 0.9, 1.0)
))
#> deff resp_rate n se moe cv
#> 1 1.0 0.8 752.8192 0.0153064 0.03 0.09003767
#> 2 1.5 0.8 1129.2288 0.0153064 0.03 0.09003767
#> 3 2.0 0.8 1505.6384 0.0153064 0.03 0.09003767
#> 4 2.5 0.8 1882.0481 0.0153064 0.03 0.09003767
#> 5 1.0 0.9 669.1726 0.0153064 0.03 0.09003767
#> 6 1.5 0.9 1003.7590 0.0153064 0.03 0.09003767
#> 7 2.0 0.9 1338.3453 0.0153064 0.03 0.09003767
#> 8 2.5 0.9 1672.9316 0.0153064 0.03 0.09003767
#> 9 1.0 1.0 602.2554 0.0153064 0.03 0.09003767
#> 10 1.5 1.0 903.3831 0.0153064 0.03 0.09003767
#> 11 2.0 1.0 1204.5108 0.0153064 0.03 0.09003767
#> 12 2.5 1.0 1505.6384 0.0153064 0.03 0.09003767This works for all svyplan object types: sample size, precision, cluster, and power.
Multistage Cluster Planning
For cluster designs, n_cluster() finds the optimal
allocation across stages given costs and the degree of clustering.
Variance components
If a frame with cluster identifiers is available,
varcomp() estimates the between-cluster and within-cluster
variance components:
vc <- varcomp(
revenue_millions ~ county,
data = ken_enterprises
)
vc
#> Variance components (2-stage)
#> varb = 16.8567, varw = 309.8482
#> delta = 0.0516
#> k = 17.6378
#> Unit relvariance = 18.5229The delta value (measure of homogeneity) feeds directly
into n_cluster() and design_effect().
Optimal allocation
# Minimize cost to achieve CV = 0.05
n_cluster(stage_cost = c(500, 50), delta = vc, cv = 0.05)
#> Optimal 2-stage allocation
#> n_psu = 15885 | psu_size = 14 -> total n = 222390 (unrounded: 215354.9)
#> cv = 0.0500, cost = 18709868
# Maximize precision within a budget
n_cluster(stage_cost = c(500, 50), delta = vc, budget = 100000)
#> Optimal 2-stage allocation
#> n_psu = 85 | psu_size = 14 -> total n = 1190 (unrounded: 1151.023)
#> cv = 0.6839, cost = 100000The total sample size from n_cluster() can be passed to
draw():
cl <- n_cluster(stage_cost = c(500, 50), delta = vc, budget = 100000)
samp <- sampling_design() |>
draw(n = cl) |>
execute(ken_enterprises, seed = 5)
nrow(samp)
#> [1] 1190Stratification Boundaries
Enterprise surveys typically stratify by size. Rather than using
predefined size classes, strata_bound() finds optimal
boundaries on a continuous variable. The cumulative square root of
frequency method minimizes the coefficient of variation for a given
number of strata and total sample size:
bounds <- strata_bound(
ken_enterprises$revenue_millions,
n_strata = 4,
method = "cumrootf",
cv = 0.05
)
bounds
#> Strata boundaries (Dalenius-Hodges, 4 strata)
#> Boundaries: 40.0, 165.0, 745.0
#> n = 126, cv = 0.0500
#> Allocation: neyman
#> ---
#> stratum lower upper N share sd n
#> 1 1.1 40 12262 0.721 9.3 16
#> 2 40.0 165 3464 0.204 31.8 15
#> 3 165.0 745 929 0.055 145.9 19
#> 4 745.0 10104 349 0.021 1553.5 76The result includes jointly optimized boundaries and stratum sample
sizes. The predict() method assigns each frame unit to a
stratum:
labels <- paste0("S", 1:4)
ken_enterprises$rev_stratum <- predict(
bounds,
ken_enterprises$revenue_millions,
labels = labels
)
table(ken_enterprises$rev_stratum)
#>
#> S1 S2 S3 S4
#> 12262 3464 929 349Use the stratum allocations from strata_bound() directly
as a named vector for draw(). This preserves the joint
optimization between boundaries and allocation:
n_alloc <- setNames(bounds$strata$n, labels)
n_alloc
#> S1 S2 S3 S4
#> 16 15 19 76
samp <- sampling_design() |>
stratify_by(rev_stratum) |>
draw(n = n_alloc) |>
execute(ken_enterprises, seed = 4)
samp |>
count(rev_stratum, name = "n_sampled")
#> # A tibble: 4 × 2
#> rev_stratum n_sampled
#> <fct> <int>
#> 1 S1 16
#> 2 S2 15
#> 3 S3 19
#> 4 S4 76Note that the boundaries and allocation are coupled. The
cumrootf method optimizes them together to achieve the
target CV. Using strata_bound() for boundaries but a
different allocation in samplyr (for example, proportional
instead of Neyman) might break this optimality. If you need a different
allocation method, compute boundaries and allocation separately.
Stratified Allocation
When strata are defined by administrative or domain boundaries rather
than optimal boundaries from strata_bound(), use
n_alloc() to compute the allocation. It supports Neyman,
proportional, optimal (cost-weighted), and power allocation with
optional constraints on minimum stratum size.
First, summarize the frame to get stratum population sizes and variabilities:
alloc_frame <- ken_enterprises |>
summarize(
N = n(),
sd = sd(revenue_millions),
mean = mean(revenue_millions),
.by = sector
) |>
rename(stratum = sector)
alloc_frame
#> # A tibble: 7 × 4
#> stratum N sd mean
#> <fct> <int> <dbl> <dbl>
#> 1 Chemicals & Chemical Products 282 878. 273.
#> 2 Construction 2343 315. 63.4
#> 3 Food 856 698. 228.
#> 4 Hotels and Restaurants 3065 190. 48.9
#> 5 Other Manufacturing 2113 537. 177.
#> 6 Other Services 6187 246. 58.2
#> 7 Retail 2158 440. 87.8Then compute the allocation. Here we use Neyman allocation targeting 600 units with a minimum of 20 per stratum:
alloc <- n_alloc(alloc_frame, n = 600, alloc = "neyman", min_n = 20)
alloc
#> Stratum allocation (neyman, 7 strata)
#> n = 600, cv = 0.1543, se = 13.5612
#> (min_n = 20)
alloc$detail[, c("stratum", "N", "n_int", "weight")]
#> stratum N n_int weight
#> 1 Chemicals & Chemical Products 282 26 10.95380
#> 2 Construction 2343 77 30.54684
#> 3 Food 856 62 13.79354
#> 4 Hotels and Restaurants 3065 60 50.63273
#> 5 Other Manufacturing 2113 118 17.91983
#> 6 Other Services 6187 158 39.08919
#> 7 Retail 2158 99 21.84907The result passes directly to draw(). The per-stratum
allocations are matched by stratum name to the values of the
stratification variable:
samp <- sampling_design() |>
stratify_by(sector) |>
draw(n = alloc) |>
execute(ken_enterprises, seed = 10)
samp |>
count(sector, name = "n_sampled")
#> # A tibble: 7 × 2
#> sector n_sampled
#> <fct> <int>
#> 1 Food 62
#> 2 Chemicals & Chemical Products 26
#> 3 Other Manufacturing 118
#> 4 Construction 77
#> 5 Retail 99
#> 6 Hotels and Restaurants 60
#> 7 Other Services 158Because n_alloc() already provides per-stratum sizes, do
not combine it with alloc in stratify_by().
The two allocation mechanisms are mutually exclusive: either let
svyplan allocate (via n_alloc()) or let
samplyr allocate (via alloc in
stratify_by()), but not both.
Budget-constrained allocation
When field costs vary across strata, n_alloc() can
optimize under a budget constraint using cost-weighted (optimal)
allocation:
alloc_frame_cost <- alloc_frame |>
mutate(cost = c(120, 200, 150, 130, 80, 100, 90))
budget_alloc <- n_alloc(
alloc_frame_cost,
budget = 50000,
alloc = "optimal",
min_n = 20
)
budget_alloc
#> Stratum allocation (optimal, 7 strata)
#> n = 451, cv = 0.1811, se = 15.9212
#> (min_n = 20)
budget_alloc$detail[, c("stratum", "N", "cost", "n_int", "weight")]
#> stratum N cost n_int weight
#> 1 Chemicals & Chemical Products 282 120 20 14.10000
#> 2 Construction 2343 200 42 55.44653
#> 3 Food 856 150 40 21.68276
#> 4 Hotels and Restaurants 3065 130 41 74.09623
#> 5 Other Manufacturing 2113 80 103 20.57178
#> 6 Other Services 6187 100 123 50.17066
#> 7 Retail 2158 90 81 26.60402
samp_budget <- sampling_design() |>
stratify_by(sector) |>
draw(n = budget_alloc) |>
execute(ken_enterprises, seed = 11)
samp_budget |>
count(sector, name = "n_sampled")
#> # A tibble: 7 × 2
#> sector n_sampled
#> <fct> <int>
#> 1 Food 40
#> 2 Chemicals & Chemical Products 20
#> 3 Other Manufacturing 103
#> 4 Construction 42
#> 5 Retail 81
#> 6 Hotels and Restaurants 41
#> 7 Other Services 123Design Effects After Sampling
After executing a design, you can evaluate its efficiency using
design_effect() and effective_n(). These are
re-exported from svyplan and work directly on
tbl_sample objects.
Kish Design Effect
The Kish design effect measures the loss of precision due to unequal weights. It equals 1 for self-weighting designs:
design_effect(samp)
#> [1] 1.182607
effective_n(samp)
#> [1] 507.3536The effective sample size is the number of observations an SRS would need to achieve the same precision.
Spencer Method
The Spencer method accounts for the relationship between the outcome
variable and the selection probabilities. Pass the outcome as a bare
column name. The selection probabilities are extracted automatically
from .weight_1:
design_effect(samp, y = revenue_millions, method = "spencer")
#> [1] 0.8225433Henry Method
The Henry method also accounts for a calibration covariate. Both
y and x_cal are column names in the
sample:
design_effect(samp, y = revenue_millions, x_cal = employees,
method = "henry")
#> [1] 0.3754297Chen-Rust Decomposition
For stratified or clustered designs, the Chen-Rust (CR) method
decomposes the design effect into weighting (deff_w),
clustering (deff_c), and stratification
(deff_s) components:
cr <- design_effect(samp, y = revenue_millions, method = "cr")
cr$strata
#> stratum n_h cv2_w deff_w deff_s
#> 1 Food 62 0 1 0.020888198
#> 2 Chemicals & Chemical Products 26 0 1 0.023398167
#> 3 Other Manufacturing 118 0 1 0.127381354
#> 4 Construction 77 0 1 0.006778693
#> 5 Retail 99 0 1 0.374431784
#> 6 Hotels and Restaurants 60 0 1 0.014352966
#> 7 Other Services 158 0 1 0.256829835
cr$overall
#> [1] 0.824061The tbl_sample method extracts stratification and
clustering variables from the stored design metadata. You only need to
supply the outcome.
Cluster Planning
Before collecting data, you can anticipate the design effect for a
cluster design using the intraclass correlation coefficient (ICC) and
cluster size. This uses svyplan::design_effect() directly
(not through a tbl_sample):
design_effect(delta = 0.05, psu_size = 10)
#> [1] 1.45
design_effect(delta = 0.15, psu_size = 10)
#> [1] 2.35With an ICC of 0.05 and 10 units per cluster, the design effect is 1.45. An ICC of 0.15 raises it to 2.35. This helps decide the number of clusters and units per cluster during the planning stage.
Summary
The typical planning workflow can be:
- Determine sample size with
n_prop(),n_mean(),n_multi(), orpower_prop()/power_mean(). - Evaluate precision at the chosen n with
prec_prop(),prec_mean(). - Run sensitivity analysis with
predict()to check robustness. - Optionally create stratification boundaries with
strata_bound()and assign strata withpredict(). - Allocate across strata with
n_alloc()(Neyman, proportional, optimal, or power allocation with constraints). - For cluster designs, estimate variance components with
varcomp()and optimize the allocation withn_cluster(). - Pass the sample size or allocation to
draw()and execute the design. - Evaluate the realized design with
design_effect()andeffective_n().
All svyplan sample size objects (svyplan_n,
svyplan_power, svyplan_cluster) work directly
with draw(). For n_alloc(), the per-stratum
allocations are passed through automatically. For other objects, the
scalar total is extracted. No manual conversion is needed.