Survey Planning with svyplan • samplyr

Overview

Before drawing a sample, you need to answer two questions: how many units to select, and how to divide them across strata. The svyplan package provides tools for both. samplyr integrates with svyplan so that planning results flow directly into draw() and execute().

The three packages form a pipeline:

svyplan determines sample sizes, evaluates precision, and computes strata boundaries (planning).
samplyr specifies and executes the sampling design (selection).
sondage provides the low-level sampling algorithms (engine).

This vignette uses the ken_enterprises dataset, a synthetic frame of 17,004 Kenyan establishments with revenue, employee counts, and export status.

data(ken_enterprises)
glimpse(ken_enterprises)
#> Rows: 17,004
#> Columns: 9
#> $ enterprise_id    <chr> "KEN_00001", "KEN_00002", "KEN_00003", "KEN_00004", "…
#> $ county           <fct> Kiambu, Kiambu, Kiambu, Kiambu, Kiambu, Kirinyaga, Mu…
#> $ region           <fct> Central, Central, Central, Central, Central, Central,…
#> $ sector           <fct> Chemicals & Chemical Products, Chemicals & Chemical P…
#> $ size_class       <fct> Large, Medium, Small, Small, Small, Small, Small, Med…
#> $ employees        <int> 179, 43, 13, 9, 15, 8, 17, 44, 10, 42, 6, 11, 424, 39…
#> $ revenue_millions <dbl> 265.5, 108.0, 25.6, 32.9, 45.7, 15.0, 40.0, 67.3, 40.…
#> $ year_established <int> 1982, 1994, 1982, 2002, 2014, 2011, 1992, 2016, 2012,…
#> $ exporter         <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE,…

Sample Size for a Proportion

Suppose we want to estimate the proportion of exporting firms. We know from the census that about 17% of firms export. We want a margin of error of 3 percentage points at 95% confidence.

n_export <- n_prop(p = 0.17, moe = 0.03)
n_export
#> Sample size for proportion (wald)
#> n = 603 (p = 0.17, moe = 0.030)

The result is a svyplan_n object. You can pass it directly to draw():

samp <- sampling_design() |>
  draw(n = n_export) |>
  execute(ken_enterprises, seed = 1)
nrow(samp)
#> [1] 603

Sample Size for a Mean

We also want to estimate mean revenue with a margin of error of 30 million KES. We compute the population variance from the frame:

rev_var <- var(ken_enterprises$revenue_millions)
n_rev <- n_mean(var = rev_var, moe = 30)
n_rev
#> Sample size for mean
#> n = 611 (var = 143091.44, moe = 30.000)

Multiple Indicators

In practice, a survey measures several things at once. n_multi() takes the binding constraint across indicators:

targets <- data.frame(
  name = c("exporter_rate", "mean_revenue"),
  p = c(0.17, NA),
  var = c(NA, rev_var),
  moe = c(0.03, 30)
)
n_survey <- n_multi(targets)
n_survey
#> Multi-indicator sample size
#> n = 611 (binding: mean_revenue)
#> ---
#>  name          .n  .cv_target .cv_achieved .binding
#>  exporter_rate 603 0.09003767 0.08940893           
#>  mean_revenue  611         NA         NA   *

The sample size is driven by whichever indicator requires the most units. Pass it to draw() like any other sample size:

samp <- sampling_design() |>
  draw(n = n_survey) |>
  execute(ken_enterprises, seed = 2)
nrow(samp)
#> [1] 611

Power Analysis

If the goal is to detect a difference (for example, comparing exporter rates between two regions), power_prop() gives the required per-group sample size:

n_power <- power_prop(p1 = 0.17, p2 = 0.25, power = 0.8)
n_power
#> Power analysis for proportions (solved for sample size)
#> n = 403 (per group), power = 0.800, effect = 0.0800
#> (p1 = 0.170, p2 = 0.250, alpha = 0.05)

This also works directly with draw():

samp <- sampling_design() |>
  draw(n = n_power) |>
  execute(ken_enterprises, seed = 3)
nrow(samp)
#> [1] 403

Response Rate Adjustment

All svyplan functions accept a resp_rate parameter. The sample size is inflated by 1 / resp_rate to account for expected non-response:

n_prop(p = 0.17, moe = 0.03, resp_rate = 0.85)
#> Sample size for proportion (wald)
#> n = 709 (net: 603) (p = 0.17, moe = 0.030, resp_rate = 0.85)

The net sample size (after non-response) matches the original target. This works in n_mean(), n_multi(), n_cluster(), and power_*() functions as well.

Precision Analysis

Given a fixed sample size, how precise will the estimates be? The prec_*() functions are the inverse of n_*():

prec_prop(p = 0.17, n = 600)
#> Sampling precision for proportion (wald)
#> n = 600
#> se = 0.0153, moe = 0.0301, cv = 0.0902

prec_mean(var = rev_var, n = 300, mu = mean(ken_enterprises$revenue_millions))
#> Sampling precision for mean
#> n = 300
#> se = 21.8397, moe = 42.8050, cv = 0.2485

Round-trip between size and precision

All n_*() and prec_*() functions are S3 generics. Pass a precision result to n_*() to recover the sample size, or a sample size result to prec_*() to compute the achieved precision:

s <- n_prop(p = 0.17, moe = 0.03)
p <- prec_prop(s)
p
#> Sampling precision for proportion (wald)
#> n = 603
#> se = 0.0153, moe = 0.0300, cv = 0.0900

# Recover the original n
n_prop(p)
#> Sample size for proportion (wald)
#> n = 603 (p = 0.17, moe = 0.030)

You can override a parameter on the way back:

n_prop(p, cv = 0.10)
#> Sample size for proportion (wald)
#> n = 489 (p = 0.17, cv = 0.100)

Confidence intervals

confint() extracts the expected confidence interval from any sample size or precision object:

confint(s)
#>  2.5 % 97.5 %
#>   0.14    0.2
confint(p, level = 0.99)
#>      0.5 %    99.5 %
#>  0.1305733 0.2094267

Sensitivity Analysis

How sensitive is a sample size to assumptions about the design effect or response rate? The predict() method evaluates any svyplan result at new parameter combinations:

x <- n_prop(p = 0.17, moe = 0.03, deff = 1.5)
predict(x, expand.grid(
  deff = c(1.0, 1.5, 2.0, 2.5),
  resp_rate = c(0.8, 0.9, 1.0)
))
#>    deff resp_rate         n        se  moe         cv
#> 1   1.0       0.8  752.8192 0.0153064 0.03 0.09003767
#> 2   1.5       0.8 1129.2288 0.0153064 0.03 0.09003767
#> 3   2.0       0.8 1505.6384 0.0153064 0.03 0.09003767
#> 4   2.5       0.8 1882.0481 0.0153064 0.03 0.09003767
#> 5   1.0       0.9  669.1726 0.0153064 0.03 0.09003767
#> 6   1.5       0.9 1003.7590 0.0153064 0.03 0.09003767
#> 7   2.0       0.9 1338.3453 0.0153064 0.03 0.09003767
#> 8   2.5       0.9 1672.9316 0.0153064 0.03 0.09003767
#> 9   1.0       1.0  602.2554 0.0153064 0.03 0.09003767
#> 10  1.5       1.0  903.3831 0.0153064 0.03 0.09003767
#> 11  2.0       1.0 1204.5108 0.0153064 0.03 0.09003767
#> 12  2.5       1.0 1505.6384 0.0153064 0.03 0.09003767

This works for all svyplan object types: sample size, precision, cluster, and power.

Multistage Cluster Planning

For cluster designs, n_cluster() finds the optimal allocation across stages given costs and the degree of clustering.

Variance components

If a frame with cluster identifiers is available, varcomp() estimates the between-cluster and within-cluster variance components:

vc <- varcomp(
  revenue_millions ~ county,
  data = ken_enterprises
)
vc
#> Variance components (2-stage)
#> varb = 16.8567, varw = 309.8482
#> delta = 0.0516
#> k = 17.6378
#> Unit relvariance = 18.5229

The delta value (measure of homogeneity) feeds directly into n_cluster() and design_effect().

Optimal allocation

# Minimize cost to achieve CV = 0.05
n_cluster(stage_cost = c(500, 50), delta = vc, cv = 0.05)
#> Optimal 2-stage allocation
#> n_psu = 15885 | psu_size = 14 -> total n = 222390 (unrounded: 215354.9)
#> cv = 0.0500, cost = 18709868

# Maximize precision within a budget
n_cluster(stage_cost = c(500, 50), delta = vc, budget = 100000)
#> Optimal 2-stage allocation
#> n_psu = 85 | psu_size = 14 -> total n = 1190 (unrounded: 1151.023)
#> cv = 0.6839, cost = 100000

The total sample size from n_cluster() can be passed to draw():

cl <- n_cluster(stage_cost = c(500, 50), delta = vc, budget = 100000)
samp <- sampling_design() |>
  draw(n = cl) |>
  execute(ken_enterprises, seed = 5)
nrow(samp)
#> [1] 1190

Stratification Boundaries

Enterprise surveys typically stratify by size. Rather than using predefined size classes, strata_bound() finds optimal boundaries on a continuous variable. The cumulative square root of frequency method minimizes the coefficient of variation for a given number of strata and total sample size:

bounds <- strata_bound(
  ken_enterprises$revenue_millions,
  n_strata = 4,
  method = "cumrootf",
  cv = 0.05
)
bounds
#> Strata boundaries (Dalenius-Hodges, 4 strata)
#> Boundaries: 40.0, 165.0, 745.0
#> n = 126, cv = 0.0500
#> Allocation: neyman
#> ---
#>  stratum lower upper N     share sd     n 
#>  1         1.1    40 12262 0.721 9.3    16
#>  2        40.0   165  3464 0.204 31.8   15
#>  3       165.0   745   929 0.055 145.9  19
#>  4       745.0 10104   349 0.021 1553.5 76

The result includes jointly optimized boundaries and stratum sample sizes. The predict() method assigns each frame unit to a stratum:

labels <- paste0("S", 1:4)
ken_enterprises$rev_stratum <- predict(
  bounds,
  ken_enterprises$revenue_millions,
  labels = labels
)
table(ken_enterprises$rev_stratum)
#> 
#>    S1    S2    S3    S4 
#> 12262  3464   929   349

Use the stratum allocations from strata_bound() directly as a named vector for draw(). This preserves the joint optimization between boundaries and allocation:

n_alloc <- setNames(bounds$strata$n, labels)
n_alloc
#> S1 S2 S3 S4 
#> 16 15 19 76

samp <- sampling_design() |>
  stratify_by(rev_stratum) |>
  draw(n = n_alloc) |>
  execute(ken_enterprises, seed = 4)
samp |>
  count(rev_stratum, name = "n_sampled")
#> # A tibble: 4 × 2
#>   rev_stratum n_sampled
#>   <fct>           <int>
#> 1 S1                 16
#> 2 S2                 15
#> 3 S3                 19
#> 4 S4                 76

Note that the boundaries and allocation are coupled. The cumrootf method optimizes them together to achieve the target CV. Using strata_bound() for boundaries but a different allocation in samplyr (for example, proportional instead of Neyman) might break this optimality. If you need a different allocation method, compute boundaries and allocation separately.

Stratified Allocation

When strata are defined by administrative or domain boundaries rather than optimal boundaries from strata_bound(), use n_alloc() to compute the allocation. It supports Neyman, proportional, optimal (cost-weighted), and power allocation with optional constraints on minimum stratum size.

First, summarize the frame to get stratum population sizes and variabilities:

alloc_frame <- ken_enterprises |>
  summarize(
    N = n(),
    sd = sd(revenue_millions),
    mean = mean(revenue_millions),
    .by = sector
  ) |>
  rename(stratum = sector)
alloc_frame
#> # A tibble: 7 × 4
#>   stratum                           N    sd  mean
#>   <fct>                         <int> <dbl> <dbl>
#> 1 Chemicals & Chemical Products   282  878. 273. 
#> 2 Construction                   2343  315.  63.4
#> 3 Food                            856  698. 228. 
#> 4 Hotels and Restaurants         3065  190.  48.9
#> 5 Other Manufacturing            2113  537. 177. 
#> 6 Other Services                 6187  246.  58.2
#> 7 Retail                         2158  440.  87.8

Then compute the allocation. Here we use Neyman allocation targeting 600 units with a minimum of 20 per stratum:

alloc <- n_alloc(alloc_frame, n = 600, alloc = "neyman", min_n = 20)
alloc
#> Stratum allocation (neyman, 7 strata)
#> n = 600, cv = 0.1543, se = 13.5612
#> (min_n = 20)
alloc$detail[, c("stratum", "N", "n_int", "weight")]
#>                         stratum    N n_int   weight
#> 1 Chemicals & Chemical Products  282    26 10.95380
#> 2                  Construction 2343    77 30.54684
#> 3                          Food  856    62 13.79354
#> 4        Hotels and Restaurants 3065    60 50.63273
#> 5           Other Manufacturing 2113   118 17.91983
#> 6                Other Services 6187   158 39.08919
#> 7                        Retail 2158    99 21.84907

The result passes directly to draw(). The per-stratum allocations are matched by stratum name to the values of the stratification variable:

samp <- sampling_design() |>
  stratify_by(sector) |>
  draw(n = alloc) |>
  execute(ken_enterprises, seed = 10)
samp |>
  count(sector, name = "n_sampled")
#> # A tibble: 7 × 2
#>   sector                        n_sampled
#>   <fct>                             <int>
#> 1 Food                                 62
#> 2 Chemicals & Chemical Products        26
#> 3 Other Manufacturing                 118
#> 4 Construction                         77
#> 5 Retail                               99
#> 6 Hotels and Restaurants               60
#> 7 Other Services                      158

Because n_alloc() already provides per-stratum sizes, do not combine it with alloc in stratify_by(). The two allocation mechanisms are mutually exclusive: either let svyplan allocate (via n_alloc()) or let samplyr allocate (via alloc in stratify_by()), but not both.

Budget-constrained allocation

When field costs vary across strata, n_alloc() can optimize under a budget constraint using cost-weighted (optimal) allocation:

alloc_frame_cost <- alloc_frame |>
  mutate(cost = c(120, 200, 150, 130, 80, 100, 90))

budget_alloc <- n_alloc(
  alloc_frame_cost,
  budget = 50000,
  alloc = "optimal",
  min_n = 20
)
budget_alloc
#> Stratum allocation (optimal, 7 strata)
#> n = 451, cv = 0.1811, se = 15.9212
#> (min_n = 20)
budget_alloc$detail[, c("stratum", "N", "cost", "n_int", "weight")]
#>                         stratum    N cost n_int   weight
#> 1 Chemicals & Chemical Products  282  120    20 14.10000
#> 2                  Construction 2343  200    42 55.44653
#> 3                          Food  856  150    40 21.68276
#> 4        Hotels and Restaurants 3065  130    41 74.09623
#> 5           Other Manufacturing 2113   80   103 20.57178
#> 6                Other Services 6187  100   123 50.17066
#> 7                        Retail 2158   90    81 26.60402

samp_budget <- sampling_design() |>
  stratify_by(sector) |>
  draw(n = budget_alloc) |>
  execute(ken_enterprises, seed = 11)
samp_budget |>
  count(sector, name = "n_sampled")
#> # A tibble: 7 × 2
#>   sector                        n_sampled
#>   <fct>                             <int>
#> 1 Food                                 40
#> 2 Chemicals & Chemical Products        20
#> 3 Other Manufacturing                 103
#> 4 Construction                         42
#> 5 Retail                               81
#> 6 Hotels and Restaurants               41
#> 7 Other Services                      123

Design Effects After Sampling

After executing a design, you can evaluate its efficiency using design_effect() and effective_n(). These are re-exported from svyplan and work directly on tbl_sample objects.

Kish Design Effect

The Kish design effect measures the loss of precision due to unequal weights. It equals 1 for self-weighting designs:

design_effect(samp)
#> [1] 1.182607
effective_n(samp)
#> [1] 507.3536

The effective sample size is the number of observations an SRS would need to achieve the same precision.

Spencer Method

The Spencer method accounts for the relationship between the outcome variable and the selection probabilities. Pass the outcome as a bare column name. The selection probabilities are extracted automatically from .weight_1:

design_effect(samp, y = revenue_millions, method = "spencer")
#> [1] 0.8225433

Henry Method

The Henry method also accounts for a calibration covariate. Both y and x_cal are column names in the sample:

design_effect(samp, y = revenue_millions, x_cal = employees,
              method = "henry")
#> [1] 0.3754297

Chen-Rust Decomposition

For stratified or clustered designs, the Chen-Rust (CR) method decomposes the design effect into weighting (deff_w), clustering (deff_c), and stratification (deff_s) components:

cr <- design_effect(samp, y = revenue_millions, method = "cr")
cr$strata
#>                         stratum n_h cv2_w deff_w      deff_s
#> 1                          Food  62     0      1 0.020888198
#> 2 Chemicals & Chemical Products  26     0      1 0.023398167
#> 3           Other Manufacturing 118     0      1 0.127381354
#> 4                  Construction  77     0      1 0.006778693
#> 5                        Retail  99     0      1 0.374431784
#> 6        Hotels and Restaurants  60     0      1 0.014352966
#> 7                Other Services 158     0      1 0.256829835
cr$overall
#> [1] 0.824061

The tbl_sample method extracts stratification and clustering variables from the stored design metadata. You only need to supply the outcome.

Cluster Planning

Before collecting data, you can anticipate the design effect for a cluster design using the intraclass correlation coefficient (ICC) and cluster size. This uses svyplan::design_effect() directly (not through a tbl_sample):

design_effect(delta = 0.05, psu_size = 10)
#> [1] 1.45
design_effect(delta = 0.15, psu_size = 10)
#> [1] 2.35

With an ICC of 0.05 and 10 units per cluster, the design effect is 1.45. An ICC of 0.15 raises it to 2.35. This helps decide the number of clusters and units per cluster during the planning stage.

Summary

The typical planning workflow can be:

Determine sample size with n_prop(), n_mean(), n_multi(), or power_prop() / power_mean().
Evaluate precision at the chosen n with prec_prop(), prec_mean().
Run sensitivity analysis with predict() to check robustness.
Optionally create stratification boundaries with strata_bound() and assign strata with predict().
Allocate across strata with n_alloc() (Neyman, proportional, optimal, or power allocation with constraints).
For cluster designs, estimate variance components with varcomp() and optimize the allocation with n_cluster().
Pass the sample size or allocation to draw() and execute the design.
Evaluate the realized design with design_effect() and effective_n().

All svyplan sample size objects (svyplan_n, svyplan_power, svyplan_cluster) work directly with draw(). For n_alloc(), the per-stratum allocations are passed through automatically. For other objects, the scalar total is extracted. No manual conversion is needed.