Skip to contents

Overview

Before drawing a sample, you need to answer two questions: how many units to select, and how to divide them across strata. The svyplan package provides tools for both. samplyr integrates with svyplan so that planning results flow directly into draw() and execute().

The three packages form a pipeline:

  • svyplan determines sample sizes, evaluates precision, and computes strata boundaries (planning).
  • samplyr specifies and executes the sampling design (selection).
  • sondage provides the low-level sampling algorithms (engine).

This vignette uses the ken_enterprises dataset, a synthetic frame of 6,823 Kenyan establishments with revenue, employee counts, and export status.

data(ken_enterprises)
glimpse(ken_enterprises)
#> Rows: 6,823
#> Columns: 9
#> $ enterprise_id    <chr> "KEN_00001", "KEN_00002", "KEN_00003", "KEN_00004", "…
#> $ county           <fct> Kiambu, Kiambu, Kiambu, Kiambu, Kiambu, Kiambu, Kiamb…
#> $ region           <fct> Kiambu, Kiambu, Kiambu, Kiambu, Kiambu, Kiambu, Kiamb…
#> $ sector           <fct> Chemicals & Plastics, Chemicals & Plastics, Chemicals…
#> $ size_class       <fct> Small, Medium, Medium, Medium, Large, Large, Small, S…
#> $ employees        <int> 13, 54, 37, 24, 107, 136, 10, 14, 43, 30, 70, 42, 84,
#> $ revenue_millions <dbl> 25.1, 57.8, 51.7, 40.5, 73.7, 507.9, 25.5, 16.8, 39.9…
#> $ year_established <int> 1993, 2005, 2002, 2014, 1994, 2007, 2005, 1997, 1991,
#> $ exporter         <lgl> TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, 

Sample Size for a Proportion

Suppose we want to estimate the proportion of exporting firms. We know from the census that about 17% of firms export. We want a margin of error of 3 percentage points at 95% confidence.

n_export <- n_prop(p = 0.17, moe = 0.03)
n_export
#> Sample size for proportion (wald)
#> n = 603 (p = 0.17, moe = 0.030)

The result is a svyplan_n object. You can pass it directly to draw():

samp <- sampling_design() |>
  draw(n = n_export) |>
  execute(ken_enterprises, seed = 1)
nrow(samp)
#> [1] 603

Sample Size for a Mean

We also want to estimate mean revenue with a margin of error of 30 million KES. We compute the population variance from the frame:

rev_var <- var(ken_enterprises$revenue_millions)
n_rev <- n_mean(var = rev_var, moe = 30)
n_rev
#> Sample size for mean
#> n = 982 (var = 230035.94, moe = 30.000)

Multiple Indicators

In practice, a survey measures several things at once. n_multi() takes the binding constraint across indicators:

targets <- data.frame(
  name = c("exporter_rate", "mean_revenue"),
  p = c(0.17, NA),
  var = c(NA, rev_var),
  moe = c(0.03, 30)
)
n_survey <- n_multi(targets)
n_survey
#> Multi-indicator sample size
#> n = 982 (binding: mean_revenue)
#> ---
#>  name          .n  .binding
#>  exporter_rate 603         
#>  mean_revenue  982 *

The sample size is driven by whichever indicator requires the most units. Pass it to draw() like any other sample size:

samp <- sampling_design() |>
  draw(n = n_survey) |>
  execute(ken_enterprises, seed = 2)
nrow(samp)
#> [1] 982

Power Analysis

If the goal is to detect a difference (for example, comparing exporter rates between two regions), power_prop() gives the required per-group sample size:

n_power <- power_prop(p1 = 0.17, p2 = 0.25, power = 0.8)
n_power
#> Power analysis for proportions (solved for sample size)
#> n = 403 (per group), power = 0.800, effect = 0.0800
#> (p1 = 0.170, p2 = 0.250, alpha = 0.05)

This also works directly with draw():

samp <- sampling_design() |>
  draw(n = n_power) |>
  execute(ken_enterprises, seed = 3)
nrow(samp)
#> [1] 403

Response Rate Adjustment

All svyplan functions accept a resp_rate parameter. The sample size is inflated by 1 / resp_rate to account for expected non-response:

n_prop(p = 0.17, moe = 0.03, resp_rate = 0.85)
#> Sample size for proportion (wald)
#> n = 709 (net: 603) (p = 0.17, moe = 0.030, resp_rate = 0.85)

The net sample size (after non-response) matches the original target. This works in n_mean(), n_multi(), n_cluster(), and power_*() functions as well.

Precision Analysis

Given a fixed sample size, how precise will the estimates be? The prec_*() functions are the inverse of n_*():

prec_prop(p = 0.17, n = 600)
#> Sampling precision for proportion (wald)
#> n = 600
#> se = 0.0153, moe = 0.0301, cv = 0.0902
prec_mean(var = rev_var, n = 300, mu = mean(ken_enterprises$revenue_millions))
#> Sampling precision for mean
#> n = 300
#> se = 27.6909, moe = 54.2732, cv = 0.2044

Round-trip between size and precision

All n_*() and prec_*() functions are S3 generics. Pass a precision result to n_*() to recover the sample size, or a sample size result to prec_*() to compute the achieved precision:

s <- n_prop(p = 0.17, moe = 0.03)
p <- prec_prop(s)
p
#> Sampling precision for proportion (wald)
#> n = 603
#> se = 0.0153, moe = 0.0300, cv = 0.0900

# Recover the original n
n_prop(p)
#> Sample size for proportion (wald)
#> n = 603 (p = 0.17, moe = 0.030)

You can override a parameter on the way back:

n_prop(p, cv = 0.10)
#> Sample size for proportion (wald)
#> n = 489 (p = 0.17, cv = 0.100)

Confidence intervals

confint() extracts the expected confidence interval from any sample size or precision object:

confint(s)
#>  2.5 % 97.5 %
#>   0.14    0.2
confint(p, level = 0.99)
#>      0.5 %    99.5 %
#>  0.1305733 0.2094267

Sensitivity Analysis

How sensitive is a sample size to assumptions about the design effect or response rate? The predict() method evaluates any svyplan result at new parameter combinations:

x <- n_prop(p = 0.17, moe = 0.03, deff = 1.5)
predict(x, expand.grid(
  deff = c(1.0, 1.5, 2.0, 2.5),
  resp_rate = c(0.8, 0.9, 1.0)
))
#>    deff resp_rate         n        se  moe         cv
#> 1   1.0       0.8  752.8192 0.0153064 0.03 0.09003767
#> 2   1.5       0.8 1129.2288 0.0153064 0.03 0.09003767
#> 3   2.0       0.8 1505.6384 0.0153064 0.03 0.09003767
#> 4   2.5       0.8 1882.0481 0.0153064 0.03 0.09003767
#> 5   1.0       0.9  669.1726 0.0153064 0.03 0.09003767
#> 6   1.5       0.9 1003.7590 0.0153064 0.03 0.09003767
#> 7   2.0       0.9 1338.3453 0.0153064 0.03 0.09003767
#> 8   2.5       0.9 1672.9316 0.0153064 0.03 0.09003767
#> 9   1.0       1.0  602.2554 0.0153064 0.03 0.09003767
#> 10  1.5       1.0  903.3831 0.0153064 0.03 0.09003767
#> 11  2.0       1.0 1204.5108 0.0153064 0.03 0.09003767
#> 12  2.5       1.0 1505.6384 0.0153064 0.03 0.09003767

This works for all svyplan object types: sample size, precision, cluster, and power.

Multistage Cluster Planning

For cluster designs, n_cluster() finds the optimal allocation across stages given costs and the degree of clustering.

Variance components

If a frame with cluster identifiers is available, varcomp() estimates the between-cluster and within-cluster variance components:

vc <- varcomp(
  revenue_millions ~ county,
  data = ken_enterprises
)
vc
#> Variance components (2-stage)
#> varb = 5.6221, varw = 70.8033
#> delta = 0.0736
#> k = 6.0984
#> Unit relvariance = 12.5320

The delta value (measure of homogeneity) feeds directly into n_cluster() and design_effect().

Optimal allocation

# Minimize cost to achieve CV = 0.05
n_cluster(cost = c(500, 50), delta = vc, cv = 0.05)
#> Optimal 2-stage allocation
#> n_psu = 4773 | psu_size = 12 -> total n = 57276 (unrounded: 53558)
#> cv = 0.0500, cost = 5064188

# Maximize precision within a budget
n_cluster(cost = c(500, 50), delta = vc, budget = 100000)
#> Optimal 2-stage allocation
#> n_psu = 95 | psu_size = 12 -> total n = 1140 (unrounded: 1057)
#> cv = 0.3558, cost = 100000

The total sample size from n_cluster() can be passed to draw():

cl <- n_cluster(cost = c(500, 50), delta = vc, budget = 100000)
samp <- sampling_design() |>
  draw(n = cl) |>
  execute(ken_enterprises, seed = 5)
nrow(samp)
#> [1] 1140

Stratification Boundaries

Enterprise surveys typically stratify by size. Rather than using predefined size classes, strata_bound() finds optimal boundaries on a continuous variable. The cumulative square root of frequency method minimizes the coefficient of variation for a given number of strata and total sample size:

bounds <- strata_bound(
  ken_enterprises$revenue_millions,
  n_strata = 4,
  method = "cumrootf",
  cv = 0.05
)
bounds
#> Strata boundaries (Dalenius-Hodges, 4 strata)
#> Boundaries: 60.0, 230.0, 910.0
#> n = 101, cv = 0.0500
#> Allocation: neyman
#> ---
#>  stratum lower upper  N_h  W_h   S_h    n_h
#>  1         1.3   60.0 4692 0.688 15.0   14 
#>  2        60.0  230.0 1492 0.219 44.0   13 
#>  3       230.0  910.0  448 0.066 179.9  16 
#>  4       910.0 8725.7  191 0.028 1565.2 58

The result includes jointly optimized boundaries and stratum sample sizes. The predict() method assigns each frame unit to a stratum:

labels <- paste0("S", 1:4)
ken_enterprises$rev_stratum <- predict(
  bounds,
  ken_enterprises$revenue_millions,
  labels = labels
)
table(ken_enterprises$rev_stratum)
#> 
#>   S1   S2   S3   S4 
#> 4692 1492  448  191

Use the stratum allocations from strata_bound() directly as a named vector for draw(). This preserves the joint optimization between boundaries and allocation:

n_alloc <- setNames(bounds$strata$n_h, labels)
n_alloc
#> S1 S2 S3 S4 
#> 14 13 16 58
samp <- sampling_design() |>
  stratify_by(rev_stratum) |>
  draw(n = n_alloc) |>
  execute(ken_enterprises, seed = 4)
samp |>
  count(rev_stratum, name = "n_sampled")
#> # A tibble: 4 × 2
#>   rev_stratum n_sampled
#>   <fct>           <int>
#> 1 S1                 14
#> 2 S2                 13
#> 3 S3                 16
#> 4 S4                 58

Note that the boundaries and allocation are coupled. The cumrootf method optimizes them together to achieve the target CV. Using strata_bound() for boundaries but a different allocation in samplyr (for example, proportional instead of Neyman) might break this optimality. If you need a different allocation method, compute boundaries and allocation separately.

Design Effects After Sampling

After executing a design, you can evaluate its efficiency using design_effect() and effective_n(). These are re-exported from svyplan and work directly on tbl_sample objects.

Kish Design Effect

The Kish design effect measures the loss of precision due to unequal weights. It equals 1 for self-weighting designs:

design_effect(samp)
#> [1] 3.811689
effective_n(samp)
#> [1] 26.49744

The effective sample size is the number of observations an SRS would need to achieve the same precision.

Spencer Method

The Spencer method accounts for the relationship between the outcome variable and the selection probabilities. Pass the outcome as a bare column name. The selection probabilities are extracted automatically from .weight_1:

design_effect(samp, y = revenue_millions, method = "spencer")
#> [1] 0.102805

Henry Method

The Henry method also accounts for a calibration covariate. Both y and x_cal are column names in the sample:

design_effect(samp, y = revenue_millions, x_cal = employees,
              method = "henry")
#> [1] 0.04859152

Chen-Rust Decomposition

For stratified or clustered designs, the Chen-Rust (CR) method decomposes the design effect into weighting (deff_w), clustering (deff_c), and stratification (deff_s) components:

cr <- design_effect(samp, y = revenue_millions, method = "cr")
cr$strata
#>   stratum n_h cv2_w deff_w      deff_s
#> 1      S1  14     0      1 0.003603710
#> 2      S2  13     0      1 0.004195195
#> 3      S3  16     0      1 0.006554564
#> 4      S4  58     0      1 0.009419020
cr$overall
#> [1] 0.02377249

The tbl_sample method extracts stratification and clustering variables from the stored design metadata. You only need to supply the outcome.

Cluster Planning

Before collecting data, you can anticipate the design effect for a cluster design using the intraclass correlation coefficient (ICC) and cluster size. This uses svyplan::design_effect() directly (not through a tbl_sample):

design_effect(delta = 0.05, psu_size = 10)
#> [1] 1.45
design_effect(delta = 0.15, psu_size = 10)
#> [1] 2.35

With an ICC of 0.05 and 10 units per cluster, the design effect is 1.45. An ICC of 0.15 raises it to 2.35. This helps decide the number of clusters and units per cluster during the planning stage.

Summary

The typical planning workflow can be:

  1. Determine sample size with n_prop(), n_mean(), n_multi(), or power_prop() / power_mean().
  2. Evaluate precision at the chosen n with prec_prop(), prec_mean().
  3. Run sensitivity analysis with predict() to check robustness.
  4. Optionally create stratification boundaries with strata_bound() and assign strata with predict().
  5. For cluster designs, estimate variance components with varcomp() and optimize the allocation with n_cluster().
  6. Pass the sample size to draw() and execute the design.
  7. Evaluate the realized design with design_effect() and effective_n().

All svyplan sample size objects (svyplan_n, svyplan_power, svyplan_cluster) work directly with draw(), so the planning and execution steps connect without manual conversion.