Skip to contents

Overview

samplyr provides a tidy grammar for specifying and executing survey sampling designs. The package is built around a minimal set of composable verbs that handle stratified, clustered, and multi-stage sampling.

For the statistical semantics and assumptions underlying each operation, see vignette("design-semantics").

The core idea is that sampling code should read like its English description:

library(samplyr)
library(dplyr)

# "Stratify by region, proportionally allocate 500 samples, execute"
sampling_design(title = "My sampling design") |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 500) |>
  execute(frame, seed = 1)

To see how this works in practice, consider a real survey design from Lohr (2022) (Example 7.1), based on a 1991 study of bed net use in rural Gambia (D’Alessandro et al., 1994):

Malaria morbidity can be reduced by using bed nets impregnated with insecticide, but this is only effective if the bed nets are in widespread use. In 1991, a nationwide survey was designed to estimate the prevalence of bed net use in rural areas of the Gambia (D’Alessandro et al., 1994).

The sampling frame consisted of all rural villages of fewer than 3,000 people. The villages were stratified by three geographic regions (eastern, central, and western) and by whether the village had a public health clinic (PHC) or not. In each region five districts were chosen with probability proportional to the district population. In each district four villages were chosen, again with probability proportional to census population: two PHC villages and two non-PHC villages. Finally, six compounds were chosen more or less randomly from each village.

In samplyr, this three-stage stratified cluster design translates directly into code:

design <- sampling_design(title = "Gambia bed nets") |>
  add_stage() |>
    stratify_by(region) |>
    cluster_by(district) |>
    draw(n = 5, method = "pps_brewer", mos = population) |>
  add_stage() |>
    stratify_by(phc) |>
    cluster_by(village) |>
    draw(n = 2, method = "pps_brewer", mos = population) |>
  add_stage() |>
    draw(n = 6)
design
#> ── Sampling Design: Gambia bed nets ────────────────────────────────────────────
#> 
#> ℹ 3 stages
#> 
#> ── Stage 1 ─────────────────────────────────────────────────────────────────────
#> • Strata: region
#> • Cluster: district
#> • Draw: n = 5, method = pps_brewer, mos = population
#> 
#> ── Stage 2 ─────────────────────────────────────────────────────────────────────
#> • Strata: phc
#> • Cluster: village
#> • Draw: n = 2, method = pps_brewer, mos = population
#> 
#> ── Stage 3 ─────────────────────────────────────────────────────────────────────
#> • Draw: n = 6, method = srswor

And you can then use execute to get your sample.

sampling_design(title = "Gambia bed net") |>
  add_stage() |>
    stratify_by(region) |>
    cluster_by(district) |>
    draw(n = 5, method = "pps_brewer", mos = population) |>
  add_stage() |>
    stratify_by(phc) |>
    cluster_by(village) |>
    draw(n = 2, method = "pps_brewer", mos = population) |>
  add_stage() |>
    draw(n = 6) |>
  execute(frame, seed = 1991)

# or
execute(design, frame, seed = 1991)

The samplyr code mirrors the verbal description verb for verb.

The Grammar

samplyr uses 5 verbs and 1 modifier:

Function Purpose
sampling_design() Create a new sampling design
stratify_by() Define stratification variables and allocation
cluster_by() Define cluster/PSU variable
draw() Specify sample size, method
execute() Run the design on a frame
add_stage() Delimit stages in multi-stage designs

Deferred Column Resolution

stratify_by() and cluster_by() accept bare column names and store them in the design object. Column names are validated only when a frame is available (execute(), as_svydesign()), which keeps design specification separate from execution.

design <- sampling_design(title = "Stratified cluster sampling") |>
  stratify_by(region, alloc = "proportional") |>
  cluster_by(ea_id) |>
  draw(n = 200)

sample <- execute(design, bfa_eas, seed = 101)

Example Data

We’ll use the bfa_eas dataset throughout this vignette. It is an enumeration area frame for Burkina Faso, with ~14,934 EAs across 13 regions.

data(bfa_eas)

bfa_eas |>
  glimpse()
#> Rows: 14,934
#> Columns: 12
#> $ ea_id               <chr> "EA_00245", "EA_00246", "EA_00247", "EA_00248", "E…
#> $ region              <fct> Boucle du Mouhoun, Boucle du Mouhoun, Boucle du Mo…
#> $ province            <fct> Bale, Bale, Bale, Bale, Bale, Bale, Bale, Bale, Ba…
#> $ commune             <fct> Bagassi, Bagassi, Bagassi, Bagassi, Bagassi, Bagas…
#> $ urban_rural         <fct> Rural, Rural, Rural, Rural, Rural, Rural, Rural, R…
#> $ population          <dbl> 927, 2184, 1666, 592, 1467, 1607, 709, 1461, 1053,
#> $ households          <int> 111, 262, 200, 71, 176, 193, 85, 176, 127, 122, 16…
#> $ area_km2            <dbl> 57.27, 2.57, 33.44, 31.73, 12.07, 25.53, 12.63, 16…
#> $ accessible          <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…
#> $ dist_road_km        <dbl> 11.6, 10.7, 23.3, 19.1, 27.1, 5.7, 21.1, 3.5, 11.5…
#> $ food_insecurity_pct <dbl> 27.9, 15.9, 19.1, 19.7, 14.2, 45.5, 15.4, 18.8, 34…
#> $ cost                <dbl> 407, 243, 510, 360, 546, 428, 724, 408, 219, 387, 

Simple Random Sampling

The most basic design selects n units at random from the frame.

design <- sampling_design(title = "Simple Random Sampling") |>
  draw(n = 100)
design
#> ── Sampling Design: Simple Random Sampling ─────────────────────────────────────
#> 
#> ℹ 1 stage
#> 
#> ── Stage 1 ─────────────────────────────────────────────────────────────────────
#> • Draw: n = 100, method = srswor

The sampling_design object stores design metadata and can be reused.

sample <- execute(design, bfa_eas, seed = 24)
nrow(sample)
#> [1] 100

The result includes the original columns plus sampling metadata: .weight (sampling weight), .weight_1, .weight_2, etc. (per-stage weights). For with-replacement methods, .draw_1, .draw_2, etc. each independent draw.

sample |>
  select(ea_id, region, urban_rural, .weight, .weight_1) |>
  head()
#> # A tbl_sample: 6 × 5 | Simple Random Sampling
#> # Weights:      149.34 [149.34, 149.34]
#>   ea_id    region            urban_rural .weight .weight_1
#> * <chr>    <fct>             <fct>         <dbl>     <dbl>
#> 1 EA_09278 Centre            Urban          149.      149.
#> 2 EA_04429 Sud-Ouest         Rural          149.      149.
#> 3 EA_04891 Sahel             Rural          149.      149.
#> 4 EA_14044 Boucle du Mouhoun Rural          149.      149.
#> 5 EA_10483 Hauts-Bassins     Rural          149.      149.
#> 6 EA_04316 Hauts-Bassins     Rural          149.      149.

Selection Methods

The method argument controls how units are selected. By default, samplyr uses simple random sampling without replacement (srswor).

Systematic sampling selects units at fixed intervals, which can improve precision when the frame is ordered.

sample_sys <- sampling_design(title = "Systematic") |>
  draw(n = 100, method = "systematic") |>
  execute(bfa_eas, seed = 2021)
head(sample_sys)
#> # A tbl_sample: 6 × 17 | Systematic
#> # Weights:      149.34 [149.34, 149.34]
#>   ea_id    region    province commune urban_rural population households area_km2
#> * <chr>    <fct>     <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#> 1 EA_02176 Boucle d… Bale     Boromo  Rural              792        103     0.42
#> 2 EA_14063 Boucle d… Bale     Yaho    Rural             1147        171     8.77
#> 3 EA_12395 Boucle d… Banwa    Solenzo Rural             1867        251    19.7 
#> 4 EA_00757 Boucle d… Kossi    Barani  Rural              431         56    16.4 
#> 5 EA_03968 Boucle d… Kossi    Doumba… Rural              922        113    30.7 
#> 6 EA_02113 Boucle d… Mouhoun  Bondok… Rural             1700        259    48.7 
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

Sampling with replacement allows the same unit to be selected multiple times, useful for bootstrap procedures.

sample_wr <- sampling_design(title = "SRS WR") |>
  draw(n = 100, method = "srswr") |>
  execute(bfa_eas, seed = 123)
head(sample_wr)
#> # A tbl_sample: 6 × 18 | SRS WR
#> # Weights:      149.34 [149.34, 149.34]
#>   ea_id    region    province commune urban_rural population households area_km2
#> * <chr>    <fct>     <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#> 1 EA_08950 Centre    Kadiogo  Ouagad… Urban             1618        244     0.2 
#> 2 EA_08998 Centre    Kadiogo  Ouagad… Urban             1955        294     0.62
#> 3 EA_02967 Hauts-Ba… Houet    Dande   Rural             1132        123    31.1 
#> 4 EA_07871 Est       Gnagna   Mani    Rural             1096        159    10.4 
#> 5 EA_05028 Nord      Zondoma  Goursi  Rural              845         95    23.3 
#> 6 EA_09473 Centre    Kadiogo  Ouagad… Urban             2069        312     0.31
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <dbl>, .draw_1 <int>

Using Sampling Fractions

Instead of a fixed sample size, you can specify a sampling fraction with frac.

sample <- sampling_design() |>
  draw(frac = 0.05) |>
  execute(bfa_eas, seed = 1789)
nrow(sample)
#> [1] 747

Bernoulli sampling selects each unit independently with the specified probability. This gives a random sample size, which can simplify field protocols.

sample <- sampling_design() |>
  draw(frac = 0.05, method = "bernoulli") |>
  execute(bfa_eas, seed = 1960)
nrow(sample)
#> [1] 716

Rounding Behavior

When using frac, the computed sample size (N × frac) must be rounded to an integer. By default, samplyr rounds up (ceiling), which matches SAS SURVEYSELECT’s default behavior. Use the round argument to control this:

  • "up" (default): round up, ensures adequate sample size
  • "down": round down, minimum sample size
  • "nearest": Standard mathematical rounding
# Frame of 105 units, frac = 0.1 → 10.5
frame_105 <- head(bfa_eas, 105)

# Default: rounds up to 11
nrow(sampling_design() |> draw(frac = 0.1) |> execute(frame_105, seed = 3))
#> [1] 11

# Round down to 10
nrow(sampling_design() |> draw(frac = 0.1, round = "down") |> execute(frame_105, seed = 3))
#> [1] 10

# Round to nearest (10)
nrow(sampling_design() |> draw(frac = 0.1, round = "nearest") |> execute(frame_105, seed = 3))
#> [1] 10

Stratified Sampling

Stratification partitions the frame into non-overlapping groups (strata) and samples from each. This ensures representation from all subgroups and often improves precision.

Per-Stratum Sample Size

Without an allocation method, n specifies the sample size within each stratum.

sample <- sampling_design() |>
  stratify_by(region) |>
  draw(n = 20) |>
  execute(bfa_eas, seed = 123)

sample |>
  count(region)
#> # A tibble: 13 × 2
#>    region                n
#>    <fct>             <int>
#>  1 Boucle du Mouhoun    20
#>  2 Cascades             20
#>  3 Centre               20
#>  4 Centre-Est           20
#>  5 Centre-Nord          20
#>  6 Centre-Ouest         20
#>  7 Centre-Sud           20
#>  8 Est                  20
#>  9 Hauts-Bassins        20
#> 10 Nord                 20
#> 11 Plateau-Central      20
#> 12 Sahel                20
#> 13 Sud-Ouest            20

Allocation Methods

With an allocation method, n becomes the total sample size to distribute across strata. Allocation methods are defined for total sample-size allocation, so draw(frac = ...) is not supported when alloc is set.

Proportional allocation distributes the sample proportionally to stratum sizes, ensuring the sample mirrors the population structure.

sample <- sampling_design(title = "Proportional Allocation") |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 300) |>
  execute(bfa_eas, seed = 42)

sample |>
  count(region) |>
  mutate(pct = n / sum(n))
#> # A tibble: 13 × 3
#>    region                n    pct
#>    <fct>             <int>  <dbl>
#>  1 Boucle du Mouhoun    30 0.1   
#>  2 Cascades             14 0.0467
#>  3 Centre               31 0.103 
#>  4 Centre-Est           25 0.0833
#>  5 Centre-Nord          28 0.0933
#>  6 Centre-Ouest         26 0.0867
#>  7 Centre-Sud           12 0.04  
#>  8 Est                  32 0.107 
#>  9 Hauts-Bassins        30 0.1   
#> 10 Nord                 24 0.08  
#> 11 Plateau-Central      15 0.05  
#> 12 Sahel                18 0.06  
#> 13 Sud-Ouest            15 0.05

Equal allocation assigns the same sample size to each stratum, regardless of population size. This maximizes precision for comparisons between strata.

sample <- sampling_design(title = "Equal Allocation") |>
  stratify_by(region, alloc = "equal") |>
  draw(n = 160) |>
  execute(bfa_eas, seed = 2026)

sample |>
  count(region)
#> # A tibble: 13 × 2
#>    region                n
#>    <fct>             <int>
#>  1 Boucle du Mouhoun    13
#>  2 Cascades             13
#>  3 Centre               13
#>  4 Centre-Est           13
#>  5 Centre-Nord          12
#>  6 Centre-Ouest         12
#>  7 Centre-Sud           12
#>  8 Est                  12
#>  9 Hauts-Bassins        12
#> 10 Nord                 12
#> 11 Plateau-Central      12
#> 12 Sahel                12
#> 13 Sud-Ouest            12

Neyman allocation minimizes the variance of the overall estimate by allocating more sample to strata with higher variability. It requires prior information about stratum variances. Use explicit stratification variable names when passing variance.

data(bfa_eas_variance)

bfa_eas_variance
#> # A tibble: 13 × 2
#>    region              var
#>    <fct>             <dbl>
#>  1 Boucle du Mouhoun  96.9
#>  2 Cascades           63.6
#>  3 Centre             34.7
#>  4 Centre-Est         58.5
#>  5 Centre-Nord        94.9
#>  6 Centre-Ouest       32.4
#>  7 Centre-Sud         31.9
#>  8 Est               140. 
#>  9 Hauts-Bassins      57.7
#> 10 Nord              155. 
#> 11 Plateau-Central    55.9
#> 12 Sahel             145. 
#> 13 Sud-Ouest          63.1

sample <- sampling_design(title = "Neyman Allocation") |>
  stratify_by(region, alloc = "neyman", variance = bfa_eas_variance) |>
  draw(n = 300) |>
  execute(bfa_eas, seed = 12345)

sample |>
  count(region)
#> # A tibble: 13 × 2
#>    region                n
#>    <fct>             <int>
#>  1 Boucle du Mouhoun    34
#>  2 Cascades             12
#>  3 Centre               21
#>  4 Centre-Est           22
#>  5 Centre-Nord          31
#>  6 Centre-Ouest         17
#>  7 Centre-Sud            8
#>  8 Est                  43
#>  9 Hauts-Bassins        26
#> 10 Nord                 35
#> 11 Plateau-Central      13
#> 12 Sahel                25
#> 13 Sud-Ouest            13

Optimal allocation extends Neyman allocation by also accounting for differential costs across strata. It minimizes variance for a fixed budget (or cost for fixed precision). Use explicit stratification variable names when passing variance/cost.

data(bfa_eas_cost)

bfa_eas_cost
#> # A tibble: 13 × 2
#>    region             cost
#>    <fct>             <dbl>
#>  1 Boucle du Mouhoun   483
#>  2 Cascades            427
#>  3 Centre              240
#>  4 Centre-Est          400
#>  5 Centre-Nord         444
#>  6 Centre-Ouest        399
#>  7 Centre-Sud          418
#>  8 Est                 631
#>  9 Hauts-Bassins       358
#> 10 Nord                583
#> 11 Plateau-Central     408
#> 12 Sahel               639
#> 13 Sud-Ouest           416

sample <- sampling_design(title = "Optimal Allocation") |>
  stratify_by(region, alloc = "optimal",
              variance = bfa_eas_variance,
              cost = bfa_eas_cost) |>
  draw(n = 300) |>
  execute(bfa_eas, seed = 9876)

sample |>
  count(region)
#> # A tibble: 13 × 2
#>    region                n
#>    <fct>             <int>
#>  1 Boucle du Mouhoun    33
#>  2 Cascades             13
#>  3 Centre               29
#>  4 Centre-Est           23
#>  5 Centre-Nord          31
#>  6 Centre-Ouest         18
#>  7 Centre-Sud            8
#>  8 Est                  37
#>  9 Hauts-Bassins        29
#> 10 Nord                 30
#> 11 Plateau-Central      14
#> 12 Sahel                21
#> 13 Sud-Ouest            14

Power allocation (Bankier 1988) uses \(n_h \propto C_h X_h^q\), where cv supplies \(C_h\), importance supplies \(X_h\), and power is \(q \in [0, 1]\).

cv_df <- data.frame(
  region = levels(bfa_eas$region),
  cv = c(0.40, 0.35, 0.12, 0.20, 0.30, 0.18,
         0.15, 0.38, 0.22, 0.32, 0.17, 0.45, 0.25)
)

importance_df <- data.frame(
  region = levels(bfa_eas$region),
  importance = c(60, 40, 120, 70, 80, 65,
                 50, 55, 90, 75, 45, 35, 30)
)

sample <- sampling_design(title = "Power Allocation") |>
  stratify_by(
    region,
    alloc = "power",
    cv = cv_df,
    importance = importance_df,
    power = 0.5
  ) |>
  draw(n = 300) |>
  execute(bfa_eas, seed = 777)

sample |>
  count(region)
#> # A tibble: 13 × 2
#>    region                n
#>    <fct>             <int>
#>  1 Boucle du Mouhoun    35
#>  2 Cascades             25
#>  3 Centre               15
#>  4 Centre-Est           19
#>  5 Centre-Nord          31
#>  6 Centre-Ouest         16
#>  7 Centre-Sud           12
#>  8 Est                  32
#>  9 Hauts-Bassins        24
#> 10 Nord                 32
#> 11 Plateau-Central      13
#> 12 Sahel                30
#> 13 Sud-Ouest            16

Sample Size Bounds

When using allocation methods, you can constrain stratum sample sizes with min_n and max_n in draw().

Minimum sample size ensures every stratum gets at least the specified number of units. This is essential for variance estimation (requires n ≥ 2 per stratum) or when you need reliable subgroup estimates.

# Neyman allocation with minimum 5 per stratum
sample <- sampling_design() |>
  stratify_by(region, alloc = "neyman", variance = bfa_eas_variance) |>
  draw(n = 300, min_n = 20) |>
  execute(bfa_eas, seed = 1)

sample |>
  count(region)
#> # A tibble: 13 × 2
#>    region                n
#>    <fct>             <int>
#>  1 Boucle du Mouhoun    27
#>  2 Cascades             20
#>  3 Centre               20
#>  4 Centre-Est           21
#>  5 Centre-Nord          26
#>  6 Centre-Ouest         20
#>  7 Centre-Sud           20
#>  8 Est                  32
#>  9 Hauts-Bassins        23
#> 10 Nord                 28
#> 11 Plateau-Central      20
#> 12 Sahel                23
#> 13 Sud-Ouest            20

Maximum sample size caps large strata to prevent them from dominating the sample. This is useful when operational constraints limit how many units you can handle in any one stratum.

# Proportional allocation but cap at 60 per stratum
sample <- sampling_design() |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 300, max_n = 55) |>
  execute(bfa_eas, seed = 40)

sample |>
  count(region)
#> # A tibble: 13 × 2
#>    region                n
#>    <fct>             <int>
#>  1 Boucle du Mouhoun    30
#>  2 Cascades             14
#>  3 Centre               31
#>  4 Centre-Est           25
#>  5 Centre-Nord          28
#>  6 Centre-Ouest         26
#>  7 Centre-Sud           12
#>  8 Est                  32
#>  9 Hauts-Bassins        30
#> 10 Nord                 24
#> 11 Plateau-Central      15
#> 12 Sahel                18
#> 13 Sud-Ouest            15

Both bounds together create a feasible range for each stratum:

sample <- sampling_design() |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 300, min_n = 20, max_n = 55) |>
  execute(bfa_eas, seed = 2003)

sample |>
  count(region)
#> # A tibble: 13 × 2
#>    region                n
#>    <fct>             <int>
#>  1 Boucle du Mouhoun    26
#>  2 Cascades             20
#>  3 Centre               27
#>  4 Centre-Est           23
#>  5 Centre-Nord          25
#>  6 Centre-Ouest         23
#>  7 Centre-Sud           20
#>  8 Est                  27
#>  9 Hauts-Bassins        26
#> 10 Nord                 23
#> 11 Plateau-Central      20
#> 12 Sahel                20
#> 13 Sud-Ouest            20

When a stratum’s population is smaller than min_n, the entire stratum is selected (capped at population size).

Custom allocation lets you specify exact sample sizes or rates per stratum by passing a data frame to draw().

sizes_df <- data.frame(
  region = levels(bfa_eas$region),
  n = c(20, 12, 25, 18, 22, 16, 14, 15, 20, 18, 12, 10, 8)
)

sample <- sampling_design() |>
  stratify_by(region) |>
  draw(n = sizes_df) |>
  execute(bfa_eas, seed = 101)

sample |>
  count(region)
#> # A tibble: 13 × 2
#>    region                n
#>    <fct>             <int>
#>  1 Boucle du Mouhoun    20
#>  2 Cascades             12
#>  3 Centre               25
#>  4 Centre-Est           18
#>  5 Centre-Nord          22
#>  6 Centre-Ouest         16
#>  7 Centre-Sud           14
#>  8 Est                  15
#>  9 Hauts-Bassins        20
#> 10 Nord                 18
#> 11 Plateau-Central      12
#> 12 Sahel                10
#> 13 Sud-Ouest             8

Multiple Stratification Variables

You can stratify by multiple variables to create crossed strata. Here we stratify by both region and urban/rural status.

sample <- sampling_design() |>
  stratify_by(region, urban_rural, alloc = "proportional") |>
  draw(n = 300) |>
  execute(bfa_eas, seed = 42)

sample |>
  count(region, urban_rural) |>
  head(10)
#> # A tibble: 10 × 3
#>    region            urban_rural     n
#>    <fct>             <fct>       <int>
#>  1 Boucle du Mouhoun Rural          30
#>  2 Cascades          Rural          13
#>  3 Centre            Rural           2
#>  4 Centre            Urban          29
#>  5 Centre-Est        Rural          23
#>  6 Centre-Est        Urban           3
#>  7 Centre-Nord       Rural          24
#>  8 Centre-Nord       Urban           3
#>  9 Centre-Ouest      Rural          22
#> 10 Centre-Ouest      Urban           4

Cluster Sampling

Cluster sampling selects groups (clusters) rather than individual units. This is practical when a complete list of individuals isn’t available or when travel costs make dispersed sampling expensive.

Use cluster_by() to specify the cluster identifier. All units within selected clusters are included.

sample <- sampling_design() |>
  cluster_by(ea_id) |>
  draw(n = 50) |>
  execute(bfa_eas, seed = 120)

sample |>
  summarise(
    n_clusters = n_distinct(ea_id),
    n_units = n()
  )
#> # A tibble: 1 × 2
#>   n_clusters n_units
#>        <int>   <int>
#> 1         50      50

PPS Sampling

Probability Proportional to Size (PPS) sampling selects units with probability proportional to a size measure. This is standard for cluster sampling where clusters vary in size.

Use the mos argument to specify the measure of size variable. Here we select EAs with probability proportional to their household count.

sample_pps <- sampling_design() |>
  cluster_by(ea_id) |>
  draw(n = 50, method = "pps_brewer", mos = households) |>
  execute(bfa_eas, seed = 365)

sample_pps |>
  summarise(
    mean_hh = mean(households),
    median_hh = median(households)
  )
#> # A tibble: 1 × 2
#>   mean_hh median_hh
#>     <dbl>     <dbl>
#> 1    232.       213

Compared to simple random sampling, PPS tends to select larger clusters:

sample_srs <- sampling_design() |>
  cluster_by(ea_id) |>
  draw(n = 50) |>
  execute(bfa_eas, seed = 42)

bind_rows(
  sample_pps |> summarise(method = "PPS", mean_hh = mean(households)),
  sample_srs |> summarise(method = "SRS", mean_hh = mean(households))
)
#> # A tibble: 2 × 2
#>   method mean_hh
#>   <chr>    <dbl>
#> 1 PPS       232.
#> 2 SRS       203.

PPS Methods

samplyr supports several PPS methods:

Method Description
pps_brewer Brewer’s method - fast with good properties (recommended)
pps_systematic PPS systematic - simple but can have periodicity issues
pps_cps Conditional Poisson sampling (maximum entropy)
pps_poisson PPS Poisson - random sample size (requires frac and supports prn)
pps_sps Sequential Poisson sampling (supports prn)
pps_pareto Pareto sampling (supports prn)
pps_multinomial PPS with replacement
pps_chromy Chromy’s sequential PPS / minimum replacement

Certainty Selection

In PPS sampling, very large units can have expected inclusion probabilities above 1. Certainty selection handles this by selecting those units with probability 1 and sampling the rest normally.

Use certainty_size for an absolute threshold or certainty_prop for a proportional one.

sample_cert <- sampling_design(title = "Certainty selection") |>
  draw(n = 50, method = "pps_brewer", mos = households,
       certainty_size = 800) |>
  execute(bfa_eas, seed = 123)

# How many EAs with more than 800 households
filter(bfa_eas, households > 800)
#> # A tibble: 8 × 12
#>   ea_id    region    province commune urban_rural population households area_km2
#>   <chr>    <fct>     <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#> 1 EA_01319 Centre-E… Boulgou  Bitou   Rural             4559        815     8.31
#> 2 EA_07029 Centre-E… Kourite… Koupela Urban             3971        805    26.9 
#> 3 EA_07045 Centre-E… Kourite… Koupela Urban             4640        941     3.62
#> 4 EA_13207 Centre-O… Boulkie… Thiou   Rural             5273        843    23.8 
#> 5 EA_11261 Centre-O… Sanguie  Reo     Urban             5171       1111    24.5 
#> 6 EA_11263 Centre-O… Sanguie  Reo     Urban             3743        804    17.4 
#> 7 EA_00797 Nord      Yatenga  Barga   Urban             5840       1032     6.03
#> 8 EA_02277 Plateau-… Ganzour… Boudri  Rural             5273        807    23.0 
#> # ℹ 4 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>

# Check which units were selected with certainty
count(sample_cert, .certainty_1)
#> # A tibble: 2 × 2
#>   .certainty_1     n
#>   <lgl>        <int>
#> 1 FALSE           42
#> 2 TRUE             8

With certainty_prop, units whose MOS proportion exceeds the threshold are taken with certainty. The process is iterative: after removing certainty units, proportions are recomputed.

sample_certp <- sampling_design() |>
  stratify_by(region) |>
  draw(n = 50, method = "pps_systematic", mos = households,
       certainty_prop = 0.05) |>
  execute(bfa_eas, seed = 12345)

# How many EAs with more than 5% of the total hh by region
bfa_eas |>
  mutate(mosprop = households/sum(households),
         .by = region) |>
  filter(mosprop >= 0.05)
#> # A tibble: 0 × 13
#> # ℹ 13 variables: ea_id <chr>, region <fct>, province <fct>, commune <fct>,
#> #   urban_rural <fct>, population <dbl>, households <int>, area_km2 <dbl>,
#> #   accessible <lgl>, dist_road_km <dbl>, food_insecurity_pct <dbl>,
#> #   cost <dbl>, mosprop <dbl>

filter(sample_certp, .certainty_1)
#> Warning in min(w): no non-missing arguments to min; returning Inf
#> Warning in max(w): no non-missing arguments to max; returning -Inf
#> # A tbl_sample: 0 × 18
#> # Weights:      NaN [Inf, -Inf]
#> # ℹ 18 variables: ea_id <chr>, region <fct>, province <fct>, commune <fct>,
#> #   urban_rural <fct>, population <dbl>, households <int>, area_km2 <dbl>,
#> #   accessible <lgl>, dist_road_km <dbl>, food_insecurity_pct <dbl>,
#> #   cost <dbl>, .weight <dbl>, .sample_id <int>, .stage <int>, .weight_1 <dbl>,
#> #   .fpc_1 <int>, .certainty_1 <lgl>

Permanent Random Numbers (PRN)

Some PPS methods support permanent random numbers for sample coordination across survey waves. Assign a stable U(0,1) value to each frame unit, then pass it via prn:

set.seed(1)
bfa_eas$prn <- runif(nrow(bfa_eas))

sample_prn <- sampling_design(title = "Samples coordination with PRN") |>
  cluster_by(ea_id) |>
  draw(n = 50, method = "pps_sps", mos = households, prn = prn) |>
  execute(bfa_eas, seed = 1)
sample_prn
#> # A tbl_sample: 50 × 19 | Samples coordination with PRN
#> # Weights:      286.17 [118.08, 856.92]
#>    ea_id    region   province commune urban_rural population households area_km2
#>  * <chr>    <fct>    <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1 EA_12900 Boucle … Banwa    Tansila Rural             1164        140    39.1 
#>  2 EA_03170 Boucle … Mouhoun  Dedoug… Rural             1209        215     3.81
#>  3 EA_12936 Boucle … Mouhoun  Tcheri… Rural             1252        200    43.2 
#>  4 EA_03269 Boucle … Sourou   Di      Rural             2337        305     1.15
#>  5 EA_12213 Cascades Comoe    Sidera… Rural              763        122     9.4 
#>  6 EA_12558 Cascades Comoe    Soubak… Rural             2114        266    70.0 
#>  7 EA_13387 Cascades Comoe    Tiefora Rural             1129        145    26.0 
#>  8 EA_08816 Centre   Kadiogo  Ouagad… Urban             1748        263     0.26
#>  9 EA_09321 Centre   Kadiogo  Ouagad… Urban             1754        264     0.27
#> 10 EA_09370 Centre   Kadiogo  Ouagad… Urban             2573        387     0.35
#> # ℹ 40 more rows
#> # ℹ 11 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> #   .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>,
#> #   .certainty_1 <lgl>

Because the PRN values are stable, re-executing the same design on the same frame produces highly overlapping samples (positive coordination). PRN is supported for bernoulli, pps_poisson, pps_sps, and pps_pareto. You can learn more about coordination with permanent random numbers in the following vignette vignette("sampling-coordination").


Control Sorting and Serpentine Ordering

Control sorting orders the frame before selection, providing implicit stratification. This is effective with systematic and sequential methods, where it ensures the sample spreads evenly across the sorted variables.

# Nested sorting: standard ascending order
sample_sort <- sampling_design() |>
  draw(n = 100, method = "systematic",
       control = c(region, province)) |>
  execute(bfa_eas, seed = 98765)
head(sample_sort)
#> # A tbl_sample: 6 × 18
#> # Weights:      149.34 [149.34, 149.34]
#>   ea_id    region    province commune urban_rural population households area_km2
#> * <chr>    <fct>     <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#> 1 EA_04258 Boucle d… Bale     Fara    Rural             1722        252    21.6 
#> 2 EA_06885 Boucle d… Banwa    Kouka   Rural             1446        199    19.5 
#> 3 EA_12448 Boucle d… Banwa    Solenzo Rural              579         78     8.59
#> 4 EA_02527 Boucle d… Kossi    Bouras… Rural             1268        143    42.5 
#> 5 EA_08665 Boucle d… Kossi    Nouna   Rural             1298        192    23.8 
#> 6 EA_03147 Boucle d… Mouhoun  Dedoug… Rural              869        154     1.29
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> #   .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

serp() implements serpentine (snake) sorting, which alternates direction at each level of the hierarchy. This minimizes jumps at boundaries, matching SAS PROC SURVEYSELECT’s SORT=SERP.

# Serpentine sorting
sample_serp <- sampling_design() |>
  draw(n = 100, method = "systematic",
       control = serp(region, province)) |>
  execute(bfa_eas, seed = 1)
head(sample_serp)
#> # A tbl_sample: 6 × 18
#> # Weights:      149.34 [149.34, 149.34]
#>   ea_id    region    province commune urban_rural population households area_km2
#> * <chr>    <fct>     <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#> 1 EA_00461 Boucle d… Bale     Bana    Rural             1348        176      7.3
#> 2 EA_11130 Boucle d… Bale     Poura   Urban             1660        267     19.0
#> 3 EA_12367 Boucle d… Banwa    Solenzo Rural              984        132     23.2
#> 4 EA_12925 Boucle d… Banwa    Tansila Rural              824         99     18.1
#> 5 EA_03758 Boucle d… Kossi    Dokui   Rural             1277        158     25.8
#> 6 EA_12543 Boucle d… Kossi    Sono    Rural             1246        188     78.8
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> #   .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

Control sorting can be combined with explicit stratification. The sorting is then applied within each stratum.

sample_serp2 <- sampling_design() |>
  stratify_by(urban_rural) |>
  draw(n = 100, method = "systematic",
       control = serp(region, province)) |>
  execute(bfa_eas, seed = 2)
head(sample_serp2)
#> # A tbl_sample: 6 × 18
#> # Weights:      122.78 [122.78, 122.78]
#>   ea_id    region    province commune urban_rural population households area_km2
#> * <chr>    <fct>     <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#> 1 EA_00267 Boucle d… Bale     Bagassi Rural             1319        159    17.1 
#> 2 EA_10369 Boucle d… Bale     Ouri    Rural             1522        195     2.25
#> 3 EA_06892 Boucle d… Banwa    Kouka   Rural             1447        199     7.21
#> 4 EA_12429 Boucle d… Banwa    Solenzo Rural              192         26     7.78
#> 5 EA_00764 Boucle d… Kossi    Barani  Rural              393         51    47.7 
#> 6 EA_03766 Boucle d… Kossi    Dokui   Rural             1093        135    18.2 
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> #   .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

Balanced Sampling

Balanced sampling selects a sample whose Horvitz-Thompson estimates of auxiliary totals match (or nearly match) the population totals. This improves precision for any variable correlated with the auxiliary variables. samplyr implements the cube method (Deville & Tille 2004) via method = "balanced".

Specify auxiliary variables with aux. Without mos, inclusion probabilities are equal; with mos, they are proportional to size:

# Equal-probability balanced sample
balanced_eq <- sampling_design() |>
  draw(n = 100, method = "balanced",
       aux = c(population, households)) |>
  execute(bfa_eas, seed = 42)
balanced_eq
#> # A tbl_sample: 100 × 18
#> # Weights:      149.34 [149.34, 149.34]
#>    ea_id    region   province commune urban_rural population households area_km2
#>  * <chr>    <fct>    <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1 EA_02157 Boucle … Bale     Boromo  Rural              129         17     8.12
#>  2 EA_08673 Boucle … Kossi    Nouna   Rural             1071        158     5.65
#>  3 EA_03203 Boucle … Mouhoun  Dedoug… Rural             1810        321    28.9 
#>  4 EA_06454 Boucle … Mouhoun  Kona    Rural             1400        194    33.1 
#>  5 EA_04530 Boucle … Nayala   Gassan  Rural             1336        147    67.3 
#>  6 EA_13871 Boucle … Sourou   Tougan  Rural             1204        165     7   
#>  7 EA_08512 Cascades Comoe    Niango… Rural             1146        157    83.9 
#>  8 EA_12128 Cascades Comoe    Sidera… Rural             1292        206    24.0 
#>  9 EA_02946 Cascades Leraba   Dakoro  Rural             1254        160    33.6 
#> 10 EA_12340 Cascades Leraba   Sindou  Rural              554         68     7.33
#> # ℹ 90 more rows
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> #   .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

The balancing property means the HT estimate of auxiliary totals is close to the true total:

pop_total <- sum(bfa_eas$population)
ht_total <- sum(balanced_eq$population * balanced_eq$.weight)
c(population = pop_total, ht_estimate = ht_total)
#>  population ht_estimate 
#>    20487532    20655813

Balanced sampling combines naturally with stratification. When stratified, samplyr calls the stratified cube algorithm (Chauvet 2009) in a single pass, preserving the allocation computed by stratify_by():

# Stratified PPS balanced sample
balanced_strat <- sampling_design() |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 300, method = "balanced", mos = households,
       aux = c(population, area_km2)) |>
  execute(bfa_eas, seed = 1960)
balanced_strat
#> # A tbl_sample: 300 × 18
#> # Weights:      48.57 [7.35, 169.35]
#>    ea_id    region   province commune urban_rural population households area_km2
#>  * <chr>    <fct>    <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1 EA_00467 Boucle … Bale     Bana    Rural             1994        261     5.99
#>  2 EA_04220 Boucle … Bale     Fara    Rural              790        116    16.6 
#>  3 EA_04249 Boucle … Bale     Fara    Rural             2148        314    24.8 
#>  4 EA_10391 Boucle … Bale     Pa      Rural             1066        153    17.3 
#>  5 EA_06858 Boucle … Banwa    Kouka   Rural             1133        156    17.5 
#>  6 EA_06864 Boucle … Banwa    Kouka   Rural             1557        214    25.1 
#>  7 EA_06892 Boucle … Banwa    Kouka   Rural             1447        199     7.21
#>  8 EA_12390 Boucle … Banwa    Solenzo Rural             1799        242    19.2 
#>  9 EA_12438 Boucle … Banwa    Solenzo Rural             1341        180     0.46
#> 10 EA_12471 Boucle … Banwa    Solenzo Rural             1857        249    53.4 
#> # ℹ 290 more rows
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> #   .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <dbl>

When used with cluster_by(), auxiliary values are automatically aggregated (summed) to the cluster level before selection.

Balanced sampling supports up to 2 stages. At stage 1, element-level auxiliary variables are aggregated to PSU totals; at stage 2, the cube method runs on the element-level frame within each selected PSU.


Multi-Stage Sampling

Multi-stage designs sample in stages: first select primary sampling units (PSUs), then sample within them. This is the standard approach for large-scale surveys when a complete frame of ultimate units isn’t available upfront.

Use add_stage() to delimit each stage of the design.

Two-Stage Design

For this example, we’ll use zwe_eas to demonstrate a two-stage design: first select districts as PSUs, then sample EAs within selected districts.

data(zwe_eas)

# First, let's see the structure: EAs nested within districts
zwe_eas |>
  count(province, district) |>
  head(10)
#> # A tibble: 10 × 3
#>    province   district           n
#>    <fct>      <fct>          <int>
#>  1 Bulawayo   Bulawayo         453
#>  2 Harare     Chitungwiza      161
#>  3 Harare     Epworth          125
#>  4 Harare     Harare           969
#>  5 Harare     Harare Rural     224
#>  6 Manicaland Buhera           556
#>  7 Manicaland Chimanimani      206
#>  8 Manicaland Chipinge         489
#>  9 Manicaland Chipinge Urban    22
#> 10 Manicaland Makoni           668

We need to create a measure of size for districts. Here we’ll use total households per district.

# Add district-level households as MOS
zwe_frame <- zwe_eas |>
  mutate(district_hh = sum(households),
         .by = district)

Now we can run a two-stage design: select 10 districts with PPS, then sample 5 EAs within each.

twostage_design <- sampling_design(title = "Two-stage cluster sampling") |>
  add_stage(label = "Districts") |>
    cluster_by(district) |>
    draw(n = 10, method = "pps_brewer", mos = district_hh) |>
  add_stage(label = "EAs") |>
  draw(n = 5)
twostage_design
#> ── Sampling Design: Two-stage cluster sampling ─────────────────────────────────
#> 
#> ℹ 2 stages
#> 
#> ── Stage 1: Districts ──────────────────────────────────────────────────────────
#> • Cluster: district
#> • Draw: n = 10, method = pps_brewer, mos = district_hh
#> 
#> ── Stage 2: EAs ────────────────────────────────────────────────────────────────
#> • Draw: n = 5, method = srswor

Let’s apply it to the frame.

sample_eas <- execute(twostage_design,
                      zwe_frame, seed = 3)

sample_eas |>
  summarise(n_districts = n_distinct(district),
            n_eas = n())
#> # A tibble: 1 × 2
#>   n_districts n_eas
#>         <int> <int>
#> 1          10    50

Stratified Multi-Stage Design

Combining stratification with multi-stage sampling is standard for national surveys. Here we stratify by province (which is constant within districts), then select districts and EAs within each stratum.

sample_eas2 <- sampling_design(title = "Stratified two-stage cluster sampling") |>
  add_stage(label = "Districts") |>
    stratify_by(province) |>
    cluster_by(district) |>
    draw(n = 2, method = "pps_brewer", mos = district_hh) |>
  add_stage(label = "EAs") |>
    draw(n = 3) |>
  execute(zwe_frame, seed = 4)
#> Warning: Sample size capped to population in 1 stratum/strata: "Bulawayo".
#>  Requested total: 20. Actual total: 19.

sample_eas2 |>
  count(province, district)
#> # A tibble: 19 × 3
#>    province            district         n
#>    <fct>               <fct>        <int>
#>  1 Bulawayo            Bulawayo         3
#>  2 Harare              Harare           3
#>  3 Harare              Harare Rural     3
#>  4 Manicaland          Buhera           3
#>  5 Manicaland          Makoni           3
#>  6 Mashonaland Central Guruve           3
#>  7 Mashonaland Central Mount Darwin     3
#>  8 Mashonaland East    Goromonzi        3
#>  9 Mashonaland East    Murehwa          3
#> 10 Mashonaland West    Sanyati          3
#> 11 Mashonaland West    Zvimba           3
#> 12 Masvingo            Bikita           3
#> 13 Masvingo            Masvingo         3
#> 14 Matabeleland North  Binga            3
#> 15 Matabeleland North  Bubi             3
#> 16 Matabeleland South  Gwanda           3
#> 17 Matabeleland South  Umzingwane       3
#> 18 Midlands            Gokwe South      3
#> 19 Midlands            Zvishavane       3

Operational Multi-Stage Sampling

In practice, multi-stage surveys often execute stages at different times. After selecting PSUs, field teams may need to conduct listing or other operations before the second stage can proceed.

The stages argument lets you execute only specific stages.

design <- sampling_design() |>
  add_stage(label = "Districts") |>
    stratify_by(province) |>
    cluster_by(district) |>
    draw(n = 2, method = "pps_brewer", mos = district_hh) |>
  add_stage(label = "EAs") |>
    draw(n = 4)

# Execute stage 1 only
selected_districts <- execute(design, zwe_frame, stages = 1, seed = 42)
#> Warning: Sample size capped to population in 1 stratum/strata: "Bulawayo".
#>  Requested total: 20. Actual total: 19.

selected_districts |>
  distinct(province, district)
#> # A tibble: 19 × 2
#>    province            district     
#>    <fct>               <fct>        
#>  1 Bulawayo            Bulawayo     
#>  2 Harare              Harare       
#>  3 Harare              Harare Rural 
#>  4 Manicaland          Chipinge     
#>  5 Manicaland          Nyanga       
#>  6 Mashonaland Central Mazowe       
#>  7 Mashonaland Central Mount Darwin 
#>  8 Mashonaland East    Mudzi        
#>  9 Mashonaland East    Murehwa      
#> 10 Mashonaland West    Chinhoyi     
#> 11 Mashonaland West    Kariba       
#> 12 Masvingo            Chivi        
#> 13 Masvingo            Masvingo     
#> 14 Matabeleland North  Tsholotsho   
#> 15 Matabeleland North  Umguza       
#> 16 Matabeleland South  Bulilima     
#> 17 Matabeleland South  Gwanda       
#> 18 Midlands            Shurugwi Town
#> 19 Midlands            Zvishavane

After fieldwork or preparation is complete, execute the remaining stages.

final_sample <- selected_districts |>
  execute(zwe_frame, seed = 43)

final_sample |>
  count(province, district)
#> # A tibble: 19 × 3
#>    province            district          n
#>    <fct>               <fct>         <int>
#>  1 Bulawayo            Bulawayo          4
#>  2 Harare              Harare            4
#>  3 Harare              Harare Rural      4
#>  4 Manicaland          Chipinge          4
#>  5 Manicaland          Nyanga            4
#>  6 Mashonaland Central Mazowe            4
#>  7 Mashonaland Central Mount Darwin      4
#>  8 Mashonaland East    Mudzi             4
#>  9 Mashonaland East    Murehwa           4
#> 10 Mashonaland West    Chinhoyi          4
#> 11 Mashonaland West    Kariba            4
#> 12 Masvingo            Chivi             4
#> 13 Masvingo            Masvingo          4
#> 14 Matabeleland North  Tsholotsho        4
#> 15 Matabeleland North  Umguza            4
#> 16 Matabeleland South  Bulilima          4
#> 17 Matabeleland South  Gwanda            4
#> 18 Midlands            Shurugwi Town     4
#> 19 Midlands            Zvishavane        4

Replicated Sampling

The reps parameter in execute() draws multiple independent samples from the same frame under the same design. The output is a single stacked tbl_sample with a .replicate column identifying each replicate. This is useful for simulation studies, repeated-sampling variance estimation, or quality control.

rep_sample <- sampling_design() |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 300) |>
  execute(bfa_eas, seed = 42, reps = 5)

rep_sample |>
  count(.replicate)
#> # A tibble: 5 × 2
#>   .replicate     n
#>        <int> <int>
#> 1          1   300
#> 2          2   300
#> 3          3   300
#> 4          4   300
#> 5          5   300

Each replicate is a fully independent draw. Replicate r uses seed seed + r - 1, so the first replicate matches a standalone execution with the same seed:

standalone <- sampling_design() |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 300) |>
  execute(bfa_eas, seed = 42)

rep1 <- rep_sample |> filter(.replicate == 1)
identical(sort(rep1$ea_id), sort(standalone$ea_id))
#> [1] TRUE

Replicated sampling also works with multi-stage designs. When continuing from a replicated partial execution, each replicate is continued independently:

design <- sampling_design() |>
  add_stage(label = "Districts") |>
    cluster_by(district) |>
    draw(n = 10, method = "pps_brewer", mos = district_hh) |>
  add_stage(label = "EAs") |>
    draw(n = 5)

stage1 <- execute(design, zwe_frame, stages = 1, seed = 1, reps = 3)
final <- execute(stage1, zwe_frame, seed = 100)

final |>
  summarise(n_districts = n_distinct(district), n_eas = n(),
            .by = .replicate)
#> # A tibble: 3 × 3
#>   .replicate n_districts n_eas
#>        <int>       <int> <int>
#> 1          1          10    50
#> 2          2          10    50
#> 3          3          10    50

Survey export functions require a single replicate. Filter first:

one_rep <- rep_sample |> filter(.replicate == 1)
svy <- as_svydesign(one_rep)

Note: reps cannot be combined with panels or with stages that use permanent random numbers (PRN). PRN produces identical samples across replicates, so coordination across waves should use an explicit loop with different PRN vectors instead.


Working with Results

Accessing the Design

The sampling design is attached to the result and can be retrieved with get_design().

sample <- sampling_design() |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 200) |>
  execute(bfa_eas, seed = 2024)

design <- get_design(sample)
design
#> ── Sampling Design ─────────────────────────────────────────────────────────────
#> 
#> ℹ 1 stage
#> 
#> ── Stage 1 ─────────────────────────────────────────────────────────────────────
#> • Strata: region (proportional)
#> • Draw: n = 200, method = srswor

and accessing element of the design

design$stages[[1]]$strata$vars
#> [1] "region"

Weight Diagnostics

The summary() method reports weight diagnostics including Kish’s effective sample size and design effect due to unequal weighting. For the full design effect (including clustering and stratification), use survey::svymean() with deff = TRUE after converting with as_svydesign().

summary(sample)
#> ── Sample Summary ──────────────────────────────────────────────────────────────
#> 
#> ℹ n = 200 | stages = 1/1 | seed = 2024
#> 
#> ── Design: Stage 1 ─────────────────────────────────────────────────────────────
#> • Strata: region (proportional)
#> • Method: srswor
#> 
#> ── Allocation: Stage 1 ─────────────────────────────────────────────────────────
#>   region             N_h    n_h  f_h   
#>   Boucle du Mouhoun  1510   20   0.0132
#>   Cascades           667    9    0.0135
#>   Centre             1556   21   0.0135
#>   Centre-Est         1259   17   0.0135
#>   Centre-Nord        1375   19   0.0138
#>   Centre-Ouest       1287   17   0.0132
#>   Centre-Sud         615    8    0.0130
#>   Est                1590   21   0.0132
#>   Hauts-Bassins      1483   20   0.0135
#>   Nord               1211   16   0.0132
#>   Plateau-Central    757    10   0.0132
#>   Sahel              902    12   0.0133
#>   Sud-Ouest          722    10   0.0139
#>                      ─────  ───  ──────
#>   Total              14934  200  0.0134
#> 
#> ── Weights ─────────────────────────────────────────────────────────────────────
#> • Range: [72.2, 76.88]
#> • Mean:  74.67 · CV: 0.02
#> • DEFF:  1 · n_eff: 200

Method Reference

Selection Methods

Method Sample Size Replacement Notes
srswor Fixed No Default, general purpose
srswr Fixed Yes Bootstrap, rare populations
systematic Fixed No Ordered frames
bernoulli Random No Simple field protocols
pps_brewer Fixed No Recommended PPS method
pps_systematic Fixed No Simple PPS
pps_cps Fixed No Maximum entropy, exact joint inclusion probs
pps_poisson Random No PPS with random size
pps_sps Fixed No Sequential Poisson, supports PRN
pps_pareto Fixed No Pareto sampling, supports PRN
pps_multinomial Fixed Yes PPS with replacement
pps_chromy Fixed PMR.1 Chromy’s sequential PPS
balanced Fixed No Cube method, balances on aux variables

Allocation Methods

Method Description Requirements
(none) n per stratum
equal Same n per stratum
proportional n ∝ stratum size
neyman Minimize variance variance data frame
optimal Minimize variance/cost variance + cost data frames
power Compromise: \(n_h \propto C_h X_h^q\) cv + importance (+ optional power)

For custom stratum-specific sizes or rates, pass a data frame directly to n or frac in draw(). When alloc is specified, use n (not frac).

Sample Size Bounds

Argument Description
min_n Minimum sample size per stratum (for variance estimation, set ≥ 2)
max_n Maximum sample size per stratum (to cap dominant strata)

These only apply when using allocation methods (equal, proportional, neyman, optimal, power).

Rounding Methods

Value Description
"up" Round up (ceiling), default, matches SAS
"down" Round down (floor)
"nearest" Standard mathematical rounding

These apply when using frac to specify sampling rates. When computed sample size < 1, it is always rounded up to 1.


Best Practices

  1. Always set a seed for reproducibility
  2. Use meaningful stage labels for documentation
  3. Validate designs with validate_frame() before execution
  4. Check weight distributions with summary() on the sample
design <- sampling_design(title = "National Health Survey 2024") |>
  add_stage(label = "Enumeration Areas") |>
    stratify_by(region, urban_rural) |>
    cluster_by(ea_id) |>
    draw(n = 5, method = "pps_brewer", mos = households) |>
  add_stage(label = "Households") |>
    draw(n = 20)

print(design)
#> ── Sampling Design: National Health Survey 2024 ────────────────────────────────
#> 
#> ℹ 2 stages
#> 
#> ── Stage 1: Enumeration Areas ──────────────────────────────────────────────────
#> • Strata: region, urban_rural
#> • Cluster: ea_id
#> • Draw: n = 5, method = pps_brewer, mos = households
#> 
#> ── Stage 2: Households ─────────────────────────────────────────────────────────
#> • Draw: n = 20, method = srswor

Frame Validation

Use validate_frame() to check that a frame has all the variables a design requires before executing.

# This passes
validate_frame(design, bfa_eas)

# This fails with a clear message
bad_frame <- data.frame(id = 1:100, value = rnorm(100))
validate_frame(design, bad_frame)
#> Error:
#> ! Frame validation failed:
#>  Enumeration Areas: missing stratification variable: "region" and
#>   "urban_rural"
#>  Enumeration Areas: missing cluster variable: "ea_id"
#>  Enumeration Areas: missing MOS variable: `households`

Reference

Bankier, Michael D. 1988. “Power Allocations: Determining Sample Sizes for Subnational Areas.” The American Statistician 42 (3): 174–77.
Lohr, Sharon L. 2022. Sampling: Design and Analysis. 3rd ed. CRC Press.