Introduction to samplyr • samplyr

Overview

samplyr provides a tidy grammar for specifying and executing survey sampling designs. The package is built around a minimal set of composable verbs that handle stratified, clustered, and multi-stage sampling.

For the statistical semantics and assumptions underlying each operation, see vignette("design-semantics").

The core idea is that sampling code should read like its English description:

library(samplyr)
library(dplyr)

# "Stratify by region, proportionally allocate 500 samples, execute"
sampling_design(title = "My sampling design") |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 500) |>
  execute(frame, seed = 1)

To see how this works in practice, consider a real survey design from Lohr (2022) (Example 7.1), based on a 1991 study of bed net use in rural Gambia (D’Alessandro et al., 1994):

Malaria morbidity can be reduced by using bed nets impregnated with insecticide, but this is only effective if the bed nets are in widespread use. In 1991, a nationwide survey was designed to estimate the prevalence of bed net use in rural areas of the Gambia (D’Alessandro et al., 1994).

The sampling frame consisted of all rural villages of fewer than 3,000 people. The villages were stratified by three geographic regions (eastern, central, and western) and by whether the village had a public health clinic (PHC) or not. In each region five districts were chosen with probability proportional to the district population. In each district four villages were chosen, again with probability proportional to census population: two PHC villages and two non-PHC villages. Finally, six compounds were chosen more or less randomly from each village.

In samplyr, this three-stage stratified cluster design translates directly into code:

design <- sampling_design(title = "Gambia bed nets") |>
  add_stage() |>
    stratify_by(region) |>
    cluster_by(district) |>
    draw(n = 5, method = "pps_brewer", mos = population) |>
  add_stage() |>
    stratify_by(phc) |>
    cluster_by(village) |>
    draw(n = 2, method = "pps_brewer", mos = population) |>
  add_stage() |>
    draw(n = 6)
design
#> ── Sampling Design: Gambia bed nets ────────────────────────────────────────────
#> 
#> ℹ 3 stages
#> 
#> ── Stage 1 ─────────────────────────────────────────────────────────────────────
#> • Strata: region
#> • Cluster: district
#> • Draw: n = 5, method = pps_brewer, mos = population
#> 
#> ── Stage 2 ─────────────────────────────────────────────────────────────────────
#> • Strata: phc
#> • Cluster: village
#> • Draw: n = 2, method = pps_brewer, mos = population
#> 
#> ── Stage 3 ─────────────────────────────────────────────────────────────────────
#> • Draw: n = 6, method = srswor

And you can then use execute to get your sample.

sampling_design(title = "Gambia bed net") |>
  add_stage() |>
    stratify_by(region) |>
    cluster_by(district) |>
    draw(n = 5, method = "pps_brewer", mos = population) |>
  add_stage() |>
    stratify_by(phc) |>
    cluster_by(village) |>
    draw(n = 2, method = "pps_brewer", mos = population) |>
  add_stage() |>
    draw(n = 6) |>
  execute(frame, seed = 1991)

# or
execute(design, frame, seed = 1991)

The samplyr code mirrors the verbal description verb for verb.

The Grammar

samplyr uses 5 verbs and 1 modifier:

Function	Purpose
`sampling_design()`	Create a new sampling design
`stratify_by()`	Define stratification variables and allocation
`cluster_by()`	Define cluster/PSU variable
`draw()`	Specify sample size, method
`execute()`	Run the design on a frame
`add_stage()`	Delimit stages in multi-stage designs

Deferred Column Resolution

stratify_by() and cluster_by() accept bare column names and store them in the design object. Column names are validated only when a frame is available (execute(), as_svydesign()), which keeps design specification separate from execution.

design <- sampling_design(title = "Stratified cluster sampling") |>
  stratify_by(region, alloc = "proportional") |>
  cluster_by(ea_id) |>
  draw(n = 200)

sample <- execute(design, bfa_eas, seed = 101)

Example Data

We’ll use the bfa_eas dataset throughout this vignette. It is an enumeration area frame for Burkina Faso, with ~14,900 EAs across 13 regions.

data(bfa_eas)

bfa_eas |>
  glimpse()
#> Rows: 14,900
#> Columns: 12
#> $ ea_id               <chr> "EA_00245", "EA_00246", "EA_00247", "EA_00248", "E…
#> $ region              <fct> Boucle du Mouhoun, Boucle du Mouhoun, Boucle du Mo…
#> $ province            <fct> Bale, Bale, Bale, Bale, Bale, Bale, Bale, Bale, Ba…
#> $ commune             <fct> Bagassi, Bagassi, Bagassi, Bagassi, Bagassi, Bagas…
#> $ urban_rural         <fct> Rural, Rural, Rural, Rural, Rural, Rural, Rural, R…
#> $ population          <dbl> 927, 2184, 1666, 592, 1467, 1607, 709, 1461, 1053,…
#> $ households          <int> 111, 262, 200, 71, 176, 193, 85, 176, 127, 122, 16…
#> $ area_km2            <dbl> 57.27, 2.57, 33.44, 31.73, 12.07, 25.53, 12.63, 16…
#> $ accessible          <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…
#> $ dist_road_km        <dbl> 17.5, 11.1, 17.3, 21.6, 4.2, 6.9, 27.4, 10.4, 10.9…
#> $ food_insecurity_pct <dbl> 9.5, 16.3, 28.0, 15.7, 31.4, 15.1, 26.9, 24.8, 17.…
#> $ cost                <dbl> 497, 415, 498, 440, 408, 211, 409, 430, 329, 1101,…

Simple Random Sampling

The most basic design selects n units at random from the frame.

design <- sampling_design(title = "Simple Random Sampling") |>
  draw(n = 100)
design
#> ── Sampling Design: Simple Random Sampling ─────────────────────────────────────
#> 
#> ℹ 1 stage
#> 
#> ── Stage 1 ─────────────────────────────────────────────────────────────────────
#> • Draw: n = 100, method = srswor

The sampling_design object stores design metadata and can be reused.

sample <- execute(design, bfa_eas, seed = 24)
nrow(sample)
#> [1] 100

The result includes the original columns plus sampling metadata: .weight (sampling weight), .weight_1, .weight_2, etc. (per-stage weights). For with-replacement methods, .draw_1, .draw_2, etc. each independent draw.

sample |>
  select(ea_id, region, urban_rural, .weight, .weight_1) |>
  head()
#> # A tbl_sample: 6 × 5 | Simple Random Sampling
#> # Weights:      149 [149, 149]
#>   ea_id    region            urban_rural .weight .weight_1
#> * <chr>    <fct>             <fct>         <dbl>     <dbl>
#> 1 EA_09278 Centre            Urban           149       149
#> 2 EA_04606 Sud-Ouest         Rural           149       149
#> 3 EA_07895 Sahel             Rural           149       149
#> 4 EA_14017 Boucle du Mouhoun Rural           149       149
#> 5 EA_10646 Hauts-Bassins     Rural           149       149
#> 6 EA_05166 Hauts-Bassins     Rural           149       149

Selection Methods

The method argument controls how units are selected. By default, samplyr uses simple random sampling without replacement (srswor).

Systematic sampling selects units at fixed intervals, which can improve precision when the frame is ordered.

sample_sys <- sampling_design(title = "Systematic") |>
  draw(n = 100, method = "systematic") |>
  execute(bfa_eas, seed = 2021)
head(sample_sys)
#> # A tbl_sample: 6 × 17 | Systematic
#> # Weights:      149 [149, 149]
#>   ea_id    region    province commune urban_rural population households area_km2
#> * <chr>    <fct>     <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#> 1 EA_02176 Boucle d… Bale     Boromo  Rural              792        103     0.42
#> 2 EA_14036 Boucle d… Bale     Yaho    Rural             1147        151     8.77
#> 3 EA_12367 Boucle d… Banwa    Solenzo Rural              968        116     0.92
#> 4 EA_00756 Boucle d… Kossi    Barani  Rural             1121        147     9.28
#> 5 EA_03940 Boucle d… Kossi    Doumba… Rural             1151        156    30.9 
#> 6 EA_02111 Boucle d… Mouhoun  Bondok… Rural              845        129    16.3 
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

Sampling with replacement allows the same unit to be selected multiple times, useful for bootstrap procedures.

sample_wr <- sampling_design(title = "SRS WR") |>
  draw(n = 100, method = "srswr") |>
  execute(bfa_eas, seed = 123)
head(sample_wr)
#> # A tbl_sample: 6 × 18 | SRS WR
#> # Weights:      149 [149, 149]
#>   ea_id    region    province commune urban_rural population households area_km2
#> * <chr>    <fct>     <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#> 1 EA_08950 Centre    Kadiogo  Ouagad… Urban             1445        274     0.19
#> 2 EA_08998 Centre    Kadiogo  Ouagad… Urban             1909        362     0.57
#> 3 EA_04259 Hauts-Ba… Houet    Fo      Rural              334         45     7.61
#> 4 EA_10777 Est       Gnagna   Piela   Rural              709         82     8.57
#> 5 EA_07238 Nord      Zondoma  Leba    Rural             1476        146    27.6 
#> 6 EA_09473 Centre    Kadiogo  Ouagad… Urban             1562        296     0.35
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <dbl>, .draw_1 <int>

Using Sampling Fractions

Instead of a fixed sample size, you can specify a sampling fraction with frac.

sample <- sampling_design() |>
  draw(frac = 0.05) |>
  execute(bfa_eas, seed = 1789)
nrow(sample)
#> [1] 745

Bernoulli sampling selects each unit independently with the specified probability. This gives a random sample size, which can simplify field protocols.

sample <- sampling_design() |>
  draw(frac = 0.05, method = "bernoulli") |>
  execute(bfa_eas, seed = 1960)
nrow(sample)
#> [1] 715

Rounding Behavior

When using frac, the computed sample size (N × frac) must be rounded to an integer. By default, samplyr rounds up (ceiling), which matches SAS SURVEYSELECT’s default behavior. Use the round argument to control this:

"up" (default): round up, ensures adequate sample size
"down": round down, minimum sample size
"nearest": Standard mathematical rounding

# Frame of 105 units, frac = 0.1 → 10.5
frame_105 <- head(bfa_eas, 105)

# Default: rounds up to 11
nrow(sampling_design() |> draw(frac = 0.1) |> execute(frame_105, seed = 3))
#> [1] 11

# Round down to 10
nrow(sampling_design() |> draw(frac = 0.1, round = "down") |> execute(frame_105, seed = 3))
#> [1] 10

# Round to nearest (10)
nrow(sampling_design() |> draw(frac = 0.1, round = "nearest") |> execute(frame_105, seed = 3))
#> [1] 10

Stratified Sampling

Stratification partitions the frame into non-overlapping groups (strata) and samples from each. This ensures representation from all subgroups and often improves precision.

Per-Stratum Sample Size

Without an allocation method, n specifies the sample size within each stratum.

sample <- sampling_design() |>
  stratify_by(region) |>
  draw(n = 20) |>
  execute(bfa_eas, seed = 123)

sample |>
  count(region)
#> # A tibble: 13 × 2
#>    region                n
#>    <fct>             <int>
#>  1 Boucle du Mouhoun    20
#>  2 Cascades             20
#>  3 Centre               20
#>  4 Centre-Est           20
#>  5 Centre-Nord          20
#>  6 Centre-Ouest         20
#>  7 Centre-Sud           20
#>  8 Est                  20
#>  9 Hauts-Bassins        20
#> 10 Nord                 20
#> 11 Plateau-Central      20
#> 12 Sahel                20
#> 13 Sud-Ouest            20

Allocation Methods

With an allocation method, n becomes the total sample size to distribute across strata. Allocation methods are defined for total sample-size allocation, so draw(frac = ...) is not supported when alloc is set.

Proportional allocation distributes the sample proportionally to stratum sizes, ensuring the sample mirrors the population structure.

sample <- sampling_design(title = "Proportional Allocation") |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 300) |>
  execute(bfa_eas, seed = 42)

sample |>
  count(region) |>
  mutate(pct = n / sum(n))
#> # A tibble: 13 × 3
#>    region                n    pct
#>    <fct>             <int>  <dbl>
#>  1 Boucle du Mouhoun    30 0.1   
#>  2 Cascades             14 0.0467
#>  3 Centre               31 0.103 
#>  4 Centre-Est           25 0.0833
#>  5 Centre-Nord          28 0.0933
#>  6 Centre-Ouest         26 0.0867
#>  7 Centre-Sud           12 0.04  
#>  8 Est                  32 0.107 
#>  9 Hauts-Bassins        30 0.1   
#> 10 Nord                 24 0.08  
#> 11 Plateau-Central      15 0.05  
#> 12 Sahel                18 0.06  
#> 13 Sud-Ouest            15 0.05

Equal allocation assigns the same sample size to each stratum, regardless of population size. This maximizes precision for comparisons between strata.

sample <- sampling_design(title = "Equal Allocation") |>
  stratify_by(region, alloc = "equal") |>
  draw(n = 160) |>
  execute(bfa_eas, seed = 2026)

sample |>
  count(region)
#> # A tibble: 13 × 2
#>    region                n
#>    <fct>             <int>
#>  1 Boucle du Mouhoun    13
#>  2 Cascades             13
#>  3 Centre               13
#>  4 Centre-Est           13
#>  5 Centre-Nord          12
#>  6 Centre-Ouest         12
#>  7 Centre-Sud           12
#>  8 Est                  12
#>  9 Hauts-Bassins        12
#> 10 Nord                 12
#> 11 Plateau-Central      12
#> 12 Sahel                12
#> 13 Sud-Ouest            12

Neyman allocation minimizes the variance of the overall estimate by allocating more sample to strata with higher variability. It requires prior information about stratum variances. Use explicit stratification variable names when passing variance.

data(bfa_eas_variance)

bfa_eas_variance
#> # A tibble: 13 × 2
#>    region              var
#>    <fct>             <dbl>
#>  1 Boucle du Mouhoun  99.2
#>  2 Cascades           57.7
#>  3 Centre             32.1
#>  4 Centre-Est         61.1
#>  5 Centre-Nord        97.4
#>  6 Centre-Ouest       33.7
#>  7 Centre-Sud         31.7
#>  8 Est               144. 
#>  9 Hauts-Bassins      60.0
#> 10 Nord              137. 
#> 11 Plateau-Central    59.4
#> 12 Sahel             142. 
#> 13 Sud-Ouest          60.8

sample <- sampling_design(title = "Neyman Allocation") |>
  stratify_by(region, alloc = "neyman", variance = bfa_eas_variance) |>
  draw(n = 300) |>
  execute(bfa_eas, seed = 12345)

sample |>
  count(region)
#> # A tibble: 13 × 2
#>    region                n
#>    <fct>             <int>
#>  1 Boucle du Mouhoun    34
#>  2 Cascades             12
#>  3 Centre               20
#>  4 Centre-Est           23
#>  5 Centre-Nord          31
#>  6 Centre-Ouest         17
#>  7 Centre-Sud            8
#>  8 Est                  44
#>  9 Hauts-Bassins        27
#> 10 Nord                 33
#> 11 Plateau-Central      13
#> 12 Sahel                25
#> 13 Sud-Ouest            13

Optimal allocation extends Neyman allocation by also accounting for differential costs across strata. It minimizes variance for a fixed budget (or cost for fixed precision). Use explicit stratification variable names when passing variance/cost.

data(bfa_eas_cost)

bfa_eas_cost
#> # A tibble: 13 × 2
#>    region             cost
#>    <fct>             <dbl>
#>  1 Boucle du Mouhoun   491
#>  2 Cascades            429
#>  3 Centre              240
#>  4 Centre-Est          407
#>  5 Centre-Nord         449
#>  6 Centre-Ouest        396
#>  7 Centre-Sud          419
#>  8 Est                 624
#>  9 Hauts-Bassins       365
#> 10 Nord                592
#> 11 Plateau-Central     408
#> 12 Sahel               638
#> 13 Sud-Ouest           418

sample <- sampling_design(title = "Optimal Allocation") |>
  stratify_by(region, alloc = "optimal",
              variance = bfa_eas_variance,
              cost = bfa_eas_cost) |>
  draw(n = 300) |>
  execute(bfa_eas, seed = 9876)

sample |>
  count(region)
#> # A tibble: 13 × 2
#>    region                n
#>    <fct>             <int>
#>  1 Boucle du Mouhoun    33
#>  2 Cascades             12
#>  3 Centre               28
#>  4 Centre-Est           24
#>  5 Centre-Nord          31
#>  6 Centre-Ouest         18
#>  7 Centre-Sud            8
#>  8 Est                  38
#>  9 Hauts-Bassins        30
#> 10 Nord                 29
#> 11 Plateau-Central      14
#> 12 Sahel                21
#> 13 Sud-Ouest            14

Power allocation (Bankier 1988) uses \(n_h \propto C_h X_h^q\), where cv supplies \(C_h\), importance supplies \(X_h\), and power is \(q \in [0, 1]\).

cv_df <- data.frame(
  region = levels(bfa_eas$region),
  cv = c(0.40, 0.35, 0.12, 0.20, 0.30, 0.18,
         0.15, 0.38, 0.22, 0.32, 0.17, 0.45, 0.25)
)

importance_df <- data.frame(
  region = levels(bfa_eas$region),
  importance = c(60, 40, 120, 70, 80, 65,
                 50, 55, 90, 75, 45, 35, 30)
)

sample <- sampling_design(title = "Power Allocation") |>
  stratify_by(
    region,
    alloc = "power",
    cv = cv_df,
    importance = importance_df,
    power = 0.5
  ) |>
  draw(n = 300) |>
  execute(bfa_eas, seed = 777)

sample |>
  count(region)
#> # A tibble: 13 × 2
#>    region                n
#>    <fct>             <int>
#>  1 Boucle du Mouhoun    35
#>  2 Cascades             25
#>  3 Centre               15
#>  4 Centre-Est           19
#>  5 Centre-Nord          31
#>  6 Centre-Ouest         16
#>  7 Centre-Sud           12
#>  8 Est                  32
#>  9 Hauts-Bassins        24
#> 10 Nord                 32
#> 11 Plateau-Central      13
#> 12 Sahel                30
#> 13 Sud-Ouest            16

Sample Size Bounds

When using allocation methods, you can constrain stratum sample sizes with min_n and max_n in draw().

Minimum sample size ensures every stratum gets at least the specified number of units. This is essential for variance estimation (requires n ≥ 2 per stratum) or when you need reliable subgroup estimates.

# Neyman allocation with minimum 5 per stratum
sample <- sampling_design() |>
  stratify_by(region, alloc = "neyman", variance = bfa_eas_variance) |>
  draw(n = 300, min_n = 20) |>
  execute(bfa_eas, seed = 1)

sample |>
  count(region)
#> # A tibble: 13 × 2
#>    region                n
#>    <fct>             <int>
#>  1 Boucle du Mouhoun    27
#>  2 Cascades             20
#>  3 Centre               20
#>  4 Centre-Est           21
#>  5 Centre-Nord          26
#>  6 Centre-Ouest         20
#>  7 Centre-Sud           20
#>  8 Est                  33
#>  9 Hauts-Bassins        23
#> 10 Nord                 27
#> 11 Plateau-Central      20
#> 12 Sahel                23
#> 13 Sud-Ouest            20

Maximum sample size caps large strata to prevent them from dominating the sample. This is useful when operational constraints limit how many units you can handle in any one stratum.

# Proportional allocation but cap at 60 per stratum
sample <- sampling_design() |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 300, max_n = 55) |>
  execute(bfa_eas, seed = 40)

sample |>
  count(region)
#> # A tibble: 13 × 2
#>    region                n
#>    <fct>             <int>
#>  1 Boucle du Mouhoun    30
#>  2 Cascades             14
#>  3 Centre               31
#>  4 Centre-Est           25
#>  5 Centre-Nord          28
#>  6 Centre-Ouest         26
#>  7 Centre-Sud           12
#>  8 Est                  32
#>  9 Hauts-Bassins        30
#> 10 Nord                 24
#> 11 Plateau-Central      15
#> 12 Sahel                18
#> 13 Sud-Ouest            15

Both bounds together create a feasible range for each stratum:

sample <- sampling_design() |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 300, min_n = 20, max_n = 55) |>
  execute(bfa_eas, seed = 2003)

sample |>
  count(region)
#> # A tibble: 13 × 2
#>    region                n
#>    <fct>             <int>
#>  1 Boucle du Mouhoun    26
#>  2 Cascades             20
#>  3 Centre               27
#>  4 Centre-Est           23
#>  5 Centre-Nord          25
#>  6 Centre-Ouest         23
#>  7 Centre-Sud           20
#>  8 Est                  27
#>  9 Hauts-Bassins        26
#> 10 Nord                 23
#> 11 Plateau-Central      20
#> 12 Sahel                20
#> 13 Sud-Ouest            20

When a stratum’s population is smaller than min_n, the entire stratum is selected (capped at population size).

Custom allocation lets you specify exact sample sizes or rates per stratum by passing a data frame to draw().

sizes_df <- data.frame(
  region = levels(bfa_eas$region),
  n = c(20, 12, 25, 18, 22, 16, 14, 15, 20, 18, 12, 10, 8)
)

sample <- sampling_design() |>
  stratify_by(region) |>
  draw(n = sizes_df) |>
  execute(bfa_eas, seed = 101)

sample |>
  count(region)
#> # A tibble: 13 × 2
#>    region                n
#>    <fct>             <int>
#>  1 Boucle du Mouhoun    20
#>  2 Cascades             12
#>  3 Centre               25
#>  4 Centre-Est           18
#>  5 Centre-Nord          22
#>  6 Centre-Ouest         16
#>  7 Centre-Sud           14
#>  8 Est                  15
#>  9 Hauts-Bassins        20
#> 10 Nord                 18
#> 11 Plateau-Central      12
#> 12 Sahel                10
#> 13 Sud-Ouest             8

Multiple Stratification Variables

You can stratify by multiple variables to create crossed strata. Here we stratify by both region and urban/rural status.

sample <- sampling_design() |>
  stratify_by(region, urban_rural, alloc = "proportional") |>
  draw(n = 300) |>
  execute(bfa_eas, seed = 42)

sample |>
  count(region, urban_rural) |>
  head(10)
#> # A tibble: 10 × 3
#>    region            urban_rural     n
#>    <fct>             <fct>       <int>
#>  1 Boucle du Mouhoun Rural          30
#>  2 Cascades          Rural          13
#>  3 Centre            Rural           2
#>  4 Centre            Urban          29
#>  5 Centre-Est        Rural          23
#>  6 Centre-Est        Urban           3
#>  7 Centre-Nord       Rural          24
#>  8 Centre-Nord       Urban           3
#>  9 Centre-Ouest      Rural          22
#> 10 Centre-Ouest      Urban           4

Cluster Sampling

Cluster sampling selects groups (clusters) rather than individual units. This is practical when a complete list of individuals isn’t available or when travel costs make dispersed sampling expensive.

Use cluster_by() to specify the cluster identifier. All units within selected clusters are included.

sample <- sampling_design() |>
  cluster_by(ea_id) |>
  draw(n = 50) |>
  execute(bfa_eas, seed = 120)

sample |>
  summarise(
    n_clusters = n_distinct(ea_id),
    n_units = n()
  )
#> # A tibble: 1 × 2
#>   n_clusters n_units
#>        <int>   <int>
#> 1         50      50

PPS Sampling

Probability Proportional to Size (PPS) sampling selects units with probability proportional to a size measure. This is standard for cluster sampling where clusters vary in size.

Use the mos argument to specify the measure of size variable. Here we select EAs with probability proportional to their household count.

sample_pps <- sampling_design() |>
  cluster_by(ea_id) |>
  draw(n = 50, method = "pps_brewer", mos = households) |>
  execute(bfa_eas, seed = 365)

sample_pps |>
  summarise(
    mean_hh = mean(households),
    median_hh = median(households)
  )
#> # A tibble: 1 × 2
#>   mean_hh median_hh
#>     <dbl>     <dbl>
#> 1    241.      192.

Compared to simple random sampling, PPS tends to select larger clusters:

sample_srs <- sampling_design() |>
  cluster_by(ea_id) |>
  draw(n = 50) |>
  execute(bfa_eas, seed = 42)

bind_rows(
  sample_pps |> summarise(method = "PPS", mean_hh = mean(households)),
  sample_srs |> summarise(method = "SRS", mean_hh = mean(households))
)
#> # A tibble: 2 × 2
#>   method mean_hh
#>   <chr>    <dbl>
#> 1 PPS       241.
#> 2 SRS       213

PPS Methods

samplyr supports several PPS methods:

Method	Description
`pps_brewer`	Brewer’s method - fast with good properties (recommended)
`pps_systematic`	PPS systematic - simple but can have periodicity issues
`pps_cps`	Conditional Poisson sampling (maximum entropy)
`pps_poisson`	PPS Poisson - random sample size (requires `frac` and supports `prn`)
`pps_sps`	Sequential Poisson sampling (supports `prn`)
`pps_pareto`	Pareto sampling (supports `prn`)
`pps_multinomial`	PPS with replacement
`pps_chromy`	Chromy’s sequential PPS / minimum replacement

Certainty Selection

In PPS sampling, very large units can have expected inclusion probabilities above 1. Certainty selection handles this by selecting those units with probability 1 and sampling the rest normally.

Use certainty_size for an absolute threshold or certainty_prop for a proportional one.

sample_cert <- sampling_design(title = "Certainty selection") |>
  draw(n = 50, method = "pps_brewer", mos = households,
       certainty_size = 800) |>
  execute(bfa_eas, seed = 123)

# How many EAs with more than 800 households
filter(bfa_eas, households > 800)
#> # A tibble: 10 × 12
#>    ea_id    region   province commune urban_rural population households area_km2
#>    <chr>    <fct>    <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1 EA_08824 Centre   Kadiogo  Ouagad… Urban             4500        853     1.39
#>  2 EA_01319 Centre-… Boulgou  Bitou   Rural             4559        815     8.31
#>  3 EA_08542 Centre-… Boulgou  Niaogho Rural             3505        865     4.38
#>  4 EA_03315 Centre-… Kourite… Dialga… Rural             3617        807     9.26
#>  5 EA_07018 Centre-… Kourite… Koupela Urban             4640        910     3.62
#>  6 EA_05858 Centre-… Sanmate… Kaya    Urban             4983        809    10.1 
#>  7 EA_13180 Centre-… Boulkie… Thiou   Rural             5273        875    23.8 
#>  8 EA_11234 Centre-… Sanguie  Reo     Urban             5171        886    24.5 
#>  9 EA_00797 Nord     Yatenga  Barga   Urban             5840       1032     6.03
#> 10 EA_02277 Plateau… Ganzour… Boudri  Rural             5273        807    23.0 
#> # ℹ 4 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>

# Check which units were selected with certainty
count(sample_cert, .certainty_1)
#> # A tibble: 2 × 2
#>   .certainty_1     n
#>   <lgl>        <int>
#> 1 FALSE           40
#> 2 TRUE            10

With certainty_prop, units whose MOS proportion exceeds the threshold are taken with certainty. The process is iterative: after removing certainty units, proportions are recomputed.

sample_certp <- sampling_design() |>
  stratify_by(region) |>
  draw(n = 50, method = "pps_systematic", mos = households,
       certainty_prop = 0.05) |>
  execute(bfa_eas, seed = 12345)

# How many EAs with more than 5% of the total hh by region
bfa_eas |>
  mutate(mosprop = households/sum(households),
         .by = region) |>
  filter(mosprop >= 0.05)
#> # A tibble: 0 × 13
#> # ℹ 13 variables: ea_id <chr>, region <fct>, province <fct>, commune <fct>,
#> #   urban_rural <fct>, population <dbl>, households <int>, area_km2 <dbl>,
#> #   accessible <lgl>, dist_road_km <dbl>, food_insecurity_pct <dbl>,
#> #   cost <dbl>, mosprop <dbl>

filter(sample_certp, .certainty_1)
#> Warning in min(w): no non-missing arguments to min; returning Inf
#> Warning in max(w): no non-missing arguments to max; returning -Inf
#> # A tbl_sample: 0 × 18
#> # Weights:      NaN [Inf, -Inf]
#> # ℹ 18 variables: ea_id <chr>, region <fct>, province <fct>, commune <fct>,
#> #   urban_rural <fct>, population <dbl>, households <int>, area_km2 <dbl>,
#> #   accessible <lgl>, dist_road_km <dbl>, food_insecurity_pct <dbl>,
#> #   cost <dbl>, .weight <dbl>, .sample_id <int>, .stage <int>, .weight_1 <dbl>,
#> #   .fpc_1 <int>, .certainty_1 <lgl>

Permanent Random Numbers (PRN)

Some PPS methods support permanent random numbers for sample coordination across survey waves. Assign a stable U(0,1) value to each frame unit, then pass it via prn:

set.seed(1)
bfa_eas$prn <- runif(nrow(bfa_eas))

sample_prn <- sampling_design(title = "Samples coordination with PRN") |>
  cluster_by(ea_id) |>
  draw(n = 50, method = "pps_sps", mos = households, prn = prn) |>
  execute(bfa_eas, seed = 1)
sample_prn
#> # A tbl_sample: 50 × 19 | Samples coordination with PRN
#> # Weights:      301.91 [121.15, 635.71]
#>    ea_id    region   province commune urban_rural population households area_km2
#>  * <chr>    <fct>    <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1 EA_12873 Boucle … Banwa    Tansila Rural             1164        148    39.1 
#>  2 EA_03170 Boucle … Mouhoun  Dedoug… Rural             1209        215     3.81
#>  3 EA_12909 Boucle … Mouhoun  Tcheri… Rural             1252        150    43.2 
#>  4 EA_05765 Boucle … Sourou   Kassoum Rural             1092        148    25.1 
#>  5 EA_12525 Cascades Comoe    Soubak… Rural             1224        173    19.1 
#>  6 EA_13358 Cascades Comoe    Tiefora Rural             1345        161    25.3 
#>  7 EA_13387 Cascades Comoe    Tiefora Rural              931        112    22.0 
#>  8 EA_08816 Centre   Kadiogo  Ouagad… Urban             1745        331     0.22
#>  9 EA_09321 Centre   Kadiogo  Ouagad… Urban             1936        367     0.26
#> 10 EA_09370 Centre   Kadiogo  Ouagad… Urban             2686        509     0.41
#> # ℹ 40 more rows
#> # ℹ 11 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> #   .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>,
#> #   .certainty_1 <lgl>

Because the PRN values are stable, re-executing the same design on the same frame produces highly overlapping samples (positive coordination). PRN is supported for bernoulli, pps_poisson, pps_sps, and pps_pareto. You can learn more about coordination with permanent random numbers in the following vignette vignette("sampling-coordination").

Control Sorting and Serpentine Ordering

Control sorting orders the frame before selection, providing implicit stratification. This is effective with systematic and sequential methods, where it ensures the sample spreads evenly across the sorted variables.

# Nested sorting: standard ascending order
sample_sort <- sampling_design() |>
  draw(n = 100, method = "systematic",
       control = c(region, province)) |>
  execute(bfa_eas, seed = 98765)
head(sample_sort)
#> # A tbl_sample: 6 × 18
#> # Weights:      149 [149, 149]
#>   ea_id    region    province commune urban_rural population households area_km2
#> * <chr>    <fct>     <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#> 1 EA_04231 Boucle d… Bale     Fara    Rural             1722        275    21.6 
#> 2 EA_06858 Boucle d… Banwa    Kouka   Rural             1446        166    19.5 
#> 3 EA_12421 Boucle d… Banwa    Solenzo Rural              579         69     8.59
#> 4 EA_02526 Boucle d… Kossi    Bouras… Rural             1460        165     2.38
#> 5 EA_08637 Boucle d… Kossi    Nouna   Rural             1261        164     1.4 
#> 6 EA_03146 Boucle d… Mouhoun  Dedoug… Rural             1266        225    15.5 
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> #   .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

serp() implements serpentine (snake) sorting, which alternates direction at each level of the hierarchy. This minimizes jumps at boundaries, matching SAS PROC SURVEYSELECT’s SORT=SERP.

# Serpentine sorting
sample_serp <- sampling_design() |>
  draw(n = 100, method = "systematic",
       control = serp(region, province)) |>
  execute(bfa_eas, seed = 1)
head(sample_serp)
#> # A tbl_sample: 6 × 18
#> # Weights:      149 [149, 149]
#>   ea_id    region    province commune urban_rural population households area_km2
#> * <chr>    <fct>     <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#> 1 EA_00461 Boucle d… Bale     Bana    Rural             1348        176      7.3
#> 2 EA_11103 Boucle d… Bale     Poura   Urban             1660        225     19.0
#> 3 EA_12339 Boucle d… Banwa    Solenzo Rural             1077        129     20.5
#> 4 EA_12897 Boucle d… Banwa    Tansila Rural             1205        154     26.6
#> 5 EA_03729 Boucle d… Kossi    Dokui   Rural             1476        236     18.2
#> 6 EA_12514 Boucle d… Kossi    Sono    Rural             1296        187     23.6
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> #   .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

Control sorting can be combined with explicit stratification. The sorting is then applied within each stratum.

sample_serp2 <- sampling_design() |>
  stratify_by(urban_rural) |>
  draw(n = 100, method = "systematic",
       control = serp(region, province)) |>
  execute(bfa_eas, seed = 2)
head(sample_serp2)
#> # A tbl_sample: 6 × 18
#> # Weights:      122.51 [122.51, 122.51]
#>   ea_id    region    province commune urban_rural population households area_km2
#> * <chr>    <fct>     <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#> 1 EA_00267 Boucle d… Bale     Bagassi Rural             1319        159    17.1 
#> 2 EA_10342 Boucle d… Bale     Ouri    Rural             1522        182     2.25
#> 3 EA_06864 Boucle d… Banwa    Kouka   Rural              990        114    19.1 
#> 4 EA_12401 Boucle d… Banwa    Solenzo Rural             1062        127    29.3 
#> 5 EA_00763 Boucle d… Kossi    Barani  Rural             1638        215    11.5 
#> 6 EA_03738 Boucle d… Kossi    Dokui   Rural             1314        210    22.0 
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> #   .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

Balanced Sampling

Balanced sampling selects a sample whose Horvitz-Thompson estimates of auxiliary totals match (or nearly match) the population totals. This improves precision for any variable correlated with the auxiliary variables. samplyr implements the cube method (Deville & Tille 2004) via method = "balanced".

Specify auxiliary variables with aux. Without mos, inclusion probabilities are equal; with mos, they are proportional to size:

# Equal-probability balanced sample
balanced_eq <- sampling_design() |>
  draw(n = 100, method = "balanced",
       aux = c(population, households)) |>
  execute(bfa_eas, seed = 42)
balanced_eq
#> # A tbl_sample: 100 × 18
#> # Weights:      149 [149, 149]
#>    ea_id    region   province commune urban_rural population households area_km2
#>  * <chr>    <fct>    <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1 EA_04217 Boucle … Bale     Fara    Rural              682        109     8.85
#>  2 EA_14036 Boucle … Bale     Yaho    Rural             1147        151     8.77
#>  3 EA_12381 Boucle … Banwa    Solenzo Rural             3444        411     3.48
#>  4 EA_03598 Boucle … Kossi    Djibas… Rural             1039        123     9.6 
#>  5 EA_10134 Boucle … Mouhoun  Ouarko… Rural              943        131    19.3 
#>  6 EA_04520 Boucle … Nayala   Gassan  Rural             1344        164    70.2 
#>  7 EA_13843 Boucle … Sourou   Tougan  Rural             1973        273   100.  
#>  8 EA_10283 Cascades Comoe    Ouo     Rural             1221        160    74.6 
#>  9 EA_12524 Cascades Comoe    Soubak… Rural             1532        216    17.4 
#> 10 EA_13400 Cascades Comoe    Tiefora Rural             1275        153    14.0 
#> # ℹ 90 more rows
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> #   .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

The balancing property means the HT estimate of auxiliary totals is close to the true total:

pop_total <- sum(bfa_eas$population)
ht_total <- sum(balanced_eq$population * balanced_eq$.weight)
c(population = pop_total, ht_estimate = ht_total)
#>  population ht_estimate 
#>    20436039    20546504

Balanced sampling combines naturally with stratification. When stratified, samplyr calls the stratified cube algorithm (Chauvet 2009) in a single pass, preserving the allocation computed by stratify_by():

# Stratified PPS balanced sample
balanced_strat <- sampling_design() |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 300, method = "balanced", mos = households,
       aux = c(population, area_km2)) |>
  execute(bfa_eas, seed = 1960)
balanced_strat
#> # A tbl_sample: 300 × 18
#> # Weights:      50.6 [14.67, 228.32]
#>    ea_id    region   province commune urban_rural population households area_km2
#>  * <chr>    <fct>    <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1 EA_02177 Boucle … Bale     Boromo  Rural             3387        442    42.2 
#>  2 EA_10382 Boucle … Bale     Pa      Rural              825        116     0.66
#>  3 EA_11051 Boucle … Bale     Pompoi  Rural             1021        161    35.6 
#>  4 EA_11060 Boucle … Bale     Pompoi  Rural              802        126     0.72
#>  5 EA_00369 Boucle … Banwa    Balave  Rural             2049        323    36.5 
#>  6 EA_12378 Boucle … Banwa    Solenzo Rural             1296        155    33.9 
#>  7 EA_12398 Boucle … Banwa    Solenzo Rural             1813        216    34.9 
#>  8 EA_12405 Boucle … Banwa    Solenzo Rural             1463        175    19.2 
#>  9 EA_12438 Boucle … Banwa    Solenzo Rural             1394        166    33.8 
#> 10 EA_02076 Boucle … Kossi    Bombor… Rural             1013        133     8.85
#> # ℹ 290 more rows
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> #   .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <dbl>

When used with cluster_by(), auxiliary values are automatically aggregated (summed) to the cluster level before selection.

Balanced sampling supports up to 2 stages. At stage 1, element-level auxiliary variables are aggregated to PSU totals; at stage 2, the cube method runs on the element-level frame within each selected PSU.

Multi-Stage Sampling

Multi-stage designs sample in stages: first select primary sampling units (PSUs), then sample within them. This is the standard approach for large-scale surveys when a complete frame of ultimate units isn’t available upfront.

Use add_stage() to delimit each stage of the design.

Two-Stage Design

For this example, we’ll use zwe_eas to demonstrate a two-stage design: first select districts as PSUs, then sample EAs within selected districts.

data(zwe_eas)

# First, let's see the structure: EAs nested within districts
zwe_eas |>
  count(province, district) |>
  head(10)
#> # A tibble: 10 × 3
#>    province   district           n
#>    <fct>      <fct>          <int>
#>  1 Bulawayo   Bulawayo         453
#>  2 Harare     Chitungwiza      161
#>  3 Harare     Epworth          125
#>  4 Harare     Harare           969
#>  5 Harare     Harare Rural     224
#>  6 Manicaland Buhera           556
#>  7 Manicaland Chimanimani      206
#>  8 Manicaland Chipinge         489
#>  9 Manicaland Chipinge Urban    22
#> 10 Manicaland Makoni           668

We need to create a measure of size for districts. Here we’ll use total households per district.

# Add district-level households as MOS
zwe_frame <- zwe_eas |>
  mutate(district_hh = sum(households),
         .by = district)

Now we can run a two-stage design: select 10 districts with PPS, then sample 5 EAs within each.

twostage_design <- sampling_design(title = "Two-stage cluster sampling") |>
  add_stage(label = "Districts") |>
    cluster_by(district) |>
    draw(n = 10, method = "pps_brewer", mos = district_hh) |>
  add_stage(label = "EAs") |>
  draw(n = 5)
twostage_design
#> ── Sampling Design: Two-stage cluster sampling ─────────────────────────────────
#> 
#> ℹ 2 stages
#> 
#> ── Stage 1: Districts ──────────────────────────────────────────────────────────
#> • Cluster: district
#> • Draw: n = 10, method = pps_brewer, mos = district_hh
#> 
#> ── Stage 2: EAs ────────────────────────────────────────────────────────────────
#> • Draw: n = 5, method = srswor

Let’s apply it to the frame.

sample_eas <- execute(twostage_design,
                      zwe_frame, seed = 3)

sample_eas |>
  summarise(n_districts = n_distinct(district),
            n_eas = n())
#> # A tibble: 1 × 2
#>   n_districts n_eas
#>         <int> <int>
#> 1          10    50

Stratified Multi-Stage Design

Combining stratification with multi-stage sampling is standard for national surveys. Here we stratify by province (which is constant within districts), then select districts and EAs within each stratum.

sample_eas2 <- sampling_design(title = "Stratified two-stage cluster sampling") |>
  add_stage(label = "Districts") |>
    stratify_by(province) |>
    cluster_by(district) |>
    draw(n = 2, method = "pps_brewer", mos = district_hh) |>
  add_stage(label = "EAs") |>
    draw(n = 3) |>
  execute(zwe_frame, seed = 4)
#> Warning: Sample size capped to population in 1 stratum/strata: "Bulawayo".
#> ℹ Requested total: 20. Actual total: 19.

sample_eas2 |>
  count(province, district)
#> # A tibble: 19 × 3
#>    province            district         n
#>    <fct>               <fct>        <int>
#>  1 Bulawayo            Bulawayo         3
#>  2 Harare              Harare           3
#>  3 Harare              Harare Rural     3
#>  4 Manicaland          Buhera           3
#>  5 Manicaland          Makoni           3
#>  6 Mashonaland Central Guruve           3
#>  7 Mashonaland Central Mount Darwin     3
#>  8 Mashonaland East    Goromonzi        3
#>  9 Mashonaland East    Murehwa          3
#> 10 Mashonaland West    Sanyati          3
#> 11 Mashonaland West    Zvimba           3
#> 12 Masvingo            Bikita           3
#> 13 Masvingo            Masvingo         3
#> 14 Matabeleland North  Binga            3
#> 15 Matabeleland North  Bubi             3
#> 16 Matabeleland South  Gwanda           3
#> 17 Matabeleland South  Umzingwane       3
#> 18 Midlands            Gokwe South      3
#> 19 Midlands            Zvishavane       3

Operational Multi-Stage Sampling

In practice, multi-stage surveys often execute stages at different times. After selecting PSUs, field teams may need to conduct listing or other operations before the second stage can proceed.

The stages argument lets you execute only specific stages.

design <- sampling_design() |>
  add_stage(label = "Districts") |>
    stratify_by(province) |>
    cluster_by(district) |>
    draw(n = 2, method = "pps_brewer", mos = district_hh) |>
  add_stage(label = "EAs") |>
    draw(n = 4)

# Execute stage 1 only
selected_districts <- execute(design, zwe_frame, stages = 1, seed = 42)
#> Warning: Sample size capped to population in 1 stratum/strata: "Bulawayo".
#> ℹ Requested total: 20. Actual total: 19.

selected_districts |>
  distinct(province, district)
#> # A tibble: 19 × 2
#>    province            district     
#>    <fct>               <fct>        
#>  1 Bulawayo            Bulawayo     
#>  2 Harare              Harare       
#>  3 Harare              Harare Rural 
#>  4 Manicaland          Chipinge     
#>  5 Manicaland          Nyanga       
#>  6 Mashonaland Central Mazowe       
#>  7 Mashonaland Central Mount Darwin 
#>  8 Mashonaland East    Mudzi        
#>  9 Mashonaland East    Murehwa      
#> 10 Mashonaland West    Chinhoyi     
#> 11 Mashonaland West    Kariba       
#> 12 Masvingo            Chivi        
#> 13 Masvingo            Masvingo     
#> 14 Matabeleland North  Tsholotsho   
#> 15 Matabeleland North  Umguza       
#> 16 Matabeleland South  Bulilima     
#> 17 Matabeleland South  Gwanda       
#> 18 Midlands            Shurugwi Town
#> 19 Midlands            Zvishavane

After fieldwork or preparation is complete, execute the remaining stages.

final_sample <- selected_districts |>
  execute(zwe_frame, seed = 43)

final_sample |>
  count(province, district)
#> # A tibble: 19 × 3
#>    province            district          n
#>    <fct>               <fct>         <int>
#>  1 Bulawayo            Bulawayo          4
#>  2 Harare              Harare            4
#>  3 Harare              Harare Rural      4
#>  4 Manicaland          Chipinge          4
#>  5 Manicaland          Nyanga            4
#>  6 Mashonaland Central Mazowe            4
#>  7 Mashonaland Central Mount Darwin      4
#>  8 Mashonaland East    Mudzi             4
#>  9 Mashonaland East    Murehwa           4
#> 10 Mashonaland West    Chinhoyi          4
#> 11 Mashonaland West    Kariba            4
#> 12 Masvingo            Chivi             4
#> 13 Masvingo            Masvingo          4
#> 14 Matabeleland North  Tsholotsho        4
#> 15 Matabeleland North  Umguza            4
#> 16 Matabeleland South  Bulilima          4
#> 17 Matabeleland South  Gwanda            4
#> 18 Midlands            Shurugwi Town     4
#> 19 Midlands            Zvishavane        4

Working with Results

Accessing the Design

The sampling design is attached to the result and can be retrieved with get_design().

sample <- sampling_design() |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 200) |>
  execute(bfa_eas, seed = 2024)

design <- get_design(sample)
design
#> ── Sampling Design ─────────────────────────────────────────────────────────────
#> 
#> ℹ 1 stage
#> 
#> ── Stage 1 ─────────────────────────────────────────────────────────────────────
#> • Strata: region (proportional)
#> • Draw: n = 200, method = srswor

and accessing element of the design

design$stages[[1]]$strata$vars
#> [1] "region"

Weight Diagnostics

The summary() method reports weight diagnostics including Kish’s effective sample size and design effect due to unequal weighting. For the full design effect (including clustering and stratification), use survey::svymean() with deff = TRUE after converting with as_svydesign().

summary(sample)
#> ── Sample Summary ──────────────────────────────────────────────────────────────
#> 
#> ℹ n = 200 | stages = 1/1 | seed = 2024
#> 
#> ── Design: Stage 1 ─────────────────────────────────────────────────────────────
#> • Strata: region (proportional)
#> • Method: srswor
#> 
#> ── Allocation: Stage 1 ─────────────────────────────────────────────────────────
#>   region             N_h    n_h  f_h   
#>   Boucle du Mouhoun  1483   20   0.0135
#>   Cascades           667    9    0.0135
#>   Centre             1556   21   0.0135
#>   Centre-Est         1259   17   0.0135
#>   Centre-Nord        1375   19   0.0138
#>   Centre-Ouest       1287   17   0.0132
#>   Centre-Sud         608    8    0.0132
#>   Est                1590   21   0.0132
#>   Hauts-Bassins      1483   20   0.0135
#>   Nord               1211   16   0.0132
#>   Plateau-Central    757    10   0.0132
#>   Sahel              902    12   0.0133
#>   Sud-Ouest          722    10   0.0139
#>                      ─────  ───  ──────
#>   Total              14900  200  0.0134
#> 
#> ── Weights ─────────────────────────────────────────────────────────────────────
#> • Range: [72.2, 76]
#> • Mean:  74.5 · CV: 0.02
#> • DEFF:  1 · n_eff: 200

Method Reference

Selection Methods

Method	Sample Size	Replacement	Notes
`srswor`	Fixed	No	Default, general purpose
`srswr`	Fixed	Yes	Bootstrap, rare populations
`systematic`	Fixed	No	Ordered frames
`bernoulli`	Random	No	Simple field protocols
`pps_brewer`	Fixed	No	Recommended PPS method
`pps_systematic`	Fixed	No	Simple PPS
`pps_cps`	Fixed	No	Maximum entropy, exact joint inclusion probs
`pps_poisson`	Random	No	PPS with random size
`pps_sps`	Fixed	No	Sequential Poisson, supports PRN
`pps_pareto`	Fixed	No	Pareto sampling, supports PRN
`pps_multinomial`	Fixed	Yes	PPS with replacement
`pps_chromy`	Fixed	PMR.¹	Chromy’s sequential PPS
`balanced`	Fixed	No	Cube method, balances on `aux` variables

Allocation Methods

Method	Description	Requirements
(none)	n per stratum
`equal`	Same n per stratum
`proportional`	n ∝ stratum size
`neyman`	Minimize variance	`variance` data frame
`optimal`	Minimize variance/cost	`variance` + `cost` data frames
`power`	Compromise: \(n_h \propto C_h X_h^q\)	`cv` + `importance` (+ optional `power`)

For custom stratum-specific sizes or rates, pass a data frame directly to n or frac in draw(). When alloc is specified, use n (not frac).

Sample Size Bounds

Argument	Description
`min_n`	Minimum sample size per stratum (for variance estimation, set ≥ 2)
`max_n`	Maximum sample size per stratum (to cap dominant strata)

These only apply when using allocation methods (equal, proportional, neyman, optimal, power).

Rounding Methods

Value	Description
`"up"`	Round up (ceiling), default, matches SAS
`"down"`	Round down (floor)
`"nearest"`	Standard mathematical rounding

These apply when using frac to specify sampling rates. When computed sample size < 1, it is always rounded up to 1.

Best Practices

Always set a seed for reproducibility
Use meaningful stage labels for documentation
Validate designs with validate_frame() before execution
Check weight distributions with summary() on the sample

design <- sampling_design(title = "National Health Survey 2024") |>
  add_stage(label = "Enumeration Areas") |>
    stratify_by(region, urban_rural) |>
    cluster_by(ea_id) |>
    draw(n = 5, method = "pps_brewer", mos = households) |>
  add_stage(label = "Households") |>
    draw(n = 20)

print(design)
#> ── Sampling Design: National Health Survey 2024 ────────────────────────────────
#> 
#> ℹ 2 stages
#> 
#> ── Stage 1: Enumeration Areas ──────────────────────────────────────────────────
#> • Strata: region, urban_rural
#> • Cluster: ea_id
#> • Draw: n = 5, method = pps_brewer, mos = households
#> 
#> ── Stage 2: Households ─────────────────────────────────────────────────────────
#> • Draw: n = 20, method = srswor

Frame Validation

Use validate_frame() to check that a frame has all the variables a design requires before executing.

# This passes
validate_frame(design, bfa_eas)

# This fails with a clear message
bad_frame <- data.frame(id = 1:100, value = rnorm(100))
validate_frame(design, bad_frame)
#> Error:
#> ! Frame validation failed:
#> ✖ Enumeration Areas: missing stratification variable: "region" and
#>   "urban_rural"
#> ✖ Enumeration Areas: missing cluster variable: "ea_id"
#> ✖ Enumeration Areas: missing MOS variable: `households`

Reference

Bankier, Michael D. 1988. “Power Allocations: Determining Sample Sizes for Subnational Areas.” The American Statistician 42 (3): 174–77.

Lohr, Sharon L. 2022. Sampling: Design and Analysis. 3rd ed. CRC Press.