Skip to contents

Overview

samplyr provides a tidy grammar for specifying and executing survey sampling designs. The package is built around a minimal set of composable verbs that handle stratified, clustered, and multi-stage sampling.

For the statistical semantics and assumptions underlying each operation, see vignette("design-semantics").

The core idea is that sampling code should read like its English description:

library(samplyr)
library(dplyr)

# "Stratify by region, proportionally allocate 500 samples, execute"
sampling_design(title = "My sampling design") |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 500) |>
  execute(frame, seed = 1)

To see how this works in practice, consider a real survey design from Lohr (2022) (Example 7.1), based on a 1991 study of bed net use in rural Gambia (D’Alessandro et al., 1994):

Malaria morbidity can be reduced by using bed nets impregnated with insecticide, but this is only effective if the bed nets are in widespread use. In 1991, a nationwide survey was designed to estimate the prevalence of bed net use in rural areas of the Gambia (D’Alessandro et al., 1994).

The sampling frame consisted of all rural villages of fewer than 3,000 people. The villages were stratified by three geographic regions (eastern, central, and western) and by whether the village had a public health clinic (PHC) or not. In each region five districts were chosen with probability proportional to the district population. In each district four villages were chosen, again with probability proportional to census population: two PHC villages and two non-PHC villages. Finally, six compounds were chosen more or less randomly from each village.

In samplyr, this three-stage stratified cluster design translates directly into code:

design <- sampling_design(title = "Gambia bed nets") |>
  add_stage() |>
    stratify_by(region) |>
    cluster_by(district) |>
    draw(n = 5, method = "pps_brewer", mos = population) |>
  add_stage() |>
    stratify_by(phc) |>
    cluster_by(village) |>
    draw(n = 2, method = "pps_brewer", mos = population) |>
  add_stage() |>
    draw(n = 6)
design
#> ── Sampling Design: Gambia bed nets ────────────────────────────────────────────
#> 
#> ℹ 3 stages
#> 
#> ── Stage 1 ─────────────────────────────────────────────────────────────────────
#> • Strata: region
#> • Cluster: district
#> • Draw: n = 5, method = pps_brewer, mos = population
#> 
#> ── Stage 2 ─────────────────────────────────────────────────────────────────────
#> • Strata: phc
#> • Cluster: village
#> • Draw: n = 2, method = pps_brewer, mos = population
#> 
#> ── Stage 3 ─────────────────────────────────────────────────────────────────────
#> • Draw: n = 6, method = srswor

And you can then use execute to get your sample.

sampling_design(title = "Gambia bed net") |>
  add_stage() |>
    stratify_by(region) |>
    cluster_by(district) |>
    draw(n = 5, method = "pps_brewer", mos = population) |>
  add_stage() |>
    stratify_by(phc) |>
    cluster_by(village) |>
    draw(n = 2, method = "pps_brewer", mos = population) |>
  add_stage() |>
    draw(n = 6) |>
  execute(frame, seed = 1991)

# or
execute(design, frame, seed = 1991)

The samplyr code mirrors the verbal description verb for verb.

The Grammar

samplyr uses 5 verbs and 1 modifier:

Function Purpose
sampling_design() Create a new sampling design
stratify_by() Define stratification variables and allocation
cluster_by() Define cluster/PSU variable
draw() Specify sample size, method
execute() Run the design on a frame
add_stage() Delimit stages in multi-stage designs

Deferred Column Resolution

stratify_by() and cluster_by() accept bare column names and store them in the design object. Column names are validated only when a frame is available (execute(), as_svydesign()), which keeps design specification separate from execution.

design <- sampling_design(title = "Stratified cluster sampling") |>
  stratify_by(region, alloc = "proportional") |>
  cluster_by(ea_id) |>
  draw(n = 200)

sample <- execute(design, bfa_eas, seed = 101)

Example Data

We’ll use the bfa_eas dataset throughout this vignette. It is an enumeration area frame for Burkina Faso, with ~14,934 EAs across 13 regions.

data(bfa_eas)

bfa_eas |>
  glimpse()
#> Rows: 44,570
#> Columns: 12
#> $ ea_id               <int> 11759, 11760, 11761, 11762, 11763, 11764, 11765, 1…
#> $ region              <fct> Boucle du Mouhoun, Boucle du Mouhoun, Boucle du Mo…
#> $ province            <fct> Bale, Bale, Bale, Bale, Bale, Bale, Bale, Bale, Ba…
#> $ commune             <fct> Bagassi, Bagassi, Bagassi, Bagassi, Bagassi, Bagas…
#> $ urban_rural         <fct> Rural, Rural, Rural, Rural, Rural, Rural, Rural, R…
#> $ population          <dbl> 56, 204, 63, 257, 48, 139, 160, 2184, 947, 28, 326…
#> $ households          <int> 7, 25, 8, 31, 6, 17, 19, 262, 114, 3, 39, 44, 28, 
#> $ area_km2            <dbl> 9.21, 8.75, 8.54, 8.92, 4.89, 8.51, 8.46, 2.57, 1.…
#> $ accessible          <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, T…
#> $ dist_road_km        <dbl> 19.7, 41.0, 13.9, 39.3, 28.8, 100.1, 34.9, 24.8, 6…
#> $ food_insecurity_pct <dbl> 1.8, 1.3, 1.9, 5.0, 0.4, 1.6, 1.6, 1.6, 0.2, 2.8, 
#> $ cost                <dbl> 272, 739, 216, 771, 744, 436, 733, 728, 316, 622, 

Simple Random Sampling

The most basic design selects n units at random from the frame.

design <- sampling_design(title = "Simple Random Sampling") |>
  draw(n = 100)
design
#> ── Sampling Design: Simple Random Sampling ─────────────────────────────────────
#> 
#> ℹ 1 stage
#> 
#> ── Stage 1 ─────────────────────────────────────────────────────────────────────
#> • Draw: n = 100, method = srswor

The sampling_design object stores design metadata and can be reused.

sample <- execute(design, bfa_eas, seed = 24)
nrow(sample)
#> [1] 100

The result includes the original columns plus sampling metadata: .weight (sampling weight), .weight_1, .weight_2, etc. (per-stage weights). For with-replacement methods, .draw_1, .draw_2, etc. each independent draw.

sample |>
  select(ea_id, region, urban_rural, .weight, .weight_1) |>
  head()
#> # A tbl_sample: 6 × 5 | Simple Random Sampling
#> # Weights:      445.7 [445.7, 445.7]
#>   ea_id region            urban_rural .weight .weight_1
#> * <int> <fct>             <fct>         <dbl>     <dbl>
#> 1  5866 Centre-Nord       Urban          446.      446.
#> 2 24905 Nord              Rural          446.      446.
#> 3  5145 Centre-Nord       Urban          446.      446.
#> 4 32861 Est               Rural          446.      446.
#> 5  4797 Sud-Ouest         Rural          446.      446.
#> 6 25986 Boucle du Mouhoun Rural          446.      446.

Selection Methods

The method argument controls how units are selected. By default, samplyr uses simple random sampling without replacement (srswor).

Systematic sampling selects units at fixed intervals, which can improve precision when the frame is ordered.

sample_sys <- sampling_design(title = "Systematic") |>
  draw(n = 100, method = "systematic") |>
  execute(bfa_eas, seed = 2021)
head(sample_sys)
#> # A tbl_sample: 6 × 17 | Systematic
#> # Weights:      445.7 [445.7, 445.7]
#>   ea_id region       province commune urban_rural population households area_km2
#> * <int> <fct>        <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#> 1  1182 Boucle du M… Bale     Boromo  Rural              177         23     0.69
#> 2 11703 Boucle du M… Bale     Yaho    Rural              402         60     9.18
#> 3  9724 Boucle du M… Banwa    Sanaba  Rural               51          6     8.02
#> 4 11018 Boucle du M… Banwa    Tansila Rural              117         14     8.6 
#> 5 12763 Boucle du M… Kossi    Djibas… Rural             1238        158     0.95
#> 6 33147 Boucle du M… Kossi    Nouna   Rural              468         69    13.2 
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

Sampling with replacement allows the same unit to be selected multiple times, useful for bootstrap procedures.

sample_wr <- sampling_design(title = "SRS WR") |>
  draw(n = 100, method = "srswr") |>
  execute(bfa_eas, seed = 123)
head(sample_wr)
#> # A tbl_sample: 6 × 18 | SRS WR
#> # Weights:      445.7 [445.7, 445.7]
#>   ea_id region       province commune urban_rural population households area_km2
#> * <int> <fct>        <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#> 1 26077 Boucle du M… Mouhoun  Dedoug… Rural              176         31     8.96
#> 2 41311 Hauts-Bassi… Houet    Bobo-D… Urban              844        135     0.11
#> 3 41094 Hauts-Bassi… Houet    Bobo-D… Urban             1121        180     0.79
#> 4 41555 Plateau-Cen… Oubrite… Dapeol… Rural              418         54     3.19
#> 5 21009 Boucle du M… Mouhoun  Bondok… Rural              556         85     9.16
#> 6 11914 Sahel        Seno     Bani    Rural              419         90     7.95
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <dbl>, .draw_1 <int>

Using Sampling Fractions

Instead of a fixed sample size, you can specify a sampling fraction with frac.

sample <- sampling_design() |>
  draw(frac = 0.05) |>
  execute(bfa_eas, seed = 1789)
nrow(sample)
#> [1] 2229

Bernoulli sampling selects each unit independently with the specified probability. This gives a random sample size, which can simplify field protocols.

sample <- sampling_design() |>
  draw(frac = 0.05, method = "bernoulli") |>
  execute(bfa_eas, seed = 1960)
nrow(sample)
#> [1] 2247

Rounding Behavior

When using frac, the computed sample size (N × frac) must be rounded to an integer. By default, samplyr rounds up (ceiling), which matches SAS SURVEYSELECT’s default behavior. Use the round argument to control this:

  • "up" (default): round up, ensures adequate sample size
  • "down": round down, minimum sample size
  • "nearest": Standard mathematical rounding
# Frame of 105 units, frac = 0.1 → 10.5
frame_105 <- head(bfa_eas, 105)

# Default: rounds up to 11
nrow(sampling_design() |> draw(frac = 0.1) |> execute(frame_105, seed = 3))
#> [1] 11

# Round down to 10
nrow(sampling_design() |> draw(frac = 0.1, round = "down") |> execute(frame_105, seed = 3))
#> [1] 10

# Round to nearest (10)
nrow(sampling_design() |> draw(frac = 0.1, round = "nearest") |> execute(frame_105, seed = 3))
#> [1] 10

Stratified Sampling

Stratification partitions the frame into non-overlapping groups (strata) and samples from each. This ensures representation from all subgroups and often improves precision.

Per-Stratum Sample Size

Without an allocation method, n specifies the sample size within each stratum.

sample <- sampling_design() |>
  stratify_by(region) |>
  draw(n = 20) |>
  execute(bfa_eas, seed = 123)

sample |>
  count(region)
#> # A tibble: 13 × 2
#>    region                n
#>    <fct>             <int>
#>  1 Boucle du Mouhoun    20
#>  2 Cascades             20
#>  3 Centre               20
#>  4 Centre-Est           20
#>  5 Centre-Nord          20
#>  6 Centre-Ouest         20
#>  7 Centre-Sud           20
#>  8 Est                  20
#>  9 Hauts-Bassins        20
#> 10 Nord                 20
#> 11 Plateau-Central      20
#> 12 Sahel                20
#> 13 Sud-Ouest            20

Allocation Methods

With an allocation method, n becomes the total sample size to distribute across strata. Allocation methods are defined for total sample-size allocation, so draw(frac = ...) is not supported when alloc is set.

Proportional allocation distributes the sample proportionally to stratum sizes, ensuring the sample mirrors the population structure.

sample <- sampling_design(title = "Proportional Allocation") |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 300) |>
  execute(bfa_eas, seed = 42)

sample |>
  count(region) |>
  mutate(pct = n / sum(n))
#> # A tibble: 13 × 3
#>    region                n    pct
#>    <fct>             <int>  <dbl>
#>  1 Boucle du Mouhoun    34 0.113 
#>  2 Cascades             17 0.0567
#>  3 Centre               26 0.0867
#>  4 Centre-Est           20 0.0667
#>  5 Centre-Nord          23 0.0767
#>  6 Centre-Ouest         25 0.0833
#>  7 Centre-Sud           11 0.0367
#>  8 Est                  37 0.123 
#>  9 Hauts-Bassins        32 0.107 
#> 10 Nord                 20 0.0667
#> 11 Plateau-Central      11 0.0367
#> 12 Sahel                28 0.0933
#> 13 Sud-Ouest            16 0.0533

Equal allocation assigns the same sample size to each stratum, regardless of population size. This maximizes precision for comparisons between strata.

sample <- sampling_design(title = "Equal Allocation") |>
  stratify_by(region, alloc = "equal") |>
  draw(n = 160) |>
  execute(bfa_eas, seed = 2026)

sample |>
  count(region)
#> # A tibble: 13 × 2
#>    region                n
#>    <fct>             <int>
#>  1 Boucle du Mouhoun    13
#>  2 Cascades             13
#>  3 Centre               13
#>  4 Centre-Est           13
#>  5 Centre-Nord          12
#>  6 Centre-Ouest         12
#>  7 Centre-Sud           12
#>  8 Est                  12
#>  9 Hauts-Bassins        12
#> 10 Nord                 12
#> 11 Plateau-Central      12
#> 12 Sahel                12
#> 13 Sud-Ouest            12

Neyman allocation minimizes the variance of the overall estimate by allocating more sample to strata with higher variability. It requires prior information about stratum variances. Use explicit stratification variable names when passing variance.

data(bfa_eas_variance)

bfa_eas_variance
#> # A tibble: 13 × 2
#>    region                var
#>    <fct>               <dbl>
#>  1 Boucle du Mouhoun  17.7  
#>  2 Cascades            1.32 
#>  3 Centre              1.02 
#>  4 Centre-Est         18.6  
#>  5 Centre-Nord        87.0  
#>  6 Centre-Ouest        7.42 
#>  7 Centre-Sud          3.03 
#>  8 Est               134.   
#>  9 Hauts-Bassins       0.411
#> 10 Nord              264.   
#> 11 Plateau-Central     7.75 
#> 12 Sahel             253.   
#> 13 Sud-Ouest           2.49

sample <- sampling_design(title = "Neyman Allocation") |>
  stratify_by(region, alloc = "neyman", variance = bfa_eas_variance) |>
  draw(n = 300) |>
  execute(bfa_eas, seed = 12345)

sample |>
  count(region)
#> # A tibble: 13 × 2
#>    region                n
#>    <fct>             <int>
#>  1 Boucle du Mouhoun    23
#>  2 Cascades              3
#>  3 Centre                4
#>  4 Centre-Est           14
#>  5 Centre-Nord          35
#>  6 Centre-Ouest         11
#>  7 Centre-Sud            3
#>  8 Est                  70
#>  9 Hauts-Bassins         4
#> 10 Nord                 52
#> 11 Plateau-Central       5
#> 12 Sahel                72
#> 13 Sud-Ouest             4

Optimal allocation extends Neyman allocation by also accounting for differential costs across strata. It minimizes variance for a fixed budget (or cost for fixed precision). Use explicit stratification variable names when passing variance/cost.

data(bfa_eas_cost)

bfa_eas_cost
#> # A tibble: 13 × 2
#>    region             cost
#>    <fct>             <dbl>
#>  1 Boucle du Mouhoun   484
#>  2 Cascades            422
#>  3 Centre              237
#>  4 Centre-Est          405
#>  5 Centre-Nord         446
#>  6 Centre-Ouest        400
#>  7 Centre-Sud          417
#>  8 Est                 631
#>  9 Hauts-Bassins       374
#> 10 Nord                579
#> 11 Plateau-Central     410
#> 12 Sahel               626
#> 13 Sud-Ouest           420

sample <- sampling_design(title = "Optimal Allocation") |>
  stratify_by(region, alloc = "optimal",
              variance = bfa_eas_variance,
              cost = bfa_eas_cost) |>
  draw(n = 300) |>
  execute(bfa_eas, seed = 9876)

sample |>
  count(region)
#> # A tibble: 13 × 2
#>    region                n
#>    <fct>             <int>
#>  1 Boucle du Mouhoun    24
#>  2 Cascades              4
#>  3 Centre                6
#>  4 Centre-Est           16
#>  5 Centre-Nord          38
#>  6 Centre-Ouest         13
#>  7 Centre-Sud            3
#>  8 Est                  64
#>  9 Hauts-Bassins         4
#> 10 Nord                 50
#> 11 Plateau-Central       6
#> 12 Sahel                67
#> 13 Sud-Ouest             5

Power allocation (Bankier 1988) uses \(n_h \propto C_h X_h^q\), where cv supplies \(C_h\), importance supplies \(X_h\), and power is \(q \in [0, 1]\).

cv_df <- data.frame(
  region = levels(bfa_eas$region),
  cv = c(0.40, 0.35, 0.12, 0.20, 0.30, 0.18,
         0.15, 0.38, 0.22, 0.32, 0.17, 0.45, 0.25)
)

importance_df <- data.frame(
  region = levels(bfa_eas$region),
  importance = c(60, 40, 120, 70, 80, 65,
                 50, 55, 90, 75, 45, 35, 30)
)

sample <- sampling_design(title = "Power Allocation") |>
  stratify_by(
    region,
    alloc = "power",
    cv = cv_df,
    importance = importance_df,
    power = 0.5
  ) |>
  draw(n = 300) |>
  execute(bfa_eas, seed = 777)

sample |>
  count(region)
#> # A tibble: 13 × 2
#>    region                n
#>    <fct>             <int>
#>  1 Boucle du Mouhoun    35
#>  2 Cascades             25
#>  3 Centre               15
#>  4 Centre-Est           19
#>  5 Centre-Nord          31
#>  6 Centre-Ouest         16
#>  7 Centre-Sud           12
#>  8 Est                  32
#>  9 Hauts-Bassins        24
#> 10 Nord                 32
#> 11 Plateau-Central      13
#> 12 Sahel                30
#> 13 Sud-Ouest            16

Sample Size Bounds

When using allocation methods, you can constrain stratum sample sizes with min_n and max_n in draw().

Minimum sample size ensures every stratum gets at least the specified number of units. This is essential for variance estimation (requires n ≥ 2 per stratum) or when you need reliable subgroup estimates.

# Neyman allocation with minimum 5 per stratum
sample <- sampling_design() |>
  stratify_by(region, alloc = "neyman", variance = bfa_eas_variance) |>
  draw(n = 300, min_n = 20) |>
  execute(bfa_eas, seed = 1)

sample |>
  count(region)
#> # A tibble: 13 × 2
#>    region                n
#>    <fct>             <int>
#>  1 Boucle du Mouhoun    21
#>  2 Cascades             20
#>  3 Centre               20
#>  4 Centre-Est           20
#>  5 Centre-Nord          24
#>  6 Centre-Ouest         20
#>  7 Centre-Sud           20
#>  8 Est                  33
#>  9 Hauts-Bassins        20
#> 10 Nord                 28
#> 11 Plateau-Central      20
#> 12 Sahel                34
#> 13 Sud-Ouest            20

Maximum sample size caps large strata to prevent them from dominating the sample. This is useful when operational constraints limit how many units you can handle in any one stratum.

# Proportional allocation but cap at 60 per stratum
sample <- sampling_design() |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 300, max_n = 55) |>
  execute(bfa_eas, seed = 40)

sample |>
  count(region)
#> # A tibble: 13 × 2
#>    region                n
#>    <fct>             <int>
#>  1 Boucle du Mouhoun    34
#>  2 Cascades             17
#>  3 Centre               26
#>  4 Centre-Est           20
#>  5 Centre-Nord          23
#>  6 Centre-Ouest         25
#>  7 Centre-Sud           11
#>  8 Est                  37
#>  9 Hauts-Bassins        32
#> 10 Nord                 20
#> 11 Plateau-Central      11
#> 12 Sahel                28
#> 13 Sud-Ouest            16

Both bounds together create a feasible range for each stratum:

sample <- sampling_design() |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 300, min_n = 20, max_n = 55) |>
  execute(bfa_eas, seed = 2003)

sample |>
  count(region)
#> # A tibble: 13 × 2
#>    region                n
#>    <fct>             <int>
#>  1 Boucle du Mouhoun    28
#>  2 Cascades             20
#>  3 Centre               24
#>  4 Centre-Est           20
#>  5 Centre-Nord          22
#>  6 Centre-Ouest         23
#>  7 Centre-Sud           20
#>  8 Est                  30
#>  9 Hauts-Bassins        28
#> 10 Nord                 20
#> 11 Plateau-Central      20
#> 12 Sahel                25
#> 13 Sud-Ouest            20

When a stratum’s population is smaller than min_n, the entire stratum is selected (capped at population size).

Custom allocation lets you specify exact sample sizes or rates per stratum by passing a data frame to draw().

sizes_df <- data.frame(
  region = levels(bfa_eas$region),
  n = c(20, 12, 25, 18, 22, 16, 14, 15, 20, 18, 12, 10, 8)
)

sample <- sampling_design() |>
  stratify_by(region) |>
  draw(n = sizes_df) |>
  execute(bfa_eas, seed = 101)

sample |>
  count(region)
#> # A tibble: 13 × 2
#>    region                n
#>    <fct>             <int>
#>  1 Boucle du Mouhoun    20
#>  2 Cascades             12
#>  3 Centre               25
#>  4 Centre-Est           18
#>  5 Centre-Nord          22
#>  6 Centre-Ouest         16
#>  7 Centre-Sud           14
#>  8 Est                  15
#>  9 Hauts-Bassins        20
#> 10 Nord                 18
#> 11 Plateau-Central      12
#> 12 Sahel                10
#> 13 Sud-Ouest             8

Multiple Stratification Variables

You can stratify by multiple variables to create crossed strata. Here we stratify by both region and urban/rural status.

sample <- sampling_design() |>
  stratify_by(region, urban_rural, alloc = "proportional") |>
  draw(n = 300) |>
  execute(bfa_eas, seed = 42)

sample |>
  count(region, urban_rural) |>
  head(10)
#> # A tibble: 10 × 3
#>    region            urban_rural     n
#>    <fct>             <fct>       <int>
#>  1 Boucle du Mouhoun Rural          34
#>  2 Cascades          Rural          17
#>  3 Centre            Rural           2
#>  4 Centre            Urban          25
#>  5 Centre-Est        Rural          18
#>  6 Centre-Est        Urban           2
#>  7 Centre-Nord       Rural          20
#>  8 Centre-Nord       Urban           3
#>  9 Centre-Ouest      Rural          22
#> 10 Centre-Ouest      Urban           3

Cluster Sampling

Cluster sampling selects groups (clusters) rather than individual units. This is practical when a complete list of individuals isn’t available or when travel costs make dispersed sampling expensive.

Use cluster_by() to specify the cluster identifier. All units within selected clusters are included.

sample <- sampling_design() |>
  cluster_by(ea_id) |>
  draw(n = 50) |>
  execute(bfa_eas, seed = 120)

sample |>
  summarise(
    n_clusters = n_distinct(ea_id),
    n_units = n()
  )
#> # A tibble: 1 × 2
#>   n_clusters n_units
#>        <int>   <int>
#> 1         50      50

PPS Sampling

Probability Proportional to Size (PPS) sampling selects units with probability proportional to a size measure. This is standard for cluster sampling where clusters vary in size.

Use the mos argument to specify the measure of size variable. Here we select EAs with probability proportional to their household count.

sample_pps <- sampling_design() |>
  cluster_by(ea_id) |>
  draw(n = 50, method = "pps_brewer", mos = households) |>
  execute(bfa_eas, seed = 365)

sample_pps |>
  summarise(
    mean_hh = mean(households),
    median_hh = median(households)
  )
#> # A tibble: 1 × 2
#>   mean_hh median_hh
#>     <dbl>     <dbl>
#> 1    104.      97.5

Compared to simple random sampling, PPS tends to select larger clusters:

sample_srs <- sampling_design() |>
  cluster_by(ea_id) |>
  draw(n = 50) |>
  execute(bfa_eas, seed = 42)

bind_rows(
  sample_pps |> summarise(method = "PPS", mean_hh = mean(households)),
  sample_srs |> summarise(method = "SRS", mean_hh = mean(households))
)
#> # A tibble: 2 × 2
#>   method mean_hh
#>   <chr>    <dbl>
#> 1 PPS      104. 
#> 2 SRS       65.6

PPS Methods

samplyr supports several PPS methods:

Method Description
pps_brewer Brewer’s method - fast with good properties (recommended)
pps_systematic PPS systematic - simple but can have periodicity issues
pps_cps Conditional Poisson sampling (maximum entropy)
pps_poisson PPS Poisson - random sample size (requires frac and supports prn)
pps_sps Sequential Poisson sampling (supports prn)
pps_pareto Pareto sampling (supports prn)
pps_multinomial PPS with replacement
pps_chromy Chromy’s sequential PPS / minimum replacement

Certainty Selection

In PPS sampling, very large units can have expected inclusion probabilities above 1. Certainty selection handles this by selecting those units with probability 1 and sampling the rest normally.

Use certainty_size for an absolute threshold or certainty_prop for a proportional one.

sample_cert <- sampling_design(title = "Certainty selection") |>
  draw(n = 50, method = "pps_brewer", mos = households,
       certainty_size = 800) |>
  execute(bfa_eas, seed = 123)

# How many EAs with more than 800 households
filter(bfa_eas, households > 800)
#> # A tibble: 4 × 12
#>   ea_id region       province commune urban_rural population households area_km2
#>   <int> <fct>        <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#> 1   888 Centre-Est   Boulgou  Bitou   Rural             4559        815     8.31
#> 2 26995 Centre-Est   Kourite… Koupela Urban             4640        941     3.62
#> 3 28122 Centre-Ouest Sanguie  Reo     Urban             3930        844     8.22
#> 4 40067 Nord         Yatenga  Barga   Urban             4644        821     2.19
#> # ℹ 4 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>

# Check which units were selected with certainty
count(sample_cert, .certainty_1)
#> # A tibble: 2 × 2
#>   .certainty_1     n
#>   <lgl>        <int>
#> 1 FALSE           46
#> 2 TRUE             4

With certainty_prop, units whose MOS proportion exceeds the threshold are taken with certainty. The process is iterative: after removing certainty units, proportions are recomputed.

sample_certp <- sampling_design() |>
  stratify_by(region) |>
  draw(n = 50, method = "pps_systematic", mos = households,
       certainty_prop = 0.05) |>
  execute(bfa_eas, seed = 12345)

# How many EAs with more than 5% of the total hh by region
bfa_eas |>
  mutate(mosprop = households/sum(households),
         .by = region) |>
  filter(mosprop >= 0.05)
#> # A tibble: 0 × 13
#> # ℹ 13 variables: ea_id <int>, region <fct>, province <fct>, commune <fct>,
#> #   urban_rural <fct>, population <dbl>, households <int>, area_km2 <dbl>,
#> #   accessible <lgl>, dist_road_km <dbl>, food_insecurity_pct <dbl>,
#> #   cost <dbl>, mosprop <dbl>

filter(sample_certp, .certainty_1)
#> Warning in min(w): no non-missing arguments to min; returning Inf
#> Warning in max(w): no non-missing arguments to max; returning -Inf
#> # A tbl_sample: 0 × 18
#> # Weights:      NaN [Inf, -Inf]
#> # ℹ 18 variables: ea_id <int>, region <fct>, province <fct>, commune <fct>,
#> #   urban_rural <fct>, population <dbl>, households <int>, area_km2 <dbl>,
#> #   accessible <lgl>, dist_road_km <dbl>, food_insecurity_pct <dbl>,
#> #   cost <dbl>, .weight <dbl>, .sample_id <int>, .stage <int>, .weight_1 <dbl>,
#> #   .fpc_1 <int>, .certainty_1 <lgl>

Permanent Random Numbers (PRN)

Some PPS methods support permanent random numbers for sample coordination across survey waves. Assign a stable U(0,1) value to each frame unit, then pass it via prn:

set.seed(1)
bfa_eas$prn <- runif(nrow(bfa_eas))

sample_prn <- sampling_design(title = "Samples coordination with PRN") |>
  cluster_by(ea_id) |>
  draw(n = 50, method = "pps_sps", mos = households, prn = prn) |>
  execute(bfa_eas, seed = 1)
sample_prn
#> # A tbl_sample: 50 × 19 | Samples coordination with PRN
#> # Weights:      610.1 [90.47, 2399.38]
#>    ea_id region      province commune urban_rural population households area_km2
#>  * <int> <fct>       <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1  6360 Boucle du … Banwa    Kouka   Rural              592         82     0.54
#>  2 23944 Boucle du … Banwa    Solenzo Rural              758        102     0.77
#>  3 37423 Boucle du … Kossi    Kombori Rural              688         88     1.37
#>  4 25972 Boucle du … Mouhoun  Dedoug… Rural              804        143     1.58
#>  5 34838 Boucle du … Sourou   Tougan  Rural              876        120     1.25
#>  6 34864 Boucle du … Sourou   Tougan  Rural             2824        387     2.07
#>  7 34885 Boucle du … Sourou   Tougan  Rural              804        110     8.88
#>  8  7750 Cascades    Leraba   Nianko… Rural              633        125     1.36
#>  9 34253 Cascades    Leraba   Sindou  Rural              499         61     0.68
#> 10 11585 Cascades    Leraba   Wolonk… Rural              741         92     5.72
#> # ℹ 40 more rows
#> # ℹ 11 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> #   .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>,
#> #   .certainty_1 <lgl>

Because the PRN values are stable, re-executing the same design on the same frame produces highly overlapping samples (positive coordination). PRN is supported for bernoulli, pps_poisson, pps_sps, and pps_pareto. You can learn more about coordination with permanent random numbers in the following vignette vignette("sampling-coordination").


Control Sorting and Serpentine Ordering

Control sorting orders the frame before selection, providing implicit stratification. This is effective with systematic and sequential methods, where it ensures the sample spreads evenly across the sorted variables.

# Nested sorting: standard ascending order
sample_sort <- sampling_design() |>
  draw(n = 100, method = "systematic",
       control = c(region, province)) |>
  execute(bfa_eas, seed = 98765)
head(sample_sort)
#> # A tbl_sample: 6 × 18
#> # Weights:      445.7 [445.7, 445.7]
#>   ea_id region       province commune urban_rural population households area_km2
#> * <int> <fct>        <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#> 1 36793 Boucle du M… Bale     Fara    Rural              212         31     0.1 
#> 2  6276 Boucle du M… Banwa    Kouka   Rural              544         75     4.3 
#> 3 23886 Boucle du M… Banwa    Solenzo Rural               79         11     5.68
#> 4   568 Boucle du M… Kossi    Barani  Rural              371         49     0.35
#> 5 41792 Boucle du M… Kossi    Dokui   Rural              821        102     0.83
#> 6 33311 Boucle du M… Kossi    Nouna   Rural              989        146     0.35
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> #   .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

serp() implements serpentine (snake) sorting, which alternates direction at each level of the hierarchy. This minimizes jumps at boundaries, matching SAS PROC SURVEYSELECT’s SORT=SERP.

# Serpentine sorting
sample_serp <- sampling_design() |>
  draw(n = 100, method = "systematic",
       control = serp(region, province)) |>
  execute(bfa_eas, seed = 1)
head(sample_serp)
#> # A tbl_sample: 6 × 18
#> # Weights:      445.7 [445.7, 445.7]
#>   ea_id region       province commune urban_rural population households area_km2
#> * <int> <fct>        <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#> 1 29502 Boucle du M… Bale     Bana    Rural              168         22     6.72
#> 2  8912 Boucle du M… Bale     Pompoi  Rural              802        109     0.72
#> 3  9641 Boucle du M… Banwa    Sanaba  Rural              825        105     8.83
#> 4 10935 Boucle du M… Banwa    Tansila Rural              173         21     9.23
#> 5 21106 Boucle du M… Kossi    Bouras… Rural               39          4     3.34
#> 6  7017 Boucle du M… Kossi    Madouba Rural              599         74     0.74
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> #   .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

Control sorting can be combined with explicit stratification. The sorting is then applied within each stratum.

sample_serp2 <- sampling_design() |>
  stratify_by(urban_rural) |>
  draw(n = 100, method = "systematic",
       control = serp(region, province)) |>
  execute(bfa_eas, seed = 2)
head(sample_serp2)
#> # A tbl_sample: 6 × 18
#> # Weights:      376.87 [376.87, 376.87]
#>   ea_id region       province commune urban_rural population households area_km2
#> * <int> <fct>        <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#> 1 11828 Boucle du M… Bale     Bagassi Rural               38          5     6.16
#> 2  8436 Boucle du M… Bale     Ouri    Rural               83         11     4.09
#> 3  6321 Boucle du M… Banwa    Kouka   Rural              527         73     8.97
#> 4 23862 Boucle du M… Banwa    Solenzo Rural             1006        135     1.17
#> 5   475 Boucle du M… Kossi    Barani  Rural               51          7     8.84
#> 6 12763 Boucle du M… Kossi    Djibas… Rural             1238        158     0.95
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> #   .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

Balanced Sampling

Balanced sampling selects a sample whose Horvitz-Thompson estimates of auxiliary totals match (or nearly match) the population totals. This improves precision for any variable correlated with the auxiliary variables. samplyr implements the cube method (Deville & Tille 2004) via method = "balanced".

Specify auxiliary variables with aux. Without mos, inclusion probabilities are equal; with mos, they are proportional to size:

# Equal-probability balanced sample
balanced_eq <- sampling_design() |>
  draw(n = 100, method = "balanced",
       aux = c(population, households)) |>
  execute(bfa_eas, seed = 42)
balanced_eq
#> # A tbl_sample: 100 × 18
#> # Weights:      445.7 [445.7, 445.7]
#>    ea_id region      province commune urban_rural population households area_km2
#>  * <int> <fct>       <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1  8354 Boucle du … Bale     Ouri    Rural               78         10     7.56
#>  2  8901 Boucle du … Bale     Pompoi  Rural              220         30     8.8 
#>  3  9031 Boucle du … Bale     Poura   Urban              941        151     1.06
#>  4  6328 Boucle du … Banwa    Kouka   Rural              889        122     8.66
#>  5 21118 Boucle du … Kossi    Bouras… Rural               28          3     8.47
#>  6 12858 Boucle du … Kossi    Djibas… Rural              431         55     4.95
#>  7  7021 Boucle du … Kossi    Madouba Rural              926        115     0.66
#>  8 26106 Boucle du … Mouhoun  Dedoug… Rural              237         42     8.72
#>  9 26496 Boucle du … Nayala   Gossina Rural              116         14     6.51
#> 10  6237 Boucle du … Nayala   Kougny  Rural               97         11     8.71
#> # ℹ 90 more rows
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> #   .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

The balancing property means the HT estimate of auxiliary totals is close to the true total:

pop_total <- sum(bfa_eas$population)
ht_total <- sum(balanced_eq$population * balanced_eq$.weight)
c(population = pop_total, ht_estimate = ht_total)
#>  population ht_estimate 
#>    20487532    20444259

Balanced sampling combines naturally with stratification. When stratified, samplyr calls the stratified cube algorithm (Chauvet 2009) in a single pass, preserving the allocation computed by stratify_by():

# Stratified PPS balanced sample
balanced_strat <- sampling_design() |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 300, method = "balanced", mos = households,
       aux = c(population, area_km2)) |>
  execute(bfa_eas, seed = 1960)
balanced_strat
#> # A tbl_sample: 300 × 18
#> # Weights:      148.8 [10.29, 2065.22]
#>    ea_id region      province commune urban_rural population households area_km2
#>  * <int> <fct>       <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1 11787 Boucle du … Bale     Bagassi Rural              874        105     1.4 
#>  2 11855 Boucle du … Bale     Bagassi Rural              220         26     8.48
#>  3  1165 Boucle du … Bale     Boromo  Rural             1257        164     0.96
#>  4 36701 Boucle du … Bale     Fara    Rural              638         93     0.81
#>  5 39987 Boucle du … Banwa    Balave  Rural              298         47     9.01
#>  6  6362 Boucle du … Banwa    Kouka   Rural              890        123     6.91
#>  7 34030 Boucle du … Banwa    Sami    Rural              139         21     8.93
#>  8 34038 Boucle du … Banwa    Sami    Rural              214         32    10.4 
#>  9 34054 Boucle du … Banwa    Sami    Rural              306         46     5.13
#> 10 23901 Boucle du … Banwa    Solenzo Rural              383         51     8.71
#> # ℹ 290 more rows
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> #   .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <dbl>

When used with cluster_by(), auxiliary values are automatically aggregated (summed) to the cluster level before selection.

Balanced sampling supports up to 2 stages. At stage 1, element-level auxiliary variables are aggregated to PSU totals; at stage 2, the cube method runs on the element-level frame within each selected PSU.


Custom Sampling Methods

Any unequal probability method can be plugged into samplyr by registering it with sondage::register_method(). The method is then available in draw() using the pps_<name> convention.

For example, the elimination procedure of Tillé (1996) is available in the sampling package. We can register it as a custom method along with its exact joint inclusion probabilities:

# Wrap sampling::UPtille to return selected indices
tille_fn <- function(pik, n = NULL, prn = NULL, ...) {
  which(as.logical(sampling::UPtille(pik)))
}

# Joint inclusion probabilities, restricted to the sampled units
tille_joint_fn <- function(pik, sample_idx = NULL, ...) {
  pi2 <- sampling::UPtillepi2(pik)
  if (!is.null(sample_idx))
    pi2 <- pi2[sample_idx, sample_idx, drop = FALSE]
  pi2
}

sondage::register_method(
  "tille", type = "wor",
  sample_fn = tille_fn,
  joint_fn = tille_joint_fn
)

Once registered, the method works like any built-in PPS method:

set.seed(1)
frame <- bfa_eas[sample(seq_len(nrow(bfa_eas)), 500), ]

sample_tille <- sampling_design() |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 50, method = "pps_tille", mos = households) |>
  execute(frame, seed = 1)

sample_tille
#> # A tbl_sample: 50 × 19
#> # Weights:      12.2 [1.31, 111.8]
#>    ea_id region      province commune urban_rural population households area_km2
#>  * <int> <fct>       <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1 43475 Est         Gnagna   Piela   Rural              971        114     8.95
#>  2 25151 Est         Gnagna   Bilanga Rural              359         54     9.07
#>  3 27796 Est         Tapoa    Partia… Rural              742        102     8.15
#>  4 37813 Est         Tapoa    Logbou  Rural              965        123     5.96
#>  5  3098 Est         Gourma   Fada-N… Rural              985        119     0.63
#>  6 25163 Est         Gnagna   Bilanga Rural              546         83     9.01
#>  7 30933 Est         Komandj… Gayeri  Rural              278         35     8.9 
#>  8  3884 Sud-Ouest   Poni     Gaoua   Rural              146         22    11.3 
#>  9 32486 Sud-Ouest   Poni     Lorope… Rural               37          5     3.97
#> 10 29504 Boucle du … Bale     Bana    Rural              817        107     8.21
#> # ℹ 40 more rows
#> # ℹ 11 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> #   .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>,
#> #   .certainty_1 <lgl>

Because we provided a joint_fn, exact variance estimation is available via joint_expectation():

jip <- joint_expectation(sample_tille, frame)
svy <- as_svydesign(sample_tille, pps = survey::ppsmat(jip$stage_1))
survey::svymean(~population, svy)
#>              mean     SE
#> population 382.58 72.167

Custom methods flow through the full pipeline: stratification, certainty selection, multi-stage designs, survey export, and summary diagnostics. See ?sondage::register_method for the full contract including WR methods and PRN support.


Multi-Stage Sampling

Multi-stage designs sample in stages: first select primary sampling units (PSUs), then sample within them. This is the standard approach for large-scale surveys when a complete frame of ultimate units isn’t available upfront.

Use add_stage() to delimit each stage of the design.

Two-Stage Design

For this example, we’ll use zwe_eas to demonstrate a two-stage design: first select districts as PSUs, then sample EAs within selected districts.

data(zwe_eas)

# First, let's see the structure: EAs nested within districts
zwe_eas |>
  count(province, district) |>
  head(10)
#> # A tibble: 10 × 3
#>    province   district           n
#>    <fct>      <fct>          <int>
#>  1 Bulawayo   Bulawayo        2164
#>  2 Harare     Chitungwiza      729
#>  3 Harare     Epworth          564
#>  4 Harare     Harare          3922
#>  5 Harare     Harare Rural     851
#>  6 Manicaland Buhera          2790
#>  7 Manicaland Chimanimani      991
#>  8 Manicaland Chipinge        2038
#>  9 Manicaland Chipinge Urban   104
#> 10 Manicaland Makoni          3177

We need to create a measure of size for districts. Here we’ll use total households per district.

# Add district-level households as MOS
zwe_frame <- zwe_eas |>
  mutate(district_hh = sum(households),
         .by = district)

Now we can run a two-stage design: select 10 districts with PPS, then sample 5 EAs within each.

twostage_design <- sampling_design(title = "Two-stage cluster sampling") |>
  add_stage(label = "Districts") |>
    cluster_by(district) |>
    draw(n = 10, method = "pps_brewer", mos = district_hh) |>
  add_stage(label = "EAs") |>
  draw(n = 5)
twostage_design
#> ── Sampling Design: Two-stage cluster sampling ─────────────────────────────────
#> 
#> ℹ 2 stages
#> 
#> ── Stage 1: Districts ──────────────────────────────────────────────────────────
#> • Cluster: district
#> • Draw: n = 10, method = pps_brewer, mos = district_hh
#> 
#> ── Stage 2: EAs ────────────────────────────────────────────────────────────────
#> • Draw: n = 5, method = srswor

Let’s apply it to the frame.

sample_eas <- execute(twostage_design,
                      zwe_frame, seed = 3)

sample_eas |>
  summarise(n_districts = n_distinct(district),
            n_eas = n())
#> # A tibble: 1 × 2
#>   n_districts n_eas
#>         <int> <int>
#> 1          10    50

Stratified Multi-Stage Design

Combining stratification with multi-stage sampling is standard for national surveys. Here we stratify by province (which is constant within districts), then select districts and EAs within each stratum.

sample_eas2 <- sampling_design(title = "Stratified two-stage cluster sampling") |>
  add_stage(label = "Districts") |>
    stratify_by(province) |>
    cluster_by(district) |>
    draw(n = 2, method = "pps_brewer", mos = district_hh) |>
  add_stage(label = "EAs") |>
    draw(n = 3) |>
  execute(zwe_frame, seed = 4)
#> Warning: Sample size capped to population in 1 stratum/strata: "Bulawayo".
#>  Requested total: 20. Actual total: 19.

sample_eas2 |>
  count(province, district)
#> # A tibble: 19 × 3
#>    province            district         n
#>    <fct>               <fct>        <int>
#>  1 Bulawayo            Bulawayo         3
#>  2 Harare              Epworth          3
#>  3 Harare              Harare           3
#>  4 Manicaland          Buhera           3
#>  5 Manicaland          Chipinge         3
#>  6 Mashonaland Central Guruve           3
#>  7 Mashonaland Central Mount Darwin     3
#>  8 Mashonaland East    Goromonzi        3
#>  9 Mashonaland East    Mutoko           3
#> 10 Mashonaland West    Sanyati          3
#> 11 Mashonaland West    Zvimba           3
#> 12 Masvingo            Bikita           3
#> 13 Masvingo            Masvingo         3
#> 14 Matabeleland North  Binga            3
#> 15 Matabeleland North  Bubi             3
#> 16 Matabeleland South  Gwanda           3
#> 17 Matabeleland South  Umzingwane       3
#> 18 Midlands            Gweru Urban      3
#> 19 Midlands            Zvishavane       3

Operational Multi-Stage Sampling

In practice, multi-stage surveys often execute stages at different times. After selecting PSUs, field teams may need to conduct listing or other operations before the second stage can proceed.

The stages argument lets you execute only specific stages.

design <- sampling_design() |>
  add_stage(label = "EA") |>
    stratify_by(urban_rural) |>
    cluster_by(ea_id) |>
    draw(n = 10, method = "pps_brewer", mos = households) |>
  add_stage(label = "HH") |>
    draw(n = 5)

# Execute stage 1 only
selected_eas <- execute(design, zwe_eas, stages = 1, seed = 1)
head(selected_eas)
#> # A tbl_sample: 6 × 18
#> # Stages:       1/2
#> # Weights:      3641.26 [591.31, 8821.67]
#>   ea_id province district ward_pcode urban_rural population households buildings
#> * <int> <fct>    <fct>    <chr>      <fct>            <int>      <int>     <int>
#> 1 47209 Bulawayo Bulawayo ZW102127   Urban              462        121       122
#> 2 35161 Harare   Harare   ZW192109   Urban              140         38        91
#> 3 86782 Harare   Harare   ZW192130   Urban              974        263       170
#> 4 88462 Harare   Harare   ZW192103   Urban              993        302        98
#> 5 93947 Mashona… Bindura  ZW120105   Rural               95         23        91
#> 6 35770 Mashona… Guruve   ZW120307   Urban              121         32       122
#> # ℹ 10 more variables: women_15_49 <int>, men_15_49 <int>,
#> #   children_under5 <int>, area_km2 <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <int>, .certainty_1 <lgl>

Once the EAs are selected, households are listed in each of them.

selected_eas_list <- selected_eas |>
  slice(rep(seq_len(n()), households)) |>
  mutate(hh_id = row_number())
head(selected_eas_list)
#> # A tbl_sample: 6 × 19
#> # Stages:       1/2
#> # Weights:      1475.82 [1475.82, 1475.82]
#>   ea_id province district ward_pcode urban_rural population households buildings
#> * <int> <fct>    <fct>    <chr>      <fct>            <int>      <int>     <int>
#> 1 47209 Bulawayo Bulawayo ZW102127   Urban              462        121       122
#> 2 47209 Bulawayo Bulawayo ZW102127   Urban              462        121       122
#> 3 47209 Bulawayo Bulawayo ZW102127   Urban              462        121       122
#> 4 47209 Bulawayo Bulawayo ZW102127   Urban              462        121       122
#> 5 47209 Bulawayo Bulawayo ZW102127   Urban              462        121       122
#> 6 47209 Bulawayo Bulawayo ZW102127   Urban              462        121       122
#> # ℹ 11 more variables: women_15_49 <int>, men_15_49 <int>,
#> #   children_under5 <int>, area_km2 <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <int>, .certainty_1 <lgl>,
#> #   hh_id <int>

You can also use tidyr::uncount() for the listing step.

selected_eas_list <- selected_eas |>
  tidyr::uncount(weights = households, .id = "hh_id", .remove = FALSE)

After fieldwork or preparation is complete, execute the remaining stages by piping the stage-1 result into execute() with the listing frame. This continuation pattern preserves the full design metadata needed for survey export.

final_sample <- selected_eas |>
  execute(selected_eas_list, seed = 2)
head(final_sample)
#> # A tbl_sample: 6 × 21
#> # Weights:      35714.84 [35714.84, 35714.84]
#>   ea_id province district ward_pcode urban_rural population households buildings
#> * <int> <fct>    <fct>    <chr>      <fct>            <int>      <int>     <int>
#> 1 47209 Bulawayo Bulawayo ZW102127   Urban              462        121       122
#> 2 47209 Bulawayo Bulawayo ZW102127   Urban              462        121       122
#> 3 47209 Bulawayo Bulawayo ZW102127   Urban              462        121       122
#> 4 47209 Bulawayo Bulawayo ZW102127   Urban              462        121       122
#> 5 47209 Bulawayo Bulawayo ZW102127   Urban              462        121       122
#> 6 35161 Harare   Harare   ZW192109   Urban              140         38        91
#> # ℹ 13 more variables: women_15_49 <int>, men_15_49 <int>,
#> #   children_under5 <int>, area_km2 <dbl>, hh_id <int>, .weight <dbl>,
#> #   .sample_id <int>, .stage <int>, .weight_2 <dbl>, .fpc_2 <int>,
#> #   .weight_1 <dbl>, .fpc_1 <int>, .certainty_1 <lgl>

Note: All dplyr verbs (mutate(), filter(), *_join(), etc.) preserve the tbl_sample class automatically. If you manipulate a sample with base R functions like merge() or cbind(), both the class and its attributes are lost. Prefer the dplyr equivalents (left_join(), bind_cols()). See ?as_tbl_sample for the full list.


Replicated Sampling

The reps parameter in execute() draws multiple independent samples from the same frame under the same design. The output is a single stacked tbl_sample with a .replicate column identifying each replicate. This is useful for simulation studies, repeated-sampling variance estimation, or quality control.

rep_sample <- sampling_design() |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 300) |>
  execute(bfa_eas, seed = 42, reps = 5)

rep_sample |>
  count(.replicate)
#> # A tibble: 5 × 2
#>   .replicate     n
#>        <int> <int>
#> 1          1   300
#> 2          2   300
#> 3          3   300
#> 4          4   300
#> 5          5   300

Each replicate is a fully independent draw. Replicate r uses seed seed + r - 1, so the first replicate matches a standalone execution with the same seed:

standalone <- sampling_design() |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 300) |>
  execute(bfa_eas, seed = 42)

rep1 <- rep_sample |> filter(.replicate == 1)
identical(sort(rep1$ea_id), sort(standalone$ea_id))
#> [1] TRUE

Replicated sampling also works with multi-stage designs. When continuing from a replicated partial execution, each replicate is continued independently:

design <- sampling_design() |>
  add_stage(label = "Districts") |>
    cluster_by(district) |>
    draw(n = 10, method = "pps_brewer", mos = district_hh) |>
  add_stage(label = "EAs") |>
    draw(n = 5)

stage1 <- execute(design, zwe_frame, stages = 1, seed = 1, reps = 3)
final <- execute(stage1, zwe_frame, seed = 100)

final |>
  summarise(n_districts = n_distinct(district), n_eas = n(),
            .by = .replicate)
#> # A tibble: 3 × 3
#>   .replicate n_districts n_eas
#>        <int>       <int> <int>
#> 1          1          10    50
#> 2          2          10    50
#> 3          3          10    50

Survey export functions require a single replicate. Filter first:

one_rep <- rep_sample |> filter(.replicate == 1)
svy <- as_svydesign(one_rep)

Note: reps cannot be combined with panels or with stages that use permanent random numbers (PRN). PRN produces identical samples across replicates, so coordination across waves should use an explicit loop with different PRN vectors instead.


Working with Results

Accessing the Design

The sampling design is attached to the result and can be retrieved with get_design().

sample <- sampling_design() |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 200) |>
  execute(bfa_eas, seed = 2024)

design <- get_design(sample)
design
#> ── Sampling Design ─────────────────────────────────────────────────────────────
#> 
#> ℹ 1 stage
#> 
#> ── Stage 1 ─────────────────────────────────────────────────────────────────────
#> • Strata: region (proportional)
#> • Draw: n = 200, method = srswor

and accessing element of the design

design$stages[[1]]$strata$vars
#> [1] "region"

Weight Diagnostics

The summary() method reports weight diagnostics including Kish’s effective sample size and design effect due to unequal weighting. For the full design effect (including clustering and stratification), use survey::svymean() with deff = TRUE after converting with as_svydesign().

summary(sample)
#> ── Sample Summary ──────────────────────────────────────────────────────────────
#> 
#> ℹ n = 200 | stages = 1/1 | seed = 2024
#> 
#> ── Design: Stage 1 ─────────────────────────────────────────────────────────────
#> • Strata: region (proportional)
#> • Method: srswor
#> 
#> ── Allocation: Stage 1 ─────────────────────────────────────────────────────────
#>   region             N_h    n_h  f_h   
#>   Boucle du Mouhoun  5009   23   0.0046
#>   Cascades           2508   11   0.0044
#>   Centre             3888   17   0.0044
#>   Centre-Est         2941   13   0.0044
#>   Centre-Nord        3402   15   0.0044
#>   Centre-Ouest       3723   17   0.0046
#>   Centre-Sud         1612   7    0.0043
#>   Est                5505   25   0.0045
#>   Hauts-Bassins      4839   22   0.0045
#>   Nord               2930   13   0.0044
#>   Plateau-Central    1662   7    0.0042
#>   Sahel              4144   19   0.0046
#>   Sud-Ouest          2407   11   0.0046
#>                      ─────  ───  ──────
#>   Total              44570  200  0.0045
#> 
#> ── Weights ─────────────────────────────────────────────────────────────────────
#> • Range: [217.78, 237.43]
#> • Mean:  222.85 · CV: 0.02
#> • DEFF:  1 · n_eff: 200

Method Reference

Selection Methods

Method Sample Size Replacement Notes
srswor Fixed No Default, general purpose
srswr Fixed Yes Bootstrap, rare populations
systematic Fixed No Ordered frames
bernoulli Random No Simple field protocols
pps_brewer Fixed No Recommended PPS method
pps_systematic Fixed No Simple PPS
pps_cps Fixed No Maximum entropy, exact joint inclusion probs
pps_poisson Random No PPS with random size
pps_sps Fixed No Sequential Poisson, supports PRN
pps_pareto Fixed No Pareto sampling, supports PRN
pps_multinomial Fixed Yes PPS with replacement
pps_chromy Fixed PMR.1 Chromy’s sequential PPS
balanced Fixed No Cube method, balances on aux variables

Allocation Methods

Method Description Requirements
(none) n per stratum
equal Same n per stratum
proportional n ∝ stratum size
neyman Minimize variance variance data frame
optimal Minimize variance/cost variance + cost data frames
power Compromise: \(n_h \propto C_h X_h^q\) cv + importance (+ optional power)

For custom stratum-specific sizes or rates, pass a data frame directly to n or frac in draw(). When alloc is specified, use n (not frac).

Sample Size Bounds

Argument Description
min_n Minimum sample size per stratum (for variance estimation, set ≥ 2)
max_n Maximum sample size per stratum (to cap dominant strata)

These only apply when using allocation methods (equal, proportional, neyman, optimal, power).

Rounding Methods

Value Description
"up" Round up (ceiling), default, matches SAS
"down" Round down (floor)
"nearest" Standard mathematical rounding

These apply when using frac to specify sampling rates. When computed sample size < 1, it is always rounded up to 1.


Best Practices

  1. Always set a seed for reproducibility
  2. Use meaningful stage labels for documentation
  3. Validate designs with validate_frame() before execution
  4. Check weight distributions with summary() on the sample
design <- sampling_design(title = "National Health Survey 2024") |>
  add_stage(label = "Enumeration Areas") |>
    stratify_by(region, urban_rural) |>
    cluster_by(ea_id) |>
    draw(n = 5, method = "pps_brewer", mos = households) |>
  add_stage(label = "Households") |>
    draw(n = 20)

print(design)
#> ── Sampling Design: National Health Survey 2024 ────────────────────────────────
#> 
#> ℹ 2 stages
#> 
#> ── Stage 1: Enumeration Areas ──────────────────────────────────────────────────
#> • Strata: region, urban_rural
#> • Cluster: ea_id
#> • Draw: n = 5, method = pps_brewer, mos = households
#> 
#> ── Stage 2: Households ─────────────────────────────────────────────────────────
#> • Draw: n = 20, method = srswor

Frame Validation

Use validate_frame() to check that a frame has all the variables a design requires before executing.

# This passes
validate_frame(design, bfa_eas)

# This fails with a clear message
bad_frame <- data.frame(id = 1:100, value = rnorm(100))
validate_frame(design, bad_frame)
#> Error:
#> ! Frame validation failed:
#>  Enumeration Areas: missing stratification variable: "region" and
#>   "urban_rural"
#>  Enumeration Areas: missing cluster variable: "ea_id"
#>  Enumeration Areas: missing MOS variable: `households`

Reference

Bankier, Michael D. 1988. “Power Allocations: Determining Sample Sizes for Subnational Areas.” The American Statistician 42 (3): 174–77.
Lohr, Sharon L. 2022. Sampling: Design and Analysis. 3rd ed. CRC Press.
Tillé, Yves. 1996. “An Elimination Procedure of Unequal Probability Sampling Without Replacement.” Biometrika 83 (1): 238–41.