Overview

samplyr provides a tidy grammar for specifying and executing survey sampling designs. The package is built around a minimal set of composable verbs that handle stratified, clustered, and multi-stage sampling.

The core idea is that sampling code should read like its English description:

# "Stratify by region, proportionally allocate 500 samples, execute"
sampling_design() |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 500) |>
  execute(frame, seed = 42)

The Grammar

samplyr uses 5 verbs and 1 modifier:

Function Purpose
sampling_design() Create a new sampling design
stratify_by() Define stratification variables and allocation
cluster_by() Define cluster/PSU variable
draw() Specify sample size, method, and measure of size
execute() Run the design on a frame
stage() Delimit stages in multi-stage designs

Example Data

We’ll use the niger_eas dataset throughout this vignette. It’s a synthetic enumeration area frame inspired by DHS household surveys, with ~1,500 EAs across 8 regions.

data(niger_eas)

niger_eas |>
  glimpse()
#> Rows: 1,536
#> Columns: 6
#> $ ea_id        <chr> "Aga_Aga_0001", "Aga_Aga_0002", "Aga_Aga_0003", "Aga_Aga_…
#> $ region       <fct> Agadez, Agadez, Agadez, Agadez, Agadez, Agadez, Agadez, A…
#> $ department   <fct> Agadez, Agadez, Agadez, Agadez, Agadez, Agadez, Agadez, A…
#> $ strata       <fct> Rural, Urban, Urban, Rural, Urban, Rural, Rural, Urban, R…
#> $ hh_count     <dbl> 59, 157, 124, 146, 112, 182, 166, 54, 97, 192, 54, 109, 6…
#> $ pop_estimate <dbl> 413, 942, 868, 1022, 896, 1092, 1162, 432, 582, 1344, 378…

Simple Random Sampling

The most basic design selects n units at random from the frame.

sample <- sampling_design() |>
  draw(n = 100) |>
  execute(niger_eas, seed = 42)

nrow(sample)
#> [1] 100

The result includes the original columns plus sampling metadata: .weight (sampling weight) and .prob (inclusion probability).

sample |>
  select(ea_id, region, strata, .weight, .prob) |>
  head()
#> == tbl_sample ==
#> Weights: 15.36 - 15.36 (mean: 15.36 )
#> 
#> # A tibble: 6 × 5
#>   ea_id        region    strata .weight  .prob
#>   <chr>        <fct>     <fct>    <dbl>  <dbl>
#> 1 Mar_Tes_0023 Maradi    Rural     15.4 0.0651
#> 2 Mar_Agu_0027 Maradi    Urban     15.4 0.0651
#> 3 Til_Tér_0015 Tillabéri Rural     15.4 0.0651
#> 4 Til_Oua_0012 Tillabéri Rural     15.4 0.0651
#> 5 Zin_Dun_0021 Zinder    Urban     15.4 0.0651
#> 6 Til_Tér_0008 Tillabéri Rural     15.4 0.0651

Selection Methods

The method argument controls how units are selected. By default, samplyr uses simple random sampling without replacement (srswor).

Systematic sampling selects units at fixed intervals, which can improve precision when the frame is ordered.

sample_sys <- sampling_design() |>
  draw(n = 100, method = "systematic") |>
  execute(niger_eas, seed = 42)

Sampling with replacement allows the same unit to be selected multiple times, useful for bootstrap procedures.

sample_wr <- sampling_design() |>
  draw(n = 100, method = "srswr") |>
  execute(niger_eas, seed = 42)

Using Sampling Fractions

Instead of a fixed sample size, you can specify a sampling fraction with frac.

sample <- sampling_design() |>
  draw(frac = 0.05) |>
  execute(niger_eas, seed = 42)

nrow(sample)
#> [1] 77

Bernoulli sampling selects each unit independently with the specified probability. This gives a random sample size, which can simplify field protocols.

sample <- sampling_design() |>
  draw(frac = 0.05, method = "bernoulli") |>
  execute(niger_eas, seed = 42)

nrow(sample)
#> [1] 64

Stratified Sampling

Stratification partitions the frame into non-overlapping groups (strata) and samples from each. This ensures representation from all subgroups and often improves precision.

Per-Stratum Sample Size

Without an allocation method, n specifies the sample size within each stratum.

sample <- sampling_design() |>
  stratify_by(region) |>
  draw(n = 20) |>
  execute(niger_eas, seed = 42)

sample |>
  count(region)
#> # A tibble: 8 × 2
#>   region        n
#>   <fct>     <int>
#> 1 Agadez       20
#> 2 Diffa        20
#> 3 Dosso        20
#> 4 Maradi       20
#> 5 Niamey       20
#> 6 Tahoua       20
#> 7 Tillabéri    20
#> 8 Zinder       20

Allocation Methods

With an allocation method, n becomes the total sample size to distribute across strata.

Proportional allocation distributes the sample proportionally to stratum sizes, ensuring the sample mirrors the population structure.

sample <- sampling_design() |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 300) |>
  execute(niger_eas, seed = 42)

sample |>
  count(region) |>
  mutate(pct = n / sum(n))
#> # A tibble: 8 × 3
#>   region        n    pct
#>   <fct>     <int>  <dbl>
#> 1 Agadez       10 0.0333
#> 2 Diffa        13 0.0433
#> 3 Dosso        35 0.117 
#> 4 Maradi       56 0.187 
#> 5 Niamey       30 0.1   
#> 6 Tahoua       54 0.18  
#> 7 Tillabéri    43 0.143 
#> 8 Zinder       59 0.197

Equal allocation assigns the same sample size to each stratum, regardless of population size. This maximizes precision for comparisons between strata.

sample <- sampling_design() |>
  stratify_by(region, alloc = "equal") |>
  draw(n = 160) |>
  execute(niger_eas, seed = 42)

sample |>
  count(region)
#> # A tibble: 8 × 2
#>   region        n
#>   <fct>     <int>
#> 1 Agadez       20
#> 2 Diffa        20
#> 3 Dosso        20
#> 4 Maradi       20
#> 5 Niamey       20
#> 6 Tahoua       20
#> 7 Tillabéri    20
#> 8 Zinder       20

Neyman allocation minimizes the variance of the overall estimate by allocating more sample to strata with higher variability. It requires prior information about stratum variances.

data(niger_eas_variance)

niger_eas_variance
#> # A tibble: 8 × 2
#>   region      var
#>   <fct>     <dbl>
#> 1 Agadez    6249.
#> 2 Diffa     1950.
#> 3 Dosso     2771.
#> 4 Maradi    3207.
#> 5 Niamey    5728.
#> 6 Tahoua    2487.
#> 7 Tillabéri 2506.
#> 8 Zinder    2926.

sample <- sampling_design() |>
  stratify_by(region, alloc = "neyman", variance = niger_eas_variance) |>
  draw(n = 300) |>
  execute(niger_eas, seed = 42)

sample |>
  count(region)
#> # A tibble: 8 × 2
#>   region        n
#>   <fct>     <int>
#> 1 Agadez       14
#> 2 Diffa        10
#> 3 Dosso        33
#> 4 Maradi       57
#> 5 Niamey       41
#> 6 Tahoua       48
#> 7 Tillabéri    39
#> 8 Zinder       58

Optimal allocation extends Neyman allocation by also accounting for differential costs across strata. It minimizes variance for a fixed budget (or cost for fixed precision).

data(niger_eas_cost)

niger_eas_cost
#> # A tibble: 8 × 2
#>   region     cost
#>   <chr>     <dbl>
#> 1 Agadez      180
#> 2 Diffa       150
#> 3 Dosso        60
#> 4 Maradi       55
#> 5 Niamey       70
#> 6 Tahoua       50
#> 7 Tillabéri    65
#> 8 Zinder       40

sample <- sampling_design() |>
  stratify_by(region, alloc = "optimal",
              variance = niger_eas_variance,
              cost = niger_eas_cost) |>
  draw(n = 300) |>
  execute(niger_eas, seed = 42)

sample |>
  count(region)
#> # A tibble: 8 × 2
#>   region        n
#>   <fct>     <int>
#> 1 Agadez        8
#> 2 Diffa         6
#> 3 Dosso        32
#> 4 Maradi       58
#> 5 Niamey       38
#> 6 Tahoua       52
#> 7 Tillabéri    36
#> 8 Zinder       70

Sample Size Bounds

When using allocation methods, you can constrain stratum sample sizes with min_n and max_n in draw().

Minimum sample size ensures every stratum gets at least the specified number of units. This is essential for variance estimation (requires n ≥ 2 per stratum) or when you need reliable subgroup estimates.

# Neyman allocation with minimum 5 per stratum
sample <- sampling_design() |>
  stratify_by(region, alloc = "neyman", variance = niger_eas_variance) |>
  draw(n = 300, min_n = 20) |>
  execute(niger_eas, seed = 42)

sample |>
  count(region)
#> # A tibble: 8 × 2
#>   region        n
#>   <fct>     <int>
#> 1 Agadez       20
#> 2 Diffa        20
#> 3 Dosso        32
#> 4 Maradi       53
#> 5 Niamey       39
#> 6 Tahoua       45
#> 7 Tillabéri    37
#> 8 Zinder       54

Maximum sample size caps large strata to prevent them from dominating the sample. This is useful when operational constraints limit how many units you can handle in any one stratum.

# Proportional allocation but cap at 60 per stratum
sample <- sampling_design() |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 300, max_n = 55) |>
  execute(niger_eas, seed = 42)

sample |>
  count(region)
#> # A tibble: 8 × 2
#>   region        n
#>   <fct>     <int>
#> 1 Agadez       12
#> 2 Diffa        14
#> 3 Dosso        36
#> 4 Maradi       55
#> 5 Niamey       31
#> 6 Tahoua       54
#> 7 Tillabéri    43
#> 8 Zinder       55

Both bounds together create a feasible range for each stratum:

sample <- sampling_design() |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 300, min_n = 20, max_n = 55) |>
  execute(niger_eas, seed = 42)

sample |>
  count(region)
#> # A tibble: 8 × 2
#>   region        n
#>   <fct>     <int>
#> 1 Agadez       20
#> 2 Diffa        20
#> 3 Dosso        33
#> 4 Maradi       52
#> 5 Niamey       30
#> 6 Tahoua       50
#> 7 Tillabéri    40
#> 8 Zinder       55

When a stratum’s population is smaller than min_n, the entire stratum is selected (capped at population size).

Custom allocation lets you specify exact sample sizes or rates per stratum by passing a data frame to draw().

sizes_df <- data.frame(
  region = levels(niger_eas$region),
  n = c(15, 25, 50, 60, 55, 45, 70, 30)
)

sample <- sampling_design() |>
  stratify_by(region) |>
  draw(n = sizes_df) |>
  execute(niger_eas, seed = 42)

sample |>
  count(region)
#> # A tibble: 8 × 2
#>   region        n
#>   <fct>     <int>
#> 1 Agadez       15
#> 2 Diffa        25
#> 3 Dosso        50
#> 4 Maradi       60
#> 5 Niamey       55
#> 6 Tahoua       45
#> 7 Tillabéri    70
#> 8 Zinder       30

Multiple Stratification Variables

You can stratify by multiple variables to create crossed strata. Here we stratify by both region and urban/rural status.

sample <- sampling_design() |>
  stratify_by(region, strata, alloc = "proportional") |>
  draw(n = 300) |>
  execute(niger_eas, seed = 42)

sample |>
  count(region, strata) |>
  head(10)
#> # A tibble: 10 × 3
#>    region strata     n
#>    <fct>  <fct>  <int>
#>  1 Agadez Urban      5
#>  2 Agadez Rural      5
#>  3 Diffa  Urban      1
#>  4 Diffa  Rural     11
#>  5 Dosso  Urban      3
#>  6 Dosso  Rural     32
#>  7 Maradi Urban     11
#>  8 Maradi Rural     45
#>  9 Niamey Urban     28
#> 10 Niamey Rural      2

Cluster Sampling

Cluster sampling selects groups (clusters) rather than individual units. This is practical when a complete list of individuals isn’t available or when travel costs make dispersed sampling expensive.

Use cluster_by() to specify the cluster identifier. All units within selected clusters are included.

sample <- sampling_design() |>
  cluster_by(ea_id) |>
  draw(n = 50) |>
  execute(niger_eas, seed = 42)

sample |>
  summarise(
    n_clusters = n_distinct(ea_id),
    n_units = n()
  )
#> # A tibble: 1 × 2
#>   n_clusters n_units
#>        <int>   <int>
#> 1         50      54

PPS Sampling

Probability Proportional to Size (PPS) sampling selects units with probability proportional to a size measure. This is standard for cluster sampling where clusters vary in size.

Use the mos argument to specify the measure of size variable. Here we select EAs with probability proportional to their household count.

sample <- sampling_design() |>
  cluster_by(ea_id) |>
  draw(n = 50, method = "pps_brewer", mos = hh_count) |>
  execute(niger_eas, seed = 42)

sample |>
  summarise(
    mean_hh = mean(hh_count),
    median_hh = median(hh_count)
  )
#> # A tibble: 1 × 2
#>   mean_hh median_hh
#>     <dbl>     <dbl>
#> 1    123.       112

Compared to simple random sampling, PPS tends to select larger clusters:

sample_srs <- sampling_design() |>
  cluster_by(ea_id) |>
  draw(n = 50) |>
  execute(niger_eas, seed = 42)

bind_rows(
  sample |> summarise(method = "PPS", mean_hh = mean(hh_count)),
  sample_srs |> summarise(method = "SRS", mean_hh = mean(hh_count))
)
#> # A tibble: 2 × 2
#>   method mean_hh
#>   <chr>    <dbl>
#> 1 PPS       123.
#> 2 SRS       106.

PPS Methods

samplyr supports several PPS methods:

Method Description
pps_brewer Brewer’s method - fast with good properties (recommended)
pps_systematic PPS systematic - simple but can have periodicity issues
pps_maxent Maximum entropy - highest entropy among fixed-size methods
pps_poisson PPS Poisson - random sample size (requires frac)
pps_multinomial PPS with replacement

Multi-Stage Sampling

Multi-stage designs sample in stages: first select primary sampling units (PSUs), then sample within them. This is the standard approach for large-scale surveys when a complete frame of ultimate units isn’t available upfront.

Use stage() to delimit each stage of the design.

Two-Stage Design

For this example, we’ll use tanzania_schools to demonstrate a two-stage design: first select districts as PSUs, then sample schools within selected districts.

data(tanzania_schools)

# First, let's see the structure: schools nested within districts
tanzania_schools |>
  count(region, district) |>
  head(10)
#> # A tibble: 10 × 3
#>    region        district            n
#>    <fct>         <fct>           <int>
#>  1 Arusha        Arusha City        82
#>  2 Arusha        Arusha District    79
#>  3 Arusha        Meru              101
#>  4 Arusha        Monduli            85
#>  5 Dar es Salaam Ilala             163
#>  6 Dar es Salaam Kinondoni         177
#>  7 Dar es Salaam Temeke            152
#>  8 Dodoma        Chamwino           77
#>  9 Dodoma        Dodoma Urban       88
#> 10 Dodoma        Kondoa             87

We need to create a measure of size for districts. Here we’ll use total enrollment per district.

# Add district-level enrollment as MOS
schools_frame <- tanzania_schools |>
  group_by(district) |>
  mutate(district_enrollment = sum(enrollment)) |>
  ungroup()

Now we can run a two-stage design: select 10 districts with PPS, then sample 5 schools within each.

sample <- sampling_design() |>
  stage(label = "Districts") |>
    cluster_by(district) |>
    draw(n = 10, method = "pps_brewer", mos = district_enrollment) |>
  stage(label = "Schools") |>
    draw(n = 5) |>
  execute(schools_frame, seed = 42)

sample |>
  summarise(
    n_districts = n_distinct(district),
    n_schools = n()
  )
#> # A tibble: 1 × 2
#>   n_districts n_schools
#>         <int>     <int>
#> 1          10        50

Stratified Multi-Stage Design

Combining stratification with multi-stage sampling is standard for national surveys. Here we stratify by school level, then select districts and schools within each stratum.

sample <- sampling_design() |>
  stage(label = "Districts") |>
    stratify_by(school_level) |>
    cluster_by(district) |>
    draw(n = 5, method = "pps_brewer", mos = district_enrollment) |>
  stage(label = "Schools") |>
    draw(n = 3) |>
  execute(schools_frame, seed = 42)

sample |>
  count(school_level, district)
#> # A tibble: 17 × 3
#>    school_level district           n
#>    <fct>        <fct>          <int>
#>  1 Primary      Ilala              2
#>  2 Primary      Ilemela            1
#>  3 Primary      Kilombero          2
#>  4 Primary      Kondoa             3
#>  5 Primary      Meru               2
#>  6 Primary      Morogoro Rural     2
#>  7 Primary      Moshi Rural        3
#>  8 Primary      Muheza             1
#>  9 Primary      Sengerema          2
#> 10 Primary      Tanga City         3
#> 11 Secondary    Ilala              1
#> 12 Secondary    Ilemela            2
#> 13 Secondary    Kilombero          1
#> 14 Secondary    Meru               1
#> 15 Secondary    Morogoro Rural     1
#> 16 Secondary    Muheza             2
#> 17 Secondary    Sengerema          1

Operational Multi-Stage Sampling

In practice, multi-stage surveys often execute stages at different times. After selecting PSUs, field teams may need to conduct listing or other operations before the second stage can proceed.

The stages argument lets you execute only specific stages.

design <- sampling_design() |>
  stage(label = "Districts") |>
    stratify_by(region) |>
    cluster_by(district) |>
    draw(n = 2, method = "pps_brewer", mos = district_enrollment) |>
  stage(label = "Schools") |>
    draw(n = 4)

# Execute stage 1 only
selected_districts <- execute(design, schools_frame, stages = 1, seed = 42)

selected_districts |>
  distinct(region, district)
#> # A tibble: 14 × 2
#>    region        district      
#>    <fct>         <fct>         
#>  1 Arusha        Meru          
#>  2 Arusha        Monduli       
#>  3 Dar es Salaam Ilala         
#>  4 Dar es Salaam Kinondoni     
#>  5 Dodoma        Dodoma Urban  
#>  6 Dodoma        Kondoa        
#>  7 Kilimanjaro   Hai           
#>  8 Kilimanjaro   Moshi Urban   
#>  9 Morogoro      Morogoro Urban
#> 10 Morogoro      Ulanga        
#> 11 Mwanza        Magu          
#> 12 Mwanza        Nyamagana     
#> 13 Tanga         Korogwe       
#> 14 Tanga         Tanga City

After fieldwork or preparation is complete, execute the remaining stages.

final_sample <- selected_districts |>
  execute(schools_frame, seed = 43)

final_sample |>
  count(region, district)
#> # A tibble: 14 × 3
#>    region        district           n
#>    <fct>         <fct>          <int>
#>  1 Arusha        Meru               4
#>  2 Arusha        Monduli            4
#>  3 Dar es Salaam Ilala              4
#>  4 Dar es Salaam Kinondoni          4
#>  5 Dodoma        Dodoma Urban       4
#>  6 Dodoma        Kondoa             4
#>  7 Kilimanjaro   Hai                4
#>  8 Kilimanjaro   Moshi Urban        4
#>  9 Morogoro      Morogoro Urban     4
#> 10 Morogoro      Ulanga             4
#> 11 Mwanza        Magu               4
#> 12 Mwanza        Nyamagana          4
#> 13 Tanga         Korogwe            4
#> 14 Tanga         Tanga City         4

Working with Results

Accessing the Design

The sampling design is attached to the result and can be retrieved with get_design().

sample <- sampling_design() |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 200) |>
  execute(niger_eas, seed = 42)

design <- get_design(sample)
design$stages[[1]]$strata$vars
#> [1] "region"

Weight Diagnostics

Unequal weights reduce effective sample size. Use effective_n() to compute the effective sample size and design_effect() for the design effect (ratio of actual variance to variance under SRS).

effective_n(sample$.weight)
#> [1] 199.9336

design_effect(sample$.weight)
#> [1] 1.000332

Method Reference

Selection Methods

Method Sample Size Replacement Notes
srswor Fixed No Default, general purpose
srswr Fixed Yes Bootstrap, rare populations
systematic Fixed No Ordered frames
bernoulli Random No Simple field protocols
pps_brewer Fixed No Recommended PPS method
pps_systematic Fixed No Simple PPS
pps_maxent Fixed No Joint inclusion probs available
pps_poisson Random No PPS with random size
pps_multinomial Fixed Yes PPS with replacement

Allocation Methods

Method Description Requirements
(none) n per stratum
equal Same n per stratum
proportional n ∝ stratum size
neyman Minimize variance variance data frame
optimal Minimize variance/cost variance + cost data frames

For custom stratum-specific sizes or rates, pass a data frame directly to n or frac in draw().

Sample Size Bounds

Argument Description
min_n Minimum sample size per stratum (for variance estimation, set ≥ 2)
max_n Maximum sample size per stratum (to cap dominant strata)

These only apply when using allocation methods (equal, proportional, neyman, optimal).


Best Practices

  1. Always set a seed for reproducibility
  2. Use meaningful stage labels for documentation
  3. Validate designs with validate_frame() before execution
  4. Check weight distributions with effective_n() and design_effect()
design <- sampling_design(title = "National Health Survey 2024") |>
  stage(label = "Enumeration Areas") |>
    stratify_by(region, strata) |>
    cluster_by(ea_id) |>
    draw(n = 5, method = "pps_brewer", mos = hh_count) |>
  stage(label = "Households") |>
    draw(n = 20)

print(design)
#> == Sampling Design ==
#> Title: National Health Survey 2024 
#> Stages: 2 
#> 
#> -- Stage 1 : Enumeration Areas --
#> * Stratify by: region, strata 
#> * Cluster by: ea_id 
#> * Draw: n = 5, method = pps_brewer, mos = hh_count 
#> 
#> -- Stage 2 : Households --
#> * Draw: n = 20, method = srswor