samplyr provides a tidy grammar for specifying and executing survey sampling designs. The package is built around a minimal set of composable verbs that handle stratified, clustered, and multi-stage sampling.
The core idea is that sampling code should read like its English description:
# "Stratify by region, proportionally allocate 500 samples, execute"
sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 500) |>
execute(frame, seed = 42)samplyr uses 5 verbs and 1 modifier:
| Function | Purpose |
|---|---|
sampling_design() |
Create a new sampling design |
stratify_by() |
Define stratification variables and allocation |
cluster_by() |
Define cluster/PSU variable |
draw() |
Specify sample size, method, and measure of size |
execute() |
Run the design on a frame |
stage() |
Delimit stages in multi-stage designs |
We’ll use the niger_eas dataset throughout this
vignette. It’s a synthetic enumeration area frame inspired by DHS
household surveys, with ~1,500 EAs across 8 regions.
data(niger_eas)
niger_eas |>
glimpse()
#> Rows: 1,536
#> Columns: 6
#> $ ea_id <chr> "Aga_Aga_0001", "Aga_Aga_0002", "Aga_Aga_0003", "Aga_Aga_…
#> $ region <fct> Agadez, Agadez, Agadez, Agadez, Agadez, Agadez, Agadez, A…
#> $ department <fct> Agadez, Agadez, Agadez, Agadez, Agadez, Agadez, Agadez, A…
#> $ strata <fct> Rural, Urban, Urban, Rural, Urban, Rural, Rural, Urban, R…
#> $ hh_count <dbl> 59, 157, 124, 146, 112, 182, 166, 54, 97, 192, 54, 109, 6…
#> $ pop_estimate <dbl> 413, 942, 868, 1022, 896, 1092, 1162, 432, 582, 1344, 378…The most basic design selects n units at random from the frame.
sample <- sampling_design() |>
draw(n = 100) |>
execute(niger_eas, seed = 42)
nrow(sample)
#> [1] 100The result includes the original columns plus sampling metadata:
.weight (sampling weight) and .prob (inclusion
probability).
sample |>
select(ea_id, region, strata, .weight, .prob) |>
head()
#> == tbl_sample ==
#> Weights: 15.36 - 15.36 (mean: 15.36 )
#>
#> # A tibble: 6 × 5
#> ea_id region strata .weight .prob
#> <chr> <fct> <fct> <dbl> <dbl>
#> 1 Mar_Tes_0023 Maradi Rural 15.4 0.0651
#> 2 Mar_Agu_0027 Maradi Urban 15.4 0.0651
#> 3 Til_Tér_0015 Tillabéri Rural 15.4 0.0651
#> 4 Til_Oua_0012 Tillabéri Rural 15.4 0.0651
#> 5 Zin_Dun_0021 Zinder Urban 15.4 0.0651
#> 6 Til_Tér_0008 Tillabéri Rural 15.4 0.0651The method argument controls how units are selected. By
default, samplyr uses simple random sampling without replacement
(srswor).
Systematic sampling selects units at fixed intervals, which can improve precision when the frame is ordered.
sample_sys <- sampling_design() |>
draw(n = 100, method = "systematic") |>
execute(niger_eas, seed = 42)Sampling with replacement allows the same unit to be selected multiple times, useful for bootstrap procedures.
sample_wr <- sampling_design() |>
draw(n = 100, method = "srswr") |>
execute(niger_eas, seed = 42)Instead of a fixed sample size, you can specify a sampling fraction
with frac.
sample <- sampling_design() |>
draw(frac = 0.05) |>
execute(niger_eas, seed = 42)
nrow(sample)
#> [1] 77Bernoulli sampling selects each unit independently with the specified probability. This gives a random sample size, which can simplify field protocols.
sample <- sampling_design() |>
draw(frac = 0.05, method = "bernoulli") |>
execute(niger_eas, seed = 42)
nrow(sample)
#> [1] 64Stratification partitions the frame into non-overlapping groups (strata) and samples from each. This ensures representation from all subgroups and often improves precision.
Without an allocation method, n specifies the sample
size within each stratum.
sample <- sampling_design() |>
stratify_by(region) |>
draw(n = 20) |>
execute(niger_eas, seed = 42)
sample |>
count(region)
#> # A tibble: 8 × 2
#> region n
#> <fct> <int>
#> 1 Agadez 20
#> 2 Diffa 20
#> 3 Dosso 20
#> 4 Maradi 20
#> 5 Niamey 20
#> 6 Tahoua 20
#> 7 Tillabéri 20
#> 8 Zinder 20With an allocation method, n becomes the total sample
size to distribute across strata.
Proportional allocation distributes the sample proportionally to stratum sizes, ensuring the sample mirrors the population structure.
sample <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 300) |>
execute(niger_eas, seed = 42)
sample |>
count(region) |>
mutate(pct = n / sum(n))
#> # A tibble: 8 × 3
#> region n pct
#> <fct> <int> <dbl>
#> 1 Agadez 10 0.0333
#> 2 Diffa 13 0.0433
#> 3 Dosso 35 0.117
#> 4 Maradi 56 0.187
#> 5 Niamey 30 0.1
#> 6 Tahoua 54 0.18
#> 7 Tillabéri 43 0.143
#> 8 Zinder 59 0.197Equal allocation assigns the same sample size to each stratum, regardless of population size. This maximizes precision for comparisons between strata.
sample <- sampling_design() |>
stratify_by(region, alloc = "equal") |>
draw(n = 160) |>
execute(niger_eas, seed = 42)
sample |>
count(region)
#> # A tibble: 8 × 2
#> region n
#> <fct> <int>
#> 1 Agadez 20
#> 2 Diffa 20
#> 3 Dosso 20
#> 4 Maradi 20
#> 5 Niamey 20
#> 6 Tahoua 20
#> 7 Tillabéri 20
#> 8 Zinder 20Neyman allocation minimizes the variance of the overall estimate by allocating more sample to strata with higher variability. It requires prior information about stratum variances.
data(niger_eas_variance)
niger_eas_variance
#> # A tibble: 8 × 2
#> region var
#> <fct> <dbl>
#> 1 Agadez 6249.
#> 2 Diffa 1950.
#> 3 Dosso 2771.
#> 4 Maradi 3207.
#> 5 Niamey 5728.
#> 6 Tahoua 2487.
#> 7 Tillabéri 2506.
#> 8 Zinder 2926.
sample <- sampling_design() |>
stratify_by(region, alloc = "neyman", variance = niger_eas_variance) |>
draw(n = 300) |>
execute(niger_eas, seed = 42)
sample |>
count(region)
#> # A tibble: 8 × 2
#> region n
#> <fct> <int>
#> 1 Agadez 14
#> 2 Diffa 10
#> 3 Dosso 33
#> 4 Maradi 57
#> 5 Niamey 41
#> 6 Tahoua 48
#> 7 Tillabéri 39
#> 8 Zinder 58Optimal allocation extends Neyman allocation by also accounting for differential costs across strata. It minimizes variance for a fixed budget (or cost for fixed precision).
data(niger_eas_cost)
niger_eas_cost
#> # A tibble: 8 × 2
#> region cost
#> <chr> <dbl>
#> 1 Agadez 180
#> 2 Diffa 150
#> 3 Dosso 60
#> 4 Maradi 55
#> 5 Niamey 70
#> 6 Tahoua 50
#> 7 Tillabéri 65
#> 8 Zinder 40
sample <- sampling_design() |>
stratify_by(region, alloc = "optimal",
variance = niger_eas_variance,
cost = niger_eas_cost) |>
draw(n = 300) |>
execute(niger_eas, seed = 42)
sample |>
count(region)
#> # A tibble: 8 × 2
#> region n
#> <fct> <int>
#> 1 Agadez 8
#> 2 Diffa 6
#> 3 Dosso 32
#> 4 Maradi 58
#> 5 Niamey 38
#> 6 Tahoua 52
#> 7 Tillabéri 36
#> 8 Zinder 70When using allocation methods, you can constrain stratum sample sizes
with min_n and max_n in
draw().
Minimum sample size ensures every stratum gets at least the specified number of units. This is essential for variance estimation (requires n ≥ 2 per stratum) or when you need reliable subgroup estimates.
# Neyman allocation with minimum 5 per stratum
sample <- sampling_design() |>
stratify_by(region, alloc = "neyman", variance = niger_eas_variance) |>
draw(n = 300, min_n = 20) |>
execute(niger_eas, seed = 42)
sample |>
count(region)
#> # A tibble: 8 × 2
#> region n
#> <fct> <int>
#> 1 Agadez 20
#> 2 Diffa 20
#> 3 Dosso 32
#> 4 Maradi 53
#> 5 Niamey 39
#> 6 Tahoua 45
#> 7 Tillabéri 37
#> 8 Zinder 54Maximum sample size caps large strata to prevent them from dominating the sample. This is useful when operational constraints limit how many units you can handle in any one stratum.
# Proportional allocation but cap at 60 per stratum
sample <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 300, max_n = 55) |>
execute(niger_eas, seed = 42)
sample |>
count(region)
#> # A tibble: 8 × 2
#> region n
#> <fct> <int>
#> 1 Agadez 12
#> 2 Diffa 14
#> 3 Dosso 36
#> 4 Maradi 55
#> 5 Niamey 31
#> 6 Tahoua 54
#> 7 Tillabéri 43
#> 8 Zinder 55Both bounds together create a feasible range for each stratum:
sample <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 300, min_n = 20, max_n = 55) |>
execute(niger_eas, seed = 42)
sample |>
count(region)
#> # A tibble: 8 × 2
#> region n
#> <fct> <int>
#> 1 Agadez 20
#> 2 Diffa 20
#> 3 Dosso 33
#> 4 Maradi 52
#> 5 Niamey 30
#> 6 Tahoua 50
#> 7 Tillabéri 40
#> 8 Zinder 55When a stratum’s population is smaller than min_n, the
entire stratum is selected (capped at population size).
Custom allocation lets you specify exact sample
sizes or rates per stratum by passing a data frame to
draw().
sizes_df <- data.frame(
region = levels(niger_eas$region),
n = c(15, 25, 50, 60, 55, 45, 70, 30)
)
sample <- sampling_design() |>
stratify_by(region) |>
draw(n = sizes_df) |>
execute(niger_eas, seed = 42)
sample |>
count(region)
#> # A tibble: 8 × 2
#> region n
#> <fct> <int>
#> 1 Agadez 15
#> 2 Diffa 25
#> 3 Dosso 50
#> 4 Maradi 60
#> 5 Niamey 55
#> 6 Tahoua 45
#> 7 Tillabéri 70
#> 8 Zinder 30You can stratify by multiple variables to create crossed strata. Here we stratify by both region and urban/rural status.
sample <- sampling_design() |>
stratify_by(region, strata, alloc = "proportional") |>
draw(n = 300) |>
execute(niger_eas, seed = 42)
sample |>
count(region, strata) |>
head(10)
#> # A tibble: 10 × 3
#> region strata n
#> <fct> <fct> <int>
#> 1 Agadez Urban 5
#> 2 Agadez Rural 5
#> 3 Diffa Urban 1
#> 4 Diffa Rural 11
#> 5 Dosso Urban 3
#> 6 Dosso Rural 32
#> 7 Maradi Urban 11
#> 8 Maradi Rural 45
#> 9 Niamey Urban 28
#> 10 Niamey Rural 2Cluster sampling selects groups (clusters) rather than individual units. This is practical when a complete list of individuals isn’t available or when travel costs make dispersed sampling expensive.
Use cluster_by() to specify the cluster identifier. All
units within selected clusters are included.
sample <- sampling_design() |>
cluster_by(ea_id) |>
draw(n = 50) |>
execute(niger_eas, seed = 42)
sample |>
summarise(
n_clusters = n_distinct(ea_id),
n_units = n()
)
#> # A tibble: 1 × 2
#> n_clusters n_units
#> <int> <int>
#> 1 50 54Probability Proportional to Size (PPS) sampling selects units with probability proportional to a size measure. This is standard for cluster sampling where clusters vary in size.
Use the mos argument to specify the measure of size
variable. Here we select EAs with probability proportional to their
household count.
sample <- sampling_design() |>
cluster_by(ea_id) |>
draw(n = 50, method = "pps_brewer", mos = hh_count) |>
execute(niger_eas, seed = 42)
sample |>
summarise(
mean_hh = mean(hh_count),
median_hh = median(hh_count)
)
#> # A tibble: 1 × 2
#> mean_hh median_hh
#> <dbl> <dbl>
#> 1 123. 112Compared to simple random sampling, PPS tends to select larger clusters:
sample_srs <- sampling_design() |>
cluster_by(ea_id) |>
draw(n = 50) |>
execute(niger_eas, seed = 42)
bind_rows(
sample |> summarise(method = "PPS", mean_hh = mean(hh_count)),
sample_srs |> summarise(method = "SRS", mean_hh = mean(hh_count))
)
#> # A tibble: 2 × 2
#> method mean_hh
#> <chr> <dbl>
#> 1 PPS 123.
#> 2 SRS 106.samplyr supports several PPS methods:
| Method | Description |
|---|---|
pps_brewer |
Brewer’s method - fast with good properties (recommended) |
pps_systematic |
PPS systematic - simple but can have periodicity issues |
pps_maxent |
Maximum entropy - highest entropy among fixed-size methods |
pps_poisson |
PPS Poisson - random sample size (requires frac) |
pps_multinomial |
PPS with replacement |
Multi-stage designs sample in stages: first select primary sampling units (PSUs), then sample within them. This is the standard approach for large-scale surveys when a complete frame of ultimate units isn’t available upfront.
Use stage() to delimit each stage of the design.
For this example, we’ll use tanzania_schools to
demonstrate a two-stage design: first select districts as PSUs, then
sample schools within selected districts.
data(tanzania_schools)
# First, let's see the structure: schools nested within districts
tanzania_schools |>
count(region, district) |>
head(10)
#> # A tibble: 10 × 3
#> region district n
#> <fct> <fct> <int>
#> 1 Arusha Arusha City 82
#> 2 Arusha Arusha District 79
#> 3 Arusha Meru 101
#> 4 Arusha Monduli 85
#> 5 Dar es Salaam Ilala 163
#> 6 Dar es Salaam Kinondoni 177
#> 7 Dar es Salaam Temeke 152
#> 8 Dodoma Chamwino 77
#> 9 Dodoma Dodoma Urban 88
#> 10 Dodoma Kondoa 87We need to create a measure of size for districts. Here we’ll use total enrollment per district.
# Add district-level enrollment as MOS
schools_frame <- tanzania_schools |>
group_by(district) |>
mutate(district_enrollment = sum(enrollment)) |>
ungroup()Now we can run a two-stage design: select 10 districts with PPS, then sample 5 schools within each.
sample <- sampling_design() |>
stage(label = "Districts") |>
cluster_by(district) |>
draw(n = 10, method = "pps_brewer", mos = district_enrollment) |>
stage(label = "Schools") |>
draw(n = 5) |>
execute(schools_frame, seed = 42)
sample |>
summarise(
n_districts = n_distinct(district),
n_schools = n()
)
#> # A tibble: 1 × 2
#> n_districts n_schools
#> <int> <int>
#> 1 10 50Combining stratification with multi-stage sampling is standard for national surveys. Here we stratify by school level, then select districts and schools within each stratum.
sample <- sampling_design() |>
stage(label = "Districts") |>
stratify_by(school_level) |>
cluster_by(district) |>
draw(n = 5, method = "pps_brewer", mos = district_enrollment) |>
stage(label = "Schools") |>
draw(n = 3) |>
execute(schools_frame, seed = 42)
sample |>
count(school_level, district)
#> # A tibble: 17 × 3
#> school_level district n
#> <fct> <fct> <int>
#> 1 Primary Ilala 2
#> 2 Primary Ilemela 1
#> 3 Primary Kilombero 2
#> 4 Primary Kondoa 3
#> 5 Primary Meru 2
#> 6 Primary Morogoro Rural 2
#> 7 Primary Moshi Rural 3
#> 8 Primary Muheza 1
#> 9 Primary Sengerema 2
#> 10 Primary Tanga City 3
#> 11 Secondary Ilala 1
#> 12 Secondary Ilemela 2
#> 13 Secondary Kilombero 1
#> 14 Secondary Meru 1
#> 15 Secondary Morogoro Rural 1
#> 16 Secondary Muheza 2
#> 17 Secondary Sengerema 1In practice, multi-stage surveys often execute stages at different times. After selecting PSUs, field teams may need to conduct listing or other operations before the second stage can proceed.
The stages argument lets you execute only specific
stages.
design <- sampling_design() |>
stage(label = "Districts") |>
stratify_by(region) |>
cluster_by(district) |>
draw(n = 2, method = "pps_brewer", mos = district_enrollment) |>
stage(label = "Schools") |>
draw(n = 4)
# Execute stage 1 only
selected_districts <- execute(design, schools_frame, stages = 1, seed = 42)
selected_districts |>
distinct(region, district)
#> # A tibble: 14 × 2
#> region district
#> <fct> <fct>
#> 1 Arusha Meru
#> 2 Arusha Monduli
#> 3 Dar es Salaam Ilala
#> 4 Dar es Salaam Kinondoni
#> 5 Dodoma Dodoma Urban
#> 6 Dodoma Kondoa
#> 7 Kilimanjaro Hai
#> 8 Kilimanjaro Moshi Urban
#> 9 Morogoro Morogoro Urban
#> 10 Morogoro Ulanga
#> 11 Mwanza Magu
#> 12 Mwanza Nyamagana
#> 13 Tanga Korogwe
#> 14 Tanga Tanga CityAfter fieldwork or preparation is complete, execute the remaining stages.
final_sample <- selected_districts |>
execute(schools_frame, seed = 43)
final_sample |>
count(region, district)
#> # A tibble: 14 × 3
#> region district n
#> <fct> <fct> <int>
#> 1 Arusha Meru 4
#> 2 Arusha Monduli 4
#> 3 Dar es Salaam Ilala 4
#> 4 Dar es Salaam Kinondoni 4
#> 5 Dodoma Dodoma Urban 4
#> 6 Dodoma Kondoa 4
#> 7 Kilimanjaro Hai 4
#> 8 Kilimanjaro Moshi Urban 4
#> 9 Morogoro Morogoro Urban 4
#> 10 Morogoro Ulanga 4
#> 11 Mwanza Magu 4
#> 12 Mwanza Nyamagana 4
#> 13 Tanga Korogwe 4
#> 14 Tanga Tanga City 4The sampling design is attached to the result and can be retrieved
with get_design().
sample <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 200) |>
execute(niger_eas, seed = 42)
design <- get_design(sample)
design$stages[[1]]$strata$vars
#> [1] "region"Unequal weights reduce effective sample size. Use
effective_n() to compute the effective sample size and
design_effect() for the design effect (ratio of actual
variance to variance under SRS).
effective_n(sample$.weight)
#> [1] 199.9336
design_effect(sample$.weight)
#> [1] 1.000332| Method | Sample Size | Replacement | Notes |
|---|---|---|---|
srswor |
Fixed | No | Default, general purpose |
srswr |
Fixed | Yes | Bootstrap, rare populations |
systematic |
Fixed | No | Ordered frames |
bernoulli |
Random | No | Simple field protocols |
pps_brewer |
Fixed | No | Recommended PPS method |
pps_systematic |
Fixed | No | Simple PPS |
pps_maxent |
Fixed | No | Joint inclusion probs available |
pps_poisson |
Random | No | PPS with random size |
pps_multinomial |
Fixed | Yes | PPS with replacement |
| Method | Description | Requirements |
|---|---|---|
| (none) | n per stratum | — |
equal |
Same n per stratum | — |
proportional |
n ∝ stratum size | — |
neyman |
Minimize variance |
variance data frame |
optimal |
Minimize variance/cost |
variance + cost data frames |
For custom stratum-specific sizes or rates, pass a data frame
directly to n or frac in
draw().
validate_frame()
before executioneffective_n() and design_effect()
design <- sampling_design(title = "National Health Survey 2024") |>
stage(label = "Enumeration Areas") |>
stratify_by(region, strata) |>
cluster_by(ea_id) |>
draw(n = 5, method = "pps_brewer", mos = hh_count) |>
stage(label = "Households") |>
draw(n = 20)
print(design)
#> == Sampling Design ==
#> Title: National Health Survey 2024
#> Stages: 2
#>
#> -- Stage 1 : Enumeration Areas --
#> * Stratify by: region, strata
#> * Cluster by: ea_id
#> * Draw: n = 5, method = pps_brewer, mos = hh_count
#>
#> -- Stage 2 : Households --
#> * Draw: n = 20, method = srswor