A tidy grammar for survey sampling in R. samplyr provides a minimal set of composable verbs for stratified, clustered, multi-stage, and multi-phase sampling designs with PPS methods, sample coordination, and panel rotation.
Why samplyr?
samplyr is built around a simple idea: sampling code should read like its English description.
library(samplyr)
data(bfa_eas)
# "Stratify by region, proportionally allocate 500 samples, execute"
sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 500) |>
execute(bfa_eas, seed = 1)
#> # A tbl_sample: 500 × 17
#> # Weights: 29.8 [28.95, 30.32]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_10155 Boucle … Mouhoun Ouarko… Rural 1347 187 33.0
#> 2 EA_03955 Boucle … Kossi Doumba… Rural 1558 211 16.6
#> 3 EA_10325 Boucle … Bale Ouri Rural 767 92 17.6
#> 4 EA_03209 Boucle … Mouhoun Dedoug… Rural 446 79 18.1
#> 5 EA_12881 Boucle … Banwa Tansila Rural 1076 137 9.45
#> 6 EA_11613 Boucle … Banwa Sami Rural 912 118 43.2
#> 7 EA_06857 Boucle … Banwa Kouka Rural 1642 189 12.5
#> 8 EA_13730 Boucle … Nayala Toma Rural 973 101 0.88
#> 9 EA_05972 Boucle … Sourou Kiemba… Rural 1359 216 36.6
#> 10 EA_03571 Boucle … Kossi Djibas… Rural 725 86 1.01
#> # ℹ 490 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>Consider a real survey design from Lohr (2022, Example 7.1), based on a 1991 study of bed net use in rural Gambia (D’Alessandro et al., 1994):
Malaria morbidity can be reduced by using bed nets impregnated with insecticide, but this is only effective if the bed nets are in widespread use. In 1991, a nationwide survey was designed to estimate the prevalence of bed net use in rural areas of the Gambia (D’Alessandro et al., 1994).
The sampling frame consisted of all rural villages of fewer than 3,000 people. The villages were stratified by three geographic regions (eastern, central, and western) and by whether the village had a public health clinic (PHC) or not. In each region five districts were chosen with probability proportional to the district population. In each district four villages were chosen, again with probability proportional to census population: two PHC villages and two non-PHC villages. Finally, six compounds were chosen more or less randomly from each village.
In samplyr, this three-stage stratified cluster design translates directly into code:
design <- sampling_design(title = "Gambia bed nets") |>
add_stage() |>
stratify_by(region) |>
cluster_by(district) |>
draw(n = 5, method = "pps_brewer", mos = population) |>
add_stage() |>
stratify_by(phc) |>
cluster_by(village) |>
draw(n = 2, method = "pps_brewer", mos = population) |>
add_stage() |>
draw(n = 6)
design
#> ── Sampling Design: Gambia bed nets ──────────────────────────────────────────────────────────────────
#>
#> ℹ 3 stages
#>
#> ── Stage 1 ───────────────────────────────────────────────────────────────────────────────────────────
#> • Strata: region
#> • Cluster: district
#> • Draw: n = 5, method = pps_brewer, mos = population
#>
#> ── Stage 2 ───────────────────────────────────────────────────────────────────────────────────────────
#> • Strata: phc
#> • Cluster: village
#> • Draw: n = 2, method = pps_brewer, mos = population
#>
#> ── Stage 3 ───────────────────────────────────────────────────────────────────────────────────────────
#> • Draw: n = 6, method = srsworThe samplyr code mirrors the verbal description verb for verb.
Lohr, S. L. (2022). Sampling: Design and Analysis (3rd ed.). CRC Press.
Installation
# Install sondage first (sampling algorithms backend)
pak::pkg_install("gitlab::dickoa/sondage")
# Install svyplan (sample size, precision, power, and stratification)
pak::pkg_install("gitlab::dickoa/svyplan")
# Install samplyr
pak::pkg_install("gitlab::dickoa/samplyr")The Grammar
samplyr uses 5 verbs and 1 modifier:
| Function | Purpose |
|---|---|
sampling_design() |
Create a new sampling design |
stratify_by() |
Define stratification and allocation |
cluster_by() |
Define cluster/PSU variable |
draw() |
Specify sample size and method |
execute() |
Run the design on a frame |
add_stage() |
Delimit stages in multi-stage designs |
Frame-Independent Design
stratify_by() and cluster_by() take bare column names. The design is stored as a specification and resolved only when a frame is available (validate_frame(), execute(), as_svydesign()), so design specification stays separate from execution.
design <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
cluster_by(ea_id) |>
draw(n = 300)
sample <- execute(design, bfa_eas, seed = 2)
sample
#> # A tbl_sample: 300 × 17
#> # Weights: 49.67 [47.64, 50.67]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_00466 Boucle … Bale Bana Rural 1272 166 1.13
#> 2 EA_10340 Boucle … Bale Ouri Rural 2128 254 11.1
#> 3 EA_00373 Boucle … Banwa Balave Rural 1624 256 57.0
#> 4 EA_00380 Boucle … Banwa Balave Rural 1005 158 26.5
#> 5 EA_06860 Boucle … Banwa Kouka Rural 1748 201 18.1
#> 6 EA_12372 Boucle … Banwa Solenzo Rural 1012 121 27.3
#> 7 EA_12393 Boucle … Banwa Solenzo Rural 79 9 5.68
#> 8 EA_12417 Boucle … Banwa Solenzo Rural 1316 157 15.5
#> 9 EA_12874 Boucle … Banwa Tansila Rural 1267 162 44.5
#> 10 EA_12890 Boucle … Banwa Tansila Rural 789 101 15.3
#> # ℹ 290 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>Quick Start
library(samplyr)
data(bfa_eas)
# Simple random sample
srs_smpl <- sampling_design() |>
draw(n = 100) |>
execute(bfa_eas, seed = 321)
srs_smpl
#> # A tbl_sample: 100 × 17
#> # Weights: 149 [149, 149]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_03776 Sahel Seno Dori Rural 1813 235 18.7
#> 2 EA_12039 Nord Yatenga Sengue… Rural 1017 126 3.23
#> 3 EA_11709 Boucle … Banwa Sanaba Rural 1074 156 25.2
#> 4 EA_04563 Est Komandj… Gayeri Rural 1302 140 32.2
#> 5 EA_06603 Centre-… Sanmate… Korsim… Rural 632 97 20.2
#> 6 EA_14769 Plateau… Oubrite… Zitenga Rural 1628 232 10.3
#> 7 EA_09105 Centre Kadiogo Ouagad… Urban 1900 360 0.34
#> 8 EA_10304 Plateau… Oubrite… Ourgou… Rural 1309 161 10.6
#> 9 EA_00281 Centre-… Boulgou Bagre Rural 1609 255 30
#> 10 EA_04646 Centre-… Sanguie Godyr Rural 720 109 7.1
#> # ℹ 90 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>
# Stratified proportional allocation
strata_smpl <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 300) |>
execute(bfa_eas, seed = 12)
strata_smpl
#> # A tbl_sample: 300 × 17
#> # Weights: 49.67 [47.64, 50.67]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_12451 Boucle … Banwa Solenzo Rural 2360 282 27.0
#> 2 EA_12347 Boucle … Banwa Solenzo Rural 1929 230 30.2
#> 3 EA_11712 Boucle … Banwa Sanaba Rural 816 118 34.3
#> 4 EA_14295 Boucle … Nayala Ye Rural 1281 226 25.9
#> 5 EA_03138 Boucle … Mouhoun Dedoug… Rural 1461 259 18.2
#> 6 EA_11050 Boucle … Bale Pompoi Rural 917 144 26.8
#> 7 EA_13839 Boucle … Sourou Tougan Rural 1073 148 12.2
#> 8 EA_12930 Boucle … Mouhoun Tcheri… Rural 1243 149 1.15
#> 9 EA_02077 Boucle … Kossi Bombor… Rural 1036 136 7.59
#> 10 EA_04889 Boucle … Nayala Gossina Rural 1032 166 27.0
#> # ℹ 290 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>
# PPS cluster sampling
cluster_smpl <- sampling_design() |>
cluster_by(ea_id) |>
draw(n = 50, method = "pps_brewer", mos = households) |>
execute(bfa_eas, seed = 123)
cluster_smpl
#> # A tbl_sample: 50 × 18
#> # Weights: 319.78 [126.62, 1185.84]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_12440 Boucle … Banwa Solenzo Rural 1481 177 11.5
#> 2 EA_12513 Boucle … Kossi Sono Rural 1566 225 41.1
#> 3 EA_03121 Boucle … Mouhoun Dedoug… Rural 1061 188 11.9
#> 4 EA_12173 Cascades Comoe Sidera… Rural 2610 360 6.97
#> 5 EA_08851 Centre Kadiogo Ouagad… Urban 1635 310 0.21
#> 6 EA_08884 Centre Kadiogo Ouagad… Urban 1514 287 0.32
#> 7 EA_08920 Centre Kadiogo Ouagad… Urban 1647 312 0.48
#> 8 EA_08962 Centre Kadiogo Ouagad… Urban 1726 327 0.27
#> 9 EA_09481 Centre Kadiogo Ouagad… Urban 1556 295 0.18
#> 10 EA_09604 Centre Kadiogo Ouagad… Urban 2568 487 0.97
#> # ℹ 40 more rows
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>, .certainty_1 <lgl>Multi-Stage Sampling
Use add_stage() to define multi-stage designs. This example selects districts with PPS, then samples EAs within each:
library(dplyr, warn.conflicts = FALSE)
data(zwe_eas)
# Add district-level measure of size
zwe_frame <- zwe_eas |>
mutate(district_hh = sum(households), .by = district)
# Two-stage design: 10 districts, 5 EAs per district
sample <- sampling_design() |>
add_stage(label = "Districts") |>
cluster_by(district) |>
draw(n = 10, method = "pps_brewer", mos = district_hh) |>
add_stage(label = "EAs") |>
draw(n = 5) |>
execute(zwe_frame, seed = 12345)
sample
#> # A tbl_sample: 50 × 16
#> # Weights: 472.68 [196.93, 541.06]
#> ea_id province district urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <int> <int> <dbl>
#> 1 EA_00955 Harare Harare Urban 1379 407 0.52
#> 2 EA_01509 Harare Harare Urban 1417 377 0.22
#> 3 EA_00825 Harare Harare Urban 1273 374 0.72
#> 4 EA_00814 Harare Harare Urban 1332 400 0.5
#> 5 EA_00777 Harare Harare Urban 1471 443 0.16
#> 6 EA_02797 Manicaland Chipinge Rural 673 152 12.4
#> 7 EA_03044 Manicaland Chipinge Rural 769 180 9.57
#> 8 EA_02960 Manicaland Chipinge Rural 681 156 9.52
#> 9 EA_03159 Manicaland Chipinge Rural 366 83 9.62
#> 10 EA_03110 Manicaland Chipinge Urban 1011 283 3.78
#> # ℹ 40 more rows
#> # ℹ 9 more variables: district_hh <int>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_2 <dbl>, .fpc_2 <int>, .weight_1 <dbl>, .fpc_1 <int>,
#> # .certainty_1 <lgl>Operational Sampling
Execute stages separately when fieldwork happens between stages:
design <- sampling_design() |>
add_stage(label = "Districts") |>
cluster_by(district) |>
draw(n = 10, method = "pps_brewer", mos = district_hh) |>
add_stage(label = "EAs") |>
draw(n = 5)
# Add district-level measure of size
zwe_frame_agg <- zwe_eas |>
summarize(district_hh = sum(households),
m = n(),
.by = district)
# Execute stage 1 only
selected_districts <- execute(design, zwe_frame_agg, stages = 1, seed = 1)
# ... fieldwork ...
zwe_frame <- zwe_eas |>
mutate(district_hh = sum(households),
m = n(),
.by = district)
# Execute stage 2
final_sample <- selected_districts |> execute(zwe_frame, seed = 2)Selection Methods
Equal Probability
| Method | Sample Size | Description |
|---|---|---|
srswor |
Fixed | Simple random sampling without replacement (default) |
srswr |
Fixed | Simple random sampling with replacement |
systematic |
Fixed | Systematic sampling |
bernoulli |
Random | Bernoulli sampling |
Probability Proportional to Size (PPS)
| Method | Sample Size | Description |
|---|---|---|
pps_brewer |
Fixed | Brewer’s method (recommended) |
pps_systematic |
Fixed | PPS systematic |
pps_cps |
Fixed | Conditional Poisson sampling |
pps_poisson |
Random | PPS Poisson (PRN) |
pps_sps |
Fixed | Sequential Poisson sampling (PRN) |
pps_pareto |
Fixed | Pareto piPS sampling (PRN) |
pps_multinomial |
Fixed | PPS with replacement |
pps_chromy |
Fixed | PPS with minimum replacement |
Allocation Methods
When stratifying, control how the total sample is distributed:
| Method | Description |
|---|---|
| (none) |
n applies per stratum |
equal |
Same sample size in each stratum |
proportional |
Proportional to stratum size |
neyman |
Minimize variance (requires variance) |
optimal |
Minimize cost-variance (requires variance and cost) |
power |
Compromise allocation (requires cv and importance) |
Sample Size Bounds
Use min_n and max_n in draw() to constrain stratum sample sizes when using allocation methods:
data(bfa_eas_variance)
# Ensure at least 2 per stratum (minimum for variance estimation)
sampling_design() |>
stratify_by(region, alloc = "neyman", variance = bfa_eas_variance) |>
draw(n = 300, min_n = 2) |>
execute(bfa_eas, seed = 321)
#> # A tbl_sample: 300 × 17
#> # Weights: 49.67 [36.08, 77.8]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_13725 Boucle … Nayala Toma Rural 1235 128 17.3
#> 2 EA_14301 Boucle … Nayala Ye Rural 1231 217 29.9
#> 3 EA_11709 Boucle … Banwa Sanaba Rural 1074 156 25.2
#> 4 EA_10155 Boucle … Mouhoun Ouarko… Rural 1347 187 33.0
#> 5 EA_06444 Boucle … Mouhoun Kona Rural 1148 181 16.6
#> 6 EA_02527 Boucle … Kossi Bouras… Rural 1268 143 42.5
#> 7 EA_03143 Boucle … Mouhoun Dedoug… Rural 1286 228 12.4
#> 8 EA_12514 Boucle … Kossi Sono Rural 1296 187 23.6
#> 9 EA_14276 Boucle … Nayala Ye Rural 1314 232 2.27
#> 10 EA_12908 Boucle … Mouhoun Tcheri… Rural 1170 140 9.7
#> # ℹ 290 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>Custom Allocation
For custom stratum-specific sizes or rates, pass a data frame to n or frac in draw():
# Custom allocation with data frame
sizes_df <- data.frame(
region = c("North", "South", "East", "West"),
n = c(100, 200, 150, 100)
)
sample <- sampling_design() |>
stratify_by(region) |>
draw(n = sizes_df) |>
execute(frame, seed = 42)
# Neyman allocation
data(bfa_eas_variance)
sample <- sampling_design() |>
stratify_by(region, alloc = "neyman", variance = bfa_eas_variance) |>
draw(n = 300) |>
execute(bfa_eas, seed = 4321)
sample
#> # A tbl_sample: 300 × 17
#> # Weights: 49.67 [36.08, 77.8]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_13823 Boucle … Sourou Tougan Rural 1153 159 21.0
#> 2 EA_11104 Boucle … Bale Poura Urban 1287 174 0.95
#> 3 EA_12348 Boucle … Banwa Solenzo Rural 528 63 8.98
#> 4 EA_02124 Boucle … Mouhoun Bondok… Rural 1372 209 29.8
#> 5 EA_03164 Boucle … Mouhoun Dedoug… Rural 650 115 14.6
#> 6 EA_00273 Boucle … Bale Bagassi Rural 1459 175 16.3
#> 7 EA_14004 Boucle … Nayala Yaba Rural 2013 271 30.6
#> 8 EA_03733 Boucle … Kossi Dokui Rural 1017 163 24.4
#> 9 EA_08669 Boucle … Kossi Nouna Rural 1470 191 89.2
#> 10 EA_14009 Boucle … Nayala Yaba Rural 1299 175 24.7
#> # ℹ 290 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>Sample Coordination (PRN)
Permanent random numbers enable coordinated sampling across survey waves. Assign a stable uniform random number to each frame unit, then pass it via prn:
frame$prn <- runif(nrow(frame))
# Wave 1: sequential Poisson sampling with PRN
wave1 <- sampling_design() |>
draw(n = 500, method = "pps_sps", mos = size, prn = prn) |>
execute(frame, seed = 10)
# Wave 2: same PRN -> high overlap (positive coordination)
wave2 <- sampling_design() |>
draw(n = 500, method = "pps_sps", mos = size, prn = prn) |>
execute(frame, seed = 20)PRN is supported for bernoulli, pps_poisson, pps_sps, and pps_pareto.
Balanced Sampling
The cube method (Deville & Tille 2004) produces samples whose Horvitz-Thompson estimates of auxiliary totals match the population totals, improving precision:
# Balanced sample using population and area as auxiliary variables
balanced_smpl <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 300, method = "balanced", mos = households,
aux = c(population, area_km2)) |>
execute(bfa_eas, seed = 24)
balanced_smpl
#> # A tbl_sample: 300 × 17
#> # Weights: 48.19 [16.47, 157.71]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_00253 Boucle … Bale Bagassi Rural 1053 127 15.1
#> 2 EA_04189 Boucle … Bale Fara Rural 1093 175 26.1
#> 3 EA_10340 Boucle … Bale Ouri Rural 2128 254 11.1
#> 4 EA_10382 Boucle … Bale Pa Rural 825 116 0.66
#> 5 EA_14033 Boucle … Bale Yaho Rural 946 125 18.5
#> 6 EA_06882 Boucle … Banwa Kouka Rural 1024 118 3.26
#> 7 EA_11693 Boucle … Banwa Sanaba Rural 1548 225 16.9
#> 8 EA_00758 Boucle … Kossi Barani Rural 1786 234 49.7
#> 9 EA_00769 Boucle … Kossi Barani Rural 1448 190 40.7
#> 10 EA_03727 Boucle … Kossi Dokui Rural 2183 349 10.0
#> # ℹ 290 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <dbl>The aux parameter accepts bare column names. When combined with mos, inclusion probabilities are proportional to size while the sample remains balanced on the auxiliary variables. Stratified designs use the stratified cube algorithm (Chauvet 2009) in a single call. Balanced sampling is supported for up to 2 stages.
Survey Export
Convert samples to survey or srvyr objects for analysis:
svy <- as_svydesign(sample)
survey::svymean(~y, svy)
# Exact or high-entropy joint inclusion probabilities
jip <- joint_expectation(sample, frame, stage = 1)
svy_exact <- as_svydesign(sample, pps = survey::ppsmat(jip[[1]]))By default as_svydesign() uses Brewer’s variance approximation. For tighter variance estimates, joint_expectation() computes pairwise joint inclusion probabilities from the original frame. The result is exact for CPS, systematic, and Poisson; for Brewer, SPS, and Pareto it uses the O(N^2) high-entropy approximation (exact recursive formulas exist but are O(N^3) and impractical for large frames). See ?joint_expectation for the full method-by-method breakdown.
Panel Partitioning
Partition the sample into rotation groups for panel surveys:
panel_sample <- sampling_design() |>
stratify_by(region) |>
draw(n = 200) |>
execute(bfa_eas, seed = 1, panels = 4)
table(panel_sample$.panel)
#>
#> 1 2 3 4
#> 650 650 650 650Panels are assigned by systematic interleaving within strata. For multi-stage designs, panels are assigned at the PSU level and propagated to all units. Weights reflect the full-sample inclusion probability; for per-panel analysis, multiply by the number of panels.
Two-Phase Sampling
Pipe a tbl_sample into execute() for multi-phase designs:
# Phase 1: large screening sample
phase1 <- sampling_design() |>
draw(n = 500) |>
execute(bfa_eas, seed = 1)
# Phase 2: subsample from phase 1
phase2 <- sampling_design() |>
draw(n = 50) |>
execute(phase1, seed = 2)
phase2
#> # A tbl_sample: 50 × 17
#> # Weights: 298 [298, 298]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_13110 Centre-… Boulgou Tenkod… Rural 1761 348 3.96
#> 2 EA_11301 Centre-… Bam Rouko Urban 2578 374 19.5
#> 3 EA_07031 Hauts-B… Kenedou… Kourin… Rural 1149 152 35.0
#> 4 EA_10184 Sud-Oue… Ioba Ouessa Rural 2061 306 33.6
#> 5 EA_05021 Nord Zondoma Goursi Rural 747 91 0.27
#> 6 EA_02164 Boucle … Bale Boromo Rural 528 69 12.5
#> 7 EA_03618 Sahel Soum Djibo Rural 1237 163 32.8
#> 8 EA_00184 Nord Passore Arbolle Rural 2026 234 20.1
#> 9 EA_06727 Centre-… Boulkie… Koudou… Urban 1916 368 3.39
#> 10 EA_09532 Centre Kadiogo Ouagad… Urban 2282 433 0.4
#> # ℹ 40 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>Weights compound automatically across phases.
Diagnostics
summary(strata_smpl)
#> ── Sample Summary ────────────────────────────────────────────────────────────────────────────────────
#>
#> ℹ n = 300 | stages = 1/1 | seed = 12
#>
#> ── Design: Stage 1 ───────────────────────────────────────────────────────────────────────────────────
#> • Strata: region (proportional)
#> • Method: srswor
#>
#> ── Allocation: Stage 1 ───────────────────────────────────────────────────────────────────────────────
#> region N_h n_h f_h
#> Boucle du Mouhoun 1483 30 0.0202
#> Cascades 667 14 0.0210
#> Centre 1556 31 0.0199
#> Centre-Est 1259 25 0.0199
#> Centre-Nord 1375 28 0.0204
#> Centre-Ouest 1287 26 0.0202
#> Centre-Sud 608 12 0.0197
#> Est 1590 32 0.0201
#> Hauts-Bassins 1483 30 0.0202
#> Nord 1211 24 0.0198
#> Plateau-Central 757 15 0.0198
#> Sahel 902 18 0.0200
#> Sud-Ouest 722 15 0.0208
#> ───── ─── ──────
#> Total 14900 300 0.0201
#>
#> ── Weights ───────────────────────────────────────────────────────────────────────────────────────────
#> • Range: [47.64, 50.67]
#> • Mean: 49.67 · CV: 0.02
#> • DEFF: 1 · n_eff: 300Included Datasets
Synthetic sampling frames for learning and testing:
| Dataset | Description | Rows |
|---|---|---|
bfa_eas |
LSMS-style EA frame (Burkina Faso) | 14,900 |
zwe_eas |
DHS-style EA frame (Zimbabwe) | 22,600 |
zwe_households |
Household frame for two-phase sampling (Zimbabwe) | 379,326 |
ken_enterprises |
Enterprise survey frame (Kenya) | 6,823 |
Plus auxiliary data: bfa_eas_variance, bfa_eas_cost
Comparison with SAS and SPSS
SAS PROC SURVEYSELECT
proc surveyselect data=frame method=pps n=50 seed=12345;
strata region;
cluster school;
size enrollment;
run;
sampling_design() |>
stratify_by(region) |>
cluster_by(school) |>
draw(n = 50, method = "pps_brewer", mos = enrollment) |>
execute(frame, seed = 1)SAS Allocation with Bounds
proc surveyselect data=frame method=srs n=500 seed=42;
strata region / alloc=neyman var=variance_data allocmin=2 allocmax=100;
run;
sampling_design() |>
stratify_by(region, alloc = "neyman", variance = variance_data) |>
draw(n = 500, min_n = 2, max_n = 100) |>
execute(frame, seed = 2)SAS Rounding Control
proc surveyselect data=frame method=sys samprate=0.02 seed=2 round=nearest;
strata State;
run;
sampling_design() |>
stratify_by(State) |>
draw(frac = 0.02, method = "systematic", round = "nearest") |>
execute(frame, seed = 3)SPSS CSPLAN
CSPLAN SAMPLE
/PLAN FILE='myplan.csplan'
/DESIGN STRATA=region CLUSTER=school
/METHOD TYPE=PPS_WOR
/SIZE VALUE=50
/MOS VARIABLE=enrollment.
sampling_design() |>
stratify_by(region) |>
cluster_by(school) |>
draw(n = 50, method = "pps_brewer", mos = enrollment) |>
execute(frame, seed = 4)