A tidy grammar for survey sampling in R. samplyr provides a minimal set of composable verbs for stratified, clustered, multi-stage, and multi-phase sampling designs with PPS methods, sample coordination, and panel rotation.
Ecosystem Positioning
samplyr sits in the middle of a three-layer workflow:
- Use
svyplanfor planning (n, precision, power, allocation, budget). - Use
samplyrto specify designs, draw samples, and carry design metadata through fieldwork. - Use
survey/srvyrafter data collection for estimation and inference.
These packages are complementary, not competing. Handoffs are explicit (for example, svyplan outputs can feed draw(n = ...), and samplyr outputs convert via as_svydesign(), as_svrepdesign(), and as_survey_design()).
Why samplyr?
samplyr is built around a simple idea: sampling code should read like its English description.
library(samplyr)
data(bfa_eas)
# "Stratify by region, proportionally allocate 500 samples, execute"
sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 500) |>
execute(bfa_eas, seed = 1)
#> # A tbl_sample: 500 × 17
#> # Weights: 89.14 [87.47, 90.09]
#> ea_id region province commune urban_rural population households area_km2
#> * <int> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 9648 Boucle du … Banwa Sanaba Rural 279 35 8.37
#> 2 11547 Boucle du … Sourou Toeni Rural 49 5 21.4
#> 3 41824 Boucle du … Kossi Dokui Rural 75 9 15.8
#> 4 11012 Boucle du … Banwa Tansila Rural 592 71 0.95
#> 5 32308 Boucle du … Sourou Lanfie… Rural 57 8 9.49
#> 6 7017 Boucle du … Kossi Madouba Rural 599 74 0.74
#> 7 36700 Boucle du … Bale Fara Rural 402 59 6.36
#> 8 11611 Boucle du … Nayala Yaba Rural 111 15 8.97
#> 9 8342 Boucle du … Mouhoun Ouarko… Rural 94 13 8.1
#> 10 11626 Boucle du … Nayala Yaba Rural 56 7 8.31
#> # ℹ 490 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>Consider a real survey design from Lohr (2022, Example 7.1), based on a 1991 study of bed net use in rural Gambia (D’Alessandro et al., 1994):
Malaria morbidity can be reduced by using bed nets impregnated with insecticide, but this is only effective if the bed nets are in widespread use. In 1991, a nationwide survey was designed to estimate the prevalence of bed net use in rural areas of the Gambia (D’Alessandro et al., 1994).
The sampling frame consisted of all rural villages of fewer than 3,000 people. The villages were stratified by three geographic regions (eastern, central, and western) and by whether the village had a public health clinic (PHC) or not. In each region five districts were chosen with probability proportional to the district population. In each district four villages were chosen, again with probability proportional to census population: two PHC villages and two non-PHC villages. Finally, six compounds were chosen more or less randomly from each village.
In samplyr, this three-stage stratified cluster design translates directly into code:
design <- sampling_design(title = "Gambia bed nets") |>
add_stage() |>
stratify_by(region) |>
cluster_by(district) |>
draw(n = 5, method = "pps_brewer", mos = population) |>
add_stage() |>
stratify_by(phc) |>
cluster_by(village) |>
draw(n = 2, method = "pps_brewer", mos = population) |>
add_stage() |>
draw(n = 6)
design
#> ── Sampling Design: Gambia bed nets ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#>
#> ℹ 3 stages
#>
#> ── Stage 1 ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Strata: region
#> • Cluster: district
#> • Draw: n = 5, method = pps_brewer, mos = population
#>
#> ── Stage 2 ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Strata: phc
#> • Cluster: village
#> • Draw: n = 2, method = pps_brewer, mos = population
#>
#> ── Stage 3 ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Draw: n = 6, method = srsworThe samplyr code mirrors the verbal description verb for verb.
Lohr, S. L. (2022). Sampling: Design and Analysis (3rd ed.). CRC Press.
Installation
# Install sondage first (sampling algorithms backend)
pak::pkg_install("gitlab::dickoa/sondage")
# Install svyplan (sample size, precision, power, and stratification)
pak::pkg_install("gitlab::dickoa/svyplan")
# Install samplyr
pak::pkg_install("gitlab::dickoa/samplyr")The Grammar
samplyr uses 5 verbs and 1 modifier:
| Function | Purpose |
|---|---|
sampling_design() |
Create a new sampling design |
stratify_by() |
Define stratification and allocation |
cluster_by() |
Define cluster/PSU variable |
draw() |
Specify sample size and method |
execute() |
Run the design on a frame |
add_stage() |
Delimit stages in multi-stage designs |
Frame-Independent Design
stratify_by() and cluster_by() take bare column names. The design is stored as a specification and resolved only when a frame is available (validate_frame(), execute(), as_svydesign()), so design specification stays separate from execution.
design <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
cluster_by(ea_id) |>
draw(n = 300)
sample <- execute(design, bfa_eas, seed = 2)
sample
#> # A tbl_sample: 300 × 17
#> # Weights: 148.57 [146.5, 151.22]
#> ea_id region province commune urban_rural population households area_km2
#> * <int> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 29513 Boucle du … Bale Bana Rural 181 24 4.89
#> 2 29527 Boucle du … Bale Bana Rural 193 25 0.14
#> 3 36703 Boucle du … Bale Fara Rural 975 143 1.31
#> 4 8455 Boucle du … Bale Ouri Rural 57 7 5.95
#> 5 8580 Boucle du … Bale Pa Rural 171 25 5.12
#> 6 11739 Boucle du … Bale Yaho Rural 509 76 8.8
#> 7 11746 Boucle du … Bale Yaho Rural 98 15 8.72
#> 8 6291 Boucle du … Banwa Kouka Rural 563 78 7.96
#> 9 34031 Boucle du … Banwa Sami Rural 41 6 20.2
#> 10 34058 Boucle du … Banwa Sami Rural 516 78 7.6
#> # ℹ 290 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>Quick Start
library(samplyr)
data(bfa_eas)
# Simple random sample
srs_smpl <- sampling_design() |>
draw(n = 100) |>
execute(bfa_eas, seed = 321)
srs_smpl
#> # A tbl_sample: 100 × 17
#> # Weights: 445.7 [445.7, 445.7]
#> ea_id region province commune urban_rural population households area_km2
#> * <int> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 5168 Centre-Nord Sanmate… Kaya Urban 794 117 0.18
#> 2 11281 Centre-Sud Nahouri Tiebele Rural 123 19 8.83
#> 3 30149 Centre-Oue… Sanguie Dassa Rural 813 127 7.52
#> 4 13590 Est Gnagna Koala Rural 353 44 6.04
#> 5 26045 Boucle du … Mouhoun Dedoug… Rural 378 67 8.36
#> 6 44140 Centre-Est Koulpel… Soudou… Rural 258 39 8.95
#> 7 30931 Est Komandj… Gayeri Rural 150 19 8.8
#> 8 10745 Sahel Yagha Tankou… Rural 312 43 17.0
#> 9 13990 Hauts-Bass… Kenedou… Kourou… Rural 406 64 8.98
#> 10 27757 Est Tapoa Partia… Rural 571 78 8.82
#> # ℹ 90 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>
# Stratified proportional allocation
strata_smpl <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 300) |>
execute(bfa_eas, seed = 12)
strata_smpl
#> # A tbl_sample: 300 × 17
#> # Weights: 148.57 [146.5, 151.22]
#> ea_id region province commune urban_rural population households area_km2
#> * <int> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 31987 Boucle du … Sourou Kiemba… Rural 24 3 7.24
#> 2 4968 Boucle du … Sourou Kassoum Rural 303 45 4.13
#> 3 4958 Boucle du … Sourou Kassoum Rural 142 21 8.24
#> 4 23903 Boucle du … Banwa Solenzo Rural 94 13 13.5
#> 5 25996 Boucle du … Mouhoun Dedoug… Rural 538 95 8.63
#> 6 21501 Boucle du … Kossi Doumba… Rural 247 30 7.93
#> 7 43794 Boucle du … Mouhoun Safane Rural 87 12 8.96
#> 8 9724 Boucle du … Banwa Sanaba Rural 51 6 8.02
#> 9 10979 Boucle du … Banwa Tansila Rural 71 9 7.72
#> 10 8893 Boucle du … Bale Pompoi Rural 522 71 0.48
#> # ℹ 290 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>
# PPS cluster sampling
cluster_smpl <- sampling_design() |>
cluster_by(ea_id) |>
draw(n = 50, method = "pps_brewer", mos = households) |>
execute(bfa_eas, seed = 123)
cluster_smpl
#> # A tbl_sample: 50 × 18
#> # Weights: 812.6 [113.61, 3998.97]
#> ea_id region province commune urban_rural population households area_km2
#> * <int> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 23926 Boucle du … Banwa Solenzo Rural 843 113 7.84
#> 2 33198 Boucle du … Kossi Nouna Rural 1011 149 15.9
#> 3 20930 Boucle du … Mouhoun Bondok… Rural 792 121 9.76
#> 4 23139 Cascades Comoe Ouo Rural 275 32 7.57
#> 5 15175 Centre Kadiogo Ouagad… Urban 1287 194 0.2
#> 6 15267 Centre Kadiogo Ouagad… Urban 3506 528 0.88
#> 7 15367 Centre Kadiogo Ouagad… Urban 993 150 0.32
#> 8 15491 Centre Kadiogo Ouagad… Urban 854 129 0.11
#> 9 17031 Centre Kadiogo Ouagad… Urban 802 121 1.93
#> 10 17414 Centre Kadiogo Ouagad… Urban 874 132 0.12
#> # ℹ 40 more rows
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>, .certainty_1 <lgl>Multi-Stage Sampling
Use add_stage() to define multi-stage designs. This example selects districts with PPS, then samples EAs within each:
library(dplyr, warn.conflicts = FALSE)
data(zwe_eas)
# Add district-level measure of size
zwe_frame <- zwe_eas |>
mutate(district_hh = sum(households), .by = district)
# Two-stage design: 10 districts, 5 EAs per district
sample <- sampling_design() |>
add_stage(label = "Districts") |>
cluster_by(district) |>
draw(n = 10, method = "pps_brewer", mos = district_hh) |>
add_stage(label = "EAs") |>
draw(n = 5) |>
execute(zwe_frame, seed = 12345)
sample
#> # A tbl_sample: 50 × 21
#> # Weights: 2514.13 [784.4, 4121.21]
#> ea_id province district ward_pcode urban_rural population households
#> * <int> <fct> <fct> <chr> <fct> <int> <int>
#> 1 88821 Harare Harare ZW192102 Urban 197 57
#> 2 87420 Harare Harare ZW192116 Urban 1378 353
#> 3 86941 Harare Harare ZW192108 Urban 103 38
#> 4 88444 Harare Harare ZW192129 Urban 538 142
#> 5 34696 Harare Harare ZW192120 Urban 452 124
#> 6 14864 Manicaland Chimanimani ZW110201 Rural 111 29
#> 7 59668 Manicaland Chimanimani ZW110204 Rural 142 36
#> 8 60166 Manicaland Chimanimani ZW110203 Rural 114 29
#> 9 60082 Manicaland Chimanimani ZW110216 Rural 73 19
#> 10 58991 Manicaland Chimanimani ZW110219 Rural 73 18
#> # ℹ 40 more rows
#> # ℹ 14 more variables: buildings <int>, women_15_49 <int>, men_15_49 <int>,
#> # children_under5 <int>, area_km2 <dbl>, district_hh <int>, .weight <dbl>,
#> # .sample_id <int>, .stage <int>, .weight_2 <dbl>, .fpc_2 <int>,
#> # .weight_1 <dbl>, .fpc_1 <int>, .certainty_1 <lgl>Operational Sampling
Execute stages separately when fieldwork happens between stages:
design <- sampling_design() |>
add_stage(label = "EA") |>
stratify_by(urban_rural) |>
cluster_by(ea_id) |>
draw(n = 10, method = "pps_brewer", mos = households) |>
add_stage(label = "HH") |>
draw(n = 5)
# Execute stage 1 only
selected_eas <- execute(design, zwe_eas, stages = 1, seed = 1)
selected_eas
#> # A tbl_sample: 20 × 18
#> # Stages: 1/2
#> # Weights: 6492.06 [591.31, 14492.74]
#> ea_id province district ward_pcode urban_rural population households
#> * <int> <fct> <fct> <chr> <fct> <int> <int>
#> 1 47209 Bulawayo Bulawayo ZW102127 Urban 462 121
#> 2 35161 Harare Harare ZW192109 Urban 140 38
#> 3 86782 Harare Harare ZW192130 Urban 974 263
#> 4 88462 Harare Harare ZW192103 Urban 993 302
#> 5 93947 Mashonaland Cent… Bindura ZW120105 Rural 95 23
#> 6 35770 Mashonaland Cent… Guruve ZW120307 Urban 121 32
#> 7 35831 Mashonaland Cent… Guruve ZW120322 Rural 322 80
#> 8 83626 Mashonaland East Goromon… ZW130225 Urban 277 72
#> 9 37885 Mashonaland East Mudzi ZW130514 Urban 294 80
#> 10 83520 Mashonaland East Murehwa ZW130627 Rural 55 15
#> 11 20895 Mashonaland East Mutoko ZW130726 Rural 96 23
#> 12 36189 Mashonaland West Makonde ZW140513 Rural 94 21
#> 13 42748 Masvingo Mwenezi ZW180610 Rural 73 15
#> 14 76867 Matabeleland Nor… Binga ZW150107 Rural 74 20
#> 15 5602 Matabeleland Nor… Tsholot… ZW150603 Rural 62 14
#> 16 61444 Matabeleland Nor… Tsholot… ZW150605 Rural 75 18
#> 17 28253 Matabeleland Sou… Gwanda … ZW162103 Urban 106 35
#> 18 51722 Matabeleland Sou… Umzingw… ZW160701 Urban 357 81
#> 19 54466 Midlands Gweru U… ZW172114 Urban 510 136
#> 20 866 Midlands Zvishav… ZW170818 Rural 91 19
#> # ℹ 11 more variables: buildings <int>, women_15_49 <int>, men_15_49 <int>,
#> # children_under5 <int>, area_km2 <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>, .certainty_1 <lgl>
# ... fieldwork ...
# list all households in selected EAs
selected_eas_list <- selected_eas |>
slice(rep(seq_len(n()), households)) |>
mutate(hh_id = row_number())
# Execute stage 2
final_sample <- design |>
execute(selected_eas_list, seed = 2)
final_sample
#> # A tbl_sample: 100 × 21
#> # Weights: 38147.25 [35714.84, 40579.66]
#> ea_id province district ward_pcode urban_rural population households
#> * <int> <fct> <fct> <chr> <fct> <int> <int>
#> 1 47209 Bulawayo Bulawayo ZW102127 Urban 462 121
#> 2 47209 Bulawayo Bulawayo ZW102127 Urban 462 121
#> 3 47209 Bulawayo Bulawayo ZW102127 Urban 462 121
#> 4 47209 Bulawayo Bulawayo ZW102127 Urban 462 121
#> 5 47209 Bulawayo Bulawayo ZW102127 Urban 462 121
#> 6 35161 Harare Harare ZW192109 Urban 140 38
#> 7 35161 Harare Harare ZW192109 Urban 140 38
#> 8 35161 Harare Harare ZW192109 Urban 140 38
#> 9 35161 Harare Harare ZW192109 Urban 140 38
#> 10 35161 Harare Harare ZW192109 Urban 140 38
#> # ℹ 90 more rows
#> # ℹ 14 more variables: buildings <int>, women_15_49 <int>, men_15_49 <int>,
#> # children_under5 <int>, area_km2 <dbl>, hh_id <int>, .weight <dbl>,
#> # .sample_id <int>, .stage <int>, .weight_2 <dbl>, .fpc_2 <int>,
#> # .weight_1 <dbl>, .fpc_1 <int>, .certainty_1 <lgl>Selection Methods
Equal Probability
| Method | Sample Size | Description |
|---|---|---|
srswor |
Fixed | Simple random sampling without replacement (default) |
srswr |
Fixed | Simple random sampling with replacement |
systematic |
Fixed | Systematic sampling |
bernoulli |
Random | Bernoulli sampling |
Probability Proportional to Size (PPS)
| Method | Sample Size | Description |
|---|---|---|
pps_brewer |
Fixed | Brewer’s method (recommended) |
pps_systematic |
Fixed | PPS systematic |
pps_cps |
Fixed | Conditional Poisson sampling |
pps_poisson |
Random | PPS Poisson (PRN) |
pps_sps |
Fixed | Sequential Poisson sampling (PRN) |
pps_pareto |
Fixed | Pareto piPS sampling (PRN) |
pps_multinomial |
Fixed | PPS with replacement |
pps_chromy |
Fixed | PPS with minimum replacement |
Allocation Methods
When stratifying, control how the total sample is distributed:
| Method | Description |
|---|---|
| (none) |
n applies per stratum |
equal |
Same sample size in each stratum |
proportional |
Proportional to stratum size |
neyman |
Minimize variance (requires variance) |
optimal |
Minimize cost-variance (requires variance and cost) |
power |
Compromise allocation (requires cv and importance) |
Sample Size Bounds
Use min_n and max_n in draw() to constrain stratum sample sizes when using allocation methods:
data(bfa_eas_variance)
# Ensure at least 2 per stratum (minimum for variance estimation)
sampling_design() |>
stratify_by(region, alloc = "neyman", variance = bfa_eas_variance) |>
draw(n = 300, min_n = 2) |>
execute(bfa_eas, seed = 321)
#> # A tbl_sample: 300 × 17
#> # Weights: 148.57 [56.35, 1209.75]
#> ea_id region province commune urban_rural population households area_km2
#> * <int> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 11643 Boucle du … Nayala Yaba Rural 396 52 8.75
#> 2 36763 Boucle du … Bale Fara Rural 79 12 2.36
#> 3 9648 Boucle du … Banwa Sanaba Rural 279 35 8.37
#> 4 10555 Boucle du … Kossi Sono Rural 164 25 8.38
#> 5 34926 Boucle du … Sourou Tougan Rural 1239 170 1.15
#> 6 44501 Boucle du … Mouhoun Tcheri… Rural 221 35 0.22
#> 7 9702 Boucle du … Banwa Sanaba Rural 260 33 6.92
#> 8 26045 Boucle du … Mouhoun Dedoug… Rural 378 67 8.36
#> 9 8605 Boucle du … Bale Pa Rural 283 41 8.78
#> 10 21078 Boucle du … Kossi Bouras… Rural 93 11 7.08
#> # ℹ 290 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>Custom Allocation
For custom stratum-specific sizes or rates, pass a data frame to n or frac in draw():
# Custom allocation with data frame
sizes_df <- data.frame(
region = c("North", "South", "East", "West"),
n = c(100, 200, 150, 100)
)
sample <- sampling_design() |>
stratify_by(region) |>
draw(n = sizes_df) |>
execute(frame, seed = 42)
# Neyman allocation
data(bfa_eas_variance)
sample <- sampling_design() |>
stratify_by(region, alloc = "neyman", variance = bfa_eas_variance) |>
draw(n = 300) |>
execute(bfa_eas, seed = 4321)
sample
#> # A tbl_sample: 300 × 17
#> # Weights: 148.57 [56.35, 1209.75]
#> ea_id region province commune urban_rural population households area_km2
#> * <int> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 21517 Boucle du … Kossi Doumba… Rural 354 43 8.72
#> 2 33112 Boucle du … Kossi Nouna Rural 28 4 5.47
#> 3 25963 Boucle du … Mouhoun Dedoug… Rural 416 74 0.67
#> 4 6354 Boucle du … Banwa Kouka Rural 296 41 8.95
#> 5 11787 Boucle du … Bale Bagassi Rural 874 105 1.4
#> 6 8233 Boucle du … Mouhoun Ouarko… Rural 183 25 9.54
#> 7 9925 Boucle du … Bale Siby Rural 204 30 0.36
#> 8 34812 Boucle du … Sourou Tougan Rural 1072 147 0.3
#> 9 11441 Boucle du … Sourou Toeni Rural 64 7 23.9
#> 10 8883 Boucle du … Bale Pompoi Rural 192 26 9.42
#> # ℹ 290 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>Sample Coordination (PRN)
Permanent random numbers enable coordinated sampling across survey waves. Assign a stable uniform random number to each frame unit, then pass it via prn:
frame$prn <- runif(nrow(frame))
# Wave 1: sequential Poisson sampling with PRN
wave1 <- sampling_design() |>
draw(n = 500, method = "pps_sps", mos = size, prn = prn) |>
execute(frame, seed = 10)
# Wave 2: same PRN -> high overlap (positive coordination)
wave2 <- sampling_design() |>
draw(n = 500, method = "pps_sps", mos = size, prn = prn) |>
execute(frame, seed = 20)PRN is supported for bernoulli, pps_poisson, pps_sps, and pps_pareto.
Balanced Sampling
The cube method (Deville & Tille 2004) produces samples whose Horvitz-Thompson estimates of auxiliary totals match the population totals, improving precision:
# Balanced sample using population and area as auxiliary variables
balanced_smpl <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 300, method = "balanced", mos = households,
aux = c(population, area_km2)) |>
execute(bfa_eas, seed = 24)
balanced_smpl
#> # A tbl_sample: 300 × 17
#> # Weights: 144.02 [13.68, 2034.38]
#> ea_id region province commune urban_rural population households area_km2
#> * <int> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 11819 Boucle du … Bale Bagassi Rural 386 46 0.57
#> 2 1136 Boucle du … Bale Boromo Rural 3613 471 4.23
#> 3 1205 Boucle du … Bale Boromo Rural 4324 564 3.13
#> 4 8453 Boucle du … Bale Ouri Rural 809 103 0.62
#> 5 39991 Boucle du … Banwa Balave Rural 865 136 0.81
#> 6 39992 Boucle du … Banwa Balave Rural 947 149 1.05
#> 7 6360 Boucle du … Banwa Kouka Rural 592 82 0.54
#> 8 23749 Boucle du … Banwa Solenzo Rural 235 32 8.79
#> 9 23855 Boucle du … Banwa Solenzo Rural 831 112 6.92
#> 10 24020 Boucle du … Banwa Solenzo Rural 466 63 8.71
#> # ℹ 290 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <dbl>The aux parameter accepts bare column names. When combined with mos, inclusion probabilities are proportional to size while the sample remains balanced on the auxiliary variables. Stratified designs use the stratified cube algorithm (Chauvet 2009) in a single call. Balanced sampling is supported for up to 2 stages.
Custom Sampling Methods
Any unequal probability sampling method can be plugged into samplyr by registering it with sondage::register_method(). The registered method is then available in draw() using the pps_<name> convention.
For example, to use the elimination procedure of Tille (1996) from the sampling package:
# Wrap sampling::UPtille to return selected indices
tille_fn <- function(pik, n = NULL, prn = NULL, ...) {
which(as.logical(sampling::UPtille(pik)))
}
# Joint inclusion probabilities, restricted to the sampled units
tille_joint_fn <- function(pik, sample_idx = NULL, ...) {
pi2 <- sampling::UPtillepi2(pik)
if (!is.null(sample_idx))
pi2 <- pi2[sample_idx, sample_idx, drop = FALSE]
pi2
}
# Register with both sampling and joint probability functions
sondage::register_method(
"tille", type = "wor",
sample_fn = tille_fn,
joint_fn = tille_joint_fn
)
# Use it like any built-in method
sample_tille <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 300, method = "pps_tille", mos = households) |>
execute(bfa_eas, seed = 1)
sample_tille
#> # A tbl_sample: 300 × 18
#> # Weights: 146.13 [13.68, 1719.7]
#> ea_id region province commune urban_rural population households area_km2
#> * <int> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 29508 Boucle du … Bale Bana Rural 998 130 2.14
#> 2 1205 Boucle du … Bale Boromo Rural 4324 564 3.13
#> 3 36751 Boucle du … Bale Fara Rural 581 85 8.69
#> 4 8609 Boucle du … Bale Pa Rural 778 112 0.88
#> 5 9895 Boucle du … Bale Siby Rural 1813 268 1.7
#> 6 11739 Boucle du … Bale Yaho Rural 509 76 8.8
#> 7 39999 Boucle du … Banwa Balave Rural 1820 287 2.04
#> 8 6256 Boucle du … Banwa Kouka Rural 493 68 8.55
#> 9 6304 Boucle du … Banwa Kouka Rural 540 74 8.06
#> 10 6366 Boucle du … Banwa Kouka Rural 717 99 0.5
#> # ℹ 290 more rows
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>, .certainty_1 <lgl>
# Clean up
sondage::unregister_method("tille")Custom methods flow through the full pipeline: stratification, multi-stage designs, certainty selection, survey export, and joint probabilities. Providing a joint_fn enables exact variance estimation via joint_expectation() and survey::ppsmat(). See ?sondage::register_method for the full contract.
Survey Export
Convert samples to survey or srvyr objects for analysis:
svy <- as_svydesign(sample)
survey::svymean(~y, svy)
# Exact or high-entropy joint inclusion probabilities
jip <- joint_expectation(sample, frame, stage = 1)
svy_exact <- as_svydesign(sample, pps = survey::ppsmat(jip[[1]]))By default as_svydesign() uses Brewer variance approximation. For tighter variance estimates, joint_expectation() computes pairwise joint inclusion probabilities from the original frame. The result is exact for CPS, systematic, and Poisson whereas high-entropy approximation O(N^2) is by Brewer, SPS, Pareto and Balanced sampling using the Cube algorithm (exact recursive formulas exist for some of these methods but are usually O(N^3) and impractical for large frames). See ?joint_expectation for the full method-by-method breakdown.
Panel Partitioning
Partition the sample into rotation groups for panel surveys:
panel_sample <- sampling_design() |>
stratify_by(region) |>
draw(n = 200) |>
execute(bfa_eas, seed = 1, panels = 4)
table(panel_sample$.panel)
#>
#> 1 2 3 4
#> 650 650 650 650Panels are assigned by systematic interleaving within strata. For multi-stage designs, panels are assigned at the PSU level and propagated to all units. Per-panel analysis is also possible by multiplying the weights by the number of panels since the weights reflect the full-sample inclusion probability.
Replicated Sampling
Draw multiple independent samples from the same frame and design with reps. This is useful for simulation studies, variance estimation via repeated sampling, or quality control:
rep_sample <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 300) |>
execute(bfa_eas, seed = 42, reps = 5)
table(rep_sample$.replicate)
#>
#> 1 2 3 4 5
#> 300 300 300 300 300Each replicate is an independent draw. Replicate r uses seed seed + r - 1. To analyse a single replicate, filter first:
rep1 <- rep_sample |> filter(.replicate == 1)
as_svydesign(rep1)Two-Phase Sampling
Pipe a tbl_sample into execute() for multi-phase designs:
# Phase 1: large screening sample
phase1 <- sampling_design() |>
draw(n = 500) |>
execute(bfa_eas, seed = 1)
# Phase 2: subsample from phase 1
phase2 <- sampling_design() |>
draw(n = 50) |>
execute(phase1, seed = 2)
phase2
#> # A tbl_sample: 50 × 17
#> # Weights: 891.4 [891.4, 891.4]
#> ea_id region province commune urban_rural population households area_km2
#> * <int> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 33282 Boucle du … Kossi Nouna Rural 327 48 4.35
#> 2 25236 Est Gnagna Bilanga Rural 581 88 8.72
#> 3 33467 Est Kompien… Pama Rural 46 6 34.1
#> 4 19744 Centre-Est Boulgou Zabre Rural 433 79 8.46
#> 5 16081 Centre Kadiogo Ouagad… Urban 830 125 0.11
#> 6 20688 Cascades Comoe Banfora Rural 823 101 0.18
#> 7 36869 Boucle du … Nayala Gassan Rural 1172 129 1.18
#> 8 32034 Centre-Est Koulpel… Komin-… Rural 444 62 7.97
#> 9 15189 Centre Kadiogo Ouagad… Urban 471 71 0.08
#> 10 11082 Sud-Ouest Bougour… Tianko… Rural 43 6 10.1
#> # ℹ 40 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>Weights compound automatically across phases.
Diagnostics
summary(strata_smpl)
#> ── Sample Summary ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#>
#> ℹ n = 300 | stages = 1/1 | seed = 12
#>
#> ── Design: Stage 1 ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Strata: region (proportional)
#> • Method: srswor
#>
#> ── Allocation: Stage 1 ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> region N_h n_h f_h
#> Boucle du Mouhoun 5009 34 0.0068
#> Cascades 2508 17 0.0068
#> Centre 3888 26 0.0067
#> Centre-Est 2941 20 0.0068
#> Centre-Nord 3402 23 0.0068
#> Centre-Ouest 3723 25 0.0067
#> Centre-Sud 1612 11 0.0068
#> Est 5505 37 0.0067
#> Hauts-Bassins 4839 32 0.0066
#> Nord 2930 20 0.0068
#> Plateau-Central 1662 11 0.0066
#> Sahel 4144 28 0.0068
#> Sud-Ouest 2407 16 0.0066
#> ───── ─── ──────
#> Total 44570 300 0.0067
#>
#> ── Weights ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Range: [146.5, 151.22]
#> • Mean: 148.57 · CV: 0.01
#> • DEFF: 1 · n_eff: 300Validation
For statistical validation on synthetic populations with known truths, see vignette("validation"). It combines deterministic invariants (weights, FPC, certainty, stage compounding) with Monte Carlo checks of bias, standard-error calibration, and coverage.
Included Datasets
Synthetic sampling frames for learning and testing:
| Dataset | Description | Rows |
|---|---|---|
bfa_eas |
Household survey EA frame (Burkina Faso) | 44,570 |
zwe_eas |
Two-stage cluster survey EA frame (Zimbabwe) | 107,250 |
ken_enterprises |
Enterprise survey frame (Kenya) | 17,004 |
Plus auxiliary data: bfa_eas_variance, bfa_eas_cost
Comparison with SAS and SPSS
SAS PROC SURVEYSELECT
proc surveyselect data=frame method=pps n=50 seed=12345;
strata region;
cluster school;
size enrollment;
run;
sampling_design() |>
stratify_by(region) |>
cluster_by(school) |>
draw(n = 50, method = "pps_brewer", mos = enrollment) |>
execute(frame, seed = 1)SAS Allocation with Bounds
proc surveyselect data=frame method=srs n=500 seed=42;
strata region / alloc=neyman var=variance_data allocmin=2 allocmax=100;
run;
sampling_design() |>
stratify_by(region, alloc = "neyman", variance = variance_data) |>
draw(n = 500, min_n = 2, max_n = 100) |>
execute(frame, seed = 2)SAS Rounding Control
proc surveyselect data=frame method=sys samprate=0.02 seed=2 round=nearest;
strata State;
run;
sampling_design() |>
stratify_by(State) |>
draw(frac = 0.02, method = "systematic", round = "nearest") |>
execute(frame, seed = 3)SPSS CSPLAN
CSPLAN SAMPLE
/PLAN FILE='myplan.csplan'
/DESIGN STRATA=region CLUSTER=school
/METHOD TYPE=PPS_WOR
/SIZE VALUE=50
/MOS VARIABLE=enrollment.
sampling_design() |>
stratify_by(region) |>
cluster_by(school) |>
draw(n = 50, method = "pps_brewer", mos = enrollment) |>
execute(frame, seed = 4)