A tidy grammar for survey sampling in R. samplyr provides a minimal set of composable verbs for stratified, clustered, and multi-stage sampling designs.
# Install sondage first (sampling algorithms backend)
remotes::install_gitlab("dickoa/sondage")
remotes::install_gitlab("dickoa/samplyr")
pak::pkg_install("dickoa/samplyr") # github mirrorsamplyr is built around a simple idea: sampling code should read like its English description.
library(samplyr)
data(kenya_health)
# "Stratify by region, proportionally allocate 500 samples, execute"
sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 500) |>
execute(kenya_health, seed = 1)The package uses 5 verbs and 1 modifier:
| Function | Purpose |
|---|---|
sampling_design() |
Create a new sampling design |
stratify_by() |
Define stratification and allocation |
cluster_by() |
Define cluster/PSU variable |
draw() |
Specify sample size and method |
execute() |
Run the design on a frame |
stage() |
Delimit stages in multi-stage designs |
library(samplyr)
data(niger_eas)
# Simple random sample
srs_smpl <- sampling_design() |>
draw(n = 100) |>
execute(niger_eas, seed = 42)
# Stratified proportional allocation
strata_smpl <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 300) |>
execute(niger_eas, seed = 42)
# PPS cluster sampling
cluster_smpl <- sampling_design() |>
cluster_by(ea_id) |>
draw(n = 50, method = "pps_brewer", mos = hh_count) |>
execute(niger_eas, seed = 42)Use stage() to define multi-stage designs. This example selects districts with PPS, then samples schools within each:
library(dplyr)
data(tanzania_schools)
# Add district-level measure of size
schools_frame <- tanzania_schools |>
mutate(district_enrollment = sum(enrollment),
.by = district)
# Two-stage design: 10 districts, 5 schools per district
sample <- sampling_design() |>
stage(label = "Districts") |>
cluster_by(district) |>
draw(n = 10, method = "pps_brewer", mos = district_enrollment) |>
stage(label = "Schools") |>
draw(n = 5) |>
execute(schools_frame, seed = 123)Execute stages separately when fieldwork happens between stages:
design <- sampling_design() |>
stage(label = "Districts") |>
cluster_by(district) |>
draw(n = 10, method = "pps_brewer", mos = district_enrollment) |>
stage(label = "Schools") |>
draw(n = 5)
# Add district-level measure of size
schools_frame_agg <- tanzania_schools |>
summarize(district_enrollment = sum(enrollment),
m = n(),
.by = district)
# Execute stage 1 only
selected_districts <- execute(design, schools_frame_agg, stages = 1, seed = 1)
# ... fieldwork ...
schools_frame <- tanzania_schools |>
mutate(district_enrollment = sum(enrollment),
m = n(),
.by = district)
# Execute stage 2
final_sample <- selected_districts |> execute(schools_frame, seed = 2)| Method | Sample Size | Description |
|---|---|---|
srswor |
Fixed | Simple random sampling without replacement (default) |
srswr |
Fixed | Simple random sampling with replacement |
systematic |
Fixed | Systematic sampling |
bernoulli |
Random | Bernoulli sampling (requires frac) |
| Method | Sample Size | Description |
|---|---|---|
pps_brewer |
Fixed | Brewer’s method (recommended) |
pps_systematic |
Fixed | PPS systematic |
pps_maxent |
Fixed | Maximum entropy |
pps_poisson |
Random | PPS Poisson (requires frac) |
pps_multinomial |
Fixed | PPS with replacement |
pps_chromy |
Fixed | PPS with minimum replacement |
When stratifying, control how the total sample is distributed:
| Method | Description |
|---|---|
| (none) |
n applies per stratum |
equal |
Same sample size in each stratum |
proportional |
Proportional to stratum size |
neyman |
Minimize variance (requires variance) |
optimal |
Minimize cost-variance (requires variance and cost) |
Use min_n and max_n in draw() to constrain stratum sample sizes when using allocation methods:
# Ensure at least 2 per stratum (minimum for variance estimation)
sampling_design() |>
stratify_by(region, alloc = "neyman", variance = niger_eas_variance) |>
draw(n = 300, min_n = 2) |>
execute(niger_eas, seed = 42)
# Cap large strata, ensure minimum representation
sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 500, min_n = 10, max_n = 100) |>
execute(frame, seed = 42)For custom stratum-specific sizes or rates, pass a data frame to n or frac in draw():
# Custom allocation with data frame
sizes_df <- data.frame(
region = c("North", "South", "East", "West"),
n = c(100, 200, 150, 100)
)
sample <- sampling_design() |>
stratify_by(region) |>
draw(n = sizes_df) |>
execute(frame, seed = 42)
# Neyman allocation
data(niger_eas_variance)
sample <- sampling_design() |>
stratify_by(region, alloc = "neyman", variance = niger_eas_variance) |>
draw(n = 300) |>
execute(niger_eas, seed = 42)Synthetic datasets for learning and testing:
| Dataset | Description | Rows |
|---|---|---|
niger_eas |
DHS-style enumeration areas (Niger) | ~1,500 |
uganda_farms |
LSMS-style agricultural frame (Uganda) | ~800 |
kenya_health |
Health facility frame (Kenya) | ~3,000 |
tanzania_schools |
School survey frame (Tanzania) | ~2,500 |
nigeria_business |
Enterprise survey frame (Nigeria) | ~10,000 |
Plus auxiliary data: niger_eas_variance, niger_eas_cost
proc surveyselect data=frame method=pps n=50 seed=42;
strata region;
cluster school;
size enrollment;
run;
sampling_design() |>
stratify_by(region) |>
cluster_by(school) |>
draw(n = 50, method = "pps_brewer", mos = enrollment) |>
execute(frame, seed = 1)proc surveyselect data=frame method=srs n=500 seed=42;
strata region / alloc=neyman var=variance_data allocmin=2 allocmax=100;
run;
sampling_design() |>
stratify_by(region, alloc = "neyman", variance = variance_data) |>
draw(n = 500, min_n = 2, max_n = 100) |>
execute(frame, seed = 2)proc surveyselect data=frame method=sys samprate=0.02 seed=42 round=nearest;
strata State;
run;
sampling_design() |>
stratify_by(State) |>
draw(frac = 0.02, method = "systematic", round = "nearest") |>
execute(frame, seed = 3)CSPLAN SAMPLE
/PLAN FILE='myplan.csplan'
/DESIGN STRATA=region CLUSTER=school
/METHOD TYPE=PPS_WOR
/SIZE VALUE=50
/MOS VARIABLE=enrollment.
sampling_design() |>
stratify_by(region) |>
cluster_by(school) |>
draw(n = 50, method = "pps_brewer", mos = enrollment) |>
execute(frame, seed = 4)