A tidy grammar for survey sampling in R. samplyr provides a minimal set of composable verbs for stratified, clustered, and multi-stage sampling designs.
# Install sondage first (sampling algorithms backend)
remotes::install_gitlab("dickoa/sondage")
remotes::install_gitlab("dickoa/samplyr")
pak::pkg_install("dickoa/samplyr") # github mirrorsamplyr is built around a simple idea: sampling code should read like its English description.
library(samplyr)
# "Stratify by region, proportionally allocate 500 samples, execute"
sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 500) |>
execute(frame, seed = 42)The package uses 5 verbs and 1 modifier:
| Function | Purpose |
|---|---|
sampling_design() |
Create a new sampling design |
stratify_by() |
Define stratification and allocation |
cluster_by() |
Define cluster/PSU variable |
draw() |
Specify sample size and method |
execute() |
Run the design on a frame |
stage() |
Delimit stages in multi-stage designs |
library(samplyr)
data(niger_eas)
# Simple random sample
sample <- sampling_design() |>
draw(n = 100) |>
execute(niger_eas, seed = 42)
# Stratified proportional allocation
sample <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 300) |>
execute(niger_eas, seed = 42)
# PPS cluster sampling
sample <- sampling_design() |>
cluster_by(ea_id) |>
draw(n = 50, method = "pps_brewer", mos = hh_count) |>
execute(niger_eas, seed = 42)Use stage() to define multi-stage designs. This example selects districts with PPS, then samples schools within each:
data(tanzania_schools)
# Add district-level measure of size
schools_frame <- tanzania_schools |>
dplyr::group_by(district) |>
dplyr::mutate(district_enrollment = sum(enrollment)) |>
dplyr::ungroup()
# Two-stage design: 10 districts, 5 schools per district
sample <- sampling_design() |>
stage(label = "Districts") |>
cluster_by(district) |>
draw(n = 10, method = "pps_brewer", mos = district_enrollment) |>
stage(label = "Schools") |>
draw(n = 5) |>
execute(schools_frame, seed = 42)Execute stages separately when fieldwork happens between stages:
design <- sampling_design() |>
stage(label = "Districts") |>
cluster_by(district) |>
draw(n = 10, method = "pps_brewer", mos = district_enrollment) |>
stage(label = "Schools") |>
draw(n = 5)
# Execute stage 1 only
selected_districts <- execute(design, schools_frame, stages = 1, seed = 42)
# ... fieldwork ...
# Execute stage 2
final_sample <- selected_districts |> execute(schools_frame, seed = 43)When stratifying, control how the total sample is distributed:
| Method | Description |
|---|---|
| (none) |
n applies per stratum |
equal |
Same sample size in each stratum |
proportional |
Proportional to stratum size |
neyman |
Minimize variance (requires variance) |
optimal |
Minimize cost-variance (requires variance and cost) |
Use min_n and max_n in draw() to constrain stratum sample sizes when using allocation methods:
# Ensure at least 2 per stratum (minimum for variance estimation)
sampling_design() |>
stratify_by(region, alloc = "neyman", variance = niger_eas_variance) |>
draw(n = 300, min_n = 2) |>
execute(niger_eas, seed = 42)
# Cap large strata, ensure minimum representation
sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 500, min_n = 10, max_n = 100) |>
execute(frame, seed = 42)For custom stratum-specific sizes or rates, pass a data frame to n or frac in draw():
# Custom allocation with data frame
sizes_df <- data.frame(
region = c("North", "South", "East", "West"),
n = c(100, 200, 150, 100)
)
sample <- sampling_design() |>
stratify_by(region) |>
draw(n = sizes_df) |>
execute(frame, seed = 42)
# Neyman allocation
data(niger_eas_variance)
sample <- sampling_design() |>
stratify_by(region, alloc = "neyman", variance = niger_eas_variance) |>
draw(n = 300) |>
execute(niger_eas, seed = 42)Synthetic datasets for learning and testing:
| Dataset | Description | Rows |
|---|---|---|
niger_eas |
DHS-style enumeration areas (Niger) | ~1,500 |
uganda_farms |
LSMS-style agricultural frame (Uganda) | ~800 |
kenya_health |
Health facility frame (Kenya) | ~3,000 |
tanzania_schools |
School survey frame (Tanzania) | ~2,500 |
nigeria_business |
Enterprise survey frame (Nigeria) | ~10,000 |
Plus auxiliary data: niger_eas_variance, niger_eas_cost
proc surveyselect data=frame method=pps n=50 seed=42;
strata region;
cluster school;
size enrollment;
run;
sampling_design() |>
stratify_by(region) |>
cluster_by(school) |>
draw(n = 50, method = "pps_brewer", mos = enrollment) |>
execute(frame, seed = 42)proc surveyselect data=frame method=srs n=500 seed=42;
strata region / alloc=neyman var=variance_data allocmin=2 allocmax=100;
run;
sampling_design() |>
stratify_by(region, alloc = "neyman", variance = variance_data) |>
draw(n = 500, min_n = 2, max_n = 100) |>
execute(frame, seed = 42)CSPLAN SAMPLE
/PLAN FILE='myplan.csplan'
/DESIGN STRATA=region CLUSTER=school
/METHOD TYPE=PPS_WOR
/SIZE VALUE=50
/MOS VARIABLE=enrollment.
sampling_design() |>
stratify_by(region) |>
cluster_by(school) |>
draw(n = 50, method = "pps_brewer", mos = enrollment) |>
execute(frame, seed = 42)