Overview
samplyr provides a tidy grammar for specifying and
executing survey sampling designs. The package is built around a minimal
set of composable verbs that handle stratified, clustered, and
multi-stage sampling.
For the statistical semantics and assumptions underlying each
operation, see vignette("design-semantics").
The core idea is that sampling code should read like its English description:
library(samplyr)
library(dplyr)
# "Stratify by region, proportionally allocate 500 samples, execute"
sampling_design(title = "My sampling design") |>
stratify_by(region, alloc = "proportional") |>
draw(n = 500) |>
execute(frame, seed = 1)To see how this works in practice, consider a real survey design from Lohr (2022) (Example 7.1), based on a 1991 study of bed net use in rural Gambia (D’Alessandro et al., 1994):
Malaria morbidity can be reduced by using bed nets impregnated with insecticide, but this is only effective if the bed nets are in widespread use. In 1991, a nationwide survey was designed to estimate the prevalence of bed net use in rural areas of the Gambia (D’Alessandro et al., 1994).
The sampling frame consisted of all rural villages of fewer than 3,000 people. The villages were stratified by three geographic regions (eastern, central, and western) and by whether the village had a public health clinic (PHC) or not. In each region five districts were chosen with probability proportional to the district population. In each district four villages were chosen, again with probability proportional to census population: two PHC villages and two non-PHC villages. Finally, six compounds were chosen more or less randomly from each village.
In samplyr, this three-stage stratified cluster design
translates directly into code:
design <- sampling_design(title = "Gambia bed nets") |>
add_stage() |>
stratify_by(region) |>
cluster_by(district) |>
draw(n = 5, method = "pps_brewer", mos = population) |>
add_stage() |>
stratify_by(phc) |>
cluster_by(village) |>
draw(n = 2, method = "pps_brewer", mos = population) |>
add_stage() |>
draw(n = 6)
design
#> ── Sampling Design: Gambia bed nets ────────────────────────────────────────────
#>
#> ℹ 3 stages
#>
#> ── Stage 1 ─────────────────────────────────────────────────────────────────────
#> • Strata: region
#> • Cluster: district
#> • Draw: n = 5, method = pps_brewer, mos = population
#>
#> ── Stage 2 ─────────────────────────────────────────────────────────────────────
#> • Strata: phc
#> • Cluster: village
#> • Draw: n = 2, method = pps_brewer, mos = population
#>
#> ── Stage 3 ─────────────────────────────────────────────────────────────────────
#> • Draw: n = 6, method = srsworAnd you can then use execute to get your sample.
sampling_design(title = "Gambia bed net") |>
add_stage() |>
stratify_by(region) |>
cluster_by(district) |>
draw(n = 5, method = "pps_brewer", mos = population) |>
add_stage() |>
stratify_by(phc) |>
cluster_by(village) |>
draw(n = 2, method = "pps_brewer", mos = population) |>
add_stage() |>
draw(n = 6) |>
execute(frame, seed = 1991)
# or
execute(design, frame, seed = 1991)The samplyr code mirrors the verbal description verb for
verb.
The Grammar
samplyr uses 5 verbs and 1 modifier:
| Function | Purpose |
|---|---|
sampling_design() |
Create a new sampling design |
stratify_by() |
Define stratification variables and allocation |
cluster_by() |
Define cluster/PSU variable |
draw() |
Specify sample size, method |
execute() |
Run the design on a frame |
add_stage() |
Delimit stages in multi-stage designs |
Deferred Column Resolution
stratify_by() and cluster_by() accept bare
column names and store them in the design object. Column names are
validated only when a frame is available (execute(),
as_svydesign()), which keeps design specification separate
from execution.
design <- sampling_design(title = "Stratified cluster sampling") |>
stratify_by(region, alloc = "proportional") |>
cluster_by(ea_id) |>
draw(n = 200)
sample <- execute(design, bfa_eas, seed = 101)Example Data
We’ll use the bfa_eas dataset throughout this vignette.
It is an enumeration area frame for Burkina Faso, with ~14,934 EAs
across 13 regions.
data(bfa_eas)
bfa_eas |>
glimpse()
#> Rows: 14,934
#> Columns: 12
#> $ ea_id <chr> "EA_00245", "EA_00246", "EA_00247", "EA_00248", "E…
#> $ region <fct> Boucle du Mouhoun, Boucle du Mouhoun, Boucle du Mo…
#> $ province <fct> Bale, Bale, Bale, Bale, Bale, Bale, Bale, Bale, Ba…
#> $ commune <fct> Bagassi, Bagassi, Bagassi, Bagassi, Bagassi, Bagas…
#> $ urban_rural <fct> Rural, Rural, Rural, Rural, Rural, Rural, Rural, R…
#> $ population <dbl> 927, 2184, 1666, 592, 1467, 1607, 709, 1461, 1053,…
#> $ households <int> 111, 262, 200, 71, 176, 193, 85, 176, 127, 122, 16…
#> $ area_km2 <dbl> 57.27, 2.57, 33.44, 31.73, 12.07, 25.53, 12.63, 16…
#> $ accessible <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…
#> $ dist_road_km <dbl> 11.6, 10.7, 23.3, 19.1, 27.1, 5.7, 21.1, 3.5, 11.5…
#> $ food_insecurity_pct <dbl> 27.9, 15.9, 19.1, 19.7, 14.2, 45.5, 15.4, 18.8, 34…
#> $ cost <dbl> 407, 243, 510, 360, 546, 428, 724, 408, 219, 387, …Simple Random Sampling
The most basic design selects n units at random from the frame.
design <- sampling_design(title = "Simple Random Sampling") |>
draw(n = 100)
design
#> ── Sampling Design: Simple Random Sampling ─────────────────────────────────────
#>
#> ℹ 1 stage
#>
#> ── Stage 1 ─────────────────────────────────────────────────────────────────────
#> • Draw: n = 100, method = srsworThe sampling_design object stores design metadata and
can be reused.
The result includes the original columns plus sampling metadata:
.weight (sampling weight), .weight_1,
.weight_2, etc. (per-stage weights). For with-replacement
methods, .draw_1, .draw_2, etc. each
independent draw.
sample |>
select(ea_id, region, urban_rural, .weight, .weight_1) |>
head()
#> # A tbl_sample: 6 × 5 | Simple Random Sampling
#> # Weights: 149.34 [149.34, 149.34]
#> ea_id region urban_rural .weight .weight_1
#> * <chr> <fct> <fct> <dbl> <dbl>
#> 1 EA_09278 Centre Urban 149. 149.
#> 2 EA_04429 Sud-Ouest Rural 149. 149.
#> 3 EA_04891 Sahel Rural 149. 149.
#> 4 EA_14044 Boucle du Mouhoun Rural 149. 149.
#> 5 EA_10483 Hauts-Bassins Rural 149. 149.
#> 6 EA_04316 Hauts-Bassins Rural 149. 149.Selection Methods
The method argument controls how units are selected. By
default, samplyr uses simple random sampling without
replacement (srswor).
Systematic sampling selects units at fixed intervals, which can improve precision when the frame is ordered.
sample_sys <- sampling_design(title = "Systematic") |>
draw(n = 100, method = "systematic") |>
execute(bfa_eas, seed = 2021)
head(sample_sys)
#> # A tbl_sample: 6 × 17 | Systematic
#> # Weights: 149.34 [149.34, 149.34]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_02176 Boucle d… Bale Boromo Rural 792 103 0.42
#> 2 EA_14063 Boucle d… Bale Yaho Rural 1147 171 8.77
#> 3 EA_12395 Boucle d… Banwa Solenzo Rural 1867 251 19.7
#> 4 EA_00757 Boucle d… Kossi Barani Rural 431 56 16.4
#> 5 EA_03968 Boucle d… Kossi Doumba… Rural 922 113 30.7
#> 6 EA_02113 Boucle d… Mouhoun Bondok… Rural 1700 259 48.7
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>Sampling with replacement allows the same unit to be selected multiple times, useful for bootstrap procedures.
sample_wr <- sampling_design(title = "SRS WR") |>
draw(n = 100, method = "srswr") |>
execute(bfa_eas, seed = 123)
head(sample_wr)
#> # A tbl_sample: 6 × 18 | SRS WR
#> # Weights: 149.34 [149.34, 149.34]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_08950 Centre Kadiogo Ouagad… Urban 1618 244 0.2
#> 2 EA_08998 Centre Kadiogo Ouagad… Urban 1955 294 0.62
#> 3 EA_02967 Hauts-Ba… Houet Dande Rural 1132 123 31.1
#> 4 EA_07871 Est Gnagna Mani Rural 1096 159 10.4
#> 5 EA_05028 Nord Zondoma Goursi Rural 845 95 23.3
#> 6 EA_09473 Centre Kadiogo Ouagad… Urban 2069 312 0.31
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <dbl>, .draw_1 <int>Using Sampling Fractions
Instead of a fixed sample size, you can specify a sampling fraction
with frac.
sample <- sampling_design() |>
draw(frac = 0.05) |>
execute(bfa_eas, seed = 1789)
nrow(sample)
#> [1] 747Bernoulli sampling selects each unit independently with the specified probability. This gives a random sample size, which can simplify field protocols.
sample <- sampling_design() |>
draw(frac = 0.05, method = "bernoulli") |>
execute(bfa_eas, seed = 1960)
nrow(sample)
#> [1] 716Rounding Behavior
When using frac, the computed sample size (N × frac)
must be rounded to an integer. By default, samplyr rounds
up (ceiling), which matches SAS SURVEYSELECT’s default behavior. Use the
round argument to control this:
-
"up"(default): round up, ensures adequate sample size -
"down": round down, minimum sample size -
"nearest": Standard mathematical rounding
# Frame of 105 units, frac = 0.1 → 10.5
frame_105 <- head(bfa_eas, 105)
# Default: rounds up to 11
nrow(sampling_design() |> draw(frac = 0.1) |> execute(frame_105, seed = 3))
#> [1] 11
# Round down to 10
nrow(sampling_design() |> draw(frac = 0.1, round = "down") |> execute(frame_105, seed = 3))
#> [1] 10
# Round to nearest (10)
nrow(sampling_design() |> draw(frac = 0.1, round = "nearest") |> execute(frame_105, seed = 3))
#> [1] 10Stratified Sampling
Stratification partitions the frame into non-overlapping groups (strata) and samples from each. This ensures representation from all subgroups and often improves precision.
Per-Stratum Sample Size
Without an allocation method, n specifies the sample
size within each stratum.
sample <- sampling_design() |>
stratify_by(region) |>
draw(n = 20) |>
execute(bfa_eas, seed = 123)
sample |>
count(region)
#> # A tibble: 13 × 2
#> region n
#> <fct> <int>
#> 1 Boucle du Mouhoun 20
#> 2 Cascades 20
#> 3 Centre 20
#> 4 Centre-Est 20
#> 5 Centre-Nord 20
#> 6 Centre-Ouest 20
#> 7 Centre-Sud 20
#> 8 Est 20
#> 9 Hauts-Bassins 20
#> 10 Nord 20
#> 11 Plateau-Central 20
#> 12 Sahel 20
#> 13 Sud-Ouest 20Allocation Methods
With an allocation method, n becomes the total sample
size to distribute across strata. Allocation methods are defined for
total sample-size allocation, so draw(frac = ...) is not
supported when alloc is set.
Proportional allocation distributes the sample proportionally to stratum sizes, ensuring the sample mirrors the population structure.
sample <- sampling_design(title = "Proportional Allocation") |>
stratify_by(region, alloc = "proportional") |>
draw(n = 300) |>
execute(bfa_eas, seed = 42)
sample |>
count(region) |>
mutate(pct = n / sum(n))
#> # A tibble: 13 × 3
#> region n pct
#> <fct> <int> <dbl>
#> 1 Boucle du Mouhoun 30 0.1
#> 2 Cascades 14 0.0467
#> 3 Centre 31 0.103
#> 4 Centre-Est 25 0.0833
#> 5 Centre-Nord 28 0.0933
#> 6 Centre-Ouest 26 0.0867
#> 7 Centre-Sud 12 0.04
#> 8 Est 32 0.107
#> 9 Hauts-Bassins 30 0.1
#> 10 Nord 24 0.08
#> 11 Plateau-Central 15 0.05
#> 12 Sahel 18 0.06
#> 13 Sud-Ouest 15 0.05Equal allocation assigns the same sample size to each stratum, regardless of population size. This maximizes precision for comparisons between strata.
sample <- sampling_design(title = "Equal Allocation") |>
stratify_by(region, alloc = "equal") |>
draw(n = 160) |>
execute(bfa_eas, seed = 2026)
sample |>
count(region)
#> # A tibble: 13 × 2
#> region n
#> <fct> <int>
#> 1 Boucle du Mouhoun 13
#> 2 Cascades 13
#> 3 Centre 13
#> 4 Centre-Est 13
#> 5 Centre-Nord 12
#> 6 Centre-Ouest 12
#> 7 Centre-Sud 12
#> 8 Est 12
#> 9 Hauts-Bassins 12
#> 10 Nord 12
#> 11 Plateau-Central 12
#> 12 Sahel 12
#> 13 Sud-Ouest 12Neyman allocation minimizes the variance of the
overall estimate by allocating more sample to strata with higher
variability. It requires prior information about stratum variances. Use
explicit stratification variable names when passing
variance.
data(bfa_eas_variance)
bfa_eas_variance
#> # A tibble: 13 × 2
#> region var
#> <fct> <dbl>
#> 1 Boucle du Mouhoun 96.9
#> 2 Cascades 63.6
#> 3 Centre 34.7
#> 4 Centre-Est 58.5
#> 5 Centre-Nord 94.9
#> 6 Centre-Ouest 32.4
#> 7 Centre-Sud 31.9
#> 8 Est 140.
#> 9 Hauts-Bassins 57.7
#> 10 Nord 155.
#> 11 Plateau-Central 55.9
#> 12 Sahel 145.
#> 13 Sud-Ouest 63.1
sample <- sampling_design(title = "Neyman Allocation") |>
stratify_by(region, alloc = "neyman", variance = bfa_eas_variance) |>
draw(n = 300) |>
execute(bfa_eas, seed = 12345)
sample |>
count(region)
#> # A tibble: 13 × 2
#> region n
#> <fct> <int>
#> 1 Boucle du Mouhoun 34
#> 2 Cascades 12
#> 3 Centre 21
#> 4 Centre-Est 22
#> 5 Centre-Nord 31
#> 6 Centre-Ouest 17
#> 7 Centre-Sud 8
#> 8 Est 43
#> 9 Hauts-Bassins 26
#> 10 Nord 35
#> 11 Plateau-Central 13
#> 12 Sahel 25
#> 13 Sud-Ouest 13Optimal allocation extends Neyman allocation by also
accounting for differential costs across strata. It minimizes variance
for a fixed budget (or cost for fixed precision). Use explicit
stratification variable names when passing
variance/cost.
data(bfa_eas_cost)
bfa_eas_cost
#> # A tibble: 13 × 2
#> region cost
#> <fct> <dbl>
#> 1 Boucle du Mouhoun 483
#> 2 Cascades 427
#> 3 Centre 240
#> 4 Centre-Est 400
#> 5 Centre-Nord 444
#> 6 Centre-Ouest 399
#> 7 Centre-Sud 418
#> 8 Est 631
#> 9 Hauts-Bassins 358
#> 10 Nord 583
#> 11 Plateau-Central 408
#> 12 Sahel 639
#> 13 Sud-Ouest 416
sample <- sampling_design(title = "Optimal Allocation") |>
stratify_by(region, alloc = "optimal",
variance = bfa_eas_variance,
cost = bfa_eas_cost) |>
draw(n = 300) |>
execute(bfa_eas, seed = 9876)
sample |>
count(region)
#> # A tibble: 13 × 2
#> region n
#> <fct> <int>
#> 1 Boucle du Mouhoun 33
#> 2 Cascades 13
#> 3 Centre 29
#> 4 Centre-Est 23
#> 5 Centre-Nord 31
#> 6 Centre-Ouest 18
#> 7 Centre-Sud 8
#> 8 Est 37
#> 9 Hauts-Bassins 29
#> 10 Nord 30
#> 11 Plateau-Central 14
#> 12 Sahel 21
#> 13 Sud-Ouest 14Power allocation (Bankier 1988) uses \(n_h \propto C_h X_h^q\), where
cv supplies \(C_h\),
importance supplies \(X_h\), and power is \(q \in [0, 1]\).
cv_df <- data.frame(
region = levels(bfa_eas$region),
cv = c(0.40, 0.35, 0.12, 0.20, 0.30, 0.18,
0.15, 0.38, 0.22, 0.32, 0.17, 0.45, 0.25)
)
importance_df <- data.frame(
region = levels(bfa_eas$region),
importance = c(60, 40, 120, 70, 80, 65,
50, 55, 90, 75, 45, 35, 30)
)
sample <- sampling_design(title = "Power Allocation") |>
stratify_by(
region,
alloc = "power",
cv = cv_df,
importance = importance_df,
power = 0.5
) |>
draw(n = 300) |>
execute(bfa_eas, seed = 777)
sample |>
count(region)
#> # A tibble: 13 × 2
#> region n
#> <fct> <int>
#> 1 Boucle du Mouhoun 35
#> 2 Cascades 25
#> 3 Centre 15
#> 4 Centre-Est 19
#> 5 Centre-Nord 31
#> 6 Centre-Ouest 16
#> 7 Centre-Sud 12
#> 8 Est 32
#> 9 Hauts-Bassins 24
#> 10 Nord 32
#> 11 Plateau-Central 13
#> 12 Sahel 30
#> 13 Sud-Ouest 16Sample Size Bounds
When using allocation methods, you can constrain stratum sample sizes
with min_n and max_n in
draw().
Minimum sample size ensures every stratum gets at least the specified number of units. This is essential for variance estimation (requires n ≥ 2 per stratum) or when you need reliable subgroup estimates.
# Neyman allocation with minimum 5 per stratum
sample <- sampling_design() |>
stratify_by(region, alloc = "neyman", variance = bfa_eas_variance) |>
draw(n = 300, min_n = 20) |>
execute(bfa_eas, seed = 1)
sample |>
count(region)
#> # A tibble: 13 × 2
#> region n
#> <fct> <int>
#> 1 Boucle du Mouhoun 27
#> 2 Cascades 20
#> 3 Centre 20
#> 4 Centre-Est 21
#> 5 Centre-Nord 26
#> 6 Centre-Ouest 20
#> 7 Centre-Sud 20
#> 8 Est 32
#> 9 Hauts-Bassins 23
#> 10 Nord 28
#> 11 Plateau-Central 20
#> 12 Sahel 23
#> 13 Sud-Ouest 20Maximum sample size caps large strata to prevent them from dominating the sample. This is useful when operational constraints limit how many units you can handle in any one stratum.
# Proportional allocation but cap at 60 per stratum
sample <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 300, max_n = 55) |>
execute(bfa_eas, seed = 40)
sample |>
count(region)
#> # A tibble: 13 × 2
#> region n
#> <fct> <int>
#> 1 Boucle du Mouhoun 30
#> 2 Cascades 14
#> 3 Centre 31
#> 4 Centre-Est 25
#> 5 Centre-Nord 28
#> 6 Centre-Ouest 26
#> 7 Centre-Sud 12
#> 8 Est 32
#> 9 Hauts-Bassins 30
#> 10 Nord 24
#> 11 Plateau-Central 15
#> 12 Sahel 18
#> 13 Sud-Ouest 15Both bounds together create a feasible range for each stratum:
sample <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 300, min_n = 20, max_n = 55) |>
execute(bfa_eas, seed = 2003)
sample |>
count(region)
#> # A tibble: 13 × 2
#> region n
#> <fct> <int>
#> 1 Boucle du Mouhoun 26
#> 2 Cascades 20
#> 3 Centre 27
#> 4 Centre-Est 23
#> 5 Centre-Nord 25
#> 6 Centre-Ouest 23
#> 7 Centre-Sud 20
#> 8 Est 27
#> 9 Hauts-Bassins 26
#> 10 Nord 23
#> 11 Plateau-Central 20
#> 12 Sahel 20
#> 13 Sud-Ouest 20When a stratum’s population is smaller than min_n, the
entire stratum is selected (capped at population size).
Custom allocation lets you specify exact sample
sizes or rates per stratum by passing a data frame to
draw().
sizes_df <- data.frame(
region = levels(bfa_eas$region),
n = c(20, 12, 25, 18, 22, 16, 14, 15, 20, 18, 12, 10, 8)
)
sample <- sampling_design() |>
stratify_by(region) |>
draw(n = sizes_df) |>
execute(bfa_eas, seed = 101)
sample |>
count(region)
#> # A tibble: 13 × 2
#> region n
#> <fct> <int>
#> 1 Boucle du Mouhoun 20
#> 2 Cascades 12
#> 3 Centre 25
#> 4 Centre-Est 18
#> 5 Centre-Nord 22
#> 6 Centre-Ouest 16
#> 7 Centre-Sud 14
#> 8 Est 15
#> 9 Hauts-Bassins 20
#> 10 Nord 18
#> 11 Plateau-Central 12
#> 12 Sahel 10
#> 13 Sud-Ouest 8Multiple Stratification Variables
You can stratify by multiple variables to create crossed strata. Here we stratify by both region and urban/rural status.
sample <- sampling_design() |>
stratify_by(region, urban_rural, alloc = "proportional") |>
draw(n = 300) |>
execute(bfa_eas, seed = 42)
sample |>
count(region, urban_rural) |>
head(10)
#> # A tibble: 10 × 3
#> region urban_rural n
#> <fct> <fct> <int>
#> 1 Boucle du Mouhoun Rural 30
#> 2 Cascades Rural 13
#> 3 Centre Rural 2
#> 4 Centre Urban 29
#> 5 Centre-Est Rural 23
#> 6 Centre-Est Urban 3
#> 7 Centre-Nord Rural 24
#> 8 Centre-Nord Urban 3
#> 9 Centre-Ouest Rural 22
#> 10 Centre-Ouest Urban 4Cluster Sampling
Cluster sampling selects groups (clusters) rather than individual units. This is practical when a complete list of individuals isn’t available or when travel costs make dispersed sampling expensive.
Use cluster_by() to specify the cluster identifier. All
units within selected clusters are included.
sample <- sampling_design() |>
cluster_by(ea_id) |>
draw(n = 50) |>
execute(bfa_eas, seed = 120)
sample |>
summarise(
n_clusters = n_distinct(ea_id),
n_units = n()
)
#> # A tibble: 1 × 2
#> n_clusters n_units
#> <int> <int>
#> 1 50 50PPS Sampling
Probability Proportional to Size (PPS) sampling selects units with probability proportional to a size measure. This is standard for cluster sampling where clusters vary in size.
Use the mos argument to specify the measure of size
variable. Here we select EAs with probability proportional to their
household count.
sample_pps <- sampling_design() |>
cluster_by(ea_id) |>
draw(n = 50, method = "pps_brewer", mos = households) |>
execute(bfa_eas, seed = 365)
sample_pps |>
summarise(
mean_hh = mean(households),
median_hh = median(households)
)
#> # A tibble: 1 × 2
#> mean_hh median_hh
#> <dbl> <dbl>
#> 1 232. 213Compared to simple random sampling, PPS tends to select larger clusters:
sample_srs <- sampling_design() |>
cluster_by(ea_id) |>
draw(n = 50) |>
execute(bfa_eas, seed = 42)
bind_rows(
sample_pps |> summarise(method = "PPS", mean_hh = mean(households)),
sample_srs |> summarise(method = "SRS", mean_hh = mean(households))
)
#> # A tibble: 2 × 2
#> method mean_hh
#> <chr> <dbl>
#> 1 PPS 232.
#> 2 SRS 203.PPS Methods
samplyr supports several PPS methods:
| Method | Description |
|---|---|
pps_brewer |
Brewer’s method - fast with good properties (recommended) |
pps_systematic |
PPS systematic - simple but can have periodicity issues |
pps_cps |
Conditional Poisson sampling (maximum entropy) |
pps_poisson |
PPS Poisson - random sample size (requires frac and
supports prn) |
pps_sps |
Sequential Poisson sampling (supports prn) |
pps_pareto |
Pareto sampling (supports prn) |
pps_multinomial |
PPS with replacement |
pps_chromy |
Chromy’s sequential PPS / minimum replacement |
Certainty Selection
In PPS sampling, very large units can have expected inclusion probabilities above 1. Certainty selection handles this by selecting those units with probability 1 and sampling the rest normally.
Use certainty_size for an absolute threshold or
certainty_prop for a proportional one.
sample_cert <- sampling_design(title = "Certainty selection") |>
draw(n = 50, method = "pps_brewer", mos = households,
certainty_size = 800) |>
execute(bfa_eas, seed = 123)
# How many EAs with more than 800 households
filter(bfa_eas, households > 800)
#> # A tibble: 8 × 12
#> ea_id region province commune urban_rural population households area_km2
#> <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_01319 Centre-E… Boulgou Bitou Rural 4559 815 8.31
#> 2 EA_07029 Centre-E… Kourite… Koupela Urban 3971 805 26.9
#> 3 EA_07045 Centre-E… Kourite… Koupela Urban 4640 941 3.62
#> 4 EA_13207 Centre-O… Boulkie… Thiou Rural 5273 843 23.8
#> 5 EA_11261 Centre-O… Sanguie Reo Urban 5171 1111 24.5
#> 6 EA_11263 Centre-O… Sanguie Reo Urban 3743 804 17.4
#> 7 EA_00797 Nord Yatenga Barga Urban 5840 1032 6.03
#> 8 EA_02277 Plateau-… Ganzour… Boudri Rural 5273 807 23.0
#> # ℹ 4 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>
# Check which units were selected with certainty
count(sample_cert, .certainty_1)
#> # A tibble: 2 × 2
#> .certainty_1 n
#> <lgl> <int>
#> 1 FALSE 42
#> 2 TRUE 8With certainty_prop, units whose MOS proportion exceeds
the threshold are taken with certainty. The process is iterative: after
removing certainty units, proportions are recomputed.
sample_certp <- sampling_design() |>
stratify_by(region) |>
draw(n = 50, method = "pps_systematic", mos = households,
certainty_prop = 0.05) |>
execute(bfa_eas, seed = 12345)
# How many EAs with more than 5% of the total hh by region
bfa_eas |>
mutate(mosprop = households/sum(households),
.by = region) |>
filter(mosprop >= 0.05)
#> # A tibble: 0 × 13
#> # ℹ 13 variables: ea_id <chr>, region <fct>, province <fct>, commune <fct>,
#> # urban_rural <fct>, population <dbl>, households <int>, area_km2 <dbl>,
#> # accessible <lgl>, dist_road_km <dbl>, food_insecurity_pct <dbl>,
#> # cost <dbl>, mosprop <dbl>
filter(sample_certp, .certainty_1)
#> Warning in min(w): no non-missing arguments to min; returning Inf
#> Warning in max(w): no non-missing arguments to max; returning -Inf
#> # A tbl_sample: 0 × 18
#> # Weights: NaN [Inf, -Inf]
#> # ℹ 18 variables: ea_id <chr>, region <fct>, province <fct>, commune <fct>,
#> # urban_rural <fct>, population <dbl>, households <int>, area_km2 <dbl>,
#> # accessible <lgl>, dist_road_km <dbl>, food_insecurity_pct <dbl>,
#> # cost <dbl>, .weight <dbl>, .sample_id <int>, .stage <int>, .weight_1 <dbl>,
#> # .fpc_1 <int>, .certainty_1 <lgl>Permanent Random Numbers (PRN)
Some PPS methods support permanent random numbers for sample
coordination across survey waves. Assign a stable U(0,1) value to each
frame unit, then pass it via prn:
set.seed(1)
bfa_eas$prn <- runif(nrow(bfa_eas))
sample_prn <- sampling_design(title = "Samples coordination with PRN") |>
cluster_by(ea_id) |>
draw(n = 50, method = "pps_sps", mos = households, prn = prn) |>
execute(bfa_eas, seed = 1)
sample_prn
#> # A tbl_sample: 50 × 19 | Samples coordination with PRN
#> # Weights: 286.17 [118.08, 856.92]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_12900 Boucle … Banwa Tansila Rural 1164 140 39.1
#> 2 EA_03170 Boucle … Mouhoun Dedoug… Rural 1209 215 3.81
#> 3 EA_12936 Boucle … Mouhoun Tcheri… Rural 1252 200 43.2
#> 4 EA_03269 Boucle … Sourou Di Rural 2337 305 1.15
#> 5 EA_12213 Cascades Comoe Sidera… Rural 763 122 9.4
#> 6 EA_12558 Cascades Comoe Soubak… Rural 2114 266 70.0
#> 7 EA_13387 Cascades Comoe Tiefora Rural 1129 145 26.0
#> 8 EA_08816 Centre Kadiogo Ouagad… Urban 1748 263 0.26
#> 9 EA_09321 Centre Kadiogo Ouagad… Urban 1754 264 0.27
#> 10 EA_09370 Centre Kadiogo Ouagad… Urban 2573 387 0.35
#> # ℹ 40 more rows
#> # ℹ 11 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> # .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>,
#> # .certainty_1 <lgl>Because the PRN values are stable, re-executing the same design on
the same frame produces highly overlapping samples (positive
coordination). PRN is supported for bernoulli,
pps_poisson, pps_sps, and
pps_pareto. You can learn more about coordination with
permanent random numbers in the following vignette
vignette("sampling-coordination").
Control Sorting and Serpentine Ordering
Control sorting orders the frame before selection, providing implicit stratification. This is effective with systematic and sequential methods, where it ensures the sample spreads evenly across the sorted variables.
# Nested sorting: standard ascending order
sample_sort <- sampling_design() |>
draw(n = 100, method = "systematic",
control = c(region, province)) |>
execute(bfa_eas, seed = 98765)
head(sample_sort)
#> # A tbl_sample: 6 × 18
#> # Weights: 149.34 [149.34, 149.34]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_04258 Boucle d… Bale Fara Rural 1722 252 21.6
#> 2 EA_06885 Boucle d… Banwa Kouka Rural 1446 199 19.5
#> 3 EA_12448 Boucle d… Banwa Solenzo Rural 579 78 8.59
#> 4 EA_02527 Boucle d… Kossi Bouras… Rural 1268 143 42.5
#> 5 EA_08665 Boucle d… Kossi Nouna Rural 1298 192 23.8
#> 6 EA_03147 Boucle d… Mouhoun Dedoug… Rural 869 154 1.29
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> # .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>serp() implements serpentine (snake) sorting, which
alternates direction at each level of the hierarchy. This minimizes
jumps at boundaries, matching SAS PROC SURVEYSELECT’s
SORT=SERP.
# Serpentine sorting
sample_serp <- sampling_design() |>
draw(n = 100, method = "systematic",
control = serp(region, province)) |>
execute(bfa_eas, seed = 1)
head(sample_serp)
#> # A tbl_sample: 6 × 18
#> # Weights: 149.34 [149.34, 149.34]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_00461 Boucle d… Bale Bana Rural 1348 176 7.3
#> 2 EA_11130 Boucle d… Bale Poura Urban 1660 267 19.0
#> 3 EA_12367 Boucle d… Banwa Solenzo Rural 984 132 23.2
#> 4 EA_12925 Boucle d… Banwa Tansila Rural 824 99 18.1
#> 5 EA_03758 Boucle d… Kossi Dokui Rural 1277 158 25.8
#> 6 EA_12543 Boucle d… Kossi Sono Rural 1246 188 78.8
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> # .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>Control sorting can be combined with explicit stratification. The sorting is then applied within each stratum.
sample_serp2 <- sampling_design() |>
stratify_by(urban_rural) |>
draw(n = 100, method = "systematic",
control = serp(region, province)) |>
execute(bfa_eas, seed = 2)
head(sample_serp2)
#> # A tbl_sample: 6 × 18
#> # Weights: 122.78 [122.78, 122.78]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_00267 Boucle d… Bale Bagassi Rural 1319 159 17.1
#> 2 EA_10369 Boucle d… Bale Ouri Rural 1522 195 2.25
#> 3 EA_06892 Boucle d… Banwa Kouka Rural 1447 199 7.21
#> 4 EA_12429 Boucle d… Banwa Solenzo Rural 192 26 7.78
#> 5 EA_00764 Boucle d… Kossi Barani Rural 393 51 47.7
#> 6 EA_03766 Boucle d… Kossi Dokui Rural 1093 135 18.2
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> # .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>Balanced Sampling
Balanced sampling selects a sample whose Horvitz-Thompson estimates
of auxiliary totals match (or nearly match) the population totals. This
improves precision for any variable correlated with the auxiliary
variables. samplyr implements the cube method (Deville
& Tille 2004) via method = "balanced".
Specify auxiliary variables with aux. Without
mos, inclusion probabilities are equal; with
mos, they are proportional to size:
# Equal-probability balanced sample
balanced_eq <- sampling_design() |>
draw(n = 100, method = "balanced",
aux = c(population, households)) |>
execute(bfa_eas, seed = 42)
balanced_eq
#> # A tbl_sample: 100 × 18
#> # Weights: 149.34 [149.34, 149.34]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_02157 Boucle … Bale Boromo Rural 129 17 8.12
#> 2 EA_08673 Boucle … Kossi Nouna Rural 1071 158 5.65
#> 3 EA_03203 Boucle … Mouhoun Dedoug… Rural 1810 321 28.9
#> 4 EA_06454 Boucle … Mouhoun Kona Rural 1400 194 33.1
#> 5 EA_04530 Boucle … Nayala Gassan Rural 1336 147 67.3
#> 6 EA_13871 Boucle … Sourou Tougan Rural 1204 165 7
#> 7 EA_08512 Cascades Comoe Niango… Rural 1146 157 83.9
#> 8 EA_12128 Cascades Comoe Sidera… Rural 1292 206 24.0
#> 9 EA_02946 Cascades Leraba Dakoro Rural 1254 160 33.6
#> 10 EA_12340 Cascades Leraba Sindou Rural 554 68 7.33
#> # ℹ 90 more rows
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> # .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>The balancing property means the HT estimate of auxiliary totals is close to the true total:
pop_total <- sum(bfa_eas$population)
ht_total <- sum(balanced_eq$population * balanced_eq$.weight)
c(population = pop_total, ht_estimate = ht_total)
#> population ht_estimate
#> 20487532 20655813Balanced sampling combines naturally with stratification. When
stratified, samplyr calls the stratified cube algorithm
(Chauvet 2009) in a single pass, preserving the allocation computed by
stratify_by():
# Stratified PPS balanced sample
balanced_strat <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 300, method = "balanced", mos = households,
aux = c(population, area_km2)) |>
execute(bfa_eas, seed = 1960)
balanced_strat
#> # A tbl_sample: 300 × 18
#> # Weights: 48.57 [7.35, 169.35]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_00467 Boucle … Bale Bana Rural 1994 261 5.99
#> 2 EA_04220 Boucle … Bale Fara Rural 790 116 16.6
#> 3 EA_04249 Boucle … Bale Fara Rural 2148 314 24.8
#> 4 EA_10391 Boucle … Bale Pa Rural 1066 153 17.3
#> 5 EA_06858 Boucle … Banwa Kouka Rural 1133 156 17.5
#> 6 EA_06864 Boucle … Banwa Kouka Rural 1557 214 25.1
#> 7 EA_06892 Boucle … Banwa Kouka Rural 1447 199 7.21
#> 8 EA_12390 Boucle … Banwa Solenzo Rural 1799 242 19.2
#> 9 EA_12438 Boucle … Banwa Solenzo Rural 1341 180 0.46
#> 10 EA_12471 Boucle … Banwa Solenzo Rural 1857 249 53.4
#> # ℹ 290 more rows
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> # .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <dbl>When used with cluster_by(), auxiliary values are
automatically aggregated (summed) to the cluster level before
selection.
Balanced sampling supports up to 2 stages. At stage 1, element-level auxiliary variables are aggregated to PSU totals; at stage 2, the cube method runs on the element-level frame within each selected PSU.
Multi-Stage Sampling
Multi-stage designs sample in stages: first select primary sampling units (PSUs), then sample within them. This is the standard approach for large-scale surveys when a complete frame of ultimate units isn’t available upfront.
Use add_stage() to delimit each stage of the design.
Two-Stage Design
For this example, we’ll use zwe_eas to demonstrate a
two-stage design: first select districts as PSUs, then sample EAs within
selected districts.
data(zwe_eas)
# First, let's see the structure: EAs nested within districts
zwe_eas |>
count(province, district) |>
head(10)
#> # A tibble: 10 × 3
#> province district n
#> <fct> <fct> <int>
#> 1 Bulawayo Bulawayo 453
#> 2 Harare Chitungwiza 161
#> 3 Harare Epworth 125
#> 4 Harare Harare 969
#> 5 Harare Harare Rural 224
#> 6 Manicaland Buhera 556
#> 7 Manicaland Chimanimani 206
#> 8 Manicaland Chipinge 489
#> 9 Manicaland Chipinge Urban 22
#> 10 Manicaland Makoni 668We need to create a measure of size for districts. Here we’ll use total households per district.
# Add district-level households as MOS
zwe_frame <- zwe_eas |>
mutate(district_hh = sum(households),
.by = district)Now we can run a two-stage design: select 10 districts with PPS, then sample 5 EAs within each.
twostage_design <- sampling_design(title = "Two-stage cluster sampling") |>
add_stage(label = "Districts") |>
cluster_by(district) |>
draw(n = 10, method = "pps_brewer", mos = district_hh) |>
add_stage(label = "EAs") |>
draw(n = 5)
twostage_design
#> ── Sampling Design: Two-stage cluster sampling ─────────────────────────────────
#>
#> ℹ 2 stages
#>
#> ── Stage 1: Districts ──────────────────────────────────────────────────────────
#> • Cluster: district
#> • Draw: n = 10, method = pps_brewer, mos = district_hh
#>
#> ── Stage 2: EAs ────────────────────────────────────────────────────────────────
#> • Draw: n = 5, method = srsworLet’s apply it to the frame.
sample_eas <- execute(twostage_design,
zwe_frame, seed = 3)
sample_eas |>
summarise(n_districts = n_distinct(district),
n_eas = n())
#> # A tibble: 1 × 2
#> n_districts n_eas
#> <int> <int>
#> 1 10 50Stratified Multi-Stage Design
Combining stratification with multi-stage sampling is standard for national surveys. Here we stratify by province (which is constant within districts), then select districts and EAs within each stratum.
sample_eas2 <- sampling_design(title = "Stratified two-stage cluster sampling") |>
add_stage(label = "Districts") |>
stratify_by(province) |>
cluster_by(district) |>
draw(n = 2, method = "pps_brewer", mos = district_hh) |>
add_stage(label = "EAs") |>
draw(n = 3) |>
execute(zwe_frame, seed = 4)
#> Warning: Sample size capped to population in 1 stratum/strata: "Bulawayo".
#> ℹ Requested total: 20. Actual total: 19.
sample_eas2 |>
count(province, district)
#> # A tibble: 19 × 3
#> province district n
#> <fct> <fct> <int>
#> 1 Bulawayo Bulawayo 3
#> 2 Harare Harare 3
#> 3 Harare Harare Rural 3
#> 4 Manicaland Buhera 3
#> 5 Manicaland Makoni 3
#> 6 Mashonaland Central Guruve 3
#> 7 Mashonaland Central Mount Darwin 3
#> 8 Mashonaland East Goromonzi 3
#> 9 Mashonaland East Murehwa 3
#> 10 Mashonaland West Sanyati 3
#> 11 Mashonaland West Zvimba 3
#> 12 Masvingo Bikita 3
#> 13 Masvingo Masvingo 3
#> 14 Matabeleland North Binga 3
#> 15 Matabeleland North Bubi 3
#> 16 Matabeleland South Gwanda 3
#> 17 Matabeleland South Umzingwane 3
#> 18 Midlands Gokwe South 3
#> 19 Midlands Zvishavane 3Operational Multi-Stage Sampling
In practice, multi-stage surveys often execute stages at different times. After selecting PSUs, field teams may need to conduct listing or other operations before the second stage can proceed.
The stages argument lets you execute only specific
stages.
design <- sampling_design() |>
add_stage(label = "Districts") |>
stratify_by(province) |>
cluster_by(district) |>
draw(n = 2, method = "pps_brewer", mos = district_hh) |>
add_stage(label = "EAs") |>
draw(n = 4)
# Execute stage 1 only
selected_districts <- execute(design, zwe_frame, stages = 1, seed = 42)
#> Warning: Sample size capped to population in 1 stratum/strata: "Bulawayo".
#> ℹ Requested total: 20. Actual total: 19.
selected_districts |>
distinct(province, district)
#> # A tibble: 19 × 2
#> province district
#> <fct> <fct>
#> 1 Bulawayo Bulawayo
#> 2 Harare Harare
#> 3 Harare Harare Rural
#> 4 Manicaland Chipinge
#> 5 Manicaland Nyanga
#> 6 Mashonaland Central Mazowe
#> 7 Mashonaland Central Mount Darwin
#> 8 Mashonaland East Mudzi
#> 9 Mashonaland East Murehwa
#> 10 Mashonaland West Chinhoyi
#> 11 Mashonaland West Kariba
#> 12 Masvingo Chivi
#> 13 Masvingo Masvingo
#> 14 Matabeleland North Tsholotsho
#> 15 Matabeleland North Umguza
#> 16 Matabeleland South Bulilima
#> 17 Matabeleland South Gwanda
#> 18 Midlands Shurugwi Town
#> 19 Midlands ZvishavaneAfter fieldwork or preparation is complete, execute the remaining stages.
final_sample <- selected_districts |>
execute(zwe_frame, seed = 43)
final_sample |>
count(province, district)
#> # A tibble: 19 × 3
#> province district n
#> <fct> <fct> <int>
#> 1 Bulawayo Bulawayo 4
#> 2 Harare Harare 4
#> 3 Harare Harare Rural 4
#> 4 Manicaland Chipinge 4
#> 5 Manicaland Nyanga 4
#> 6 Mashonaland Central Mazowe 4
#> 7 Mashonaland Central Mount Darwin 4
#> 8 Mashonaland East Mudzi 4
#> 9 Mashonaland East Murehwa 4
#> 10 Mashonaland West Chinhoyi 4
#> 11 Mashonaland West Kariba 4
#> 12 Masvingo Chivi 4
#> 13 Masvingo Masvingo 4
#> 14 Matabeleland North Tsholotsho 4
#> 15 Matabeleland North Umguza 4
#> 16 Matabeleland South Bulilima 4
#> 17 Matabeleland South Gwanda 4
#> 18 Midlands Shurugwi Town 4
#> 19 Midlands Zvishavane 4Replicated Sampling
The reps parameter in execute() draws
multiple independent samples from the same frame under the same design.
The output is a single stacked tbl_sample with a
.replicate column identifying each replicate. This is
useful for simulation studies, repeated-sampling variance estimation, or
quality control.
rep_sample <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 300) |>
execute(bfa_eas, seed = 42, reps = 5)
rep_sample |>
count(.replicate)
#> # A tibble: 5 × 2
#> .replicate n
#> <int> <int>
#> 1 1 300
#> 2 2 300
#> 3 3 300
#> 4 4 300
#> 5 5 300Each replicate is a fully independent draw. Replicate r
uses seed seed + r - 1, so the first replicate matches a
standalone execution with the same seed:
standalone <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 300) |>
execute(bfa_eas, seed = 42)
rep1 <- rep_sample |> filter(.replicate == 1)
identical(sort(rep1$ea_id), sort(standalone$ea_id))
#> [1] TRUEReplicated sampling also works with multi-stage designs. When continuing from a replicated partial execution, each replicate is continued independently:
design <- sampling_design() |>
add_stage(label = "Districts") |>
cluster_by(district) |>
draw(n = 10, method = "pps_brewer", mos = district_hh) |>
add_stage(label = "EAs") |>
draw(n = 5)
stage1 <- execute(design, zwe_frame, stages = 1, seed = 1, reps = 3)
final <- execute(stage1, zwe_frame, seed = 100)
final |>
summarise(n_districts = n_distinct(district), n_eas = n(),
.by = .replicate)
#> # A tibble: 3 × 3
#> .replicate n_districts n_eas
#> <int> <int> <int>
#> 1 1 10 50
#> 2 2 10 50
#> 3 3 10 50Survey export functions require a single replicate. Filter first:
one_rep <- rep_sample |> filter(.replicate == 1)
svy <- as_svydesign(one_rep)Note: reps cannot be combined with panels
or with stages that use permanent random numbers (PRN). PRN produces
identical samples across replicates, so coordination across waves should
use an explicit loop with different PRN vectors instead.
Working with Results
Accessing the Design
The sampling design is attached to the result and can be retrieved
with get_design().
sample <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 200) |>
execute(bfa_eas, seed = 2024)
design <- get_design(sample)
design
#> ── Sampling Design ─────────────────────────────────────────────────────────────
#>
#> ℹ 1 stage
#>
#> ── Stage 1 ─────────────────────────────────────────────────────────────────────
#> • Strata: region (proportional)
#> • Draw: n = 200, method = srsworand accessing element of the design
design$stages[[1]]$strata$vars
#> [1] "region"Weight Diagnostics
The summary() method reports weight diagnostics
including Kish’s effective sample size and design effect due to unequal
weighting. For the full design effect (including clustering and
stratification), use survey::svymean() with
deff = TRUE after converting with
as_svydesign().
summary(sample)
#> ── Sample Summary ──────────────────────────────────────────────────────────────
#>
#> ℹ n = 200 | stages = 1/1 | seed = 2024
#>
#> ── Design: Stage 1 ─────────────────────────────────────────────────────────────
#> • Strata: region (proportional)
#> • Method: srswor
#>
#> ── Allocation: Stage 1 ─────────────────────────────────────────────────────────
#> region N_h n_h f_h
#> Boucle du Mouhoun 1510 20 0.0132
#> Cascades 667 9 0.0135
#> Centre 1556 21 0.0135
#> Centre-Est 1259 17 0.0135
#> Centre-Nord 1375 19 0.0138
#> Centre-Ouest 1287 17 0.0132
#> Centre-Sud 615 8 0.0130
#> Est 1590 21 0.0132
#> Hauts-Bassins 1483 20 0.0135
#> Nord 1211 16 0.0132
#> Plateau-Central 757 10 0.0132
#> Sahel 902 12 0.0133
#> Sud-Ouest 722 10 0.0139
#> ───── ─── ──────
#> Total 14934 200 0.0134
#>
#> ── Weights ─────────────────────────────────────────────────────────────────────
#> • Range: [72.2, 76.88]
#> • Mean: 74.67 · CV: 0.02
#> • DEFF: 1 · n_eff: 200Method Reference
Selection Methods
| Method | Sample Size | Replacement | Notes |
|---|---|---|---|
srswor |
Fixed | No | Default, general purpose |
srswr |
Fixed | Yes | Bootstrap, rare populations |
systematic |
Fixed | No | Ordered frames |
bernoulli |
Random | No | Simple field protocols |
pps_brewer |
Fixed | No | Recommended PPS method |
pps_systematic |
Fixed | No | Simple PPS |
pps_cps |
Fixed | No | Maximum entropy, exact joint inclusion probs |
pps_poisson |
Random | No | PPS with random size |
pps_sps |
Fixed | No | Sequential Poisson, supports PRN |
pps_pareto |
Fixed | No | Pareto sampling, supports PRN |
pps_multinomial |
Fixed | Yes | PPS with replacement |
pps_chromy |
Fixed | PMR.1 | Chromy’s sequential PPS |
balanced |
Fixed | No | Cube method, balances on aux variables |
Allocation Methods
| Method | Description | Requirements |
|---|---|---|
| (none) | n per stratum | |
equal |
Same n per stratum | |
proportional |
n ∝ stratum size | |
neyman |
Minimize variance |
variance data frame |
optimal |
Minimize variance/cost |
variance + cost data frames |
power |
Compromise: \(n_h \propto C_h X_h^q\) |
cv + importance (+ optional
power) |
For custom stratum-specific sizes or rates, pass a data frame
directly to n or frac in draw().
When alloc is specified, use n (not
frac).
Best Practices
- Always set a seed for reproducibility
- Use meaningful stage labels for documentation
-
Validate designs with
validate_frame()before execution -
Check weight distributions with
summary()on the sample
design <- sampling_design(title = "National Health Survey 2024") |>
add_stage(label = "Enumeration Areas") |>
stratify_by(region, urban_rural) |>
cluster_by(ea_id) |>
draw(n = 5, method = "pps_brewer", mos = households) |>
add_stage(label = "Households") |>
draw(n = 20)
print(design)
#> ── Sampling Design: National Health Survey 2024 ────────────────────────────────
#>
#> ℹ 2 stages
#>
#> ── Stage 1: Enumeration Areas ──────────────────────────────────────────────────
#> • Strata: region, urban_rural
#> • Cluster: ea_id
#> • Draw: n = 5, method = pps_brewer, mos = households
#>
#> ── Stage 2: Households ─────────────────────────────────────────────────────────
#> • Draw: n = 20, method = srsworFrame Validation
Use validate_frame() to check that a frame has all the
variables a design requires before executing.
# This passes
validate_frame(design, bfa_eas)
# This fails with a clear message
bad_frame <- data.frame(id = 1:100, value = rnorm(100))
validate_frame(design, bad_frame)
#> Error:
#> ! Frame validation failed:
#> ✖ Enumeration Areas: missing stratification variable: "region" and
#> "urban_rural"
#> ✖ Enumeration Areas: missing cluster variable: "ea_id"
#> ✖ Enumeration Areas: missing MOS variable: `households`