Overview
samplyr provides a tidy grammar for specifying and
executing survey sampling designs. The package is built around a minimal
set of composable verbs that handle stratified, clustered, and
multi-stage sampling.
For the statistical semantics and assumptions underlying each
operation, see vignette("design-semantics").
The core idea is that sampling code should read like its English description:
library(samplyr)
library(dplyr)
# "Stratify by region, proportionally allocate 500 samples, execute"
sampling_design(title = "My sampling design") |>
stratify_by(region, alloc = "proportional") |>
draw(n = 500) |>
execute(frame, seed = 1)To see how this works in practice, consider a real survey design from Lohr (2022) (Example 7.1), based on a 1991 study of bed net use in rural Gambia (D’Alessandro et al., 1994):
Malaria morbidity can be reduced by using bed nets impregnated with insecticide, but this is only effective if the bed nets are in widespread use. In 1991, a nationwide survey was designed to estimate the prevalence of bed net use in rural areas of the Gambia (D’Alessandro et al., 1994).
The sampling frame consisted of all rural villages of fewer than 3,000 people. The villages were stratified by three geographic regions (eastern, central, and western) and by whether the village had a public health clinic (PHC) or not. In each region five districts were chosen with probability proportional to the district population. In each district four villages were chosen, again with probability proportional to census population: two PHC villages and two non-PHC villages. Finally, six compounds were chosen more or less randomly from each village.
In samplyr, this three-stage stratified cluster design
translates directly into code:
design <- sampling_design(title = "Gambia bed nets") |>
add_stage() |>
stratify_by(region) |>
cluster_by(district) |>
draw(n = 5, method = "pps_brewer", mos = population) |>
add_stage() |>
stratify_by(phc) |>
cluster_by(village) |>
draw(n = 2, method = "pps_brewer", mos = population) |>
add_stage() |>
draw(n = 6)
design
#> ── Sampling Design: Gambia bed nets ────────────────────────────────────────────
#>
#> ℹ 3 stages
#>
#> ── Stage 1 ─────────────────────────────────────────────────────────────────────
#> • Strata: region
#> • Cluster: district
#> • Draw: n = 5, method = pps_brewer, mos = population
#>
#> ── Stage 2 ─────────────────────────────────────────────────────────────────────
#> • Strata: phc
#> • Cluster: village
#> • Draw: n = 2, method = pps_brewer, mos = population
#>
#> ── Stage 3 ─────────────────────────────────────────────────────────────────────
#> • Draw: n = 6, method = srsworAnd you can then use execute to get your sample.
sampling_design(title = "Gambia bed net") |>
add_stage() |>
stratify_by(region) |>
cluster_by(district) |>
draw(n = 5, method = "pps_brewer", mos = population) |>
add_stage() |>
stratify_by(phc) |>
cluster_by(village) |>
draw(n = 2, method = "pps_brewer", mos = population) |>
add_stage() |>
draw(n = 6) |>
execute(frame, seed = 1991)
# or
execute(design, frame, seed = 1991)The samplyr code mirrors the verbal description verb for
verb.
The Grammar
samplyr uses 5 verbs and 1 modifier:
| Function | Purpose |
|---|---|
sampling_design() |
Create a new sampling design |
stratify_by() |
Define stratification variables and allocation |
cluster_by() |
Define cluster/PSU variable |
draw() |
Specify sample size, method |
execute() |
Run the design on a frame |
add_stage() |
Delimit stages in multi-stage designs |
Deferred Column Resolution
stratify_by() and cluster_by() accept bare
column names and store them in the design object. Column names are
validated only when a frame is available (execute(),
as_svydesign()), which keeps design specification separate
from execution.
design <- sampling_design(title = "Stratified cluster sampling") |>
stratify_by(region, alloc = "proportional") |>
cluster_by(ea_id) |>
draw(n = 200)
sample <- execute(design, bfa_eas, seed = 101)Example Data
We’ll use the bfa_eas dataset throughout this vignette.
It is an enumeration area frame for Burkina Faso, with ~14,934 EAs
across 13 regions.
data(bfa_eas)
bfa_eas |>
glimpse()
#> Rows: 44,570
#> Columns: 12
#> $ ea_id <int> 11759, 11760, 11761, 11762, 11763, 11764, 11765, 1…
#> $ region <fct> Boucle du Mouhoun, Boucle du Mouhoun, Boucle du Mo…
#> $ province <fct> Bale, Bale, Bale, Bale, Bale, Bale, Bale, Bale, Ba…
#> $ commune <fct> Bagassi, Bagassi, Bagassi, Bagassi, Bagassi, Bagas…
#> $ urban_rural <fct> Rural, Rural, Rural, Rural, Rural, Rural, Rural, R…
#> $ population <dbl> 56, 204, 63, 257, 48, 139, 160, 2184, 947, 28, 326…
#> $ households <int> 7, 25, 8, 31, 6, 17, 19, 262, 114, 3, 39, 44, 28, …
#> $ area_km2 <dbl> 9.21, 8.75, 8.54, 8.92, 4.89, 8.51, 8.46, 2.57, 1.…
#> $ accessible <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, T…
#> $ dist_road_km <dbl> 19.7, 41.0, 13.9, 39.3, 28.8, 100.1, 34.9, 24.8, 6…
#> $ food_insecurity_pct <dbl> 1.8, 1.3, 1.9, 5.0, 0.4, 1.6, 1.6, 1.6, 0.2, 2.8, …
#> $ cost <dbl> 272, 739, 216, 771, 744, 436, 733, 728, 316, 622, …Simple Random Sampling
The most basic design selects n units at random from the frame.
design <- sampling_design(title = "Simple Random Sampling") |>
draw(n = 100)
design
#> ── Sampling Design: Simple Random Sampling ─────────────────────────────────────
#>
#> ℹ 1 stage
#>
#> ── Stage 1 ─────────────────────────────────────────────────────────────────────
#> • Draw: n = 100, method = srsworThe sampling_design object stores design metadata and
can be reused.
The result includes the original columns plus sampling metadata:
.weight (sampling weight), .weight_1,
.weight_2, etc. (per-stage weights). For with-replacement
methods, .draw_1, .draw_2, etc. each
independent draw.
sample |>
select(ea_id, region, urban_rural, .weight, .weight_1) |>
head()
#> # A tbl_sample: 6 × 5 | Simple Random Sampling
#> # Weights: 445.7 [445.7, 445.7]
#> ea_id region urban_rural .weight .weight_1
#> * <int> <fct> <fct> <dbl> <dbl>
#> 1 5866 Centre-Nord Urban 446. 446.
#> 2 24905 Nord Rural 446. 446.
#> 3 5145 Centre-Nord Urban 446. 446.
#> 4 32861 Est Rural 446. 446.
#> 5 4797 Sud-Ouest Rural 446. 446.
#> 6 25986 Boucle du Mouhoun Rural 446. 446.Selection Methods
The method argument controls how units are selected. By
default, samplyr uses simple random sampling without
replacement (srswor).
Systematic sampling selects units at fixed intervals, which can improve precision when the frame is ordered.
sample_sys <- sampling_design(title = "Systematic") |>
draw(n = 100, method = "systematic") |>
execute(bfa_eas, seed = 2021)
head(sample_sys)
#> # A tbl_sample: 6 × 17 | Systematic
#> # Weights: 445.7 [445.7, 445.7]
#> ea_id region province commune urban_rural population households area_km2
#> * <int> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 1182 Boucle du M… Bale Boromo Rural 177 23 0.69
#> 2 11703 Boucle du M… Bale Yaho Rural 402 60 9.18
#> 3 9724 Boucle du M… Banwa Sanaba Rural 51 6 8.02
#> 4 11018 Boucle du M… Banwa Tansila Rural 117 14 8.6
#> 5 12763 Boucle du M… Kossi Djibas… Rural 1238 158 0.95
#> 6 33147 Boucle du M… Kossi Nouna Rural 468 69 13.2
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>Sampling with replacement allows the same unit to be selected multiple times, useful for bootstrap procedures.
sample_wr <- sampling_design(title = "SRS WR") |>
draw(n = 100, method = "srswr") |>
execute(bfa_eas, seed = 123)
head(sample_wr)
#> # A tbl_sample: 6 × 18 | SRS WR
#> # Weights: 445.7 [445.7, 445.7]
#> ea_id region province commune urban_rural population households area_km2
#> * <int> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 26077 Boucle du M… Mouhoun Dedoug… Rural 176 31 8.96
#> 2 41311 Hauts-Bassi… Houet Bobo-D… Urban 844 135 0.11
#> 3 41094 Hauts-Bassi… Houet Bobo-D… Urban 1121 180 0.79
#> 4 41555 Plateau-Cen… Oubrite… Dapeol… Rural 418 54 3.19
#> 5 21009 Boucle du M… Mouhoun Bondok… Rural 556 85 9.16
#> 6 11914 Sahel Seno Bani Rural 419 90 7.95
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <dbl>, .draw_1 <int>Using Sampling Fractions
Instead of a fixed sample size, you can specify a sampling fraction
with frac.
sample <- sampling_design() |>
draw(frac = 0.05) |>
execute(bfa_eas, seed = 1789)
nrow(sample)
#> [1] 2229Bernoulli sampling selects each unit independently with the specified probability. This gives a random sample size, which can simplify field protocols.
sample <- sampling_design() |>
draw(frac = 0.05, method = "bernoulli") |>
execute(bfa_eas, seed = 1960)
nrow(sample)
#> [1] 2247Rounding Behavior
When using frac, the computed sample size (N × frac)
must be rounded to an integer. By default, samplyr rounds
up (ceiling), which matches SAS SURVEYSELECT’s default behavior. Use the
round argument to control this:
-
"up"(default): round up, ensures adequate sample size -
"down": round down, minimum sample size -
"nearest": Standard mathematical rounding
# Frame of 105 units, frac = 0.1 → 10.5
frame_105 <- head(bfa_eas, 105)
# Default: rounds up to 11
nrow(sampling_design() |> draw(frac = 0.1) |> execute(frame_105, seed = 3))
#> [1] 11
# Round down to 10
nrow(sampling_design() |> draw(frac = 0.1, round = "down") |> execute(frame_105, seed = 3))
#> [1] 10
# Round to nearest (10)
nrow(sampling_design() |> draw(frac = 0.1, round = "nearest") |> execute(frame_105, seed = 3))
#> [1] 10Stratified Sampling
Stratification partitions the frame into non-overlapping groups (strata) and samples from each. This ensures representation from all subgroups and often improves precision.
Per-Stratum Sample Size
Without an allocation method, n specifies the sample
size within each stratum.
sample <- sampling_design() |>
stratify_by(region) |>
draw(n = 20) |>
execute(bfa_eas, seed = 123)
sample |>
count(region)
#> # A tibble: 13 × 2
#> region n
#> <fct> <int>
#> 1 Boucle du Mouhoun 20
#> 2 Cascades 20
#> 3 Centre 20
#> 4 Centre-Est 20
#> 5 Centre-Nord 20
#> 6 Centre-Ouest 20
#> 7 Centre-Sud 20
#> 8 Est 20
#> 9 Hauts-Bassins 20
#> 10 Nord 20
#> 11 Plateau-Central 20
#> 12 Sahel 20
#> 13 Sud-Ouest 20Allocation Methods
With an allocation method, n becomes the total sample
size to distribute across strata. Allocation methods are defined for
total sample-size allocation, so draw(frac = ...) is not
supported when alloc is set.
Proportional allocation distributes the sample proportionally to stratum sizes, ensuring the sample mirrors the population structure.
sample <- sampling_design(title = "Proportional Allocation") |>
stratify_by(region, alloc = "proportional") |>
draw(n = 300) |>
execute(bfa_eas, seed = 42)
sample |>
count(region) |>
mutate(pct = n / sum(n))
#> # A tibble: 13 × 3
#> region n pct
#> <fct> <int> <dbl>
#> 1 Boucle du Mouhoun 34 0.113
#> 2 Cascades 17 0.0567
#> 3 Centre 26 0.0867
#> 4 Centre-Est 20 0.0667
#> 5 Centre-Nord 23 0.0767
#> 6 Centre-Ouest 25 0.0833
#> 7 Centre-Sud 11 0.0367
#> 8 Est 37 0.123
#> 9 Hauts-Bassins 32 0.107
#> 10 Nord 20 0.0667
#> 11 Plateau-Central 11 0.0367
#> 12 Sahel 28 0.0933
#> 13 Sud-Ouest 16 0.0533Equal allocation assigns the same sample size to each stratum, regardless of population size. This maximizes precision for comparisons between strata.
sample <- sampling_design(title = "Equal Allocation") |>
stratify_by(region, alloc = "equal") |>
draw(n = 160) |>
execute(bfa_eas, seed = 2026)
sample |>
count(region)
#> # A tibble: 13 × 2
#> region n
#> <fct> <int>
#> 1 Boucle du Mouhoun 13
#> 2 Cascades 13
#> 3 Centre 13
#> 4 Centre-Est 13
#> 5 Centre-Nord 12
#> 6 Centre-Ouest 12
#> 7 Centre-Sud 12
#> 8 Est 12
#> 9 Hauts-Bassins 12
#> 10 Nord 12
#> 11 Plateau-Central 12
#> 12 Sahel 12
#> 13 Sud-Ouest 12Neyman allocation minimizes the variance of the
overall estimate by allocating more sample to strata with higher
variability. It requires prior information about stratum variances. Use
explicit stratification variable names when passing
variance.
data(bfa_eas_variance)
bfa_eas_variance
#> # A tibble: 13 × 2
#> region var
#> <fct> <dbl>
#> 1 Boucle du Mouhoun 17.7
#> 2 Cascades 1.32
#> 3 Centre 1.02
#> 4 Centre-Est 18.6
#> 5 Centre-Nord 87.0
#> 6 Centre-Ouest 7.42
#> 7 Centre-Sud 3.03
#> 8 Est 134.
#> 9 Hauts-Bassins 0.411
#> 10 Nord 264.
#> 11 Plateau-Central 7.75
#> 12 Sahel 253.
#> 13 Sud-Ouest 2.49
sample <- sampling_design(title = "Neyman Allocation") |>
stratify_by(region, alloc = "neyman", variance = bfa_eas_variance) |>
draw(n = 300) |>
execute(bfa_eas, seed = 12345)
sample |>
count(region)
#> # A tibble: 13 × 2
#> region n
#> <fct> <int>
#> 1 Boucle du Mouhoun 23
#> 2 Cascades 3
#> 3 Centre 4
#> 4 Centre-Est 14
#> 5 Centre-Nord 35
#> 6 Centre-Ouest 11
#> 7 Centre-Sud 3
#> 8 Est 70
#> 9 Hauts-Bassins 4
#> 10 Nord 52
#> 11 Plateau-Central 5
#> 12 Sahel 72
#> 13 Sud-Ouest 4Optimal allocation extends Neyman allocation by also
accounting for differential costs across strata. It minimizes variance
for a fixed budget (or cost for fixed precision). Use explicit
stratification variable names when passing
variance/cost.
data(bfa_eas_cost)
bfa_eas_cost
#> # A tibble: 13 × 2
#> region cost
#> <fct> <dbl>
#> 1 Boucle du Mouhoun 484
#> 2 Cascades 422
#> 3 Centre 237
#> 4 Centre-Est 405
#> 5 Centre-Nord 446
#> 6 Centre-Ouest 400
#> 7 Centre-Sud 417
#> 8 Est 631
#> 9 Hauts-Bassins 374
#> 10 Nord 579
#> 11 Plateau-Central 410
#> 12 Sahel 626
#> 13 Sud-Ouest 420
sample <- sampling_design(title = "Optimal Allocation") |>
stratify_by(region, alloc = "optimal",
variance = bfa_eas_variance,
cost = bfa_eas_cost) |>
draw(n = 300) |>
execute(bfa_eas, seed = 9876)
sample |>
count(region)
#> # A tibble: 13 × 2
#> region n
#> <fct> <int>
#> 1 Boucle du Mouhoun 24
#> 2 Cascades 4
#> 3 Centre 6
#> 4 Centre-Est 16
#> 5 Centre-Nord 38
#> 6 Centre-Ouest 13
#> 7 Centre-Sud 3
#> 8 Est 64
#> 9 Hauts-Bassins 4
#> 10 Nord 50
#> 11 Plateau-Central 6
#> 12 Sahel 67
#> 13 Sud-Ouest 5Power allocation (Bankier 1988) uses \(n_h \propto C_h X_h^q\), where
cv supplies \(C_h\),
importance supplies \(X_h\), and power is \(q \in [0, 1]\).
cv_df <- data.frame(
region = levels(bfa_eas$region),
cv = c(0.40, 0.35, 0.12, 0.20, 0.30, 0.18,
0.15, 0.38, 0.22, 0.32, 0.17, 0.45, 0.25)
)
importance_df <- data.frame(
region = levels(bfa_eas$region),
importance = c(60, 40, 120, 70, 80, 65,
50, 55, 90, 75, 45, 35, 30)
)
sample <- sampling_design(title = "Power Allocation") |>
stratify_by(
region,
alloc = "power",
cv = cv_df,
importance = importance_df,
power = 0.5
) |>
draw(n = 300) |>
execute(bfa_eas, seed = 777)
sample |>
count(region)
#> # A tibble: 13 × 2
#> region n
#> <fct> <int>
#> 1 Boucle du Mouhoun 35
#> 2 Cascades 25
#> 3 Centre 15
#> 4 Centre-Est 19
#> 5 Centre-Nord 31
#> 6 Centre-Ouest 16
#> 7 Centre-Sud 12
#> 8 Est 32
#> 9 Hauts-Bassins 24
#> 10 Nord 32
#> 11 Plateau-Central 13
#> 12 Sahel 30
#> 13 Sud-Ouest 16Sample Size Bounds
When using allocation methods, you can constrain stratum sample sizes
with min_n and max_n in
draw().
Minimum sample size ensures every stratum gets at least the specified number of units. This is essential for variance estimation (requires n ≥ 2 per stratum) or when you need reliable subgroup estimates.
# Neyman allocation with minimum 5 per stratum
sample <- sampling_design() |>
stratify_by(region, alloc = "neyman", variance = bfa_eas_variance) |>
draw(n = 300, min_n = 20) |>
execute(bfa_eas, seed = 1)
sample |>
count(region)
#> # A tibble: 13 × 2
#> region n
#> <fct> <int>
#> 1 Boucle du Mouhoun 21
#> 2 Cascades 20
#> 3 Centre 20
#> 4 Centre-Est 20
#> 5 Centre-Nord 24
#> 6 Centre-Ouest 20
#> 7 Centre-Sud 20
#> 8 Est 33
#> 9 Hauts-Bassins 20
#> 10 Nord 28
#> 11 Plateau-Central 20
#> 12 Sahel 34
#> 13 Sud-Ouest 20Maximum sample size caps large strata to prevent them from dominating the sample. This is useful when operational constraints limit how many units you can handle in any one stratum.
# Proportional allocation but cap at 60 per stratum
sample <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 300, max_n = 55) |>
execute(bfa_eas, seed = 40)
sample |>
count(region)
#> # A tibble: 13 × 2
#> region n
#> <fct> <int>
#> 1 Boucle du Mouhoun 34
#> 2 Cascades 17
#> 3 Centre 26
#> 4 Centre-Est 20
#> 5 Centre-Nord 23
#> 6 Centre-Ouest 25
#> 7 Centre-Sud 11
#> 8 Est 37
#> 9 Hauts-Bassins 32
#> 10 Nord 20
#> 11 Plateau-Central 11
#> 12 Sahel 28
#> 13 Sud-Ouest 16Both bounds together create a feasible range for each stratum:
sample <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 300, min_n = 20, max_n = 55) |>
execute(bfa_eas, seed = 2003)
sample |>
count(region)
#> # A tibble: 13 × 2
#> region n
#> <fct> <int>
#> 1 Boucle du Mouhoun 28
#> 2 Cascades 20
#> 3 Centre 24
#> 4 Centre-Est 20
#> 5 Centre-Nord 22
#> 6 Centre-Ouest 23
#> 7 Centre-Sud 20
#> 8 Est 30
#> 9 Hauts-Bassins 28
#> 10 Nord 20
#> 11 Plateau-Central 20
#> 12 Sahel 25
#> 13 Sud-Ouest 20When a stratum’s population is smaller than min_n, the
entire stratum is selected (capped at population size).
Custom allocation lets you specify exact sample
sizes or rates per stratum by passing a data frame to
draw().
sizes_df <- data.frame(
region = levels(bfa_eas$region),
n = c(20, 12, 25, 18, 22, 16, 14, 15, 20, 18, 12, 10, 8)
)
sample <- sampling_design() |>
stratify_by(region) |>
draw(n = sizes_df) |>
execute(bfa_eas, seed = 101)
sample |>
count(region)
#> # A tibble: 13 × 2
#> region n
#> <fct> <int>
#> 1 Boucle du Mouhoun 20
#> 2 Cascades 12
#> 3 Centre 25
#> 4 Centre-Est 18
#> 5 Centre-Nord 22
#> 6 Centre-Ouest 16
#> 7 Centre-Sud 14
#> 8 Est 15
#> 9 Hauts-Bassins 20
#> 10 Nord 18
#> 11 Plateau-Central 12
#> 12 Sahel 10
#> 13 Sud-Ouest 8Multiple Stratification Variables
You can stratify by multiple variables to create crossed strata. Here we stratify by both region and urban/rural status.
sample <- sampling_design() |>
stratify_by(region, urban_rural, alloc = "proportional") |>
draw(n = 300) |>
execute(bfa_eas, seed = 42)
sample |>
count(region, urban_rural) |>
head(10)
#> # A tibble: 10 × 3
#> region urban_rural n
#> <fct> <fct> <int>
#> 1 Boucle du Mouhoun Rural 34
#> 2 Cascades Rural 17
#> 3 Centre Rural 2
#> 4 Centre Urban 25
#> 5 Centre-Est Rural 18
#> 6 Centre-Est Urban 2
#> 7 Centre-Nord Rural 20
#> 8 Centre-Nord Urban 3
#> 9 Centre-Ouest Rural 22
#> 10 Centre-Ouest Urban 3Cluster Sampling
Cluster sampling selects groups (clusters) rather than individual units. This is practical when a complete list of individuals isn’t available or when travel costs make dispersed sampling expensive.
Use cluster_by() to specify the cluster identifier. All
units within selected clusters are included.
sample <- sampling_design() |>
cluster_by(ea_id) |>
draw(n = 50) |>
execute(bfa_eas, seed = 120)
sample |>
summarise(
n_clusters = n_distinct(ea_id),
n_units = n()
)
#> # A tibble: 1 × 2
#> n_clusters n_units
#> <int> <int>
#> 1 50 50PPS Sampling
Probability Proportional to Size (PPS) sampling selects units with probability proportional to a size measure. This is standard for cluster sampling where clusters vary in size.
Use the mos argument to specify the measure of size
variable. Here we select EAs with probability proportional to their
household count.
sample_pps <- sampling_design() |>
cluster_by(ea_id) |>
draw(n = 50, method = "pps_brewer", mos = households) |>
execute(bfa_eas, seed = 365)
sample_pps |>
summarise(
mean_hh = mean(households),
median_hh = median(households)
)
#> # A tibble: 1 × 2
#> mean_hh median_hh
#> <dbl> <dbl>
#> 1 104. 97.5Compared to simple random sampling, PPS tends to select larger clusters:
sample_srs <- sampling_design() |>
cluster_by(ea_id) |>
draw(n = 50) |>
execute(bfa_eas, seed = 42)
bind_rows(
sample_pps |> summarise(method = "PPS", mean_hh = mean(households)),
sample_srs |> summarise(method = "SRS", mean_hh = mean(households))
)
#> # A tibble: 2 × 2
#> method mean_hh
#> <chr> <dbl>
#> 1 PPS 104.
#> 2 SRS 65.6PPS Methods
samplyr supports several PPS methods:
| Method | Description |
|---|---|
pps_brewer |
Brewer’s method - fast with good properties (recommended) |
pps_systematic |
PPS systematic - simple but can have periodicity issues |
pps_cps |
Conditional Poisson sampling (maximum entropy) |
pps_poisson |
PPS Poisson - random sample size (requires frac and
supports prn) |
pps_sps |
Sequential Poisson sampling (supports prn) |
pps_pareto |
Pareto sampling (supports prn) |
pps_multinomial |
PPS with replacement |
pps_chromy |
Chromy’s sequential PPS / minimum replacement |
Certainty Selection
In PPS sampling, very large units can have expected inclusion probabilities above 1. Certainty selection handles this by selecting those units with probability 1 and sampling the rest normally.
Use certainty_size for an absolute threshold or
certainty_prop for a proportional one.
sample_cert <- sampling_design(title = "Certainty selection") |>
draw(n = 50, method = "pps_brewer", mos = households,
certainty_size = 800) |>
execute(bfa_eas, seed = 123)
# How many EAs with more than 800 households
filter(bfa_eas, households > 800)
#> # A tibble: 4 × 12
#> ea_id region province commune urban_rural population households area_km2
#> <int> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 888 Centre-Est Boulgou Bitou Rural 4559 815 8.31
#> 2 26995 Centre-Est Kourite… Koupela Urban 4640 941 3.62
#> 3 28122 Centre-Ouest Sanguie Reo Urban 3930 844 8.22
#> 4 40067 Nord Yatenga Barga Urban 4644 821 2.19
#> # ℹ 4 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>
# Check which units were selected with certainty
count(sample_cert, .certainty_1)
#> # A tibble: 2 × 2
#> .certainty_1 n
#> <lgl> <int>
#> 1 FALSE 46
#> 2 TRUE 4With certainty_prop, units whose MOS proportion exceeds
the threshold are taken with certainty. The process is iterative: after
removing certainty units, proportions are recomputed.
sample_certp <- sampling_design() |>
stratify_by(region) |>
draw(n = 50, method = "pps_systematic", mos = households,
certainty_prop = 0.05) |>
execute(bfa_eas, seed = 12345)
# How many EAs with more than 5% of the total hh by region
bfa_eas |>
mutate(mosprop = households/sum(households),
.by = region) |>
filter(mosprop >= 0.05)
#> # A tibble: 0 × 13
#> # ℹ 13 variables: ea_id <int>, region <fct>, province <fct>, commune <fct>,
#> # urban_rural <fct>, population <dbl>, households <int>, area_km2 <dbl>,
#> # accessible <lgl>, dist_road_km <dbl>, food_insecurity_pct <dbl>,
#> # cost <dbl>, mosprop <dbl>
filter(sample_certp, .certainty_1)
#> Warning in min(w): no non-missing arguments to min; returning Inf
#> Warning in max(w): no non-missing arguments to max; returning -Inf
#> # A tbl_sample: 0 × 18
#> # Weights: NaN [Inf, -Inf]
#> # ℹ 18 variables: ea_id <int>, region <fct>, province <fct>, commune <fct>,
#> # urban_rural <fct>, population <dbl>, households <int>, area_km2 <dbl>,
#> # accessible <lgl>, dist_road_km <dbl>, food_insecurity_pct <dbl>,
#> # cost <dbl>, .weight <dbl>, .sample_id <int>, .stage <int>, .weight_1 <dbl>,
#> # .fpc_1 <int>, .certainty_1 <lgl>Permanent Random Numbers (PRN)
Some PPS methods support permanent random numbers for sample
coordination across survey waves. Assign a stable U(0,1) value to each
frame unit, then pass it via prn:
set.seed(1)
bfa_eas$prn <- runif(nrow(bfa_eas))
sample_prn <- sampling_design(title = "Samples coordination with PRN") |>
cluster_by(ea_id) |>
draw(n = 50, method = "pps_sps", mos = households, prn = prn) |>
execute(bfa_eas, seed = 1)
sample_prn
#> # A tbl_sample: 50 × 19 | Samples coordination with PRN
#> # Weights: 610.1 [90.47, 2399.38]
#> ea_id region province commune urban_rural population households area_km2
#> * <int> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 6360 Boucle du … Banwa Kouka Rural 592 82 0.54
#> 2 23944 Boucle du … Banwa Solenzo Rural 758 102 0.77
#> 3 37423 Boucle du … Kossi Kombori Rural 688 88 1.37
#> 4 25972 Boucle du … Mouhoun Dedoug… Rural 804 143 1.58
#> 5 34838 Boucle du … Sourou Tougan Rural 876 120 1.25
#> 6 34864 Boucle du … Sourou Tougan Rural 2824 387 2.07
#> 7 34885 Boucle du … Sourou Tougan Rural 804 110 8.88
#> 8 7750 Cascades Leraba Nianko… Rural 633 125 1.36
#> 9 34253 Cascades Leraba Sindou Rural 499 61 0.68
#> 10 11585 Cascades Leraba Wolonk… Rural 741 92 5.72
#> # ℹ 40 more rows
#> # ℹ 11 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> # .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>,
#> # .certainty_1 <lgl>Because the PRN values are stable, re-executing the same design on
the same frame produces highly overlapping samples (positive
coordination). PRN is supported for bernoulli,
pps_poisson, pps_sps, and
pps_pareto. You can learn more about coordination with
permanent random numbers in the following vignette
vignette("sampling-coordination").
Control Sorting and Serpentine Ordering
Control sorting orders the frame before selection, providing implicit stratification. This is effective with systematic and sequential methods, where it ensures the sample spreads evenly across the sorted variables.
# Nested sorting: standard ascending order
sample_sort <- sampling_design() |>
draw(n = 100, method = "systematic",
control = c(region, province)) |>
execute(bfa_eas, seed = 98765)
head(sample_sort)
#> # A tbl_sample: 6 × 18
#> # Weights: 445.7 [445.7, 445.7]
#> ea_id region province commune urban_rural population households area_km2
#> * <int> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 36793 Boucle du M… Bale Fara Rural 212 31 0.1
#> 2 6276 Boucle du M… Banwa Kouka Rural 544 75 4.3
#> 3 23886 Boucle du M… Banwa Solenzo Rural 79 11 5.68
#> 4 568 Boucle du M… Kossi Barani Rural 371 49 0.35
#> 5 41792 Boucle du M… Kossi Dokui Rural 821 102 0.83
#> 6 33311 Boucle du M… Kossi Nouna Rural 989 146 0.35
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> # .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>serp() implements serpentine (snake) sorting, which
alternates direction at each level of the hierarchy. This minimizes
jumps at boundaries, matching SAS PROC SURVEYSELECT’s
SORT=SERP.
# Serpentine sorting
sample_serp <- sampling_design() |>
draw(n = 100, method = "systematic",
control = serp(region, province)) |>
execute(bfa_eas, seed = 1)
head(sample_serp)
#> # A tbl_sample: 6 × 18
#> # Weights: 445.7 [445.7, 445.7]
#> ea_id region province commune urban_rural population households area_km2
#> * <int> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 29502 Boucle du M… Bale Bana Rural 168 22 6.72
#> 2 8912 Boucle du M… Bale Pompoi Rural 802 109 0.72
#> 3 9641 Boucle du M… Banwa Sanaba Rural 825 105 8.83
#> 4 10935 Boucle du M… Banwa Tansila Rural 173 21 9.23
#> 5 21106 Boucle du M… Kossi Bouras… Rural 39 4 3.34
#> 6 7017 Boucle du M… Kossi Madouba Rural 599 74 0.74
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> # .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>Control sorting can be combined with explicit stratification. The sorting is then applied within each stratum.
sample_serp2 <- sampling_design() |>
stratify_by(urban_rural) |>
draw(n = 100, method = "systematic",
control = serp(region, province)) |>
execute(bfa_eas, seed = 2)
head(sample_serp2)
#> # A tbl_sample: 6 × 18
#> # Weights: 376.87 [376.87, 376.87]
#> ea_id region province commune urban_rural population households area_km2
#> * <int> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 11828 Boucle du M… Bale Bagassi Rural 38 5 6.16
#> 2 8436 Boucle du M… Bale Ouri Rural 83 11 4.09
#> 3 6321 Boucle du M… Banwa Kouka Rural 527 73 8.97
#> 4 23862 Boucle du M… Banwa Solenzo Rural 1006 135 1.17
#> 5 475 Boucle du M… Kossi Barani Rural 51 7 8.84
#> 6 12763 Boucle du M… Kossi Djibas… Rural 1238 158 0.95
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> # .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>Balanced Sampling
Balanced sampling selects a sample whose Horvitz-Thompson estimates
of auxiliary totals match (or nearly match) the population totals. This
improves precision for any variable correlated with the auxiliary
variables. samplyr implements the cube method (Deville
& Tille 2004) via method = "balanced".
Specify auxiliary variables with aux. Without
mos, inclusion probabilities are equal; with
mos, they are proportional to size:
# Equal-probability balanced sample
balanced_eq <- sampling_design() |>
draw(n = 100, method = "balanced",
aux = c(population, households)) |>
execute(bfa_eas, seed = 42)
balanced_eq
#> # A tbl_sample: 100 × 18
#> # Weights: 445.7 [445.7, 445.7]
#> ea_id region province commune urban_rural population households area_km2
#> * <int> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 8354 Boucle du … Bale Ouri Rural 78 10 7.56
#> 2 8901 Boucle du … Bale Pompoi Rural 220 30 8.8
#> 3 9031 Boucle du … Bale Poura Urban 941 151 1.06
#> 4 6328 Boucle du … Banwa Kouka Rural 889 122 8.66
#> 5 21118 Boucle du … Kossi Bouras… Rural 28 3 8.47
#> 6 12858 Boucle du … Kossi Djibas… Rural 431 55 4.95
#> 7 7021 Boucle du … Kossi Madouba Rural 926 115 0.66
#> 8 26106 Boucle du … Mouhoun Dedoug… Rural 237 42 8.72
#> 9 26496 Boucle du … Nayala Gossina Rural 116 14 6.51
#> 10 6237 Boucle du … Nayala Kougny Rural 97 11 8.71
#> # ℹ 90 more rows
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> # .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>The balancing property means the HT estimate of auxiliary totals is close to the true total:
pop_total <- sum(bfa_eas$population)
ht_total <- sum(balanced_eq$population * balanced_eq$.weight)
c(population = pop_total, ht_estimate = ht_total)
#> population ht_estimate
#> 20487532 20444259Balanced sampling combines naturally with stratification. When
stratified, samplyr calls the stratified cube algorithm
(Chauvet 2009) in a single pass, preserving the allocation computed by
stratify_by():
# Stratified PPS balanced sample
balanced_strat <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 300, method = "balanced", mos = households,
aux = c(population, area_km2)) |>
execute(bfa_eas, seed = 1960)
balanced_strat
#> # A tbl_sample: 300 × 18
#> # Weights: 148.8 [10.29, 2065.22]
#> ea_id region province commune urban_rural population households area_km2
#> * <int> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 11787 Boucle du … Bale Bagassi Rural 874 105 1.4
#> 2 11855 Boucle du … Bale Bagassi Rural 220 26 8.48
#> 3 1165 Boucle du … Bale Boromo Rural 1257 164 0.96
#> 4 36701 Boucle du … Bale Fara Rural 638 93 0.81
#> 5 39987 Boucle du … Banwa Balave Rural 298 47 9.01
#> 6 6362 Boucle du … Banwa Kouka Rural 890 123 6.91
#> 7 34030 Boucle du … Banwa Sami Rural 139 21 8.93
#> 8 34038 Boucle du … Banwa Sami Rural 214 32 10.4
#> 9 34054 Boucle du … Banwa Sami Rural 306 46 5.13
#> 10 23901 Boucle du … Banwa Solenzo Rural 383 51 8.71
#> # ℹ 290 more rows
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> # .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <dbl>When used with cluster_by(), auxiliary values are
automatically aggregated (summed) to the cluster level before
selection.
Balanced sampling supports up to 2 stages. At stage 1, element-level auxiliary variables are aggregated to PSU totals; at stage 2, the cube method runs on the element-level frame within each selected PSU.
Custom Sampling Methods
Any unequal probability method can be plugged into samplyr by
registering it with sondage::register_method(). The method
is then available in draw() using the
pps_<name> convention.
For example, the elimination procedure of Tillé (1996) is available in the sampling package. We can register it as a custom method along with its exact joint inclusion probabilities:
# Wrap sampling::UPtille to return selected indices
tille_fn <- function(pik, n = NULL, prn = NULL, ...) {
which(as.logical(sampling::UPtille(pik)))
}
# Joint inclusion probabilities, restricted to the sampled units
tille_joint_fn <- function(pik, sample_idx = NULL, ...) {
pi2 <- sampling::UPtillepi2(pik)
if (!is.null(sample_idx))
pi2 <- pi2[sample_idx, sample_idx, drop = FALSE]
pi2
}
sondage::register_method(
"tille", type = "wor",
sample_fn = tille_fn,
joint_fn = tille_joint_fn
)Once registered, the method works like any built-in PPS method:
set.seed(1)
frame <- bfa_eas[sample(seq_len(nrow(bfa_eas)), 500), ]
sample_tille <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 50, method = "pps_tille", mos = households) |>
execute(frame, seed = 1)
sample_tille
#> # A tbl_sample: 50 × 19
#> # Weights: 12.2 [1.31, 111.8]
#> ea_id region province commune urban_rural population households area_km2
#> * <int> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 43475 Est Gnagna Piela Rural 971 114 8.95
#> 2 25151 Est Gnagna Bilanga Rural 359 54 9.07
#> 3 27796 Est Tapoa Partia… Rural 742 102 8.15
#> 4 37813 Est Tapoa Logbou Rural 965 123 5.96
#> 5 3098 Est Gourma Fada-N… Rural 985 119 0.63
#> 6 25163 Est Gnagna Bilanga Rural 546 83 9.01
#> 7 30933 Est Komandj… Gayeri Rural 278 35 8.9
#> 8 3884 Sud-Ouest Poni Gaoua Rural 146 22 11.3
#> 9 32486 Sud-Ouest Poni Lorope… Rural 37 5 3.97
#> 10 29504 Boucle du … Bale Bana Rural 817 107 8.21
#> # ℹ 40 more rows
#> # ℹ 11 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> # .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>,
#> # .certainty_1 <lgl>Because we provided a joint_fn, exact variance
estimation is available via joint_expectation():
jip <- joint_expectation(sample_tille, frame)
svy <- as_svydesign(sample_tille, pps = survey::ppsmat(jip$stage_1))
survey::svymean(~population, svy)
#> mean SE
#> population 382.58 72.167Custom methods flow through the full pipeline: stratification,
certainty selection, multi-stage designs, survey export, and summary
diagnostics. See ?sondage::register_method for the full
contract including WR methods and PRN support.
Multi-Stage Sampling
Multi-stage designs sample in stages: first select primary sampling units (PSUs), then sample within them. This is the standard approach for large-scale surveys when a complete frame of ultimate units isn’t available upfront.
Use add_stage() to delimit each stage of the design.
Two-Stage Design
For this example, we’ll use zwe_eas to demonstrate a
two-stage design: first select districts as PSUs, then sample EAs within
selected districts.
data(zwe_eas)
# First, let's see the structure: EAs nested within districts
zwe_eas |>
count(province, district) |>
head(10)
#> # A tibble: 10 × 3
#> province district n
#> <fct> <fct> <int>
#> 1 Bulawayo Bulawayo 2164
#> 2 Harare Chitungwiza 729
#> 3 Harare Epworth 564
#> 4 Harare Harare 3922
#> 5 Harare Harare Rural 851
#> 6 Manicaland Buhera 2790
#> 7 Manicaland Chimanimani 991
#> 8 Manicaland Chipinge 2038
#> 9 Manicaland Chipinge Urban 104
#> 10 Manicaland Makoni 3177We need to create a measure of size for districts. Here we’ll use total households per district.
# Add district-level households as MOS
zwe_frame <- zwe_eas |>
mutate(district_hh = sum(households),
.by = district)Now we can run a two-stage design: select 10 districts with PPS, then sample 5 EAs within each.
twostage_design <- sampling_design(title = "Two-stage cluster sampling") |>
add_stage(label = "Districts") |>
cluster_by(district) |>
draw(n = 10, method = "pps_brewer", mos = district_hh) |>
add_stage(label = "EAs") |>
draw(n = 5)
twostage_design
#> ── Sampling Design: Two-stage cluster sampling ─────────────────────────────────
#>
#> ℹ 2 stages
#>
#> ── Stage 1: Districts ──────────────────────────────────────────────────────────
#> • Cluster: district
#> • Draw: n = 10, method = pps_brewer, mos = district_hh
#>
#> ── Stage 2: EAs ────────────────────────────────────────────────────────────────
#> • Draw: n = 5, method = srsworLet’s apply it to the frame.
sample_eas <- execute(twostage_design,
zwe_frame, seed = 3)
sample_eas |>
summarise(n_districts = n_distinct(district),
n_eas = n())
#> # A tibble: 1 × 2
#> n_districts n_eas
#> <int> <int>
#> 1 10 50Stratified Multi-Stage Design
Combining stratification with multi-stage sampling is standard for national surveys. Here we stratify by province (which is constant within districts), then select districts and EAs within each stratum.
sample_eas2 <- sampling_design(title = "Stratified two-stage cluster sampling") |>
add_stage(label = "Districts") |>
stratify_by(province) |>
cluster_by(district) |>
draw(n = 2, method = "pps_brewer", mos = district_hh) |>
add_stage(label = "EAs") |>
draw(n = 3) |>
execute(zwe_frame, seed = 4)
#> Warning: Sample size capped to population in 1 stratum/strata: "Bulawayo".
#> ℹ Requested total: 20. Actual total: 19.
sample_eas2 |>
count(province, district)
#> # A tibble: 19 × 3
#> province district n
#> <fct> <fct> <int>
#> 1 Bulawayo Bulawayo 3
#> 2 Harare Epworth 3
#> 3 Harare Harare 3
#> 4 Manicaland Buhera 3
#> 5 Manicaland Chipinge 3
#> 6 Mashonaland Central Guruve 3
#> 7 Mashonaland Central Mount Darwin 3
#> 8 Mashonaland East Goromonzi 3
#> 9 Mashonaland East Mutoko 3
#> 10 Mashonaland West Sanyati 3
#> 11 Mashonaland West Zvimba 3
#> 12 Masvingo Bikita 3
#> 13 Masvingo Masvingo 3
#> 14 Matabeleland North Binga 3
#> 15 Matabeleland North Bubi 3
#> 16 Matabeleland South Gwanda 3
#> 17 Matabeleland South Umzingwane 3
#> 18 Midlands Gweru Urban 3
#> 19 Midlands Zvishavane 3Operational Multi-Stage Sampling
In practice, multi-stage surveys often execute stages at different times. After selecting PSUs, field teams may need to conduct listing or other operations before the second stage can proceed.
The stages argument lets you execute only specific
stages.
design <- sampling_design() |>
add_stage(label = "EA") |>
stratify_by(urban_rural) |>
cluster_by(ea_id) |>
draw(n = 10, method = "pps_brewer", mos = households) |>
add_stage(label = "HH") |>
draw(n = 5)
# Execute stage 1 only
selected_eas <- execute(design, zwe_eas, stages = 1, seed = 1)
head(selected_eas)
#> # A tbl_sample: 6 × 18
#> # Stages: 1/2
#> # Weights: 3641.26 [591.31, 8821.67]
#> ea_id province district ward_pcode urban_rural population households buildings
#> * <int> <fct> <fct> <chr> <fct> <int> <int> <int>
#> 1 47209 Bulawayo Bulawayo ZW102127 Urban 462 121 122
#> 2 35161 Harare Harare ZW192109 Urban 140 38 91
#> 3 86782 Harare Harare ZW192130 Urban 974 263 170
#> 4 88462 Harare Harare ZW192103 Urban 993 302 98
#> 5 93947 Mashona… Bindura ZW120105 Rural 95 23 91
#> 6 35770 Mashona… Guruve ZW120307 Urban 121 32 122
#> # ℹ 10 more variables: women_15_49 <int>, men_15_49 <int>,
#> # children_under5 <int>, area_km2 <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>, .certainty_1 <lgl>Once the EAs are selected, households are listed in each of them.
selected_eas_list <- selected_eas |>
slice(rep(seq_len(n()), households)) |>
mutate(hh_id = row_number())
head(selected_eas_list)
#> # A tbl_sample: 6 × 19
#> # Stages: 1/2
#> # Weights: 1475.82 [1475.82, 1475.82]
#> ea_id province district ward_pcode urban_rural population households buildings
#> * <int> <fct> <fct> <chr> <fct> <int> <int> <int>
#> 1 47209 Bulawayo Bulawayo ZW102127 Urban 462 121 122
#> 2 47209 Bulawayo Bulawayo ZW102127 Urban 462 121 122
#> 3 47209 Bulawayo Bulawayo ZW102127 Urban 462 121 122
#> 4 47209 Bulawayo Bulawayo ZW102127 Urban 462 121 122
#> 5 47209 Bulawayo Bulawayo ZW102127 Urban 462 121 122
#> 6 47209 Bulawayo Bulawayo ZW102127 Urban 462 121 122
#> # ℹ 11 more variables: women_15_49 <int>, men_15_49 <int>,
#> # children_under5 <int>, area_km2 <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>, .certainty_1 <lgl>,
#> # hh_id <int>You can also use tidyr::uncount() for the listing
step.
selected_eas_list <- selected_eas |>
tidyr::uncount(weights = households, .id = "hh_id", .remove = FALSE)After fieldwork or preparation is complete, execute the remaining
stages by piping the stage-1 result into execute() with the
listing frame. This continuation pattern preserves the full design
metadata needed for survey export.
final_sample <- selected_eas |>
execute(selected_eas_list, seed = 2)
head(final_sample)
#> # A tbl_sample: 6 × 21
#> # Weights: 35714.84 [35714.84, 35714.84]
#> ea_id province district ward_pcode urban_rural population households buildings
#> * <int> <fct> <fct> <chr> <fct> <int> <int> <int>
#> 1 47209 Bulawayo Bulawayo ZW102127 Urban 462 121 122
#> 2 47209 Bulawayo Bulawayo ZW102127 Urban 462 121 122
#> 3 47209 Bulawayo Bulawayo ZW102127 Urban 462 121 122
#> 4 47209 Bulawayo Bulawayo ZW102127 Urban 462 121 122
#> 5 47209 Bulawayo Bulawayo ZW102127 Urban 462 121 122
#> 6 35161 Harare Harare ZW192109 Urban 140 38 91
#> # ℹ 13 more variables: women_15_49 <int>, men_15_49 <int>,
#> # children_under5 <int>, area_km2 <dbl>, hh_id <int>, .weight <dbl>,
#> # .sample_id <int>, .stage <int>, .weight_2 <dbl>, .fpc_2 <int>,
#> # .weight_1 <dbl>, .fpc_1 <int>, .certainty_1 <lgl>Note: All dplyr verbs (
mutate(),filter(),*_join(), etc.) preserve thetbl_sampleclass automatically. If you manipulate a sample with base R functions likemerge()orcbind(), both the class and its attributes are lost. Prefer the dplyr equivalents (left_join(),bind_cols()). See?as_tbl_samplefor the full list.
Replicated Sampling
The reps parameter in execute() draws
multiple independent samples from the same frame under the same design.
The output is a single stacked tbl_sample with a
.replicate column identifying each replicate. This is
useful for simulation studies, repeated-sampling variance estimation, or
quality control.
rep_sample <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 300) |>
execute(bfa_eas, seed = 42, reps = 5)
rep_sample |>
count(.replicate)
#> # A tibble: 5 × 2
#> .replicate n
#> <int> <int>
#> 1 1 300
#> 2 2 300
#> 3 3 300
#> 4 4 300
#> 5 5 300Each replicate is a fully independent draw. Replicate r
uses seed seed + r - 1, so the first replicate matches a
standalone execution with the same seed:
standalone <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 300) |>
execute(bfa_eas, seed = 42)
rep1 <- rep_sample |> filter(.replicate == 1)
identical(sort(rep1$ea_id), sort(standalone$ea_id))
#> [1] TRUEReplicated sampling also works with multi-stage designs. When continuing from a replicated partial execution, each replicate is continued independently:
design <- sampling_design() |>
add_stage(label = "Districts") |>
cluster_by(district) |>
draw(n = 10, method = "pps_brewer", mos = district_hh) |>
add_stage(label = "EAs") |>
draw(n = 5)
stage1 <- execute(design, zwe_frame, stages = 1, seed = 1, reps = 3)
final <- execute(stage1, zwe_frame, seed = 100)
final |>
summarise(n_districts = n_distinct(district), n_eas = n(),
.by = .replicate)
#> # A tibble: 3 × 3
#> .replicate n_districts n_eas
#> <int> <int> <int>
#> 1 1 10 50
#> 2 2 10 50
#> 3 3 10 50Survey export functions require a single replicate. Filter first:
one_rep <- rep_sample |> filter(.replicate == 1)
svy <- as_svydesign(one_rep)Note: reps cannot be combined with panels
or with stages that use permanent random numbers (PRN). PRN produces
identical samples across replicates, so coordination across waves should
use an explicit loop with different PRN vectors instead.
Working with Results
Accessing the Design
The sampling design is attached to the result and can be retrieved
with get_design().
sample <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 200) |>
execute(bfa_eas, seed = 2024)
design <- get_design(sample)
design
#> ── Sampling Design ─────────────────────────────────────────────────────────────
#>
#> ℹ 1 stage
#>
#> ── Stage 1 ─────────────────────────────────────────────────────────────────────
#> • Strata: region (proportional)
#> • Draw: n = 200, method = srsworand accessing element of the design
design$stages[[1]]$strata$vars
#> [1] "region"Weight Diagnostics
The summary() method reports weight diagnostics
including Kish’s effective sample size and design effect due to unequal
weighting. For the full design effect (including clustering and
stratification), use survey::svymean() with
deff = TRUE after converting with
as_svydesign().
summary(sample)
#> ── Sample Summary ──────────────────────────────────────────────────────────────
#>
#> ℹ n = 200 | stages = 1/1 | seed = 2024
#>
#> ── Design: Stage 1 ─────────────────────────────────────────────────────────────
#> • Strata: region (proportional)
#> • Method: srswor
#>
#> ── Allocation: Stage 1 ─────────────────────────────────────────────────────────
#> region N_h n_h f_h
#> Boucle du Mouhoun 5009 23 0.0046
#> Cascades 2508 11 0.0044
#> Centre 3888 17 0.0044
#> Centre-Est 2941 13 0.0044
#> Centre-Nord 3402 15 0.0044
#> Centre-Ouest 3723 17 0.0046
#> Centre-Sud 1612 7 0.0043
#> Est 5505 25 0.0045
#> Hauts-Bassins 4839 22 0.0045
#> Nord 2930 13 0.0044
#> Plateau-Central 1662 7 0.0042
#> Sahel 4144 19 0.0046
#> Sud-Ouest 2407 11 0.0046
#> ───── ─── ──────
#> Total 44570 200 0.0045
#>
#> ── Weights ─────────────────────────────────────────────────────────────────────
#> • Range: [217.78, 237.43]
#> • Mean: 222.85 · CV: 0.02
#> • DEFF: 1 · n_eff: 200Method Reference
Selection Methods
| Method | Sample Size | Replacement | Notes |
|---|---|---|---|
srswor |
Fixed | No | Default, general purpose |
srswr |
Fixed | Yes | Bootstrap, rare populations |
systematic |
Fixed | No | Ordered frames |
bernoulli |
Random | No | Simple field protocols |
pps_brewer |
Fixed | No | Recommended PPS method |
pps_systematic |
Fixed | No | Simple PPS |
pps_cps |
Fixed | No | Maximum entropy, exact joint inclusion probs |
pps_poisson |
Random | No | PPS with random size |
pps_sps |
Fixed | No | Sequential Poisson, supports PRN |
pps_pareto |
Fixed | No | Pareto sampling, supports PRN |
pps_multinomial |
Fixed | Yes | PPS with replacement |
pps_chromy |
Fixed | PMR.1 | Chromy’s sequential PPS |
balanced |
Fixed | No | Cube method, balances on aux variables |
Allocation Methods
| Method | Description | Requirements |
|---|---|---|
| (none) | n per stratum | |
equal |
Same n per stratum | |
proportional |
n ∝ stratum size | |
neyman |
Minimize variance |
variance data frame |
optimal |
Minimize variance/cost |
variance + cost data frames |
power |
Compromise: \(n_h \propto C_h X_h^q\) |
cv + importance (+ optional
power) |
For custom stratum-specific sizes or rates, pass a data frame
directly to n or frac in draw().
When alloc is specified, use n (not
frac).
Best Practices
- Always set a seed for reproducibility
- Use meaningful stage labels for documentation
-
Validate designs with
validate_frame()before execution -
Check weight distributions with
summary()on the sample
design <- sampling_design(title = "National Health Survey 2024") |>
add_stage(label = "Enumeration Areas") |>
stratify_by(region, urban_rural) |>
cluster_by(ea_id) |>
draw(n = 5, method = "pps_brewer", mos = households) |>
add_stage(label = "Households") |>
draw(n = 20)
print(design)
#> ── Sampling Design: National Health Survey 2024 ────────────────────────────────
#>
#> ℹ 2 stages
#>
#> ── Stage 1: Enumeration Areas ──────────────────────────────────────────────────
#> • Strata: region, urban_rural
#> • Cluster: ea_id
#> • Draw: n = 5, method = pps_brewer, mos = households
#>
#> ── Stage 2: Households ─────────────────────────────────────────────────────────
#> • Draw: n = 20, method = srsworFrame Validation
Use validate_frame() to check that a frame has all the
variables a design requires before executing.
# This passes
validate_frame(design, bfa_eas)
# This fails with a clear message
bad_frame <- data.frame(id = 1:100, value = rnorm(100))
validate_frame(design, bad_frame)
#> Error:
#> ! Frame validation failed:
#> ✖ Enumeration Areas: missing stratification variable: "region" and
#> "urban_rural"
#> ✖ Enumeration Areas: missing cluster variable: "ea_id"
#> ✖ Enumeration Areas: missing MOS variable: `households`