Overview
samplyr provides a tidy grammar for specifying and
executing survey sampling designs. The package is built around a minimal
set of composable verbs that handle stratified, clustered, and
multi-stage sampling.
For the statistical semantics and assumptions underlying each
operation, see vignette("design-semantics").
The core idea is that sampling code should read like its English description:
library(samplyr)
library(dplyr)
# "Stratify by region, proportionally allocate 500 samples, execute"
sampling_design(title = "My sampling design") |>
stratify_by(region, alloc = "proportional") |>
draw(n = 500) |>
execute(frame, seed = 1)To see how this works in practice, consider a real survey design from Lohr (2022) (Example 7.1), based on a 1991 study of bed net use in rural Gambia (D’Alessandro et al., 1994):
Malaria morbidity can be reduced by using bed nets impregnated with insecticide, but this is only effective if the bed nets are in widespread use. In 1991, a nationwide survey was designed to estimate the prevalence of bed net use in rural areas of the Gambia (D’Alessandro et al., 1994).
The sampling frame consisted of all rural villages of fewer than 3,000 people. The villages were stratified by three geographic regions (eastern, central, and western) and by whether the village had a public health clinic (PHC) or not. In each region five districts were chosen with probability proportional to the district population. In each district four villages were chosen, again with probability proportional to census population: two PHC villages and two non-PHC villages. Finally, six compounds were chosen more or less randomly from each village.
In samplyr, this three-stage stratified cluster design
translates directly into code:
design <- sampling_design(title = "Gambia bed nets") |>
add_stage() |>
stratify_by(region) |>
cluster_by(district) |>
draw(n = 5, method = "pps_brewer", mos = population) |>
add_stage() |>
stratify_by(phc) |>
cluster_by(village) |>
draw(n = 2, method = "pps_brewer", mos = population) |>
add_stage() |>
draw(n = 6)
design
#> ── Sampling Design: Gambia bed nets ────────────────────────────────────────────
#>
#> ℹ 3 stages
#>
#> ── Stage 1 ─────────────────────────────────────────────────────────────────────
#> • Strata: region
#> • Cluster: district
#> • Draw: n = 5, method = pps_brewer, mos = population
#>
#> ── Stage 2 ─────────────────────────────────────────────────────────────────────
#> • Strata: phc
#> • Cluster: village
#> • Draw: n = 2, method = pps_brewer, mos = population
#>
#> ── Stage 3 ─────────────────────────────────────────────────────────────────────
#> • Draw: n = 6, method = srsworAnd you can then use execute to get your sample.
sampling_design(title = "Gambia bed net") |>
add_stage() |>
stratify_by(region) |>
cluster_by(district) |>
draw(n = 5, method = "pps_brewer", mos = population) |>
add_stage() |>
stratify_by(phc) |>
cluster_by(village) |>
draw(n = 2, method = "pps_brewer", mos = population) |>
add_stage() |>
draw(n = 6) |>
execute(frame, seed = 1991)
# or
execute(design, frame, seed = 1991)The samplyr code mirrors the verbal description verb for
verb.
The Grammar
samplyr uses 5 verbs and 1 modifier:
| Function | Purpose |
|---|---|
sampling_design() |
Create a new sampling design |
stratify_by() |
Define stratification variables and allocation |
cluster_by() |
Define cluster/PSU variable |
draw() |
Specify sample size, method |
execute() |
Run the design on a frame |
add_stage() |
Delimit stages in multi-stage designs |
Deferred Column Resolution
stratify_by() and cluster_by() accept bare
column names and store them in the design object. Column names are
validated only when a frame is available (execute(),
as_svydesign()), which keeps design specification separate
from execution.
design <- sampling_design(title = "Stratified cluster sampling") |>
stratify_by(region, alloc = "proportional") |>
cluster_by(ea_id) |>
draw(n = 200)
sample <- execute(design, bfa_eas, seed = 101)Example Data
We’ll use the bfa_eas dataset throughout this vignette.
It is an enumeration area frame for Burkina Faso, with ~14,900 EAs
across 13 regions.
data(bfa_eas)
bfa_eas |>
glimpse()
#> Rows: 14,900
#> Columns: 12
#> $ ea_id <chr> "EA_00245", "EA_00246", "EA_00247", "EA_00248", "E…
#> $ region <fct> Boucle du Mouhoun, Boucle du Mouhoun, Boucle du Mo…
#> $ province <fct> Bale, Bale, Bale, Bale, Bale, Bale, Bale, Bale, Ba…
#> $ commune <fct> Bagassi, Bagassi, Bagassi, Bagassi, Bagassi, Bagas…
#> $ urban_rural <fct> Rural, Rural, Rural, Rural, Rural, Rural, Rural, R…
#> $ population <dbl> 927, 2184, 1666, 592, 1467, 1607, 709, 1461, 1053,…
#> $ households <int> 111, 262, 200, 71, 176, 193, 85, 176, 127, 122, 16…
#> $ area_km2 <dbl> 57.27, 2.57, 33.44, 31.73, 12.07, 25.53, 12.63, 16…
#> $ accessible <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…
#> $ dist_road_km <dbl> 17.5, 11.1, 17.3, 21.6, 4.2, 6.9, 27.4, 10.4, 10.9…
#> $ food_insecurity_pct <dbl> 9.5, 16.3, 28.0, 15.7, 31.4, 15.1, 26.9, 24.8, 17.…
#> $ cost <dbl> 497, 415, 498, 440, 408, 211, 409, 430, 329, 1101,…Simple Random Sampling
The most basic design selects n units at random from the frame.
design <- sampling_design(title = "Simple Random Sampling") |>
draw(n = 100)
design
#> ── Sampling Design: Simple Random Sampling ─────────────────────────────────────
#>
#> ℹ 1 stage
#>
#> ── Stage 1 ─────────────────────────────────────────────────────────────────────
#> • Draw: n = 100, method = srsworThe sampling_design object stores design metadata and
can be reused.
The result includes the original columns plus sampling metadata:
.weight (sampling weight), .weight_1,
.weight_2, etc. (per-stage weights). For with-replacement
methods, .draw_1, .draw_2, etc. each
independent draw.
sample |>
select(ea_id, region, urban_rural, .weight, .weight_1) |>
head()
#> # A tbl_sample: 6 × 5 | Simple Random Sampling
#> # Weights: 149 [149, 149]
#> ea_id region urban_rural .weight .weight_1
#> * <chr> <fct> <fct> <dbl> <dbl>
#> 1 EA_09278 Centre Urban 149 149
#> 2 EA_04606 Sud-Ouest Rural 149 149
#> 3 EA_07895 Sahel Rural 149 149
#> 4 EA_14017 Boucle du Mouhoun Rural 149 149
#> 5 EA_10646 Hauts-Bassins Rural 149 149
#> 6 EA_05166 Hauts-Bassins Rural 149 149Selection Methods
The method argument controls how units are selected. By
default, samplyr uses simple random sampling without
replacement (srswor).
Systematic sampling selects units at fixed intervals, which can improve precision when the frame is ordered.
sample_sys <- sampling_design(title = "Systematic") |>
draw(n = 100, method = "systematic") |>
execute(bfa_eas, seed = 2021)
head(sample_sys)
#> # A tbl_sample: 6 × 17 | Systematic
#> # Weights: 149 [149, 149]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_02176 Boucle d… Bale Boromo Rural 792 103 0.42
#> 2 EA_14036 Boucle d… Bale Yaho Rural 1147 151 8.77
#> 3 EA_12367 Boucle d… Banwa Solenzo Rural 968 116 0.92
#> 4 EA_00756 Boucle d… Kossi Barani Rural 1121 147 9.28
#> 5 EA_03940 Boucle d… Kossi Doumba… Rural 1151 156 30.9
#> 6 EA_02111 Boucle d… Mouhoun Bondok… Rural 845 129 16.3
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>Sampling with replacement allows the same unit to be selected multiple times, useful for bootstrap procedures.
sample_wr <- sampling_design(title = "SRS WR") |>
draw(n = 100, method = "srswr") |>
execute(bfa_eas, seed = 123)
head(sample_wr)
#> # A tbl_sample: 6 × 18 | SRS WR
#> # Weights: 149 [149, 149]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_08950 Centre Kadiogo Ouagad… Urban 1445 274 0.19
#> 2 EA_08998 Centre Kadiogo Ouagad… Urban 1909 362 0.57
#> 3 EA_04259 Hauts-Ba… Houet Fo Rural 334 45 7.61
#> 4 EA_10777 Est Gnagna Piela Rural 709 82 8.57
#> 5 EA_07238 Nord Zondoma Leba Rural 1476 146 27.6
#> 6 EA_09473 Centre Kadiogo Ouagad… Urban 1562 296 0.35
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <dbl>, .draw_1 <int>Using Sampling Fractions
Instead of a fixed sample size, you can specify a sampling fraction
with frac.
sample <- sampling_design() |>
draw(frac = 0.05) |>
execute(bfa_eas, seed = 1789)
nrow(sample)
#> [1] 745Bernoulli sampling selects each unit independently with the specified probability. This gives a random sample size, which can simplify field protocols.
sample <- sampling_design() |>
draw(frac = 0.05, method = "bernoulli") |>
execute(bfa_eas, seed = 1960)
nrow(sample)
#> [1] 715Rounding Behavior
When using frac, the computed sample size (N × frac)
must be rounded to an integer. By default, samplyr rounds
up (ceiling), which matches SAS SURVEYSELECT’s default behavior. Use the
round argument to control this:
-
"up"(default): round up, ensures adequate sample size -
"down": round down, minimum sample size -
"nearest": Standard mathematical rounding
# Frame of 105 units, frac = 0.1 → 10.5
frame_105 <- head(bfa_eas, 105)
# Default: rounds up to 11
nrow(sampling_design() |> draw(frac = 0.1) |> execute(frame_105, seed = 3))
#> [1] 11
# Round down to 10
nrow(sampling_design() |> draw(frac = 0.1, round = "down") |> execute(frame_105, seed = 3))
#> [1] 10
# Round to nearest (10)
nrow(sampling_design() |> draw(frac = 0.1, round = "nearest") |> execute(frame_105, seed = 3))
#> [1] 10Stratified Sampling
Stratification partitions the frame into non-overlapping groups (strata) and samples from each. This ensures representation from all subgroups and often improves precision.
Per-Stratum Sample Size
Without an allocation method, n specifies the sample
size within each stratum.
sample <- sampling_design() |>
stratify_by(region) |>
draw(n = 20) |>
execute(bfa_eas, seed = 123)
sample |>
count(region)
#> # A tibble: 13 × 2
#> region n
#> <fct> <int>
#> 1 Boucle du Mouhoun 20
#> 2 Cascades 20
#> 3 Centre 20
#> 4 Centre-Est 20
#> 5 Centre-Nord 20
#> 6 Centre-Ouest 20
#> 7 Centre-Sud 20
#> 8 Est 20
#> 9 Hauts-Bassins 20
#> 10 Nord 20
#> 11 Plateau-Central 20
#> 12 Sahel 20
#> 13 Sud-Ouest 20Allocation Methods
With an allocation method, n becomes the total sample
size to distribute across strata. Allocation methods are defined for
total sample-size allocation, so draw(frac = ...) is not
supported when alloc is set.
Proportional allocation distributes the sample proportionally to stratum sizes, ensuring the sample mirrors the population structure.
sample <- sampling_design(title = "Proportional Allocation") |>
stratify_by(region, alloc = "proportional") |>
draw(n = 300) |>
execute(bfa_eas, seed = 42)
sample |>
count(region) |>
mutate(pct = n / sum(n))
#> # A tibble: 13 × 3
#> region n pct
#> <fct> <int> <dbl>
#> 1 Boucle du Mouhoun 30 0.1
#> 2 Cascades 14 0.0467
#> 3 Centre 31 0.103
#> 4 Centre-Est 25 0.0833
#> 5 Centre-Nord 28 0.0933
#> 6 Centre-Ouest 26 0.0867
#> 7 Centre-Sud 12 0.04
#> 8 Est 32 0.107
#> 9 Hauts-Bassins 30 0.1
#> 10 Nord 24 0.08
#> 11 Plateau-Central 15 0.05
#> 12 Sahel 18 0.06
#> 13 Sud-Ouest 15 0.05Equal allocation assigns the same sample size to each stratum, regardless of population size. This maximizes precision for comparisons between strata.
sample <- sampling_design(title = "Equal Allocation") |>
stratify_by(region, alloc = "equal") |>
draw(n = 160) |>
execute(bfa_eas, seed = 2026)
sample |>
count(region)
#> # A tibble: 13 × 2
#> region n
#> <fct> <int>
#> 1 Boucle du Mouhoun 13
#> 2 Cascades 13
#> 3 Centre 13
#> 4 Centre-Est 13
#> 5 Centre-Nord 12
#> 6 Centre-Ouest 12
#> 7 Centre-Sud 12
#> 8 Est 12
#> 9 Hauts-Bassins 12
#> 10 Nord 12
#> 11 Plateau-Central 12
#> 12 Sahel 12
#> 13 Sud-Ouest 12Neyman allocation minimizes the variance of the
overall estimate by allocating more sample to strata with higher
variability. It requires prior information about stratum variances. Use
explicit stratification variable names when passing
variance.
data(bfa_eas_variance)
bfa_eas_variance
#> # A tibble: 13 × 2
#> region var
#> <fct> <dbl>
#> 1 Boucle du Mouhoun 99.2
#> 2 Cascades 57.7
#> 3 Centre 32.1
#> 4 Centre-Est 61.1
#> 5 Centre-Nord 97.4
#> 6 Centre-Ouest 33.7
#> 7 Centre-Sud 31.7
#> 8 Est 144.
#> 9 Hauts-Bassins 60.0
#> 10 Nord 137.
#> 11 Plateau-Central 59.4
#> 12 Sahel 142.
#> 13 Sud-Ouest 60.8
sample <- sampling_design(title = "Neyman Allocation") |>
stratify_by(region, alloc = "neyman", variance = bfa_eas_variance) |>
draw(n = 300) |>
execute(bfa_eas, seed = 12345)
sample |>
count(region)
#> # A tibble: 13 × 2
#> region n
#> <fct> <int>
#> 1 Boucle du Mouhoun 34
#> 2 Cascades 12
#> 3 Centre 20
#> 4 Centre-Est 23
#> 5 Centre-Nord 31
#> 6 Centre-Ouest 17
#> 7 Centre-Sud 8
#> 8 Est 44
#> 9 Hauts-Bassins 27
#> 10 Nord 33
#> 11 Plateau-Central 13
#> 12 Sahel 25
#> 13 Sud-Ouest 13Optimal allocation extends Neyman allocation by also
accounting for differential costs across strata. It minimizes variance
for a fixed budget (or cost for fixed precision). Use explicit
stratification variable names when passing
variance/cost.
data(bfa_eas_cost)
bfa_eas_cost
#> # A tibble: 13 × 2
#> region cost
#> <fct> <dbl>
#> 1 Boucle du Mouhoun 491
#> 2 Cascades 429
#> 3 Centre 240
#> 4 Centre-Est 407
#> 5 Centre-Nord 449
#> 6 Centre-Ouest 396
#> 7 Centre-Sud 419
#> 8 Est 624
#> 9 Hauts-Bassins 365
#> 10 Nord 592
#> 11 Plateau-Central 408
#> 12 Sahel 638
#> 13 Sud-Ouest 418
sample <- sampling_design(title = "Optimal Allocation") |>
stratify_by(region, alloc = "optimal",
variance = bfa_eas_variance,
cost = bfa_eas_cost) |>
draw(n = 300) |>
execute(bfa_eas, seed = 9876)
sample |>
count(region)
#> # A tibble: 13 × 2
#> region n
#> <fct> <int>
#> 1 Boucle du Mouhoun 33
#> 2 Cascades 12
#> 3 Centre 28
#> 4 Centre-Est 24
#> 5 Centre-Nord 31
#> 6 Centre-Ouest 18
#> 7 Centre-Sud 8
#> 8 Est 38
#> 9 Hauts-Bassins 30
#> 10 Nord 29
#> 11 Plateau-Central 14
#> 12 Sahel 21
#> 13 Sud-Ouest 14Power allocation (Bankier 1988) uses \(n_h \propto C_h X_h^q\), where
cv supplies \(C_h\),
importance supplies \(X_h\), and power is \(q \in [0, 1]\).
cv_df <- data.frame(
region = levels(bfa_eas$region),
cv = c(0.40, 0.35, 0.12, 0.20, 0.30, 0.18,
0.15, 0.38, 0.22, 0.32, 0.17, 0.45, 0.25)
)
importance_df <- data.frame(
region = levels(bfa_eas$region),
importance = c(60, 40, 120, 70, 80, 65,
50, 55, 90, 75, 45, 35, 30)
)
sample <- sampling_design(title = "Power Allocation") |>
stratify_by(
region,
alloc = "power",
cv = cv_df,
importance = importance_df,
power = 0.5
) |>
draw(n = 300) |>
execute(bfa_eas, seed = 777)
sample |>
count(region)
#> # A tibble: 13 × 2
#> region n
#> <fct> <int>
#> 1 Boucle du Mouhoun 35
#> 2 Cascades 25
#> 3 Centre 15
#> 4 Centre-Est 19
#> 5 Centre-Nord 31
#> 6 Centre-Ouest 16
#> 7 Centre-Sud 12
#> 8 Est 32
#> 9 Hauts-Bassins 24
#> 10 Nord 32
#> 11 Plateau-Central 13
#> 12 Sahel 30
#> 13 Sud-Ouest 16Sample Size Bounds
When using allocation methods, you can constrain stratum sample sizes
with min_n and max_n in
draw().
Minimum sample size ensures every stratum gets at least the specified number of units. This is essential for variance estimation (requires n ≥ 2 per stratum) or when you need reliable subgroup estimates.
# Neyman allocation with minimum 5 per stratum
sample <- sampling_design() |>
stratify_by(region, alloc = "neyman", variance = bfa_eas_variance) |>
draw(n = 300, min_n = 20) |>
execute(bfa_eas, seed = 1)
sample |>
count(region)
#> # A tibble: 13 × 2
#> region n
#> <fct> <int>
#> 1 Boucle du Mouhoun 27
#> 2 Cascades 20
#> 3 Centre 20
#> 4 Centre-Est 21
#> 5 Centre-Nord 26
#> 6 Centre-Ouest 20
#> 7 Centre-Sud 20
#> 8 Est 33
#> 9 Hauts-Bassins 23
#> 10 Nord 27
#> 11 Plateau-Central 20
#> 12 Sahel 23
#> 13 Sud-Ouest 20Maximum sample size caps large strata to prevent them from dominating the sample. This is useful when operational constraints limit how many units you can handle in any one stratum.
# Proportional allocation but cap at 60 per stratum
sample <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 300, max_n = 55) |>
execute(bfa_eas, seed = 40)
sample |>
count(region)
#> # A tibble: 13 × 2
#> region n
#> <fct> <int>
#> 1 Boucle du Mouhoun 30
#> 2 Cascades 14
#> 3 Centre 31
#> 4 Centre-Est 25
#> 5 Centre-Nord 28
#> 6 Centre-Ouest 26
#> 7 Centre-Sud 12
#> 8 Est 32
#> 9 Hauts-Bassins 30
#> 10 Nord 24
#> 11 Plateau-Central 15
#> 12 Sahel 18
#> 13 Sud-Ouest 15Both bounds together create a feasible range for each stratum:
sample <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 300, min_n = 20, max_n = 55) |>
execute(bfa_eas, seed = 2003)
sample |>
count(region)
#> # A tibble: 13 × 2
#> region n
#> <fct> <int>
#> 1 Boucle du Mouhoun 26
#> 2 Cascades 20
#> 3 Centre 27
#> 4 Centre-Est 23
#> 5 Centre-Nord 25
#> 6 Centre-Ouest 23
#> 7 Centre-Sud 20
#> 8 Est 27
#> 9 Hauts-Bassins 26
#> 10 Nord 23
#> 11 Plateau-Central 20
#> 12 Sahel 20
#> 13 Sud-Ouest 20When a stratum’s population is smaller than min_n, the
entire stratum is selected (capped at population size).
Custom allocation lets you specify exact sample
sizes or rates per stratum by passing a data frame to
draw().
sizes_df <- data.frame(
region = levels(bfa_eas$region),
n = c(20, 12, 25, 18, 22, 16, 14, 15, 20, 18, 12, 10, 8)
)
sample <- sampling_design() |>
stratify_by(region) |>
draw(n = sizes_df) |>
execute(bfa_eas, seed = 101)
sample |>
count(region)
#> # A tibble: 13 × 2
#> region n
#> <fct> <int>
#> 1 Boucle du Mouhoun 20
#> 2 Cascades 12
#> 3 Centre 25
#> 4 Centre-Est 18
#> 5 Centre-Nord 22
#> 6 Centre-Ouest 16
#> 7 Centre-Sud 14
#> 8 Est 15
#> 9 Hauts-Bassins 20
#> 10 Nord 18
#> 11 Plateau-Central 12
#> 12 Sahel 10
#> 13 Sud-Ouest 8Multiple Stratification Variables
You can stratify by multiple variables to create crossed strata. Here we stratify by both region and urban/rural status.
sample <- sampling_design() |>
stratify_by(region, urban_rural, alloc = "proportional") |>
draw(n = 300) |>
execute(bfa_eas, seed = 42)
sample |>
count(region, urban_rural) |>
head(10)
#> # A tibble: 10 × 3
#> region urban_rural n
#> <fct> <fct> <int>
#> 1 Boucle du Mouhoun Rural 30
#> 2 Cascades Rural 13
#> 3 Centre Rural 2
#> 4 Centre Urban 29
#> 5 Centre-Est Rural 23
#> 6 Centre-Est Urban 3
#> 7 Centre-Nord Rural 24
#> 8 Centre-Nord Urban 3
#> 9 Centre-Ouest Rural 22
#> 10 Centre-Ouest Urban 4Cluster Sampling
Cluster sampling selects groups (clusters) rather than individual units. This is practical when a complete list of individuals isn’t available or when travel costs make dispersed sampling expensive.
Use cluster_by() to specify the cluster identifier. All
units within selected clusters are included.
sample <- sampling_design() |>
cluster_by(ea_id) |>
draw(n = 50) |>
execute(bfa_eas, seed = 120)
sample |>
summarise(
n_clusters = n_distinct(ea_id),
n_units = n()
)
#> # A tibble: 1 × 2
#> n_clusters n_units
#> <int> <int>
#> 1 50 50PPS Sampling
Probability Proportional to Size (PPS) sampling selects units with probability proportional to a size measure. This is standard for cluster sampling where clusters vary in size.
Use the mos argument to specify the measure of size
variable. Here we select EAs with probability proportional to their
household count.
sample_pps <- sampling_design() |>
cluster_by(ea_id) |>
draw(n = 50, method = "pps_brewer", mos = households) |>
execute(bfa_eas, seed = 365)
sample_pps |>
summarise(
mean_hh = mean(households),
median_hh = median(households)
)
#> # A tibble: 1 × 2
#> mean_hh median_hh
#> <dbl> <dbl>
#> 1 241. 192.Compared to simple random sampling, PPS tends to select larger clusters:
sample_srs <- sampling_design() |>
cluster_by(ea_id) |>
draw(n = 50) |>
execute(bfa_eas, seed = 42)
bind_rows(
sample_pps |> summarise(method = "PPS", mean_hh = mean(households)),
sample_srs |> summarise(method = "SRS", mean_hh = mean(households))
)
#> # A tibble: 2 × 2
#> method mean_hh
#> <chr> <dbl>
#> 1 PPS 241.
#> 2 SRS 213PPS Methods
samplyr supports several PPS methods:
| Method | Description |
|---|---|
pps_brewer |
Brewer’s method - fast with good properties (recommended) |
pps_systematic |
PPS systematic - simple but can have periodicity issues |
pps_cps |
Conditional Poisson sampling (maximum entropy) |
pps_poisson |
PPS Poisson - random sample size (requires frac and
supports prn) |
pps_sps |
Sequential Poisson sampling (supports prn) |
pps_pareto |
Pareto sampling (supports prn) |
pps_multinomial |
PPS with replacement |
pps_chromy |
Chromy’s sequential PPS / minimum replacement |
Certainty Selection
In PPS sampling, very large units can have expected inclusion probabilities above 1. Certainty selection handles this by selecting those units with probability 1 and sampling the rest normally.
Use certainty_size for an absolute threshold or
certainty_prop for a proportional one.
sample_cert <- sampling_design(title = "Certainty selection") |>
draw(n = 50, method = "pps_brewer", mos = households,
certainty_size = 800) |>
execute(bfa_eas, seed = 123)
# How many EAs with more than 800 households
filter(bfa_eas, households > 800)
#> # A tibble: 10 × 12
#> ea_id region province commune urban_rural population households area_km2
#> <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_08824 Centre Kadiogo Ouagad… Urban 4500 853 1.39
#> 2 EA_01319 Centre-… Boulgou Bitou Rural 4559 815 8.31
#> 3 EA_08542 Centre-… Boulgou Niaogho Rural 3505 865 4.38
#> 4 EA_03315 Centre-… Kourite… Dialga… Rural 3617 807 9.26
#> 5 EA_07018 Centre-… Kourite… Koupela Urban 4640 910 3.62
#> 6 EA_05858 Centre-… Sanmate… Kaya Urban 4983 809 10.1
#> 7 EA_13180 Centre-… Boulkie… Thiou Rural 5273 875 23.8
#> 8 EA_11234 Centre-… Sanguie Reo Urban 5171 886 24.5
#> 9 EA_00797 Nord Yatenga Barga Urban 5840 1032 6.03
#> 10 EA_02277 Plateau… Ganzour… Boudri Rural 5273 807 23.0
#> # ℹ 4 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>
# Check which units were selected with certainty
count(sample_cert, .certainty_1)
#> # A tibble: 2 × 2
#> .certainty_1 n
#> <lgl> <int>
#> 1 FALSE 40
#> 2 TRUE 10With certainty_prop, units whose MOS proportion exceeds
the threshold are taken with certainty. The process is iterative: after
removing certainty units, proportions are recomputed.
sample_certp <- sampling_design() |>
stratify_by(region) |>
draw(n = 50, method = "pps_systematic", mos = households,
certainty_prop = 0.05) |>
execute(bfa_eas, seed = 12345)
# How many EAs with more than 5% of the total hh by region
bfa_eas |>
mutate(mosprop = households/sum(households),
.by = region) |>
filter(mosprop >= 0.05)
#> # A tibble: 0 × 13
#> # ℹ 13 variables: ea_id <chr>, region <fct>, province <fct>, commune <fct>,
#> # urban_rural <fct>, population <dbl>, households <int>, area_km2 <dbl>,
#> # accessible <lgl>, dist_road_km <dbl>, food_insecurity_pct <dbl>,
#> # cost <dbl>, mosprop <dbl>
filter(sample_certp, .certainty_1)
#> Warning in min(w): no non-missing arguments to min; returning Inf
#> Warning in max(w): no non-missing arguments to max; returning -Inf
#> # A tbl_sample: 0 × 18
#> # Weights: NaN [Inf, -Inf]
#> # ℹ 18 variables: ea_id <chr>, region <fct>, province <fct>, commune <fct>,
#> # urban_rural <fct>, population <dbl>, households <int>, area_km2 <dbl>,
#> # accessible <lgl>, dist_road_km <dbl>, food_insecurity_pct <dbl>,
#> # cost <dbl>, .weight <dbl>, .sample_id <int>, .stage <int>, .weight_1 <dbl>,
#> # .fpc_1 <int>, .certainty_1 <lgl>Permanent Random Numbers (PRN)
Some PPS methods support permanent random numbers for sample
coordination across survey waves. Assign a stable U(0,1) value to each
frame unit, then pass it via prn:
set.seed(1)
bfa_eas$prn <- runif(nrow(bfa_eas))
sample_prn <- sampling_design(title = "Samples coordination with PRN") |>
cluster_by(ea_id) |>
draw(n = 50, method = "pps_sps", mos = households, prn = prn) |>
execute(bfa_eas, seed = 1)
sample_prn
#> # A tbl_sample: 50 × 19 | Samples coordination with PRN
#> # Weights: 301.91 [121.15, 635.71]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_12873 Boucle … Banwa Tansila Rural 1164 148 39.1
#> 2 EA_03170 Boucle … Mouhoun Dedoug… Rural 1209 215 3.81
#> 3 EA_12909 Boucle … Mouhoun Tcheri… Rural 1252 150 43.2
#> 4 EA_05765 Boucle … Sourou Kassoum Rural 1092 148 25.1
#> 5 EA_12525 Cascades Comoe Soubak… Rural 1224 173 19.1
#> 6 EA_13358 Cascades Comoe Tiefora Rural 1345 161 25.3
#> 7 EA_13387 Cascades Comoe Tiefora Rural 931 112 22.0
#> 8 EA_08816 Centre Kadiogo Ouagad… Urban 1745 331 0.22
#> 9 EA_09321 Centre Kadiogo Ouagad… Urban 1936 367 0.26
#> 10 EA_09370 Centre Kadiogo Ouagad… Urban 2686 509 0.41
#> # ℹ 40 more rows
#> # ℹ 11 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> # .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>,
#> # .certainty_1 <lgl>Because the PRN values are stable, re-executing the same design on
the same frame produces highly overlapping samples (positive
coordination). PRN is supported for bernoulli,
pps_poisson, pps_sps, and
pps_pareto. You can learn more about coordination with
permanent random numbers in the following vignette
vignette("sampling-coordination").
Control Sorting and Serpentine Ordering
Control sorting orders the frame before selection, providing implicit stratification. This is effective with systematic and sequential methods, where it ensures the sample spreads evenly across the sorted variables.
# Nested sorting: standard ascending order
sample_sort <- sampling_design() |>
draw(n = 100, method = "systematic",
control = c(region, province)) |>
execute(bfa_eas, seed = 98765)
head(sample_sort)
#> # A tbl_sample: 6 × 18
#> # Weights: 149 [149, 149]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_04231 Boucle d… Bale Fara Rural 1722 275 21.6
#> 2 EA_06858 Boucle d… Banwa Kouka Rural 1446 166 19.5
#> 3 EA_12421 Boucle d… Banwa Solenzo Rural 579 69 8.59
#> 4 EA_02526 Boucle d… Kossi Bouras… Rural 1460 165 2.38
#> 5 EA_08637 Boucle d… Kossi Nouna Rural 1261 164 1.4
#> 6 EA_03146 Boucle d… Mouhoun Dedoug… Rural 1266 225 15.5
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> # .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>serp() implements serpentine (snake) sorting, which
alternates direction at each level of the hierarchy. This minimizes
jumps at boundaries, matching SAS PROC SURVEYSELECT’s
SORT=SERP.
# Serpentine sorting
sample_serp <- sampling_design() |>
draw(n = 100, method = "systematic",
control = serp(region, province)) |>
execute(bfa_eas, seed = 1)
head(sample_serp)
#> # A tbl_sample: 6 × 18
#> # Weights: 149 [149, 149]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_00461 Boucle d… Bale Bana Rural 1348 176 7.3
#> 2 EA_11103 Boucle d… Bale Poura Urban 1660 225 19.0
#> 3 EA_12339 Boucle d… Banwa Solenzo Rural 1077 129 20.5
#> 4 EA_12897 Boucle d… Banwa Tansila Rural 1205 154 26.6
#> 5 EA_03729 Boucle d… Kossi Dokui Rural 1476 236 18.2
#> 6 EA_12514 Boucle d… Kossi Sono Rural 1296 187 23.6
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> # .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>Control sorting can be combined with explicit stratification. The sorting is then applied within each stratum.
sample_serp2 <- sampling_design() |>
stratify_by(urban_rural) |>
draw(n = 100, method = "systematic",
control = serp(region, province)) |>
execute(bfa_eas, seed = 2)
head(sample_serp2)
#> # A tbl_sample: 6 × 18
#> # Weights: 122.51 [122.51, 122.51]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_00267 Boucle d… Bale Bagassi Rural 1319 159 17.1
#> 2 EA_10342 Boucle d… Bale Ouri Rural 1522 182 2.25
#> 3 EA_06864 Boucle d… Banwa Kouka Rural 990 114 19.1
#> 4 EA_12401 Boucle d… Banwa Solenzo Rural 1062 127 29.3
#> 5 EA_00763 Boucle d… Kossi Barani Rural 1638 215 11.5
#> 6 EA_03738 Boucle d… Kossi Dokui Rural 1314 210 22.0
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> # .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>Balanced Sampling
Balanced sampling selects a sample whose Horvitz-Thompson estimates
of auxiliary totals match (or nearly match) the population totals. This
improves precision for any variable correlated with the auxiliary
variables. samplyr implements the cube method (Deville
& Tille 2004) via method = "balanced".
Specify auxiliary variables with aux. Without
mos, inclusion probabilities are equal; with
mos, they are proportional to size:
# Equal-probability balanced sample
balanced_eq <- sampling_design() |>
draw(n = 100, method = "balanced",
aux = c(population, households)) |>
execute(bfa_eas, seed = 42)
balanced_eq
#> # A tbl_sample: 100 × 18
#> # Weights: 149 [149, 149]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_04217 Boucle … Bale Fara Rural 682 109 8.85
#> 2 EA_14036 Boucle … Bale Yaho Rural 1147 151 8.77
#> 3 EA_12381 Boucle … Banwa Solenzo Rural 3444 411 3.48
#> 4 EA_03598 Boucle … Kossi Djibas… Rural 1039 123 9.6
#> 5 EA_10134 Boucle … Mouhoun Ouarko… Rural 943 131 19.3
#> 6 EA_04520 Boucle … Nayala Gassan Rural 1344 164 70.2
#> 7 EA_13843 Boucle … Sourou Tougan Rural 1973 273 100.
#> 8 EA_10283 Cascades Comoe Ouo Rural 1221 160 74.6
#> 9 EA_12524 Cascades Comoe Soubak… Rural 1532 216 17.4
#> 10 EA_13400 Cascades Comoe Tiefora Rural 1275 153 14.0
#> # ℹ 90 more rows
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> # .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>The balancing property means the HT estimate of auxiliary totals is close to the true total:
pop_total <- sum(bfa_eas$population)
ht_total <- sum(balanced_eq$population * balanced_eq$.weight)
c(population = pop_total, ht_estimate = ht_total)
#> population ht_estimate
#> 20436039 20546504Balanced sampling combines naturally with stratification. When
stratified, samplyr calls the stratified cube algorithm
(Chauvet 2009) in a single pass, preserving the allocation computed by
stratify_by():
# Stratified PPS balanced sample
balanced_strat <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 300, method = "balanced", mos = households,
aux = c(population, area_km2)) |>
execute(bfa_eas, seed = 1960)
balanced_strat
#> # A tbl_sample: 300 × 18
#> # Weights: 50.6 [14.67, 228.32]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_02177 Boucle … Bale Boromo Rural 3387 442 42.2
#> 2 EA_10382 Boucle … Bale Pa Rural 825 116 0.66
#> 3 EA_11051 Boucle … Bale Pompoi Rural 1021 161 35.6
#> 4 EA_11060 Boucle … Bale Pompoi Rural 802 126 0.72
#> 5 EA_00369 Boucle … Banwa Balave Rural 2049 323 36.5
#> 6 EA_12378 Boucle … Banwa Solenzo Rural 1296 155 33.9
#> 7 EA_12398 Boucle … Banwa Solenzo Rural 1813 216 34.9
#> 8 EA_12405 Boucle … Banwa Solenzo Rural 1463 175 19.2
#> 9 EA_12438 Boucle … Banwa Solenzo Rural 1394 166 33.8
#> 10 EA_02076 Boucle … Kossi Bombor… Rural 1013 133 8.85
#> # ℹ 290 more rows
#> # ℹ 10 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, prn <dbl>, .weight <dbl>,
#> # .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <dbl>When used with cluster_by(), auxiliary values are
automatically aggregated (summed) to the cluster level before
selection.
Balanced sampling supports up to 2 stages. At stage 1, element-level auxiliary variables are aggregated to PSU totals; at stage 2, the cube method runs on the element-level frame within each selected PSU.
Multi-Stage Sampling
Multi-stage designs sample in stages: first select primary sampling units (PSUs), then sample within them. This is the standard approach for large-scale surveys when a complete frame of ultimate units isn’t available upfront.
Use add_stage() to delimit each stage of the design.
Two-Stage Design
For this example, we’ll use zwe_eas to demonstrate a
two-stage design: first select districts as PSUs, then sample EAs within
selected districts.
data(zwe_eas)
# First, let's see the structure: EAs nested within districts
zwe_eas |>
count(province, district) |>
head(10)
#> # A tibble: 10 × 3
#> province district n
#> <fct> <fct> <int>
#> 1 Bulawayo Bulawayo 453
#> 2 Harare Chitungwiza 161
#> 3 Harare Epworth 125
#> 4 Harare Harare 969
#> 5 Harare Harare Rural 224
#> 6 Manicaland Buhera 556
#> 7 Manicaland Chimanimani 206
#> 8 Manicaland Chipinge 489
#> 9 Manicaland Chipinge Urban 22
#> 10 Manicaland Makoni 668We need to create a measure of size for districts. Here we’ll use total households per district.
# Add district-level households as MOS
zwe_frame <- zwe_eas |>
mutate(district_hh = sum(households),
.by = district)Now we can run a two-stage design: select 10 districts with PPS, then sample 5 EAs within each.
twostage_design <- sampling_design(title = "Two-stage cluster sampling") |>
add_stage(label = "Districts") |>
cluster_by(district) |>
draw(n = 10, method = "pps_brewer", mos = district_hh) |>
add_stage(label = "EAs") |>
draw(n = 5)
twostage_design
#> ── Sampling Design: Two-stage cluster sampling ─────────────────────────────────
#>
#> ℹ 2 stages
#>
#> ── Stage 1: Districts ──────────────────────────────────────────────────────────
#> • Cluster: district
#> • Draw: n = 10, method = pps_brewer, mos = district_hh
#>
#> ── Stage 2: EAs ────────────────────────────────────────────────────────────────
#> • Draw: n = 5, method = srsworLet’s apply it to the frame.
sample_eas <- execute(twostage_design,
zwe_frame, seed = 3)
sample_eas |>
summarise(n_districts = n_distinct(district),
n_eas = n())
#> # A tibble: 1 × 2
#> n_districts n_eas
#> <int> <int>
#> 1 10 50Stratified Multi-Stage Design
Combining stratification with multi-stage sampling is standard for national surveys. Here we stratify by province (which is constant within districts), then select districts and EAs within each stratum.
sample_eas2 <- sampling_design(title = "Stratified two-stage cluster sampling") |>
add_stage(label = "Districts") |>
stratify_by(province) |>
cluster_by(district) |>
draw(n = 2, method = "pps_brewer", mos = district_hh) |>
add_stage(label = "EAs") |>
draw(n = 3) |>
execute(zwe_frame, seed = 4)
#> Warning: Sample size capped to population in 1 stratum/strata: "Bulawayo".
#> ℹ Requested total: 20. Actual total: 19.
sample_eas2 |>
count(province, district)
#> # A tibble: 19 × 3
#> province district n
#> <fct> <fct> <int>
#> 1 Bulawayo Bulawayo 3
#> 2 Harare Harare 3
#> 3 Harare Harare Rural 3
#> 4 Manicaland Buhera 3
#> 5 Manicaland Makoni 3
#> 6 Mashonaland Central Guruve 3
#> 7 Mashonaland Central Mount Darwin 3
#> 8 Mashonaland East Goromonzi 3
#> 9 Mashonaland East Murehwa 3
#> 10 Mashonaland West Sanyati 3
#> 11 Mashonaland West Zvimba 3
#> 12 Masvingo Bikita 3
#> 13 Masvingo Masvingo 3
#> 14 Matabeleland North Binga 3
#> 15 Matabeleland North Bubi 3
#> 16 Matabeleland South Gwanda 3
#> 17 Matabeleland South Umzingwane 3
#> 18 Midlands Gokwe South 3
#> 19 Midlands Zvishavane 3Operational Multi-Stage Sampling
In practice, multi-stage surveys often execute stages at different times. After selecting PSUs, field teams may need to conduct listing or other operations before the second stage can proceed.
The stages argument lets you execute only specific
stages.
design <- sampling_design() |>
add_stage(label = "Districts") |>
stratify_by(province) |>
cluster_by(district) |>
draw(n = 2, method = "pps_brewer", mos = district_hh) |>
add_stage(label = "EAs") |>
draw(n = 4)
# Execute stage 1 only
selected_districts <- execute(design, zwe_frame, stages = 1, seed = 42)
#> Warning: Sample size capped to population in 1 stratum/strata: "Bulawayo".
#> ℹ Requested total: 20. Actual total: 19.
selected_districts |>
distinct(province, district)
#> # A tibble: 19 × 2
#> province district
#> <fct> <fct>
#> 1 Bulawayo Bulawayo
#> 2 Harare Harare
#> 3 Harare Harare Rural
#> 4 Manicaland Chipinge
#> 5 Manicaland Nyanga
#> 6 Mashonaland Central Mazowe
#> 7 Mashonaland Central Mount Darwin
#> 8 Mashonaland East Mudzi
#> 9 Mashonaland East Murehwa
#> 10 Mashonaland West Chinhoyi
#> 11 Mashonaland West Kariba
#> 12 Masvingo Chivi
#> 13 Masvingo Masvingo
#> 14 Matabeleland North Tsholotsho
#> 15 Matabeleland North Umguza
#> 16 Matabeleland South Bulilima
#> 17 Matabeleland South Gwanda
#> 18 Midlands Shurugwi Town
#> 19 Midlands ZvishavaneAfter fieldwork or preparation is complete, execute the remaining stages.
final_sample <- selected_districts |>
execute(zwe_frame, seed = 43)
final_sample |>
count(province, district)
#> # A tibble: 19 × 3
#> province district n
#> <fct> <fct> <int>
#> 1 Bulawayo Bulawayo 4
#> 2 Harare Harare 4
#> 3 Harare Harare Rural 4
#> 4 Manicaland Chipinge 4
#> 5 Manicaland Nyanga 4
#> 6 Mashonaland Central Mazowe 4
#> 7 Mashonaland Central Mount Darwin 4
#> 8 Mashonaland East Mudzi 4
#> 9 Mashonaland East Murehwa 4
#> 10 Mashonaland West Chinhoyi 4
#> 11 Mashonaland West Kariba 4
#> 12 Masvingo Chivi 4
#> 13 Masvingo Masvingo 4
#> 14 Matabeleland North Tsholotsho 4
#> 15 Matabeleland North Umguza 4
#> 16 Matabeleland South Bulilima 4
#> 17 Matabeleland South Gwanda 4
#> 18 Midlands Shurugwi Town 4
#> 19 Midlands Zvishavane 4Working with Results
Accessing the Design
The sampling design is attached to the result and can be retrieved
with get_design().
sample <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 200) |>
execute(bfa_eas, seed = 2024)
design <- get_design(sample)
design
#> ── Sampling Design ─────────────────────────────────────────────────────────────
#>
#> ℹ 1 stage
#>
#> ── Stage 1 ─────────────────────────────────────────────────────────────────────
#> • Strata: region (proportional)
#> • Draw: n = 200, method = srsworand accessing element of the design
design$stages[[1]]$strata$vars
#> [1] "region"Weight Diagnostics
The summary() method reports weight diagnostics
including Kish’s effective sample size and design effect due to unequal
weighting. For the full design effect (including clustering and
stratification), use survey::svymean() with
deff = TRUE after converting with
as_svydesign().
summary(sample)
#> ── Sample Summary ──────────────────────────────────────────────────────────────
#>
#> ℹ n = 200 | stages = 1/1 | seed = 2024
#>
#> ── Design: Stage 1 ─────────────────────────────────────────────────────────────
#> • Strata: region (proportional)
#> • Method: srswor
#>
#> ── Allocation: Stage 1 ─────────────────────────────────────────────────────────
#> region N_h n_h f_h
#> Boucle du Mouhoun 1483 20 0.0135
#> Cascades 667 9 0.0135
#> Centre 1556 21 0.0135
#> Centre-Est 1259 17 0.0135
#> Centre-Nord 1375 19 0.0138
#> Centre-Ouest 1287 17 0.0132
#> Centre-Sud 608 8 0.0132
#> Est 1590 21 0.0132
#> Hauts-Bassins 1483 20 0.0135
#> Nord 1211 16 0.0132
#> Plateau-Central 757 10 0.0132
#> Sahel 902 12 0.0133
#> Sud-Ouest 722 10 0.0139
#> ───── ─── ──────
#> Total 14900 200 0.0134
#>
#> ── Weights ─────────────────────────────────────────────────────────────────────
#> • Range: [72.2, 76]
#> • Mean: 74.5 · CV: 0.02
#> • DEFF: 1 · n_eff: 200Method Reference
Selection Methods
| Method | Sample Size | Replacement | Notes |
|---|---|---|---|
srswor |
Fixed | No | Default, general purpose |
srswr |
Fixed | Yes | Bootstrap, rare populations |
systematic |
Fixed | No | Ordered frames |
bernoulli |
Random | No | Simple field protocols |
pps_brewer |
Fixed | No | Recommended PPS method |
pps_systematic |
Fixed | No | Simple PPS |
pps_cps |
Fixed | No | Maximum entropy, exact joint inclusion probs |
pps_poisson |
Random | No | PPS with random size |
pps_sps |
Fixed | No | Sequential Poisson, supports PRN |
pps_pareto |
Fixed | No | Pareto sampling, supports PRN |
pps_multinomial |
Fixed | Yes | PPS with replacement |
pps_chromy |
Fixed | PMR.1 | Chromy’s sequential PPS |
balanced |
Fixed | No | Cube method, balances on aux variables |
Allocation Methods
| Method | Description | Requirements |
|---|---|---|
| (none) | n per stratum | |
equal |
Same n per stratum | |
proportional |
n ∝ stratum size | |
neyman |
Minimize variance |
variance data frame |
optimal |
Minimize variance/cost |
variance + cost data frames |
power |
Compromise: \(n_h \propto C_h X_h^q\) |
cv + importance (+ optional
power) |
For custom stratum-specific sizes or rates, pass a data frame
directly to n or frac in draw().
When alloc is specified, use n (not
frac).
Best Practices
- Always set a seed for reproducibility
- Use meaningful stage labels for documentation
-
Validate designs with
validate_frame()before execution -
Check weight distributions with
summary()on the sample
design <- sampling_design(title = "National Health Survey 2024") |>
add_stage(label = "Enumeration Areas") |>
stratify_by(region, urban_rural) |>
cluster_by(ea_id) |>
draw(n = 5, method = "pps_brewer", mos = households) |>
add_stage(label = "Households") |>
draw(n = 20)
print(design)
#> ── Sampling Design: National Health Survey 2024 ────────────────────────────────
#>
#> ℹ 2 stages
#>
#> ── Stage 1: Enumeration Areas ──────────────────────────────────────────────────
#> • Strata: region, urban_rural
#> • Cluster: ea_id
#> • Draw: n = 5, method = pps_brewer, mos = households
#>
#> ── Stage 2: Households ─────────────────────────────────────────────────────────
#> • Draw: n = 20, method = srsworFrame Validation
Use validate_frame() to check that a frame has all the
variables a design requires before executing.
# This passes
validate_frame(design, bfa_eas)
# This fails with a clear message
bad_frame <- data.frame(id = 1:100, value = rnorm(100))
validate_frame(design, bad_frame)
#> Error:
#> ! Frame validation failed:
#> ✖ Enumeration Areas: missing stratification variable: "region" and
#> "urban_rural"
#> ✖ Enumeration Areas: missing cluster variable: "ea_id"
#> ✖ Enumeration Areas: missing MOS variable: `households`