cluster_by() specifies the sampling units (PSUs/clusters) for cluster
or multi-stage sampling designs. Unlike stratify_by(), which defines
subgroups to sample within, cluster_by() defines units to sample
as a whole.
Arguments
- .data
A
sampling_designobject (piped fromsampling_design(),stratify_by(), oradd_stage()).- ...
Clustering variable(s) specified as bare column names that identify the sampling units. In most cases this is a single variable (e.g., school_id, household_id).
Details
cluster_by() is purely structural – it defines what to sample, not how.
The selection method and sample size are specified in draw().
Cluster vs. Stratification
Stratification (
stratify_by()): Sample within each group; all groups represented in the sampleClustering (
cluster_by()): Sample groups as units; only selected groups appear in sample
Multi-Stage Designs
In multi-stage designs, each stage typically has its own clustering variable:
Stage 1: Select schools (
cluster_by(school_id))Stage 2: Select classrooms within schools (
cluster_by(classroom_id))Stage 3: Select students within classrooms (no clustering, sample individuals)
The nesting structure (classrooms within schools) is validated at execution time.
Order of Operations
In a single stage, the typical order is:
stratify_by()(optional) - define stratacluster_by()(optional) - define sampling unitsdraw()(required) - specify selection parameters
Both stratify_by() and cluster_by() are optional but draw() is required.
See also
sampling_design() for creating designs,
stratify_by() for stratification,
draw() for specifying selection,
add_stage() for multi-stage designs
Examples
# Simple cluster sample: select 30 EAs
sampling_design() |>
cluster_by(ea_id) |>
draw(n = 30) |>
execute(zwe_eas, seed = 123)
#> # A tbl_sample: 30 × 12
#> # Weights: 753.33 [753.33, 753.33]
#> ea_id province district urban_rural population households area_km2 .weight
#> * <chr> <fct> <fct> <fct> <int> <int> <dbl> <dbl>
#> 1 EA_01614 Harare Harare Urban 930 257 0.46 753.
#> 2 EA_01842 Harare Harare … Urban 1973 601 0.93 753.
#> 3 EA_02757 Manical… Chipinge Rural 574 130 32.6 753.
#> 4 EA_02888 Manical… Chipinge Rural 686 167 3.81 753.
#> 5 EA_02986 Manical… Chipinge Urban 536 147 1.05 753.
#> 6 EA_03371 Manical… Makoni Rural 653 146 15.8 753.
#> 7 EA_04761 Manical… Mutasa Rural 571 134 8.04 753.
#> 8 EA_06170 Mashona… Mazowe Urban 603 176 0.57 753.
#> 9 EA_06746 Mashona… Mount D… Rural 721 171 5.16 753.
#> 10 EA_09097 Mashona… Murehwa Rural 590 146 9.22 753.
#> # ℹ 20 more rows
#> # ℹ 4 more variables: .sample_id <int>, .stage <int>, .weight_1 <dbl>,
#> # .fpc_1 <int>
# Stratified cluster sample: 10 EAs per urban/rural
sampling_design() |>
stratify_by(urban_rural) |>
cluster_by(ea_id) |>
draw(n = 10) |>
execute(zwe_eas, seed = 1)
#> # A tbl_sample: 20 × 12
#> # Weights: 1130 [622, 1638]
#> ea_id province district urban_rural population households area_km2 .weight
#> * <chr> <fct> <fct> <fct> <int> <int> <dbl> <dbl>
#> 1 EA_00286 Bulawayo Bulawayo Urban 1293 376 0.46 622
#> 2 EA_01042 Harare Harare Urban 1405 384 1.36 622
#> 3 EA_01566 Harare Harare Urban 1346 375 0.43 622
#> 4 EA_03285 Manical… Makoni Urban 769 220 3.43 622
#> 5 EA_03508 Manical… Makoni Rural 741 185 6.5 1638
#> 6 EA_04139 Manical… Mutare Rural 641 149 6.08 1638
#> 7 EA_04232 Manical… Mutare Urban 94 27 0.45 622
#> 8 EA_10793 Mashona… Hurungwe Urban 714 197 3.39 622
#> 9 EA_12890 Mashona… Zvimba Urban 362 105 1.78 622
#> 10 EA_13183 Masvingo Bikita Rural 585 133 6.65 1638
#> 11 EA_13505 Masvingo Chiredzi Rural 618 146 14.4 1638
#> 12 EA_14060 Masvingo Chivi Urban 388 104 2.02 622
#> 13 EA_15915 Masvingo Zaka Urban 932 269 3.2 622
#> 14 EA_16688 Matabel… Hwange Rural 499 119 29.1 1638
#> 15 EA_17282 Matabel… Nkayi Rural 495 121 12.7 1638
#> 16 EA_17444 Matabel… Tsholot… Rural 594 135 15.2 1638
#> 17 EA_18072 Matabel… Beitbri… Urban 485 132 0.83 622
#> 18 EA_19803 Midlands Gokwe N… Rural 660 153 21.4 1638
#> 19 EA_21928 Midlands Mbereng… Rural 574 130 9.07 1638
#> 20 EA_22395 Midlands Shurugw… Rural 538 116 7.41 1638
#> # ℹ 4 more variables: .sample_id <int>, .stage <int>, .weight_1 <dbl>,
#> # .fpc_1 <int>
# PPS cluster sample using households as measure of size
sampling_design() |>
cluster_by(ea_id) |>
draw(n = 50, method = "pps_brewer", mos = households) |>
execute(zwe_eas, seed = 2026)
#> # A tbl_sample: 50 × 13
#> # Weights: 409.36 [136.73, 836.89]
#> ea_id province district urban_rural population households area_km2 .weight
#> * <chr> <fct> <fct> <fct> <int> <int> <dbl> <dbl>
#> 1 EA_00014 Bulawayo Bulawayo Rural 1392 335 31.6 227.
#> 2 EA_00059 Bulawayo Bulawayo Urban 1157 317 1.47 240.
#> 3 EA_00064 Bulawayo Bulawayo Urban 1202 341 2.27 223.
#> 4 EA_00173 Bulawayo Bulawayo Urban 1323 386 0.88 197.
#> 5 EA_00185 Bulawayo Bulawayo Urban 1263 367 0.51 208.
#> 6 EA_00269 Bulawayo Bulawayo Urban 1402 392 0.42 194.
#> 7 EA_00376 Bulawayo Bulawayo Urban 1187 341 0.36 223.
#> 8 EA_00638 Harare Epworth Urban 1947 540 0.27 141.
#> 9 EA_00841 Harare Harare Urban 1090 334 0.6 228.
#> 10 EA_01410 Harare Harare Urban 927 274 0.15 278.
#> # ℹ 40 more rows
#> # ℹ 5 more variables: .sample_id <int>, .stage <int>, .weight_1 <dbl>,
#> # .fpc_1 <int>, .certainty_1 <lgl>
# Two-stage cluster sample
zwe_frame <- zwe_eas |>
dplyr::mutate(district_hh = sum(households), .by = district)
sampling_design() |>
add_stage(label = "Districts") |>
cluster_by(district) |>
draw(n = 20, method = "pps_brewer", mos = district_hh) |>
add_stage(label = "EAs") |>
draw(n = 10) |>
execute(zwe_frame, seed = 1234)
#> # A tbl_sample: 200 × 16
#> # Weights: 113.07 [44.69, 153.93]
#> ea_id province district urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <int> <int> <dbl>
#> 1 EA_00372 Bulawayo Bulawayo Urban 1515 441 0.34
#> 2 EA_00270 Bulawayo Bulawayo Rural 299 67 2.26
#> 3 EA_00382 Bulawayo Bulawayo Urban 1475 426 1.04
#> 4 EA_00184 Bulawayo Bulawayo Urban 1332 392 0.34
#> 5 EA_00062 Bulawayo Bulawayo Urban 1022 282 1.85
#> 6 EA_00004 Bulawayo Bulawayo Urban 1328 383 0.36
#> 7 EA_00388 Bulawayo Bulawayo Urban 1322 365 0.7
#> 8 EA_00149 Bulawayo Bulawayo Urban 1390 383 0.49
#> 9 EA_00040 Bulawayo Bulawayo Urban 1343 377 0.96
#> 10 EA_00212 Bulawayo Bulawayo Urban 1423 397 0.43
#> # ℹ 190 more rows
#> # ℹ 9 more variables: district_hh <int>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_2 <dbl>, .fpc_2 <int>, .weight_1 <dbl>, .fpc_1 <int>,
#> # .certainty_1 <lgl>