cluster_by() specifies the sampling units (PSUs/clusters) for cluster
or multi-stage sampling designs. Unlike stratify_by(), which defines
subgroups to sample within, cluster_by() defines units to sample
as a whole.
Arguments
- .data
A
sampling_designobject (piped fromsampling_design(),stratify_by(), oradd_stage()).- ...
Clustering variable(s) specified as bare column names that identify the sampling units. In most cases this is a single variable (e.g., school_id, household_id).
Details
cluster_by() is purely structural – it defines what to sample, not how.
The selection method and sample size are specified in draw().
Cluster vs. Stratification
Stratification (
stratify_by()): Sample within each group; all groups represented in the sampleClustering (
cluster_by()): Sample groups as units; only selected groups appear in sample
Multi-Stage Designs
In multi-stage designs, each stage typically has its own clustering variable:
Stage 1: Select schools (
cluster_by(school_id))Stage 2: Select classrooms within schools (
cluster_by(classroom_id))Stage 3: Select students within classrooms (no clustering, sample individuals)
Child-stage IDs do not need to be globally unique across the entire frame.
For example, classroom_id = 1 can appear in more than one school. At
execution time, samplyr resolves lower-stage clusters using the full
ancestry from earlier stages, so IDs can be unique within parent clusters
rather than globally unique.
In practice, this means the frame must represent a valid hierarchy through
the combination of parent-stage and current-stage IDs. If a lower-stage ID is
only meaningful within a parent, that is supported. If users want a stage's
cluster variable to be globally unique on its own, they should provide a
globally unique ID or include multiple columns in cluster_by().
Order of Operations
In a single stage, the typical order is:
stratify_by()(optional) - define stratacluster_by()(optional) - define sampling unitsdraw()(required) - specify selection parameters
Both stratify_by() and cluster_by() are optional but draw() is required.
See also
sampling_design() for creating designs,
stratify_by() for stratification,
draw() for specifying selection,
add_stage() for multi-stage designs
Examples
# Simple cluster sample: select 30 EAs
sampling_design() |>
cluster_by(ea_id) |>
draw(n = 30) |>
execute(zwe_eas, seed = 123)
#> # A tbl_sample: 30 × 17
#> # Weights: 3575 [3575, 3575]
#> ea_id province district ward_pcode urban_rural population households
#> * <int> <fct> <fct> <chr> <fct> <int> <int>
#> 1 40006 Harare Epworth ZW192304 Urban 1384 372
#> 2 87920 Harare Harare ZW192137 Urban 780 212
#> 3 33850 Harare Harare … ZW192401 Urban 301 78
#> 4 18135 Manicaland Buhera ZW110122 Rural 98 21
#> 5 66047 Manicaland Buhera ZW110106 Rural 98 23
#> 6 60773 Manicaland Chipinge ZW110303 Urban 78 18
#> 7 12448 Manicaland Makoni ZW110431 Rural 123 25
#> 8 28763 Manicaland Makoni ZW110402 Rural 117 27
#> 9 22483 Mashonaland Cent… Mazowe ZW120426 Rural 86 22
#> 10 21237 Mashonaland Cent… Mount D… ZW120509 Rural 471 126
#> # ℹ 20 more rows
#> # ℹ 10 more variables: buildings <int>, women_15_49 <int>, men_15_49 <int>,
#> # children_under5 <int>, area_km2 <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>
# Stratified cluster sample: 10 EAs per urban/rural
sampling_design() |>
stratify_by(urban_rural) |>
cluster_by(ea_id) |>
draw(n = 10) |>
execute(zwe_eas, seed = 1)
#> # A tbl_sample: 20 × 17
#> # Weights: 5362.5 [2415.9, 8309.1]
#> ea_id province district ward_pcode urban_rural population households
#> * <int> <fct> <fct> <chr> <fct> <int> <int>
#> 1 85418 Harare Harare ZW192118 Urban 154 54
#> 2 86381 Harare Harare ZW192141 Urban 200 54
#> 3 18031 Manicaland Buhera ZW110120 Rural 119 27
#> 4 60099 Manicaland Chimani… ZW110216 Urban 149 38
#> 5 90121 Manicaland Mutasa ZW110601 Urban 420 110
#> 6 37769 Mashonaland Cen… Mbire ZW120809 Rural 214 53
#> 7 23994 Mashonaland Cen… Mount D… ZW120536 Urban 140 33
#> 8 20542 Mashonaland East Goromon… ZW130204 Urban 275 71
#> 9 32927 Mashonaland East Maronde… ZW130420 Urban 127 33
#> 10 85023 Mashonaland East Maronde… ZW132105 Urban 483 143
#> 11 38359 Mashonaland East Mutoko ZW130718 Rural 80 20
#> 12 13702 Mashonaland West Chegutu ZW140102 Rural 83 22
#> 13 78285 Mashonaland West Sanyati ZW142818 Urban 92 26
#> 14 33156 Mashonaland West Zvimba ZW140635 Urban 277 71
#> 15 29555 Masvingo Bikita ZW180128 Rural 88 19
#> 16 11817 Masvingo Gutu ZW180403 Rural 25 6
#> 17 653 Masvingo Mwenezi ZW180616 Rural 91 18
#> 18 77314 Matabeleland No… Binga ZW150125 Rural 64 15
#> 19 61839 Matabeleland No… Nkayi ZW150502 Rural 69 14
#> 20 103261 Matabeleland So… Gwanda ZW160412 Rural 26 6
#> # ℹ 10 more variables: buildings <int>, women_15_49 <int>, men_15_49 <int>,
#> # children_under5 <int>, area_km2 <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>
# PPS cluster sample using households as measure of size
sampling_design() |>
cluster_by(ea_id) |>
draw(n = 50, method = "pps_brewer", mos = households) |>
execute(zwe_eas, seed = 2026)
#> # A tbl_sample: 50 × 18
#> # Weights: 2143.52 [180.37, 6935.86]
#> ea_id province district ward_pcode urban_rural population households
#> * <int> <fct> <fct> <chr> <fct> <int> <int>
#> 1 296 Bulawayo Bulawayo ZW102103 Urban 164 45
#> 2 2173 Bulawayo Bulawayo ZW102104 Urban 400 120
#> 3 2243 Bulawayo Bulawayo ZW102104 Urban 229 69
#> 4 22871 Bulawayo Bulawayo ZW102109 Urban 585 157
#> 5 23746 Bulawayo Bulawayo ZW102118 Urban 524 146
#> 6 46259 Bulawayo Bulawayo ZW102128 Urban 316 83
#> 7 47397 Bulawayo Bulawayo ZW102120 Urban 124 32
#> 8 85838 Harare Chitungwiza ZW192204 Urban 532 137
#> 9 88065 Harare Epworth ZW192303 Urban 527 139
#> 10 87300 Harare Harare ZW192143 Urban 427 109
#> # ℹ 40 more rows
#> # ℹ 11 more variables: buildings <int>, women_15_49 <int>, men_15_49 <int>,
#> # children_under5 <int>, area_km2 <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>, .certainty_1 <lgl>
# Two-stage cluster sample
zwe_frame <- zwe_eas |>
dplyr::mutate(district_hh = sum(households), .by = district)
sampling_design() |>
add_stage(label = "Districts") |>
cluster_by(district) |>
draw(n = 20, method = "pps_brewer", mos = district_hh) |>
add_stage(label = "EAs") |>
draw(n = 10) |>
execute(zwe_frame, seed = 1234)
#> # A tbl_sample: 200 × 21
#> # Weights: 528.97 [135.84, 881.23]
#> ea_id province district ward_pcode urban_rural population households
#> * <int> <fct> <fct> <chr> <fct> <int> <int>
#> 1 1311 Bulawayo Bulawayo ZW102105 Urban 109 32
#> 2 48293 Bulawayo Bulawayo ZW102115 Urban 380 97
#> 3 47427 Bulawayo Bulawayo ZW102121 Urban 696 179
#> 4 46195 Bulawayo Bulawayo ZW102128 Urban 691 181
#> 5 46178 Bulawayo Bulawayo ZW102128 Urban 325 85
#> 6 22895 Bulawayo Bulawayo ZW102109 Urban 356 95
#> 7 2192 Bulawayo Bulawayo ZW102104 Urban 412 124
#> 8 8696 Bulawayo Bulawayo ZW102102 Urban 177 48
#> 9 20336 Bulawayo Bulawayo ZW102108 Urban 538 138
#> 10 2294 Bulawayo Bulawayo ZW102104 Urban 88 27
#> # ℹ 190 more rows
#> # ℹ 14 more variables: buildings <int>, women_15_49 <int>, men_15_49 <int>,
#> # children_under5 <int>, area_km2 <dbl>, district_hh <int>, .weight <dbl>,
#> # .sample_id <int>, .stage <int>, .weight_2 <dbl>, .fpc_2 <int>,
#> # .weight_1 <dbl>, .fpc_1 <int>, .certainty_1 <lgl>