Skip to contents

cluster_by() specifies the sampling units (PSUs/clusters) for cluster or multi-stage sampling designs. Unlike stratify_by(), which defines subgroups to sample within, cluster_by() defines units to sample as a whole.

Usage

cluster_by(.data, ...)

Arguments

.data

A sampling_design object (piped from sampling_design(), stratify_by(), or add_stage()).

...

Clustering variable(s) specified as bare column names that identify the sampling units. In most cases this is a single variable (e.g., school_id, household_id).

Value

A modified sampling_design object with clustering specified.

Details

cluster_by() is purely structural – it defines what to sample, not how. The selection method and sample size are specified in draw().

Cluster vs. Stratification

  • Stratification (stratify_by()): Sample within each group; all groups represented in the sample

  • Clustering (cluster_by()): Sample groups as units; only selected groups appear in sample

Multi-Stage Designs

In multi-stage designs, each stage typically has its own clustering variable:

  • Stage 1: Select schools (cluster_by(school_id))

  • Stage 2: Select classrooms within schools (cluster_by(classroom_id))

  • Stage 3: Select students within classrooms (no clustering, sample individuals)

Child-stage IDs do not need to be globally unique across the entire frame. For example, classroom_id = 1 can appear in more than one school. At execution time, samplyr resolves lower-stage clusters using the full ancestry from earlier stages, so IDs can be unique within parent clusters rather than globally unique.

In practice, this means the frame must represent a valid hierarchy through the combination of parent-stage and current-stage IDs. If a lower-stage ID is only meaningful within a parent, that is supported. If users want a stage's cluster variable to be globally unique on its own, they should provide a globally unique ID or include multiple columns in cluster_by().

Order of Operations

In a single stage, the typical order is:

  1. stratify_by() (optional) - define strata

  2. cluster_by() (optional) - define sampling units

  3. draw() (required) - specify selection parameters

Both stratify_by() and cluster_by() are optional but draw() is required.

See also

sampling_design() for creating designs, stratify_by() for stratification, draw() for specifying selection, add_stage() for multi-stage designs

Examples

# Simple cluster sample: select 30 EAs
sampling_design() |>
  cluster_by(ea_id) |>
  draw(n = 30) |>
  execute(zwe_eas, seed = 123)
#> # A tbl_sample: 30 × 17
#> # Weights:      3575 [3575, 3575]
#>    ea_id province          district ward_pcode urban_rural population households
#>  * <int> <fct>             <fct>    <chr>      <fct>            <int>      <int>
#>  1 40006 Harare            Epworth  ZW192304   Urban             1384        372
#>  2 87920 Harare            Harare   ZW192137   Urban              780        212
#>  3 33850 Harare            Harare … ZW192401   Urban              301         78
#>  4 18135 Manicaland        Buhera   ZW110122   Rural               98         21
#>  5 66047 Manicaland        Buhera   ZW110106   Rural               98         23
#>  6 60773 Manicaland        Chipinge ZW110303   Urban               78         18
#>  7 12448 Manicaland        Makoni   ZW110431   Rural              123         25
#>  8 28763 Manicaland        Makoni   ZW110402   Rural              117         27
#>  9 22483 Mashonaland Cent… Mazowe   ZW120426   Rural               86         22
#> 10 21237 Mashonaland Cent… Mount D… ZW120509   Rural              471        126
#> # ℹ 20 more rows
#> # ℹ 10 more variables: buildings <int>, women_15_49 <int>, men_15_49 <int>,
#> #   children_under5 <int>, area_km2 <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

# Stratified cluster sample: 10 EAs per urban/rural
sampling_design() |>
  stratify_by(urban_rural) |>
  cluster_by(ea_id) |>
  draw(n = 10) |>
  execute(zwe_eas, seed = 1)
#> # A tbl_sample: 20 × 17
#> # Weights:      5362.5 [2415.9, 8309.1]
#>     ea_id province         district ward_pcode urban_rural population households
#>  *  <int> <fct>            <fct>    <chr>      <fct>            <int>      <int>
#>  1  85418 Harare           Harare   ZW192118   Urban              154         54
#>  2  86381 Harare           Harare   ZW192141   Urban              200         54
#>  3  18031 Manicaland       Buhera   ZW110120   Rural              119         27
#>  4  60099 Manicaland       Chimani… ZW110216   Urban              149         38
#>  5  90121 Manicaland       Mutasa   ZW110601   Urban              420        110
#>  6  37769 Mashonaland Cen… Mbire    ZW120809   Rural              214         53
#>  7  23994 Mashonaland Cen… Mount D… ZW120536   Urban              140         33
#>  8  20542 Mashonaland East Goromon… ZW130204   Urban              275         71
#>  9  32927 Mashonaland East Maronde… ZW130420   Urban              127         33
#> 10  85023 Mashonaland East Maronde… ZW132105   Urban              483        143
#> 11  38359 Mashonaland East Mutoko   ZW130718   Rural               80         20
#> 12  13702 Mashonaland West Chegutu  ZW140102   Rural               83         22
#> 13  78285 Mashonaland West Sanyati  ZW142818   Urban               92         26
#> 14  33156 Mashonaland West Zvimba   ZW140635   Urban              277         71
#> 15  29555 Masvingo         Bikita   ZW180128   Rural               88         19
#> 16  11817 Masvingo         Gutu     ZW180403   Rural               25          6
#> 17    653 Masvingo         Mwenezi  ZW180616   Rural               91         18
#> 18  77314 Matabeleland No… Binga    ZW150125   Rural               64         15
#> 19  61839 Matabeleland No… Nkayi    ZW150502   Rural               69         14
#> 20 103261 Matabeleland So… Gwanda   ZW160412   Rural               26          6
#> # ℹ 10 more variables: buildings <int>, women_15_49 <int>, men_15_49 <int>,
#> #   children_under5 <int>, area_km2 <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

# PPS cluster sample using households as measure of size
sampling_design() |>
  cluster_by(ea_id) |>
  draw(n = 50, method = "pps_brewer", mos = households) |>
  execute(zwe_eas, seed = 2026)
#> # A tbl_sample: 50 × 18
#> # Weights:      2143.52 [180.37, 6935.86]
#>    ea_id province district    ward_pcode urban_rural population households
#>  * <int> <fct>    <fct>       <chr>      <fct>            <int>      <int>
#>  1   296 Bulawayo Bulawayo    ZW102103   Urban              164         45
#>  2  2173 Bulawayo Bulawayo    ZW102104   Urban              400        120
#>  3  2243 Bulawayo Bulawayo    ZW102104   Urban              229         69
#>  4 22871 Bulawayo Bulawayo    ZW102109   Urban              585        157
#>  5 23746 Bulawayo Bulawayo    ZW102118   Urban              524        146
#>  6 46259 Bulawayo Bulawayo    ZW102128   Urban              316         83
#>  7 47397 Bulawayo Bulawayo    ZW102120   Urban              124         32
#>  8 85838 Harare   Chitungwiza ZW192204   Urban              532        137
#>  9 88065 Harare   Epworth     ZW192303   Urban              527        139
#> 10 87300 Harare   Harare      ZW192143   Urban              427        109
#> # ℹ 40 more rows
#> # ℹ 11 more variables: buildings <int>, women_15_49 <int>, men_15_49 <int>,
#> #   children_under5 <int>, area_km2 <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <int>, .certainty_1 <lgl>

# Two-stage cluster sample
zwe_frame <- zwe_eas |>
  dplyr::mutate(district_hh = sum(households), .by = district)

sampling_design() |>
  add_stage(label = "Districts") |>
    cluster_by(district) |>
    draw(n = 20, method = "pps_brewer", mos = district_hh) |>
  add_stage(label = "EAs") |>
    draw(n = 10) |>
  execute(zwe_frame, seed = 1234)
#> # A tbl_sample: 200 × 21
#> # Weights:      528.97 [135.84, 881.23]
#>    ea_id province district ward_pcode urban_rural population households
#>  * <int> <fct>    <fct>    <chr>      <fct>            <int>      <int>
#>  1  1311 Bulawayo Bulawayo ZW102105   Urban              109         32
#>  2 48293 Bulawayo Bulawayo ZW102115   Urban              380         97
#>  3 47427 Bulawayo Bulawayo ZW102121   Urban              696        179
#>  4 46195 Bulawayo Bulawayo ZW102128   Urban              691        181
#>  5 46178 Bulawayo Bulawayo ZW102128   Urban              325         85
#>  6 22895 Bulawayo Bulawayo ZW102109   Urban              356         95
#>  7  2192 Bulawayo Bulawayo ZW102104   Urban              412        124
#>  8  8696 Bulawayo Bulawayo ZW102102   Urban              177         48
#>  9 20336 Bulawayo Bulawayo ZW102108   Urban              538        138
#> 10  2294 Bulawayo Bulawayo ZW102104   Urban               88         27
#> # ℹ 190 more rows
#> # ℹ 14 more variables: buildings <int>, women_15_49 <int>, men_15_49 <int>,
#> #   children_under5 <int>, area_km2 <dbl>, district_hh <int>, .weight <dbl>,
#> #   .sample_id <int>, .stage <int>, .weight_2 <dbl>, .fpc_2 <int>,
#> #   .weight_1 <dbl>, .fpc_1 <int>, .certainty_1 <lgl>