Skip to contents

cluster_by() specifies the sampling units (PSUs/clusters) for cluster or multi-stage sampling designs. Unlike stratify_by(), which defines subgroups to sample within, cluster_by() defines units to sample as a whole.

Usage

cluster_by(.data, ...)

Arguments

.data

A sampling_design object (piped from sampling_design(), stratify_by(), or add_stage()).

...

Clustering variable(s) specified as bare column names that identify the sampling units. In most cases this is a single variable (e.g., school_id, household_id).

Value

A modified sampling_design object with clustering specified.

Details

cluster_by() is purely structural – it defines what to sample, not how. The selection method and sample size are specified in draw().

Cluster vs. Stratification

  • Stratification (stratify_by()): Sample within each group; all groups represented in the sample

  • Clustering (cluster_by()): Sample groups as units; only selected groups appear in sample

Multi-Stage Designs

In multi-stage designs, each stage typically has its own clustering variable:

  • Stage 1: Select schools (cluster_by(school_id))

  • Stage 2: Select classrooms within schools (cluster_by(classroom_id))

  • Stage 3: Select students within classrooms (no clustering, sample individuals)

The nesting structure (classrooms within schools) is validated at execution time.

Order of Operations

In a single stage, the typical order is:

  1. stratify_by() (optional) - define strata

  2. cluster_by() (optional) - define sampling units

  3. draw() (required) - specify selection parameters

Both stratify_by() and cluster_by() are optional but draw() is required.

See also

sampling_design() for creating designs, stratify_by() for stratification, draw() for specifying selection, add_stage() for multi-stage designs

Examples

# Simple cluster sample: select 30 EAs
sampling_design() |>
  cluster_by(ea_id) |>
  draw(n = 30) |>
  execute(zwe_eas, seed = 123)
#> # A tbl_sample: 30 × 12
#> # Weights:      753.33 [753.33, 753.33]
#>    ea_id    province district urban_rural population households area_km2 .weight
#>  * <chr>    <fct>    <fct>    <fct>            <int>      <int>    <dbl>   <dbl>
#>  1 EA_01614 Harare   Harare   Urban              930        257     0.46    753.
#>  2 EA_01842 Harare   Harare … Urban             1973        601     0.93    753.
#>  3 EA_02757 Manical… Chipinge Rural              574        130    32.6     753.
#>  4 EA_02888 Manical… Chipinge Rural              686        167     3.81    753.
#>  5 EA_02986 Manical… Chipinge Urban              536        147     1.05    753.
#>  6 EA_03371 Manical… Makoni   Rural              653        146    15.8     753.
#>  7 EA_04761 Manical… Mutasa   Rural              571        134     8.04    753.
#>  8 EA_06170 Mashona… Mazowe   Urban              603        176     0.57    753.
#>  9 EA_06746 Mashona… Mount D… Rural              721        171     5.16    753.
#> 10 EA_09097 Mashona… Murehwa  Rural              590        146     9.22    753.
#> # ℹ 20 more rows
#> # ℹ 4 more variables: .sample_id <int>, .stage <int>, .weight_1 <dbl>,
#> #   .fpc_1 <int>

# Stratified cluster sample: 10 EAs per urban/rural
sampling_design() |>
  stratify_by(urban_rural) |>
  cluster_by(ea_id) |>
  draw(n = 10) |>
  execute(zwe_eas, seed = 1)
#> # A tbl_sample: 20 × 12
#> # Weights:      1130 [622, 1638]
#>    ea_id    province district urban_rural population households area_km2 .weight
#>  * <chr>    <fct>    <fct>    <fct>            <int>      <int>    <dbl>   <dbl>
#>  1 EA_00286 Bulawayo Bulawayo Urban             1293        376     0.46     622
#>  2 EA_01042 Harare   Harare   Urban             1405        384     1.36     622
#>  3 EA_01566 Harare   Harare   Urban             1346        375     0.43     622
#>  4 EA_03285 Manical… Makoni   Urban              769        220     3.43     622
#>  5 EA_03508 Manical… Makoni   Rural              741        185     6.5     1638
#>  6 EA_04139 Manical… Mutare   Rural              641        149     6.08    1638
#>  7 EA_04232 Manical… Mutare   Urban               94         27     0.45     622
#>  8 EA_10793 Mashona… Hurungwe Urban              714        197     3.39     622
#>  9 EA_12890 Mashona… Zvimba   Urban              362        105     1.78     622
#> 10 EA_13183 Masvingo Bikita   Rural              585        133     6.65    1638
#> 11 EA_13505 Masvingo Chiredzi Rural              618        146    14.4     1638
#> 12 EA_14060 Masvingo Chivi    Urban              388        104     2.02     622
#> 13 EA_15915 Masvingo Zaka     Urban              932        269     3.2      622
#> 14 EA_16688 Matabel… Hwange   Rural              499        119    29.1     1638
#> 15 EA_17282 Matabel… Nkayi    Rural              495        121    12.7     1638
#> 16 EA_17444 Matabel… Tsholot… Rural              594        135    15.2     1638
#> 17 EA_18072 Matabel… Beitbri… Urban              485        132     0.83     622
#> 18 EA_19803 Midlands Gokwe N… Rural              660        153    21.4     1638
#> 19 EA_21928 Midlands Mbereng… Rural              574        130     9.07    1638
#> 20 EA_22395 Midlands Shurugw… Rural              538        116     7.41    1638
#> # ℹ 4 more variables: .sample_id <int>, .stage <int>, .weight_1 <dbl>,
#> #   .fpc_1 <int>

# PPS cluster sample using households as measure of size
sampling_design() |>
  cluster_by(ea_id) |>
  draw(n = 50, method = "pps_brewer", mos = households) |>
  execute(zwe_eas, seed = 2026)
#> # A tbl_sample: 50 × 13
#> # Weights:      409.36 [136.73, 836.89]
#>    ea_id    province district urban_rural population households area_km2 .weight
#>  * <chr>    <fct>    <fct>    <fct>            <int>      <int>    <dbl>   <dbl>
#>  1 EA_00014 Bulawayo Bulawayo Rural             1392        335    31.6     227.
#>  2 EA_00059 Bulawayo Bulawayo Urban             1157        317     1.47    240.
#>  3 EA_00064 Bulawayo Bulawayo Urban             1202        341     2.27    223.
#>  4 EA_00173 Bulawayo Bulawayo Urban             1323        386     0.88    197.
#>  5 EA_00185 Bulawayo Bulawayo Urban             1263        367     0.51    208.
#>  6 EA_00269 Bulawayo Bulawayo Urban             1402        392     0.42    194.
#>  7 EA_00376 Bulawayo Bulawayo Urban             1187        341     0.36    223.
#>  8 EA_00638 Harare   Epworth  Urban             1947        540     0.27    141.
#>  9 EA_00841 Harare   Harare   Urban             1090        334     0.6     228.
#> 10 EA_01410 Harare   Harare   Urban              927        274     0.15    278.
#> # ℹ 40 more rows
#> # ℹ 5 more variables: .sample_id <int>, .stage <int>, .weight_1 <dbl>,
#> #   .fpc_1 <int>, .certainty_1 <lgl>

# Two-stage cluster sample
zwe_frame <- zwe_eas |>
  dplyr::mutate(district_hh = sum(households), .by = district)

sampling_design() |>
  add_stage(label = "Districts") |>
    cluster_by(district) |>
    draw(n = 20, method = "pps_brewer", mos = district_hh) |>
  add_stage(label = "EAs") |>
    draw(n = 10) |>
  execute(zwe_frame, seed = 1234)
#> # A tbl_sample: 200 × 16
#> # Weights:      113.07 [44.69, 153.93]
#>    ea_id    province district urban_rural population households area_km2
#>  * <chr>    <fct>    <fct>    <fct>            <int>      <int>    <dbl>
#>  1 EA_00372 Bulawayo Bulawayo Urban             1515        441     0.34
#>  2 EA_00270 Bulawayo Bulawayo Rural              299         67     2.26
#>  3 EA_00382 Bulawayo Bulawayo Urban             1475        426     1.04
#>  4 EA_00184 Bulawayo Bulawayo Urban             1332        392     0.34
#>  5 EA_00062 Bulawayo Bulawayo Urban             1022        282     1.85
#>  6 EA_00004 Bulawayo Bulawayo Urban             1328        383     0.36
#>  7 EA_00388 Bulawayo Bulawayo Urban             1322        365     0.7 
#>  8 EA_00149 Bulawayo Bulawayo Urban             1390        383     0.49
#>  9 EA_00040 Bulawayo Bulawayo Urban             1343        377     0.96
#> 10 EA_00212 Bulawayo Bulawayo Urban             1423        397     0.43
#> # ℹ 190 more rows
#> # ℹ 9 more variables: district_hh <int>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_2 <dbl>, .fpc_2 <int>, .weight_1 <dbl>, .fpc_1 <int>,
#> #   .certainty_1 <lgl>