cluster_by() specifies the sampling units (PSUs/clusters) for cluster or multi-stage sampling designs. Unlike stratify_by(), which defines subgroups to sample within, cluster_by() defines units to sample as a whole.

cluster_by(.data, ...)

Arguments

.data

A sampling_design object (piped from sampling_design(), stratify_by(), or stage()).

...

<tidy-select> Clustering variable(s) that identify the sampling units. In most cases this is a single variable (e.g., school_id, household_id).

Value

A modified sampling_design object with clustering specified.

Details

cluster_by() is purely structural—it defines what to sample, not how. The selection method and sample size are specified in draw().

Cluster vs. Stratification

  • Stratification (stratify_by()): Sample within each group; all groups represented in the sample

  • Clustering (cluster_by()): Sample groups as units; only selected groups appear in sample

Multi-Stage Designs

In multi-stage designs, each stage typically has its own clustering variable:

  • Stage 1: Select schools (cluster_by(school_id))

  • Stage 2: Select classrooms within schools (cluster_by(classroom_id))

  • Stage 3: Select students within classrooms (no clustering, sample individuals)

The nesting structure (classrooms within schools) is validated at execution time.

Order of Operations

In a single stage, the typical order is:

  1. stratify_by() (optional) - define strata

  2. cluster_by() (optional) - define sampling units

  3. draw() (required) - specify selection parameters

Both stratify_by() and cluster_by() are optional but draw() is required.

See also

sampling_design() for creating designs, stratify_by() for stratification, draw() for specifying selection, stage() for multi-stage designs

Examples

# Simple cluster sample: select 30 schools
sampling_design() |>
  cluster_by(school_id) |>
  draw(n = 30) |>
  execute(tanzania_schools, seed = 123)
#> == tbl_sample ==
#> Weights: 81.13 - 81.13 (mean: 81.13 )
#> 
#> # A tibble: 30 × 14
#>    school_id  region       district school_level ownership enrollment n_teachers
#>  * <chr>      <fct>        <fct>    <fct>        <fct>          <dbl>      <dbl>
#>  1 TZ_05_0094 Arusha       Arusha … Primary      Private          348          9
#>  2 TZ_06_0010 Arusha       Meru     Primary      Governme…        358          8
#>  3 TZ_07_0068 Arusha       Monduli  Primary      Governme…        477         13
#>  4 TZ_01_0017 Dar es Sala… Ilala    Primary      Private          213          5
#>  5 TZ_01_0170 Dar es Sala… Ilala    Secondary    Governme…        106          3
#>  6 TZ_02_0023 Dar es Sala… Kinondo… Primary      Governme…        259          6
#>  7 TZ_03_0003 Dar es Sala… Temeke   Primary      Governme…        520         13
#>  8 TZ_08_0018 Dodoma       Dodoma … Secondary    Governme…        141          4
#>  9 TZ_08_0066 Dodoma       Dodoma … Primary      Governme…        276          6
#> 10 TZ_10_0056 Dodoma       Kondoa   Secondary    Governme…        194          5
#> # ℹ 20 more rows
#> # ℹ 7 more variables: has_electricity <lgl>, has_water <lgl>, .weight <dbl>,
#> #   .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

# Stratified cluster sample: 10 schools per education level
sampling_design() |>
  stratify_by(school_level) |>
  cluster_by(school_id) |>
  draw(n = 10) |>
  execute(tanzania_schools, seed = 1)
#> == tbl_sample ==
#> Weights: 65.6 - 177.8 (mean: 121.7 )
#> 
#> # A tibble: 20 × 14
#>    school_id  region       district school_level ownership enrollment n_teachers
#>  * <chr>      <fct>        <fct>    <fct>        <fct>          <dbl>      <dbl>
#>  1 TZ_05_0062 Arusha       Arusha … Secondary    Governme…        184          5
#>  2 TZ_05_0066 Arusha       Arusha … Primary      Governme…        596         15
#>  3 TZ_07_0063 Arusha       Monduli  Primary      Governme…        275          8
#>  4 TZ_01_0025 Dar es Sala… Ilala    Primary      Governme…        384         11
#>  5 TZ_01_0102 Dar es Sala… Ilala    Secondary    Governme…        605         14
#>  6 TZ_02_0097 Dar es Sala… Kinondo… Primary      Governme…        742         21
#>  7 TZ_08_0034 Dodoma       Dodoma … Primary      Governme…        302          8
#>  8 TZ_11_0031 Dodoma       Mpwapwa  Secondary    Governme…        864         21
#>  9 TZ_13_0058 Kilimanjaro  Moshi R… Secondary    Private          144          3
#> 10 TZ_12_0018 Kilimanjaro  Moshi U… Primary      Governme…        350          8
#> 11 TZ_15_0055 Kilimanjaro  Rombo    Primary      Governme…        490         10
#> 12 TZ_15_0078 Kilimanjaro  Rombo    Secondary    Governme…        131          3
#> 13 TZ_24_0063 Morogoro     Morogor… Primary      Governme…        713         19
#> 14 TZ_17_0057 Mwanza       Ilemela  Secondary    Governme…        215          5
#> 15 TZ_17_0076 Mwanza       Ilemela  Primary      Governme…        799         17
#> 16 TZ_18_0004 Mwanza       Magu     Secondary    Governme…        286          8
#> 17 TZ_19_0056 Mwanza       Sengere… Primary      Private          770         18
#> 18 TZ_21_0058 Tanga        Korogwe  Secondary    Governme…        290          8
#> 19 TZ_22_0024 Tanga        Lushoto  Secondary    Private           88          3
#> 20 TZ_22_0039 Tanga        Lushoto  Secondary    Governme…        253          5
#> # ℹ 7 more variables: has_electricity <lgl>, has_water <lgl>, .weight <dbl>,
#> #   .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

# PPS cluster sample using enrollment as measure of size
sampling_design() |>
  cluster_by(school_id) |>
  draw(n = 50, method = "pps_brewer", mos = enrollment) |>
  execute(tanzania_schools, seed = 2026)
#> == tbl_sample ==
#> Weights: 10.28 - 184.32 (mean: 52.59 )
#> 
#> # A tibble: 50 × 15
#>    school_id  region district       school_level ownership enrollment n_teachers
#>  * <chr>      <fct>  <fct>          <fct>        <fct>          <dbl>      <dbl>
#>  1 TZ_04_0004 Arusha Arusha City    Primary      Governme…        455         11
#>  2 TZ_04_0014 Arusha Arusha City    Primary      Governme…        438          9
#>  3 TZ_04_0015 Arusha Arusha City    Primary      Governme…        576         13
#>  4 TZ_04_0038 Arusha Arusha City    Primary      Governme…        686         15
#>  5 TZ_04_0041 Arusha Arusha City    Secondary    Governme…        280          6
#>  6 TZ_04_0061 Arusha Arusha City    Primary      Governme…        247          7
#>  7 TZ_04_0082 Arusha Arusha City    Primary      Governme…        706         20
#>  8 TZ_05_0046 Arusha Arusha Distri… Primary      Governme…        664         15
#>  9 TZ_05_0092 Arusha Arusha Distri… Primary      Governme…        477         13
#> 10 TZ_07_0044 Arusha Monduli        Primary      Governme…        503         14
#> # ℹ 40 more rows
#> # ℹ 8 more variables: has_electricity <lgl>, has_water <lgl>, .weight <dbl>,
#> #   .sample_id <int>, .stage <int>, .weight_1 <dbl>, .fpc_1 <int>,
#> #   .certainty_1 <lgl>

# Two-stage cluster sample
sampling_design() |>
  stage(label = "Schools") |>
    cluster_by(school_id) |>
    draw(n = 30, method = "pps_brewer", mos = enrollment) |>
  stage(label = "Students") |>
    draw(n = 15) |>
  execute(tanzania_schools, seed = 1234)
#> == tbl_sample ==
#> Weights: 26.24 - 281.39 (mean: 83.24 )
#> 
#> # A tibble: 30 × 17
#>    school_id  region       district school_level ownership enrollment n_teachers
#>  * <chr>      <fct>        <fct>    <fct>        <fct>          <dbl>      <dbl>
#>  1 TZ_01_0019 Dar es Sala… Ilala    Primary      Governme…        381         11
#>  2 TZ_01_0084 Dar es Sala… Ilala    Primary      Governme…        344          9
#>  3 TZ_01_0163 Dar es Sala… Ilala    Primary      Governme…        521         14
#>  4 TZ_02_0016 Dar es Sala… Kinondo… Secondary    Governme…        119          3
#>  5 TZ_02_0021 Dar es Sala… Kinondo… Secondary    Governme…        495         10
#>  6 TZ_02_0103 Dar es Sala… Kinondo… Primary      Governme…        636         18
#>  7 TZ_03_0017 Dar es Sala… Temeke   Primary      Governme…        164          4
#>  8 TZ_03_0022 Dar es Sala… Temeke   Primary      Governme…        408         11
#>  9 TZ_03_0040 Dar es Sala… Temeke   Primary      Governme…        801         20
#> 10 TZ_03_0067 Dar es Sala… Temeke   Secondary    Governme…        217          5
#> # ℹ 20 more rows
#> # ℹ 10 more variables: has_electricity <lgl>, has_water <lgl>, .weight <dbl>,
#> #   .sample_id <int>, .stage <int>, .weight_2 <dbl>, .fpc_2 <int>,
#> #   .weight_1 <dbl>, .fpc_1 <int>, .certainty_1 <lgl>