stratify_by() specifies stratification variables and optional allocation methods for a sampling design. Stratification ensures representation from all subgroups defined by the stratification variables.

stratify_by(.data, ..., alloc = NULL, variance = NULL, cost = NULL)

Arguments

.data

A sampling_design object (piped from sampling_design() or stage()).

...

<tidy-select> Stratification variables. These should be categorical variables that define the strata.

alloc

Character string specifying the allocation method. One of:

  • NULL (default): No allocation; n in draw() is per stratum

  • "equal": Equal allocation across strata

  • "proportional": Proportional to stratum size

  • "neyman": Neyman optimal allocation (requires variance)

  • "optimal": Cost-variance optimal allocation (requires variance and cost)

variance

A data frame with stratum variances for Neyman or optimal allocation. Must contain columns for all stratification variables plus a var column with variance estimates.

cost

A data frame with stratum costs for optimal allocation. Must contain columns for all stratification variables plus a cost column with per-unit costs.

Value

A modified sampling_design object with stratification specified.

Details

Allocation Methods

When no alloc is specified, the n parameter in draw() is interpreted as the sample size per stratum. When an alloc method is specified, n becomes the total sample size to be distributed according to the allocation method.

Equal Allocation

Each stratum receives n/H units, where H is the number of strata.

Proportional Allocation

Each stratum receives n × (N_h/N) units, where N_h is the stratum population size and N is the total population size.

Neyman Allocation

Minimizes variance for fixed sample size. Each stratum receives: n × (N_h × S_h) / Σ(N_h × S_h) where S_h is the stratum standard deviation.

Optimal Allocation

Minimizes variance for fixed cost (or cost for fixed variance). Each stratum receives: n × (N_h × S_h / √C_h) / Σ(N_h × S_h / √C_h) where C_h is the per-unit cost in stratum h.

Custom Allocation

For custom stratum-specific sample sizes or rates, pass a data frame directly to the n or frac argument in draw(). The data frame must contain columns for all stratification variables plus an n or frac column.

Data Frame Requirements

Auxiliary data frames (variance, cost) must contain:

  • All stratification variable columns (used as join keys)

  • The appropriate value column (var or cost)

See also

sampling_design() for creating designs, draw() for specifying sample sizes, cluster_by() for cluster sampling

Examples

# Simple stratification: 20 EAs per region
sampling_design() |>
  stratify_by(region) |>
  draw(n = 20) |>
  execute(niger_eas, seed = 1234)
#> == tbl_sample ==
#> Weights: 2.55 - 15.25 (mean: 9.6 )
#> 
#> # A tibble: 160 × 11
#>    region ea_id       department strata hh_count pop_estimate .weight .sample_id
#>  * <fct>  <chr>       <fct>      <fct>     <dbl>        <dbl>   <dbl>      <int>
#>  1 Agadez Aga_03_0001 Bilma      Urban       161         1127    2.55          1
#>  2 Agadez Aga_02_0003 Arlit      Rural        61          488    2.55          2
#>  3 Agadez Aga_02_0009 Arlit      Urban       137          959    2.55          3
#>  4 Agadez Aga_03_0010 Bilma      Urban       259         1554    2.55          4
#>  5 Agadez Aga_04_0007 Tchirozér… Urban       121          847    2.55          5
#>  6 Agadez Aga_01_0009 Agadez     Rural        97          582    2.55          6
#>  7 Agadez Aga_01_0005 Agadez     Urban       112          896    2.55          7
#>  8 Agadez Aga_04_0001 Tchirozér… Rural        87          609    2.55          8
#>  9 Agadez Aga_04_0013 Tchirozér… Urban       190         1330    2.55          9
#> 10 Agadez Aga_01_0004 Agadez     Rural       146         1022    2.55         10
#> # ℹ 150 more rows
#> # ℹ 3 more variables: .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

# Proportional allocation across regions
sampling_design() |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 200) |>
  execute(niger_eas, seed = 123)
#> == tbl_sample ==
#> Weights: 7.29 - 8.12 (mean: 7.68 )
#> 
#> # A tibble: 200 × 11
#>    region ea_id       department strata hh_count pop_estimate .weight .sample_id
#>  * <fct>  <chr>       <fct>      <fct>     <dbl>        <dbl>   <dbl>      <int>
#>  1 Agadez Aga_03_0004 Bilma      Rural        83          415    7.29          1
#>  2 Agadez Aga_02_0002 Arlit      Rural        63          315    7.29          2
#>  3 Agadez Aga_02_0001 Arlit      Urban       128          640    7.29          3
#>  4 Agadez Aga_01_0003 Agadez     Urban       124          868    7.29          4
#>  5 Agadez Aga_04_0005 Tchirozér… Urban        75          525    7.29          5
#>  6 Agadez Aga_04_0006 Tchirozér… Urban       181         1267    7.29          6
#>  7 Agadez Aga_03_0010 Bilma      Urban       259         1554    7.29          7
#>  8 Diffa  Dif_08_0014 Bosso      Rural        57          342    8.12          8
#>  9 Diffa  Dif_07_0006 N'Guigmi   Rural       109          654    8.12          9
#> 10 Diffa  Dif_05_0010 Diffa      Rural        54          378    8.12         10
#> # ℹ 190 more rows
#> # ℹ 3 more variables: .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

# Neyman allocation using pre-computed variances
sampling_design() |>
  stratify_by(region, alloc = "neyman", variance = niger_eas_variance) |>
  draw(n = 200) |>
  execute(niger_eas, seed = 12)
#> == tbl_sample ==
#> Weights: 5.57 - 9.29 (mean: 7.68 )
#> 
#> # A tibble: 200 × 11
#>    region ea_id       department strata hh_count pop_estimate .weight .sample_id
#>  * <fct>  <chr>       <fct>      <fct>     <dbl>        <dbl>   <dbl>      <int>
#>  1 Agadez Aga_01_0002 Agadez     Urban       157          942    5.67          1
#>  2 Agadez Aga_02_0013 Arlit      Rural        46          368    5.67          2
#>  3 Agadez Aga_02_0003 Arlit      Rural        61          488    5.67          3
#>  4 Agadez Aga_02_0014 Arlit      Urban       142          710    5.67          4
#>  5 Agadez Aga_04_0009 Tchirozér… Rural       107          856    5.67          5
#>  6 Agadez Aga_01_0005 Agadez     Urban       112          896    5.67          6
#>  7 Agadez Aga_04_0010 Tchirozér… Rural        37          185    5.67          7
#>  8 Agadez Aga_03_0001 Bilma      Urban       161         1127    5.67          8
#>  9 Agadez Aga_03_0007 Bilma      Rural       178         1246    5.67          9
#> 10 Diffa  Dif_08_0008 Bosso      Urban       170          850    9.29         10
#> # ℹ 190 more rows
#> # ℹ 3 more variables: .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

# Optimal allocation considering both variance and cost
sampling_design() |>
  stratify_by(region, alloc = "optimal",
              variance = niger_eas_variance,
              cost = niger_eas_cost) |>
  draw(n = 200) |>
  execute(niger_eas, seed = 1)
#> == tbl_sample ==
#> Weights: 6.24 - 16.25 (mean: 7.68 )
#> 
#> # A tibble: 200 × 11
#>    region ea_id       department strata hh_count pop_estimate .weight .sample_id
#>  * <fct>  <chr>       <fct>      <fct>     <dbl>        <dbl>   <dbl>      <int>
#>  1 Agadez Aga_01_0004 Agadez     Rural       146         1022   10.2           1
#>  2 Agadez Aga_04_0002 Tchirozér… Urban       352         2464   10.2           2
#>  3 Agadez Aga_01_0001 Agadez     Rural        59          413   10.2           3
#>  4 Agadez Aga_03_0007 Bilma      Rural       178         1246   10.2           4
#>  5 Agadez Aga_02_0010 Arlit      Urban        91          455   10.2           5
#>  6 Diffa  Dif_06_0010 Mainé-Sor… Urban       119          714   16.2           6
#>  7 Diffa  Dif_08_0014 Bosso      Rural        57          342   16.2           7
#>  8 Diffa  Dif_05_0003 Diffa      Rural       120          960   16.2           8
#>  9 Diffa  Dif_07_0011 N'Guigmi   Rural       108          648   16.2           9
#> 10 Dosso  Dos_11_0016 Dogondout… Rural       197         1379    8.09         10
#> # ℹ 190 more rows
#> # ℹ 3 more variables: .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

# Custom sample sizes per stratum using a data frame
custom_sizes <- data.frame(
  region = c("Agadez", "Diffa", "Dosso", "Maradi",
             "Niamey", "Tahoua", "Tillabéri", "Zinder"),
  n = c(15, 20, 30, 35, 25, 30, 25, 20)
)
sampling_design() |>
  stratify_by(region) |>
  draw(n = custom_sizes) |>
  execute(niger_eas, seed = 2026)
#> == tbl_sample ==
#> Weights: 3.25 - 15.25 (mean: 7.68 )
#> 
#> # A tibble: 200 × 11
#>    region ea_id       department strata hh_count pop_estimate .weight .sample_id
#>  * <fct>  <chr>       <fct>      <fct>     <dbl>        <dbl>   <dbl>      <int>
#>  1 Agadez Aga_03_0002 Bilma      Urban       279         1395     3.4          1
#>  2 Agadez Aga_03_0006 Bilma      Rural        98          490     3.4          2
#>  3 Agadez Aga_04_0001 Tchirozér… Rural        87          609     3.4          3
#>  4 Agadez Aga_04_0008 Tchirozér… Rural        76          456     3.4          4
#>  5 Agadez Aga_04_0010 Tchirozér… Rural        37          185     3.4          5
#>  6 Agadez Aga_02_0014 Arlit      Urban       142          710     3.4          6
#>  7 Agadez Aga_04_0007 Tchirozér… Urban       121          847     3.4          7
#>  8 Agadez Aga_03_0009 Bilma      Rural        41          287     3.4          8
#>  9 Agadez Aga_01_0005 Agadez     Urban       112          896     3.4          9
#> 10 Agadez Aga_03_0004 Bilma      Rural        83          415     3.4         10
#> # ℹ 190 more rows
#> # ℹ 3 more variables: .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

# Multiple stratification variables
sampling_design() |>
  stratify_by(region, strata, alloc = "proportional") |>
  draw(n = 300) |>
  execute(niger_eas, seed = 2025)
#> == tbl_sample ==
#> Weights: 4.33 - 6 (mean: 5.12 )
#> 
#> # A tibble: 300 × 11
#>    region strata ea_id       department hh_count pop_estimate .weight .sample_id
#>  * <fct>  <fct>  <chr>       <fct>         <dbl>        <dbl>   <dbl>      <int>
#>  1 Agadez Urban  Aga_03_0002 Bilma           279         1395     4.8          1
#>  2 Agadez Urban  Aga_03_0001 Bilma           161         1127     4.8          2
#>  3 Agadez Urban  Aga_01_0008 Agadez           54          432     4.8          3
#>  4 Agadez Urban  Aga_01_0002 Agadez          157          942     4.8          4
#>  5 Agadez Urban  Aga_02_0010 Arlit            91          455     4.8          5
#>  6 Agadez Rural  Aga_02_0008 Arlit            42          294     5.4          6
#>  7 Agadez Rural  Aga_03_0004 Bilma            83          415     5.4          7
#>  8 Agadez Rural  Aga_02_0007 Arlit           125          750     5.4          8
#>  9 Agadez Rural  Aga_01_0007 Agadez          166         1162     5.4          9
#> 10 Agadez Rural  Aga_04_0008 Tchirozér…       76          456     5.4         10
#> # ℹ 290 more rows
#> # ℹ 3 more variables: .stage <int>, .weight_1 <dbl>, .fpc_1 <int>