stratify_by() specifies stratification variables and optional allocation
methods for a sampling design. Stratification ensures representation from
all subgroups defined by the stratification variables.
stratify_by(.data, ..., alloc = NULL, variance = NULL, cost = NULL)A sampling_design object (piped from sampling_design() or
stage()).
<tidy-select> Stratification
variables. These should be categorical variables that define the strata.
Character string specifying the allocation method. One of:
NULL (default): No allocation; n in draw() is per stratum
"equal": Equal allocation across strata
"proportional": Proportional to stratum size
"neyman": Neyman optimal allocation (requires variance)
"optimal": Cost-variance optimal allocation (requires variance and cost)
A data frame with stratum variances for Neyman or optimal
allocation. Must contain columns for all stratification variables plus
a var column with variance estimates.
A data frame with stratum costs for optimal allocation.
Must contain columns for all stratification variables plus a cost
column with per-unit costs.
A modified sampling_design object with stratification specified.
When no alloc is specified, the n parameter in draw() is interpreted
as the sample size per stratum. When an alloc method is specified,
n becomes the total sample size to be distributed according to the
allocation method.
Each stratum receives n × (N_h/N) units, where N_h is the stratum population size and N is the total population size.
Minimizes variance for fixed sample size. Each stratum receives: n × (N_h × S_h) / Σ(N_h × S_h) where S_h is the stratum standard deviation.
Minimizes variance for fixed cost (or cost for fixed variance). Each stratum receives: n × (N_h × S_h / √C_h) / Σ(N_h × S_h / √C_h) where C_h is the per-unit cost in stratum h.
For custom stratum-specific sample sizes or rates, pass a data frame
directly to the n or frac argument in draw(). The data frame must
contain columns for all stratification variables plus an n or frac column.
Auxiliary data frames (variance, cost) must contain:
All stratification variable columns (used as join keys)
The appropriate value column (var or cost)
sampling_design() for creating designs,
draw() for specifying sample sizes,
cluster_by() for cluster sampling
# Simple stratification: 20 EAs per region
sampling_design() |>
stratify_by(region) |>
draw(n = 20) |>
execute(niger_eas, seed = 1234)
#> == tbl_sample ==
#> Weights: 2.55 - 15.25 (mean: 9.6 )
#>
#> # A tibble: 160 × 11
#> region ea_id department strata hh_count pop_estimate .weight .sample_id
#> * <fct> <chr> <fct> <fct> <dbl> <dbl> <dbl> <int>
#> 1 Agadez Aga_03_0001 Bilma Urban 161 1127 2.55 1
#> 2 Agadez Aga_02_0003 Arlit Rural 61 488 2.55 2
#> 3 Agadez Aga_02_0009 Arlit Urban 137 959 2.55 3
#> 4 Agadez Aga_03_0010 Bilma Urban 259 1554 2.55 4
#> 5 Agadez Aga_04_0007 Tchirozér… Urban 121 847 2.55 5
#> 6 Agadez Aga_01_0009 Agadez Rural 97 582 2.55 6
#> 7 Agadez Aga_01_0005 Agadez Urban 112 896 2.55 7
#> 8 Agadez Aga_04_0001 Tchirozér… Rural 87 609 2.55 8
#> 9 Agadez Aga_04_0013 Tchirozér… Urban 190 1330 2.55 9
#> 10 Agadez Aga_01_0004 Agadez Rural 146 1022 2.55 10
#> # ℹ 150 more rows
#> # ℹ 3 more variables: .stage <int>, .weight_1 <dbl>, .fpc_1 <int>
# Proportional allocation across regions
sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 200) |>
execute(niger_eas, seed = 123)
#> == tbl_sample ==
#> Weights: 7.29 - 8.12 (mean: 7.68 )
#>
#> # A tibble: 200 × 11
#> region ea_id department strata hh_count pop_estimate .weight .sample_id
#> * <fct> <chr> <fct> <fct> <dbl> <dbl> <dbl> <int>
#> 1 Agadez Aga_03_0004 Bilma Rural 83 415 7.29 1
#> 2 Agadez Aga_02_0002 Arlit Rural 63 315 7.29 2
#> 3 Agadez Aga_02_0001 Arlit Urban 128 640 7.29 3
#> 4 Agadez Aga_01_0003 Agadez Urban 124 868 7.29 4
#> 5 Agadez Aga_04_0005 Tchirozér… Urban 75 525 7.29 5
#> 6 Agadez Aga_04_0006 Tchirozér… Urban 181 1267 7.29 6
#> 7 Agadez Aga_03_0010 Bilma Urban 259 1554 7.29 7
#> 8 Diffa Dif_08_0014 Bosso Rural 57 342 8.12 8
#> 9 Diffa Dif_07_0006 N'Guigmi Rural 109 654 8.12 9
#> 10 Diffa Dif_05_0010 Diffa Rural 54 378 8.12 10
#> # ℹ 190 more rows
#> # ℹ 3 more variables: .stage <int>, .weight_1 <dbl>, .fpc_1 <int>
# Neyman allocation using pre-computed variances
sampling_design() |>
stratify_by(region, alloc = "neyman", variance = niger_eas_variance) |>
draw(n = 200) |>
execute(niger_eas, seed = 12)
#> == tbl_sample ==
#> Weights: 5.57 - 9.29 (mean: 7.68 )
#>
#> # A tibble: 200 × 11
#> region ea_id department strata hh_count pop_estimate .weight .sample_id
#> * <fct> <chr> <fct> <fct> <dbl> <dbl> <dbl> <int>
#> 1 Agadez Aga_01_0002 Agadez Urban 157 942 5.67 1
#> 2 Agadez Aga_02_0013 Arlit Rural 46 368 5.67 2
#> 3 Agadez Aga_02_0003 Arlit Rural 61 488 5.67 3
#> 4 Agadez Aga_02_0014 Arlit Urban 142 710 5.67 4
#> 5 Agadez Aga_04_0009 Tchirozér… Rural 107 856 5.67 5
#> 6 Agadez Aga_01_0005 Agadez Urban 112 896 5.67 6
#> 7 Agadez Aga_04_0010 Tchirozér… Rural 37 185 5.67 7
#> 8 Agadez Aga_03_0001 Bilma Urban 161 1127 5.67 8
#> 9 Agadez Aga_03_0007 Bilma Rural 178 1246 5.67 9
#> 10 Diffa Dif_08_0008 Bosso Urban 170 850 9.29 10
#> # ℹ 190 more rows
#> # ℹ 3 more variables: .stage <int>, .weight_1 <dbl>, .fpc_1 <int>
# Optimal allocation considering both variance and cost
sampling_design() |>
stratify_by(region, alloc = "optimal",
variance = niger_eas_variance,
cost = niger_eas_cost) |>
draw(n = 200) |>
execute(niger_eas, seed = 1)
#> == tbl_sample ==
#> Weights: 6.24 - 16.25 (mean: 7.68 )
#>
#> # A tibble: 200 × 11
#> region ea_id department strata hh_count pop_estimate .weight .sample_id
#> * <fct> <chr> <fct> <fct> <dbl> <dbl> <dbl> <int>
#> 1 Agadez Aga_01_0004 Agadez Rural 146 1022 10.2 1
#> 2 Agadez Aga_04_0002 Tchirozér… Urban 352 2464 10.2 2
#> 3 Agadez Aga_01_0001 Agadez Rural 59 413 10.2 3
#> 4 Agadez Aga_03_0007 Bilma Rural 178 1246 10.2 4
#> 5 Agadez Aga_02_0010 Arlit Urban 91 455 10.2 5
#> 6 Diffa Dif_06_0010 Mainé-Sor… Urban 119 714 16.2 6
#> 7 Diffa Dif_08_0014 Bosso Rural 57 342 16.2 7
#> 8 Diffa Dif_05_0003 Diffa Rural 120 960 16.2 8
#> 9 Diffa Dif_07_0011 N'Guigmi Rural 108 648 16.2 9
#> 10 Dosso Dos_11_0016 Dogondout… Rural 197 1379 8.09 10
#> # ℹ 190 more rows
#> # ℹ 3 more variables: .stage <int>, .weight_1 <dbl>, .fpc_1 <int>
# Custom sample sizes per stratum using a data frame
custom_sizes <- data.frame(
region = c("Agadez", "Diffa", "Dosso", "Maradi",
"Niamey", "Tahoua", "Tillabéri", "Zinder"),
n = c(15, 20, 30, 35, 25, 30, 25, 20)
)
sampling_design() |>
stratify_by(region) |>
draw(n = custom_sizes) |>
execute(niger_eas, seed = 2026)
#> == tbl_sample ==
#> Weights: 3.25 - 15.25 (mean: 7.68 )
#>
#> # A tibble: 200 × 11
#> region ea_id department strata hh_count pop_estimate .weight .sample_id
#> * <fct> <chr> <fct> <fct> <dbl> <dbl> <dbl> <int>
#> 1 Agadez Aga_03_0002 Bilma Urban 279 1395 3.4 1
#> 2 Agadez Aga_03_0006 Bilma Rural 98 490 3.4 2
#> 3 Agadez Aga_04_0001 Tchirozér… Rural 87 609 3.4 3
#> 4 Agadez Aga_04_0008 Tchirozér… Rural 76 456 3.4 4
#> 5 Agadez Aga_04_0010 Tchirozér… Rural 37 185 3.4 5
#> 6 Agadez Aga_02_0014 Arlit Urban 142 710 3.4 6
#> 7 Agadez Aga_04_0007 Tchirozér… Urban 121 847 3.4 7
#> 8 Agadez Aga_03_0009 Bilma Rural 41 287 3.4 8
#> 9 Agadez Aga_01_0005 Agadez Urban 112 896 3.4 9
#> 10 Agadez Aga_03_0004 Bilma Rural 83 415 3.4 10
#> # ℹ 190 more rows
#> # ℹ 3 more variables: .stage <int>, .weight_1 <dbl>, .fpc_1 <int>
# Multiple stratification variables
sampling_design() |>
stratify_by(region, strata, alloc = "proportional") |>
draw(n = 300) |>
execute(niger_eas, seed = 2025)
#> == tbl_sample ==
#> Weights: 4.33 - 6 (mean: 5.12 )
#>
#> # A tibble: 300 × 11
#> region strata ea_id department hh_count pop_estimate .weight .sample_id
#> * <fct> <fct> <chr> <fct> <dbl> <dbl> <dbl> <int>
#> 1 Agadez Urban Aga_03_0002 Bilma 279 1395 4.8 1
#> 2 Agadez Urban Aga_03_0001 Bilma 161 1127 4.8 2
#> 3 Agadez Urban Aga_01_0008 Agadez 54 432 4.8 3
#> 4 Agadez Urban Aga_01_0002 Agadez 157 942 4.8 4
#> 5 Agadez Urban Aga_02_0010 Arlit 91 455 4.8 5
#> 6 Agadez Rural Aga_02_0008 Arlit 42 294 5.4 6
#> 7 Agadez Rural Aga_03_0004 Bilma 83 415 5.4 7
#> 8 Agadez Rural Aga_02_0007 Arlit 125 750 5.4 8
#> 9 Agadez Rural Aga_01_0007 Agadez 166 1162 5.4 9
#> 10 Agadez Rural Aga_04_0008 Tchirozér… 76 456 5.4 10
#> # ℹ 290 more rows
#> # ℹ 3 more variables: .stage <int>, .weight_1 <dbl>, .fpc_1 <int>