stratify_by() specifies stratification variables and optional allocation
methods for a sampling design. Stratification ensures representation from
all subgroups defined by the stratification variables.
Usage
stratify_by(
.data,
...,
alloc = NULL,
variance = NULL,
cost = NULL,
cv = NULL,
importance = NULL,
power = NULL
)Arguments
- .data
A
sampling_designobject (piped fromsampling_design()oradd_stage()).- ...
Stratification variables specified as bare column names.
- alloc
Character string specifying the allocation method. One of:
NULL(default): No allocation;nindraw()is per stratum"equal": Equal allocation across strata"proportional": Proportional to stratum size"neyman": Neyman optimal allocation (requiresvariance)"optimal": Cost-variance optimal allocation (requiresvarianceandcost)"power": Power allocation (requirescvandimportance)
- variance
Stratum variances for Neyman or optimal allocation. Either a data frame with columns for all stratification variables plus a
varcolumn, or a named numeric vector (when using a single stratification variable) where names correspond to stratum levels. For named vectors, names must match the values in the stratification column (for examplec(A = 1.2, B = 0.8)).- cost
Stratum costs for optimal allocation. Either a data frame with columns for all stratification variables plus a
costcolumn, or a named numeric vector (when using a single stratification variable) where names correspond to stratum levels. For named vectors, names must match the values in the stratification column.- cv
Stratum coefficients of variation (\(C_h\)) for power allocation. Either a data frame with stratification columns plus a
cvcolumn, or a named numeric vector for a single stratification variable (names are stratum levels).- importance
Stratum importance measure (\(X_h\)) for power allocation. Either a data frame with stratification columns plus an
importancecolumn, or a named numeric vector for a single stratification variable (names are stratum levels).- power
Power exponent \(q\) for power allocation. Must satisfy \(0 \le q \le 1\). Defaults to
0.5.
Details
Allocation Methods
When no alloc is specified, the n parameter in draw() is interpreted
as the sample size per stratum. When an alloc method is specified,
n becomes the total sample size to be distributed according to the
allocation method.
Proportional Allocation
Each stratum receives \(n \times N_h/N\) units, where \(N_h\) is the stratum population size and N is the total population size.
Neyman Allocation
Minimizes variance for fixed sample size. Each stratum receives: \(n \times (N_h \times S_h) / \sum(N_h \times S_h)\) where S_h is the stratum standard deviation.
Optimal Allocation
Minimizes variance for fixed cost (or cost for fixed variance). Each stratum receives: \(n \times (N_h \times S_h / \sqrt{C_h}) / \sum(N_h \times S_h / \sqrt{C_h})\) where C_h is the per-unit cost in stratum h.
Power Allocation
Power allocation (Bankier, 1988) is a compromise allocation: \(n_h \propto C_h \times X_h^q\), where \(C_h\) is stratum CV, \(X_h\) is a stratum importance measure, and \(q \in [0, 1]\).
Custom Allocation
For custom stratum-specific sample sizes or rates, pass a data frame
directly to the n or frac argument in draw(). The data frame must
contain columns for all stratification variables plus an n or frac column.
Data Frame Requirements
Auxiliary data frames (variance, cost) must contain:
All stratification variable columns (used as join keys)
The appropriate value column (
varorcost)
See also
sampling_design() for creating designs,
draw() for specifying sample sizes,
cluster_by() for cluster sampling
Examples
# Simple stratification: 20 EAs per region
sampling_design() |>
stratify_by(region) |>
draw(n = 20) |>
execute(bfa_eas, seed = 1234)
#> # A tbl_sample: 260 × 17
#> # Weights: 57.31 [30.4, 79.5]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_05762 Boucle … Sourou Kassoum Rural 1633 221 23.3
#> 2 EA_11523 Boucle … Mouhoun Safane Rural 549 78 16.4
#> 3 EA_04508 Boucle … Nayala Gassan Rural 1900 232 9.41
#> 4 EA_10142 Boucle … Mouhoun Ouarko… Rural 1201 167 41.8
#> 5 EA_03597 Boucle … Kossi Djibas… Rural 1615 192 11.0
#> 6 EA_03184 Boucle … Mouhoun Dedoug… Rural 966 171 0.5
#> 7 EA_03738 Boucle … Kossi Dokui Rural 1314 210 22.0
#> 8 EA_03213 Boucle … Mouhoun Dedoug… Rural 821 146 0.25
#> 9 EA_12401 Boucle … Banwa Solenzo Rural 1062 127 29.3
#> 10 EA_03179 Boucle … Mouhoun Dedoug… Rural 1368 243 43.6
#> # ℹ 250 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>
# Proportional allocation across regions
sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 200) |>
execute(bfa_eas, seed = 123)
#> # A tbl_sample: 200 × 17
#> # Weights: 74.5 [72.2, 76]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_12416 Boucle … Banwa Solenzo Rural 1395 166 8.03
#> 2 EA_12873 Boucle … Banwa Tansila Rural 1164 148 39.1
#> 3 EA_11055 Boucle … Bale Pompoi Rural 1198 189 33.3
#> 4 EA_00767 Boucle … Kossi Barani Rural 1433 188 17.5
#> 5 EA_12078 Boucle … Bale Siby Rural 912 125 8.55
#> 6 EA_03217 Boucle … Mouhoun Dedoug… Rural 998 177 3.4
#> 7 EA_04525 Boucle … Nayala Gassan Rural 649 79 0.49
#> 8 EA_05964 Boucle … Sourou Kiemba… Rural 1075 171 12.7
#> 9 EA_14277 Boucle … Nayala Ye Rural 1323 233 10.9
#> 10 EA_14292 Boucle … Nayala Ye Rural 1030 182 18.1
#> # ℹ 190 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>
# Neyman allocation using pre-computed variances
sampling_design() |>
stratify_by(region, alloc = "neyman", variance = bfa_eas_variance) |>
draw(n = 200) |>
execute(bfa_eas, seed = 12)
#> # A tbl_sample: 200 × 17
#> # Weights: 74.5 [54.83, 121.6]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_12451 Boucle … Banwa Solenzo Rural 2360 282 27.0
#> 2 EA_12347 Boucle … Banwa Solenzo Rural 1929 230 30.2
#> 3 EA_11712 Boucle … Banwa Sanaba Rural 816 118 34.3
#> 4 EA_14295 Boucle … Nayala Ye Rural 1281 226 25.9
#> 5 EA_03138 Boucle … Mouhoun Dedoug… Rural 1461 259 18.2
#> 6 EA_11050 Boucle … Bale Pompoi Rural 917 144 26.8
#> 7 EA_13839 Boucle … Sourou Tougan Rural 1073 148 12.2
#> 8 EA_12930 Boucle … Mouhoun Tcheri… Rural 1243 149 1.15
#> 9 EA_02077 Boucle … Kossi Bombor… Rural 1036 136 7.59
#> 10 EA_04889 Boucle … Nayala Gossina Rural 1032 166 27.0
#> # ℹ 190 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>
# Optimal allocation considering both variance and cost
sampling_design() |>
stratify_by(region, alloc = "optimal",
variance = bfa_eas_variance,
cost = bfa_eas_cost) |>
draw(n = 200) |>
execute(bfa_eas, seed = 1)
#> # A tbl_sample: 200 × 17
#> # Weights: 74.5 [63.6, 107.25]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_10155 Boucle … Mouhoun Ouarko… Rural 1347 187 33.0
#> 2 EA_03955 Boucle … Kossi Doumba… Rural 1558 211 16.6
#> 3 EA_10325 Boucle … Bale Ouri Rural 767 92 17.6
#> 4 EA_03209 Boucle … Mouhoun Dedoug… Rural 446 79 18.1
#> 5 EA_12881 Boucle … Banwa Tansila Rural 1076 137 9.45
#> 6 EA_11613 Boucle … Banwa Sami Rural 912 118 43.2
#> 7 EA_06857 Boucle … Banwa Kouka Rural 1642 189 12.5
#> 8 EA_13730 Boucle … Nayala Toma Rural 973 101 0.88
#> 9 EA_05972 Boucle … Sourou Kiemba… Rural 1359 216 36.6
#> 10 EA_03571 Boucle … Kossi Djibas… Rural 725 86 1.01
#> # ℹ 190 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>
# Power allocation (Bankier, 1988)
sampling_design() |>
stratify_by(
region,
alloc = "power",
cv = data.frame(
region = levels(bfa_eas$region),
cv = c(0.40, 0.35, 0.12, 0.20, 0.30, 0.18,
0.15, 0.38, 0.22, 0.32, 0.17, 0.45, 0.25)
),
importance = data.frame(
region = levels(bfa_eas$region),
importance = c(60, 40, 120, 70, 80, 65,
50, 55, 90, 75, 45, 35, 30)
),
power = 0.5
) |>
draw(n = 200) |>
execute(bfa_eas, seed = 7)
#> # A tbl_sample: 200 × 17
#> # Weights: 74.5 [39.24, 155.6]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_05963 Boucle … Sourou Kiemba… Rural 1173 187 23.8
#> 2 EA_13801 Boucle … Sourou Tougan Rural 613 85 16.7
#> 3 EA_12886 Boucle … Banwa Tansila Rural 1277 163 22.7
#> 4 EA_07600 Boucle … Kossi Madouba Rural 1781 226 26.6
#> 5 EA_14037 Boucle … Bale Yaho Rural 1623 214 32.4
#> 6 EA_03723 Boucle … Kossi Dokui Rural 1584 253 2.02
#> 7 EA_13703 Boucle … Sourou Toeni Rural 1618 247 216.
#> 8 EA_10154 Boucle … Mouhoun Ouarko… Rural 1526 212 60.3
#> 9 EA_02133 Boucle … Mouhoun Bondok… Rural 1107 169 32.3
#> 10 EA_10378 Boucle … Bale Pa Rural 1128 159 2.43
#> # ℹ 190 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>
# Custom sample sizes per stratum using a data frame
custom_sizes <- data.frame(
region = levels(bfa_eas$region),
n = c(20, 12, 25, 18, 22, 16, 14, 15, 20, 18, 12, 10, 8)
)
sampling_design() |>
stratify_by(region) |>
draw(n = custom_sizes) |>
execute(bfa_eas, seed = 2026)
#> # A tbl_sample: 210 × 17
#> # Weights: 70.95 [43.43, 106]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_08652 Boucle … Kossi Nouna Rural 1393 181 17.6
#> 2 EA_10131 Boucle … Mouhoun Ouarko… Rural 1123 156 23.4
#> 3 EA_06881 Boucle … Banwa Kouka Rural 1229 142 9.3
#> 4 EA_04515 Boucle … Nayala Gassan Rural 1556 190 42.6
#> 5 EA_10374 Boucle … Bale Pa Rural 1150 162 82.3
#> 6 EA_13719 Boucle … Nayala Toma Rural 955 99 0.98
#> 7 EA_12390 Boucle … Banwa Solenzo Rural 2953 352 25.6
#> 8 EA_06874 Boucle … Banwa Kouka Rural 1723 198 17.7
#> 9 EA_02110 Boucle … Mouhoun Bondok… Rural 1413 216 16.8
#> 10 EA_11690 Boucle … Banwa Sanaba Rural 929 135 1.9
#> # ℹ 200 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>
# Multiple stratification variables
sampling_design() |>
stratify_by(region, urban_rural, alloc = "proportional") |>
draw(n = 300) |>
execute(bfa_eas, seed = 2025)
#> # A tbl_sample: 300 × 17
#> # Weights: 49.51 [41.33, 56.67]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_03197 Boucle … Mouhoun Dedoug… Rural 1296 230 14.3
#> 2 EA_12879 Boucle … Banwa Tansila Rural 1109 141 15.6
#> 3 EA_03220 Boucle … Mouhoun Dedoug… Rural 1605 285 1.62
#> 4 EA_06875 Boucle … Banwa Kouka Rural 635 73 0.75
#> 5 EA_12079 Boucle … Bale Siby Rural 1100 151 47.6
#> 6 EA_04708 Boucle … Sourou Gomboro Rural 988 131 49.1
#> 7 EA_12880 Boucle … Banwa Tansila Rural 1181 151 23.6
#> 8 EA_10119 Boucle … Mouhoun Ouarko… Rural 904 125 38.2
#> 9 EA_03179 Boucle … Mouhoun Dedoug… Rural 1368 243 43.6
#> 10 EA_13693 Boucle … Sourou Toeni Rural 1112 170 1.24
#> # ℹ 290 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>