stratify_by() specifies stratification variables and optional allocation
methods for a sampling design. Stratification ensures representation from
all subgroups defined by the stratification variables.
Usage
stratify_by(
.data,
...,
alloc = NULL,
variance = NULL,
cost = NULL,
cv = NULL,
importance = NULL,
power = NULL
)Arguments
- .data
A
sampling_designobject (piped fromsampling_design()oradd_stage()).- ...
Stratification variables specified as bare column names.
- alloc
Character string specifying the allocation method. One of:
NULL(default): No allocation;nindraw()is per stratum"equal": Equal allocation across strata"proportional": Proportional to stratum size"neyman": Neyman optimal allocation (requiresvariance)"optimal": Cost-variance optimal allocation (requiresvarianceandcost)"power": Power allocation (requirescvandimportance)
- variance
Stratum variances for Neyman or optimal allocation. Either a data frame with columns for all stratification variables plus a
varcolumn, or a named numeric vector (when using a single stratification variable) where names correspond to stratum levels. For named vectors, names must match the values in the stratification column (for examplec(A = 1.2, B = 0.8)).- cost
Stratum costs for optimal allocation. Either a data frame with columns for all stratification variables plus a
costcolumn, or a named numeric vector (when using a single stratification variable) where names correspond to stratum levels. For named vectors, names must match the values in the stratification column.- cv
Stratum coefficients of variation (\(C_h\)) for power allocation. Either a data frame with stratification columns plus a
cvcolumn, or a named numeric vector for a single stratification variable (names are stratum levels).- importance
Stratum importance measure (\(X_h\)) for power allocation. Either a data frame with stratification columns plus an
importancecolumn, or a named numeric vector for a single stratification variable (names are stratum levels).- power
Power exponent \(q\) for power allocation. Must satisfy \(0 \le q \le 1\). Defaults to
0.5.
Details
Allocation Methods
When no alloc is specified, the n parameter in draw() is interpreted
as the sample size per stratum. When an alloc method is specified,
n becomes the total sample size to be distributed according to the
allocation method.
Proportional Allocation
Each stratum receives \(n \times N_h/N\) units, where \(N_h\) is the stratum population size and N is the total population size.
Neyman Allocation
Minimizes variance for fixed sample size. Each stratum receives: \(n \times (N_h \times S_h) / \sum(N_h \times S_h)\) where S_h is the stratum standard deviation.
Optimal Allocation
Minimizes variance for fixed cost (or cost for fixed variance). Each stratum receives: \(n \times (N_h \times S_h / \sqrt{C_h}) / \sum(N_h \times S_h / \sqrt{C_h})\) where C_h is the per-unit cost in stratum h.
Power Allocation
Power allocation (Bankier, 1988) is a compromise allocation: \(n_h \propto C_h \times X_h^q\), where \(C_h\) is stratum CV, \(X_h\) is a stratum importance measure, and \(q \in [0, 1]\).
Custom Allocation
For custom stratum-specific sample sizes or rates, pass a data frame
directly to the n or frac argument in draw(). The data frame must
contain columns for all stratification variables plus an n or frac column.
Data Frame Requirements
Auxiliary data frames (variance, cost) must contain:
All stratification variable columns (used as join keys)
The appropriate value column (
varorcost)
See also
sampling_design() for creating designs,
draw() for specifying sample sizes,
cluster_by() for cluster sampling
Examples
# Simple stratification: 20 EAs per region
sampling_design() |>
stratify_by(region) |>
draw(n = 20) |>
execute(bfa_eas, seed = 1234)
#> # A tbl_sample: 260 × 17
#> # Weights: 57.44 [30.75, 79.5]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_03266 Boucle … Sourou Di Rural 3642 475 2.62
#> 2 EA_11550 Boucle … Mouhoun Safane Rural 549 79 16.4
#> 3 EA_04535 Boucle … Nayala Gassan Rural 1900 210 9.41
#> 4 EA_10169 Boucle … Mouhoun Ouarko… Rural 1201 165 41.8
#> 5 EA_03624 Boucle … Kossi Djibas… Rural 1615 206 11.0
#> 6 EA_03184 Boucle … Mouhoun Dedoug… Rural 966 171 0.5
#> 7 EA_03765 Boucle … Kossi Dokui Rural 1314 163 22.0
#> 8 EA_03213 Boucle … Mouhoun Dedoug… Rural 821 146 0.25
#> 9 EA_12428 Boucle … Banwa Solenzo Rural 1062 143 29.3
#> 10 EA_03179 Boucle … Mouhoun Dedoug… Rural 1368 243 43.6
#> # ℹ 250 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>
# Proportional allocation across regions
sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 200) |>
execute(bfa_eas, seed = 123)
#> # A tbl_sample: 200 × 17
#> # Weights: 74.67 [72.2, 76.88]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_12443 Boucle … Banwa Solenzo Rural 1395 187 8.03
#> 2 EA_12900 Boucle … Banwa Tansila Rural 1164 140 39.1
#> 3 EA_11082 Boucle … Bale Pompoi Rural 1198 162 33.3
#> 4 EA_00767 Boucle … Kossi Barani Rural 1433 188 17.5
#> 5 EA_12105 Boucle … Bale Siby Rural 912 135 8.55
#> 6 EA_03217 Boucle … Mouhoun Dedoug… Rural 998 177 3.4
#> 7 EA_04552 Boucle … Nayala Gassan Rural 649 72 0.49
#> 8 EA_04732 Boucle … Sourou Gomboro Rural 1825 257 32.9
#> 9 EA_14304 Boucle … Nayala Ye Rural 1323 190 10.9
#> 10 EA_14319 Boucle … Nayala Ye Rural 1030 148 18.1
#> # ℹ 190 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>
# Neyman allocation using pre-computed variances
sampling_design() |>
stratify_by(region, alloc = "neyman", variance = bfa_eas_variance) |>
draw(n = 200) |>
execute(bfa_eas, seed = 12)
#> # A tbl_sample: 200 × 17
#> # Weights: 74.67 [52.65, 123]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_12478 Boucle … Banwa Solenzo Rural 2360 317 27.0
#> 2 EA_12374 Boucle … Banwa Solenzo Rural 1929 259 30.2
#> 3 EA_11739 Boucle … Banwa Sanaba Rural 816 104 34.3
#> 4 EA_14322 Boucle … Nayala Ye Rural 1281 184 25.9
#> 5 EA_03138 Boucle … Mouhoun Dedoug… Rural 1461 259 18.2
#> 6 EA_11077 Boucle … Bale Pompoi Rural 917 124 26.8
#> 7 EA_13839 Boucle … Sourou Tougan Rural 1934 265 24.2
#> 8 EA_12957 Boucle … Mouhoun Tcheri… Rural 1243 199 1.15
#> 9 EA_13862 Boucle … Sourou Tougan Rural 902 124 59.4
#> 10 EA_02077 Boucle … Kossi Bombor… Rural 1036 136 7.59
#> # ℹ 190 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>
# Optimal allocation considering both variance and cost
sampling_design() |>
stratify_by(region, alloc = "optimal",
variance = bfa_eas_variance,
cost = bfa_eas_cost) |>
draw(n = 200) |>
execute(bfa_eas, seed = 1)
#> # A tbl_sample: 200 × 17
#> # Weights: 74.67 [60.55, 107.25]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_10182 Boucle … Mouhoun Ouarko… Rural 1347 185 33.0
#> 2 EA_03982 Boucle … Kossi Doumba… Rural 1558 191 16.6
#> 3 EA_10352 Boucle … Bale Ouri Rural 767 98 17.6
#> 4 EA_03209 Boucle … Mouhoun Dedoug… Rural 446 79 18.1
#> 5 EA_12908 Boucle … Banwa Tansila Rural 1076 129 9.45
#> 6 EA_11640 Boucle … Banwa Sami Rural 912 137 43.2
#> 7 EA_06884 Boucle … Banwa Kouka Rural 1642 226 12.5
#> 8 EA_13757 Boucle … Nayala Toma Rural 973 141 0.88
#> 9 EA_05785 Boucle … Sourou Kassoum Rural 1315 195 1.5
#> 10 EA_03598 Boucle … Kossi Djibas… Rural 725 93 1.01
#> # ℹ 190 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>
# Power allocation (Bankier, 1988)
sampling_design() |>
stratify_by(
region,
alloc = "power",
cv = data.frame(
region = levels(bfa_eas$region),
cv = c(0.40, 0.35, 0.12, 0.20, 0.30, 0.18,
0.15, 0.38, 0.22, 0.32, 0.17, 0.45, 0.25)
),
importance = data.frame(
region = levels(bfa_eas$region),
importance = c(60, 40, 120, 70, 80, 65,
50, 55, 90, 75, 45, 35, 30)
),
power = 0.5
) |>
draw(n = 200) |>
execute(bfa_eas, seed = 7)
#> # A tbl_sample: 200 × 17
#> # Weights: 74.67 [39.24, 155.6]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_04731 Boucle … Sourou Gomboro Rural 809 114 17.4
#> 2 EA_13853 Boucle … Sourou Tougan Rural 1078 148 52.7
#> 3 EA_13726 Boucle … Sourou Toeni Rural 1545 160 38.2
#> 4 EA_12913 Boucle … Banwa Tansila Rural 1277 153 22.7
#> 5 EA_07627 Boucle … Kossi Madouba Rural 1781 220 26.6
#> 6 EA_14064 Boucle … Bale Yaho Rural 1623 242 32.4
#> 7 EA_03750 Boucle … Kossi Dokui Rural 1584 196 2.02
#> 8 EA_07243 Boucle … Sourou Lankoue Rural 1870 246 40.0
#> 9 EA_10181 Boucle … Mouhoun Ouarko… Rural 1526 209 60.3
#> 10 EA_02133 Boucle … Mouhoun Bondok… Rural 1107 169 32.3
#> # ℹ 190 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>
# Custom sample sizes per stratum using a data frame
custom_sizes <- data.frame(
region = levels(bfa_eas$region),
n = c(20, 12, 25, 18, 22, 16, 14, 15, 20, 18, 12, 10, 8)
)
sampling_design() |>
stratify_by(region) |>
draw(n = custom_sizes) |>
execute(bfa_eas, seed = 2026)
#> # A tbl_sample: 210 × 17
#> # Weights: 71.11 [43.93, 106]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_08679 Boucle … Kossi Nouna Rural 1393 206 17.6
#> 2 EA_10158 Boucle … Mouhoun Ouarko… Rural 1123 154 23.4
#> 3 EA_06908 Boucle … Banwa Kouka Rural 1229 169 9.3
#> 4 EA_04542 Boucle … Nayala Gassan Rural 1556 172 42.6
#> 5 EA_10401 Boucle … Bale Pa Rural 1150 165 82.3
#> 6 EA_13746 Boucle … Nayala Toma Rural 955 138 0.98
#> 7 EA_12417 Boucle … Banwa Solenzo Rural 2953 397 25.6
#> 8 EA_06901 Boucle … Banwa Kouka Rural 1723 237 17.7
#> 9 EA_02110 Boucle … Mouhoun Bondok… Rural 1413 216 16.8
#> 10 EA_11717 Boucle … Banwa Sanaba Rural 929 118 1.9
#> # ℹ 200 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>
# Multiple stratification variables
sampling_design() |>
stratify_by(region, urban_rural, alloc = "proportional") |>
draw(n = 300) |>
execute(bfa_eas, seed = 2025)
#> # A tbl_sample: 300 × 17
#> # Weights: 49.7 [29, 62]
#> ea_id region province commune urban_rural population households area_km2
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 EA_03197 Boucle … Mouhoun Dedoug… Rural 1296 230 14.3
#> 2 EA_12906 Boucle … Banwa Tansila Rural 1109 133 15.6
#> 3 EA_03220 Boucle … Mouhoun Dedoug… Rural 1605 285 1.62
#> 4 EA_06902 Boucle … Banwa Kouka Rural 635 87 0.75
#> 5 EA_12106 Boucle … Bale Siby Rural 1100 163 47.6
#> 6 EA_03257 Boucle … Sourou Di Rural 1192 155 56.3
#> 7 EA_12907 Boucle … Banwa Tansila Rural 1181 142 23.6
#> 8 EA_10146 Boucle … Mouhoun Ouarko… Rural 904 124 38.2
#> 9 EA_03179 Boucle … Mouhoun Dedoug… Rural 1368 243 43.6
#> 10 EA_07233 Boucle … Sourou Lankoue Rural 1411 186 19.0
#> # ℹ 290 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>