stratify_by() specifies stratification variables and optional allocation
methods for a sampling design. Stratification ensures representation from
all subgroups defined by the stratification variables.
Usage
stratify_by(
.data,
...,
alloc = NULL,
variance = NULL,
cost = NULL,
cv = NULL,
importance = NULL,
power = NULL
)Arguments
- .data
A
sampling_designobject (piped fromsampling_design()oradd_stage()).- ...
Stratification variables specified as bare column names.
- alloc
Character string specifying the allocation method. One of:
NULL(default): No allocation;nindraw()is per stratum"equal": Equal allocation across strata"proportional": Proportional to stratum size"neyman": Neyman optimal allocation (requiresvariance)"optimal": Cost-variance optimal allocation (requiresvarianceandcost)"power": Power allocation (requirescvandimportance)
- variance
Stratum variances for Neyman or optimal allocation. Either a data frame with columns for all stratification variables plus a
varcolumn, or a named numeric vector (when using a single stratification variable) where names correspond to stratum levels. For named vectors, names must match the values in the stratification column (for examplec(A = 1.2, B = 0.8)).- cost
Stratum costs for optimal allocation. Either a data frame with columns for all stratification variables plus a
costcolumn, or a named numeric vector (when using a single stratification variable) where names correspond to stratum levels. For named vectors, names must match the values in the stratification column.- cv
Stratum coefficients of variation (\(C_h\)) for power allocation. Either a data frame with stratification columns plus a
cvcolumn, or a named numeric vector for a single stratification variable (names are stratum levels).- importance
Stratum importance measure (\(X_h\)) for power allocation. Either a data frame with stratification columns plus an
importancecolumn, or a named numeric vector for a single stratification variable (names are stratum levels).- power
Power exponent \(q\) for power allocation. Must satisfy \(0 \le q \le 1\). Defaults to
0.5.
Details
Allocation Methods
When no alloc is specified, the n parameter in draw() is interpreted
as the sample size per stratum. When an alloc method is specified,
n becomes the total sample size to be distributed according to the
allocation method.
Proportional Allocation
Each stratum receives \(n \times N_h/N\) units, where \(N_h\) is the stratum population size and N is the total population size.
Neyman Allocation
Minimizes variance for fixed sample size. Each stratum receives: \(n \times (N_h \times S_h) / \sum(N_h \times S_h)\) where S_h is the stratum standard deviation.
Optimal Allocation
Minimizes variance for fixed cost (or cost for fixed variance). Each stratum receives: \(n \times (N_h \times S_h / \sqrt{C_h}) / \sum(N_h \times S_h / \sqrt{C_h})\) where C_h is the per-unit cost in stratum h.
Power Allocation
Power allocation (Bankier, 1988) is a compromise allocation: \(n_h \propto C_h \times X_h^q\), where \(C_h\) is stratum CV, \(X_h\) is a stratum importance measure, and \(q \in [0, 1]\).
Custom Allocation
For custom stratum-specific sample sizes or rates, pass a data frame
directly to the n or frac argument in draw(). The data frame must
contain columns for all stratification variables plus an n or frac column.
Data Frame Requirements
Auxiliary data frames (variance, cost) must contain:
All stratification variable columns (used as join keys)
The appropriate value column (
varorcost)
See also
sampling_design() for creating designs,
draw() for specifying sample sizes,
cluster_by() for cluster sampling
Examples
# Simple stratification: 20 EAs per region
sampling_design() |>
stratify_by(region) |>
draw(n = 20) |>
execute(bfa_eas, seed = 1234)
#> # A tbl_sample: 260 × 17
#> # Weights: 171.42 [80.6, 275.25]
#> ea_id region province commune urban_rural population households area_km2
#> * <int> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 9635 Boucle du … Banwa Sanaba Rural 357 45 0.77
#> 2 9908 Boucle du … Bale Siby Rural 113 17 8.92
#> 3 20945 Boucle du … Mouhoun Bondok… Rural 395 60 7.58
#> 4 34029 Boucle du … Banwa Sami Rural 363 55 8.85
#> 5 31936 Boucle du … Sourou Kiemba… Rural 683 89 0.6
#> 6 26038 Boucle du … Mouhoun Dedoug… Rural 351 62 8.67
#> 7 41793 Boucle du … Kossi Dokui Rural 184 23 8.27
#> 8 5643 Boucle du … Mouhoun Kona Rural 101 14 8.38
#> 9 21026 Boucle du … Mouhoun Bondok… Rural 64 10 7.92
#> 10 33091 Boucle du … Kossi Nouna Rural 28 4 8.55
#> # ℹ 250 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>
# Proportional allocation across regions
sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 200) |>
execute(bfa_eas, seed = 123)
#> # A tbl_sample: 200 × 17
#> # Weights: 222.85 [217.78, 237.43]
#> ea_id region province commune urban_rural population households area_km2
#> * <int> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 33181 Boucle du … Kossi Nouna Rural 133 20 0.07
#> 2 33229 Boucle du … Kossi Nouna Rural 48 7 4.53
#> 3 21506 Boucle du … Kossi Doumba… Rural 485 59 4.82
#> 4 45264 Boucle du … Bale Pa Rural 825 119 0.66
#> 5 26201 Boucle du … Sourou Di Rural 95 12 21.1
#> 6 26077 Boucle du … Mouhoun Dedoug… Rural 176 31 8.96
#> 7 25510 Boucle du … Kossi Bombor… Rural 246 32 0.44
#> 8 23774 Boucle du … Banwa Solenzo Rural 158 21 2.9
#> 9 8332 Boucle du … Mouhoun Ouarko… Rural 55 8 5.68
#> 10 43713 Boucle du … Mouhoun Safane Rural 97 14 6.46
#> # ℹ 190 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>
# Neyman allocation using pre-computed variances
sampling_design() |>
stratify_by(region, alloc = "neyman", variance = bfa_eas_variance) |>
draw(n = 200) |>
execute(bfa_eas, seed = 12)
#> # A tbl_sample: 200 × 17
#> # Weights: 222.85 [83.71, 2419.5]
#> ea_id region province commune urban_rural population households area_km2
#> * <int> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 31987 Boucle du … Sourou Kiemba… Rural 24 3 7.24
#> 2 4968 Boucle du … Sourou Kassoum Rural 303 45 4.13
#> 3 4958 Boucle du … Sourou Kassoum Rural 142 21 8.24
#> 4 23903 Boucle du … Banwa Solenzo Rural 94 13 13.5
#> 5 25996 Boucle du … Mouhoun Dedoug… Rural 538 95 8.63
#> 6 21501 Boucle du … Kossi Doumba… Rural 247 30 7.93
#> 7 43794 Boucle du … Mouhoun Safane Rural 87 12 8.96
#> 8 9724 Boucle du … Banwa Sanaba Rural 51 6 8.02
#> 9 10979 Boucle du … Banwa Tansila Rural 71 9 7.72
#> 10 8893 Boucle du … Bale Pompoi Rural 522 71 0.48
#> # ℹ 190 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>
# Optimal allocation considering both variance and cost
sampling_design() |>
stratify_by(region, alloc = "optimal",
variance = bfa_eas_variance,
cost = bfa_eas_cost) |>
draw(n = 200) |>
execute(bfa_eas, seed = 1)
#> # A tbl_sample: 200 × 17
#> # Weights: 222.85 [88.79, 1613]
#> ea_id region province commune urban_rural population households area_km2
#> * <int> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 9648 Boucle du … Banwa Sanaba Rural 279 35 8.37
#> 2 11547 Boucle du … Sourou Toeni Rural 49 5 21.4
#> 3 41824 Boucle du … Kossi Dokui Rural 75 9 15.8
#> 4 11012 Boucle du … Banwa Tansila Rural 592 71 0.95
#> 5 32308 Boucle du … Sourou Lanfie… Rural 57 8 9.49
#> 6 7017 Boucle du … Kossi Madouba Rural 599 74 0.74
#> 7 36700 Boucle du … Bale Fara Rural 402 59 6.36
#> 8 11611 Boucle du … Nayala Yaba Rural 111 15 8.97
#> 9 8342 Boucle du … Mouhoun Ouarko… Rural 94 13 8.1
#> 10 11626 Boucle du … Nayala Yaba Rural 56 7 8.31
#> # ℹ 190 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>
# Power allocation (Bankier, 1988)
sampling_design() |>
stratify_by(
region,
alloc = "power",
cv = data.frame(
region = levels(bfa_eas$region),
cv = c(0.40, 0.35, 0.12, 0.20, 0.30, 0.18,
0.15, 0.38, 0.22, 0.32, 0.17, 0.45, 0.25)
),
importance = data.frame(
region = levels(bfa_eas$region),
importance = c(60, 40, 120, 70, 80, 65,
50, 55, 90, 75, 45, 35, 30)
),
power = 0.5
) |>
draw(n = 200) |>
execute(bfa_eas, seed = 7)
#> # A tbl_sample: 200 × 17
#> # Weights: 222.85 [139.52, 388.8]
#> ea_id region province commune urban_rural population households area_km2
#> * <int> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 10970 Boucle du … Banwa Tansila Rural 488 59 10.7
#> 2 32313 Boucle du … Sourou Lanfie… Rural 45 6 8.78
#> 3 21006 Boucle du … Mouhoun Bondok… Rural 102 16 3.08
#> 4 20930 Boucle du … Mouhoun Bondok… Rural 792 121 9.76
#> 5 26158 Boucle du … Mouhoun Dedoug… Rural 214 38 5.38
#> 6 10975 Boucle du … Banwa Tansila Rural 141 17 16.0
#> 7 44747 Boucle du … Nayala Toma Rural 73 11 10.2
#> 8 9030 Boucle du … Bale Poura Urban 562 90 8.94
#> 9 4080 Boucle du … Sourou Gomboro Rural 163 23 0.2
#> 10 34042 Boucle du … Banwa Sami Rural 100 15 7.46
#> # ℹ 190 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>
# Custom sample sizes per stratum using a data frame
custom_sizes <- data.frame(
region = levels(bfa_eas$region),
n = c(20, 12, 25, 18, 22, 16, 14, 15, 20, 18, 12, 10, 8)
)
sampling_design() |>
stratify_by(region) |>
draw(n = custom_sizes) |>
execute(bfa_eas, seed = 2026)
#> # A tbl_sample: 210 × 17
#> # Weights: 212.24 [115.14, 414.4]
#> ea_id region province commune urban_rural population households area_km2
#> * <int> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 34795 Boucle du … Sourou Tougan Rural 45 6 14.0
#> 2 44496 Boucle du … Mouhoun Tcheri… Rural 180 29 16.6
#> 3 9624 Boucle du … Banwa Sanaba Rural 177 23 6.56
#> 4 7012 Boucle du … Kossi Madouba Rural 387 48 6.18
#> 5 44420 Boucle du … Mouhoun Tcheri… Rural 63 10 22.3
#> 6 515 Boucle du … Kossi Barani Rural 651 85 1.11
#> 7 1144 Boucle du … Bale Boromo Rural 75 10 8.92
#> 8 8378 Boucle du … Bale Ouri Rural 693 89 9.26
#> 9 4909 Boucle du … Sourou Kassoum Rural 662 98 0.65
#> 10 21094 Boucle du … Kossi Bouras… Rural 25 3 5.1
#> # ℹ 200 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>
# Multiple stratification variables
sampling_design() |>
stratify_by(region, urban_rural, alloc = "proportional") |>
draw(n = 300) |>
execute(bfa_eas, seed = 2025)
#> # A tbl_sample: 300 × 17
#> # Weights: 148.35 [82, 176]
#> ea_id region province commune urban_rural population households area_km2
#> * <int> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
#> 1 34056 Boucle du … Banwa Sami Rural 136 20 7.25
#> 2 12794 Boucle du … Kossi Djibas… Rural 61 8 8.98
#> 3 36709 Boucle du … Bale Fara Rural 341 50 6.47
#> 4 11042 Boucle du … Banwa Tansila Rural 118 14 8.43
#> 5 1167 Boucle du … Bale Boromo Rural 120 16 181.
#> 6 23952 Boucle du … Banwa Solenzo Rural 106 14 7.32
#> 7 32327 Boucle du … Sourou Lanfie… Rural 108 15 8.21
#> 8 44753 Boucle du … Nayala Toma Rural 72 10 7.27
#> 9 12732 Boucle du … Kossi Djibas… Rural 772 99 8.62
#> 10 11071 Boucle du … Banwa Tansila Rural 26 3 7.28
#> # ℹ 290 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> # food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> # .stage <int>, .weight_1 <dbl>, .fpc_1 <int>