Define Stratification

stratify_by() specifies stratification variables and optional allocation methods for a sampling design. Stratification ensures representation from all subgroups defined by the stratification variables.

Usage

stratify_by(
  .data,
  ...,
  alloc = NULL,
  variance = NULL,
  cost = NULL,
  cv = NULL,
  importance = NULL,
  power = NULL
)

Arguments

.data

A sampling_design object (piped from sampling_design() or add_stage()).

...

Stratification variables specified as bare column names.

alloc

Character string specifying the allocation method. One of:

NULL (default): No allocation; n in draw() is per stratum
"equal": Equal allocation across strata
"proportional": Proportional to stratum size
"neyman": Neyman optimal allocation (requires variance)
"optimal": Cost-variance optimal allocation (requires variance and cost)
"power": Power allocation (requires cv and importance)

variance

Stratum variances for Neyman or optimal allocation. Either a data frame with columns for all stratification variables plus a var column, or a named numeric vector (when using a single stratification variable) where names correspond to stratum levels. For named vectors, names must match the values in the stratification column (for example c(A = 1.2, B = 0.8)).

cost

Stratum costs for optimal allocation. Either a data frame with columns for all stratification variables plus a cost column, or a named numeric vector (when using a single stratification variable) where names correspond to stratum levels. For named vectors, names must match the values in the stratification column.

cv

Stratum coefficients of variation (\(C_h\)) for power allocation. Either a data frame with stratification columns plus a cv column, or a named numeric vector for a single stratification variable (names are stratum levels).

importance

Stratum importance measure (\(X_h\)) for power allocation. Either a data frame with stratification columns plus an importance column, or a named numeric vector for a single stratification variable (names are stratum levels).

power

Power exponent \(q\) for power allocation. Must satisfy \(0 \le q \le 1\). Defaults to 0.5.

Value

A modified sampling_design object with stratification specified.

Details

Allocation Methods

When no alloc is specified, the n parameter in draw() is interpreted as the sample size per stratum. When an alloc method is specified, n becomes the total sample size to be distributed according to the allocation method.

Equal Allocation

Each stratum receives n/H units, where H is the number of strata.

Proportional Allocation

Each stratum receives \(n \times N_h/N\) units, where \(N_h\) is the stratum population size and N is the total population size.

Neyman Allocation

Minimizes variance for fixed sample size. Each stratum receives: \(n \times (N_h \times S_h) / \sum(N_h \times S_h)\) where S_h is the stratum standard deviation.

Optimal Allocation

Minimizes variance for fixed cost (or cost for fixed variance). Each stratum receives: \(n \times (N_h \times S_h / \sqrt{C_h}) / \sum(N_h \times S_h / \sqrt{C_h})\) where C_h is the per-unit cost in stratum h.

Power Allocation

Power allocation (Bankier, 1988) is a compromise allocation: \(n_h \propto C_h \times X_h^q\), where \(C_h\) is stratum CV, \(X_h\) is a stratum importance measure, and \(q \in [0, 1]\).

Custom Allocation

For custom stratum-specific sample sizes or rates, pass a data frame directly to the n or frac argument in draw(). The data frame must contain columns for all stratification variables plus an n or frac column.

Auxiliary Input Formats (`variance`, `cost`, `cv`, `importance`)

With one stratification variable, you may use a named vector (e.g., variance = c(A = 1.2, B = 0.8)).
With multiple stratification variables, you must use a data frame containing all stratification columns plus the value column.

Data Frame Requirements

Auxiliary data frames (variance, cost) must contain:

All stratification variable columns (used as join keys)
The appropriate value column (var or cost)

Examples

# Simple stratification: 20 EAs per region
sampling_design() |>
  stratify_by(region) |>
  draw(n = 20) |>
  execute(bfa_eas, seed = 1234)
#> # A tbl_sample: 260 × 17
#> # Weights:      57.31 [30.4, 79.5]
#>    ea_id    region   province commune urban_rural population households area_km2
#>  * <chr>    <fct>    <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1 EA_05762 Boucle … Sourou   Kassoum Rural             1633        221    23.3 
#>  2 EA_11523 Boucle … Mouhoun  Safane  Rural              549         78    16.4 
#>  3 EA_04508 Boucle … Nayala   Gassan  Rural             1900        232     9.41
#>  4 EA_10142 Boucle … Mouhoun  Ouarko… Rural             1201        167    41.8 
#>  5 EA_03597 Boucle … Kossi    Djibas… Rural             1615        192    11.0 
#>  6 EA_03184 Boucle … Mouhoun  Dedoug… Rural              966        171     0.5 
#>  7 EA_03738 Boucle … Kossi    Dokui   Rural             1314        210    22.0 
#>  8 EA_03213 Boucle … Mouhoun  Dedoug… Rural              821        146     0.25
#>  9 EA_12401 Boucle … Banwa    Solenzo Rural             1062        127    29.3 
#> 10 EA_03179 Boucle … Mouhoun  Dedoug… Rural             1368        243    43.6 
#> # ℹ 250 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

# Proportional allocation across regions
sampling_design() |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 200) |>
  execute(bfa_eas, seed = 123)
#> # A tbl_sample: 200 × 17
#> # Weights:      74.5 [72.2, 76]
#>    ea_id    region   province commune urban_rural population households area_km2
#>  * <chr>    <fct>    <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1 EA_12416 Boucle … Banwa    Solenzo Rural             1395        166     8.03
#>  2 EA_12873 Boucle … Banwa    Tansila Rural             1164        148    39.1 
#>  3 EA_11055 Boucle … Bale     Pompoi  Rural             1198        189    33.3 
#>  4 EA_00767 Boucle … Kossi    Barani  Rural             1433        188    17.5 
#>  5 EA_12078 Boucle … Bale     Siby    Rural              912        125     8.55
#>  6 EA_03217 Boucle … Mouhoun  Dedoug… Rural              998        177     3.4 
#>  7 EA_04525 Boucle … Nayala   Gassan  Rural              649         79     0.49
#>  8 EA_05964 Boucle … Sourou   Kiemba… Rural             1075        171    12.7 
#>  9 EA_14277 Boucle … Nayala   Ye      Rural             1323        233    10.9 
#> 10 EA_14292 Boucle … Nayala   Ye      Rural             1030        182    18.1 
#> # ℹ 190 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

# Neyman allocation using pre-computed variances
sampling_design() |>
  stratify_by(region, alloc = "neyman", variance = bfa_eas_variance) |>
  draw(n = 200) |>
  execute(bfa_eas, seed = 12)
#> # A tbl_sample: 200 × 17
#> # Weights:      74.5 [54.83, 121.6]
#>    ea_id    region   province commune urban_rural population households area_km2
#>  * <chr>    <fct>    <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1 EA_12451 Boucle … Banwa    Solenzo Rural             2360        282    27.0 
#>  2 EA_12347 Boucle … Banwa    Solenzo Rural             1929        230    30.2 
#>  3 EA_11712 Boucle … Banwa    Sanaba  Rural              816        118    34.3 
#>  4 EA_14295 Boucle … Nayala   Ye      Rural             1281        226    25.9 
#>  5 EA_03138 Boucle … Mouhoun  Dedoug… Rural             1461        259    18.2 
#>  6 EA_11050 Boucle … Bale     Pompoi  Rural              917        144    26.8 
#>  7 EA_13839 Boucle … Sourou   Tougan  Rural             1073        148    12.2 
#>  8 EA_12930 Boucle … Mouhoun  Tcheri… Rural             1243        149     1.15
#>  9 EA_02077 Boucle … Kossi    Bombor… Rural             1036        136     7.59
#> 10 EA_04889 Boucle … Nayala   Gossina Rural             1032        166    27.0 
#> # ℹ 190 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

# Optimal allocation considering both variance and cost
sampling_design() |>
  stratify_by(region, alloc = "optimal",
              variance = bfa_eas_variance,
              cost = bfa_eas_cost) |>
  draw(n = 200) |>
  execute(bfa_eas, seed = 1)
#> # A tbl_sample: 200 × 17
#> # Weights:      74.5 [63.6, 107.25]
#>    ea_id    region   province commune urban_rural population households area_km2
#>  * <chr>    <fct>    <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1 EA_10155 Boucle … Mouhoun  Ouarko… Rural             1347        187    33.0 
#>  2 EA_03955 Boucle … Kossi    Doumba… Rural             1558        211    16.6 
#>  3 EA_10325 Boucle … Bale     Ouri    Rural              767         92    17.6 
#>  4 EA_03209 Boucle … Mouhoun  Dedoug… Rural              446         79    18.1 
#>  5 EA_12881 Boucle … Banwa    Tansila Rural             1076        137     9.45
#>  6 EA_11613 Boucle … Banwa    Sami    Rural              912        118    43.2 
#>  7 EA_06857 Boucle … Banwa    Kouka   Rural             1642        189    12.5 
#>  8 EA_13730 Boucle … Nayala   Toma    Rural              973        101     0.88
#>  9 EA_05972 Boucle … Sourou   Kiemba… Rural             1359        216    36.6 
#> 10 EA_03571 Boucle … Kossi    Djibas… Rural              725         86     1.01
#> # ℹ 190 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

# Power allocation (Bankier, 1988)
sampling_design() |>
  stratify_by(
    region,
    alloc = "power",
    cv = data.frame(
      region = levels(bfa_eas$region),
      cv = c(0.40, 0.35, 0.12, 0.20, 0.30, 0.18,
             0.15, 0.38, 0.22, 0.32, 0.17, 0.45, 0.25)
    ),
    importance = data.frame(
      region = levels(bfa_eas$region),
      importance = c(60, 40, 120, 70, 80, 65,
                     50, 55, 90, 75, 45, 35, 30)
    ),
    power = 0.5
  ) |>
  draw(n = 200) |>
  execute(bfa_eas, seed = 7)
#> # A tbl_sample: 200 × 17
#> # Weights:      74.5 [39.24, 155.6]
#>    ea_id    region   province commune urban_rural population households area_km2
#>  * <chr>    <fct>    <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1 EA_05963 Boucle … Sourou   Kiemba… Rural             1173        187    23.8 
#>  2 EA_13801 Boucle … Sourou   Tougan  Rural              613         85    16.7 
#>  3 EA_12886 Boucle … Banwa    Tansila Rural             1277        163    22.7 
#>  4 EA_07600 Boucle … Kossi    Madouba Rural             1781        226    26.6 
#>  5 EA_14037 Boucle … Bale     Yaho    Rural             1623        214    32.4 
#>  6 EA_03723 Boucle … Kossi    Dokui   Rural             1584        253     2.02
#>  7 EA_13703 Boucle … Sourou   Toeni   Rural             1618        247   216.  
#>  8 EA_10154 Boucle … Mouhoun  Ouarko… Rural             1526        212    60.3 
#>  9 EA_02133 Boucle … Mouhoun  Bondok… Rural             1107        169    32.3 
#> 10 EA_10378 Boucle … Bale     Pa      Rural             1128        159     2.43
#> # ℹ 190 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

# Custom sample sizes per stratum using a data frame
custom_sizes <- data.frame(
  region = levels(bfa_eas$region),
  n = c(20, 12, 25, 18, 22, 16, 14, 15, 20, 18, 12, 10, 8)
)
sampling_design() |>
  stratify_by(region) |>
  draw(n = custom_sizes) |>
  execute(bfa_eas, seed = 2026)
#> # A tbl_sample: 210 × 17
#> # Weights:      70.95 [43.43, 106]
#>    ea_id    region   province commune urban_rural population households area_km2
#>  * <chr>    <fct>    <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1 EA_08652 Boucle … Kossi    Nouna   Rural             1393        181    17.6 
#>  2 EA_10131 Boucle … Mouhoun  Ouarko… Rural             1123        156    23.4 
#>  3 EA_06881 Boucle … Banwa    Kouka   Rural             1229        142     9.3 
#>  4 EA_04515 Boucle … Nayala   Gassan  Rural             1556        190    42.6 
#>  5 EA_10374 Boucle … Bale     Pa      Rural             1150        162    82.3 
#>  6 EA_13719 Boucle … Nayala   Toma    Rural              955         99     0.98
#>  7 EA_12390 Boucle … Banwa    Solenzo Rural             2953        352    25.6 
#>  8 EA_06874 Boucle … Banwa    Kouka   Rural             1723        198    17.7 
#>  9 EA_02110 Boucle … Mouhoun  Bondok… Rural             1413        216    16.8 
#> 10 EA_11690 Boucle … Banwa    Sanaba  Rural              929        135     1.9 
#> # ℹ 200 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

# Multiple stratification variables
sampling_design() |>
  stratify_by(region, urban_rural, alloc = "proportional") |>
  draw(n = 300) |>
  execute(bfa_eas, seed = 2025)
#> # A tbl_sample: 300 × 17
#> # Weights:      49.51 [41.33, 56.67]
#>    ea_id    region   province commune urban_rural population households area_km2
#>  * <chr>    <fct>    <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1 EA_03197 Boucle … Mouhoun  Dedoug… Rural             1296        230    14.3 
#>  2 EA_12879 Boucle … Banwa    Tansila Rural             1109        141    15.6 
#>  3 EA_03220 Boucle … Mouhoun  Dedoug… Rural             1605        285     1.62
#>  4 EA_06875 Boucle … Banwa    Kouka   Rural              635         73     0.75
#>  5 EA_12079 Boucle … Bale     Siby    Rural             1100        151    47.6 
#>  6 EA_04708 Boucle … Sourou   Gomboro Rural              988        131    49.1 
#>  7 EA_12880 Boucle … Banwa    Tansila Rural             1181        151    23.6 
#>  8 EA_10119 Boucle … Mouhoun  Ouarko… Rural              904        125    38.2 
#>  9 EA_03179 Boucle … Mouhoun  Dedoug… Rural             1368        243    43.6 
#> 10 EA_13693 Boucle … Sourou   Toeni   Rural             1112        170     1.24
#> # ℹ 290 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <int>