Skip to contents

stratify_by() specifies stratification variables and optional allocation methods for a sampling design. Stratification ensures representation from all subgroups defined by the stratification variables.

Usage

stratify_by(
  .data,
  ...,
  alloc = NULL,
  variance = NULL,
  cost = NULL,
  cv = NULL,
  importance = NULL,
  power = NULL
)

Arguments

.data

A sampling_design object (piped from sampling_design() or add_stage()).

...

Stratification variables specified as bare column names.

alloc

Character string specifying the allocation method. One of:

  • NULL (default): No allocation; n in draw() is per stratum

  • "equal": Equal allocation across strata

  • "proportional": Proportional to stratum size

  • "neyman": Neyman optimal allocation (requires variance)

  • "optimal": Cost-variance optimal allocation (requires variance and cost)

  • "power": Power allocation (requires cv and importance)

variance

Stratum variances for Neyman or optimal allocation. Either a data frame with columns for all stratification variables plus a var column, or a named numeric vector (when using a single stratification variable) where names correspond to stratum levels. For named vectors, names must match the values in the stratification column (for example c(A = 1.2, B = 0.8)).

cost

Stratum costs for optimal allocation. Either a data frame with columns for all stratification variables plus a cost column, or a named numeric vector (when using a single stratification variable) where names correspond to stratum levels. For named vectors, names must match the values in the stratification column.

cv

Stratum coefficients of variation (\(C_h\)) for power allocation. Either a data frame with stratification columns plus a cv column, or a named numeric vector for a single stratification variable (names are stratum levels).

importance

Stratum importance measure (\(X_h\)) for power allocation. Either a data frame with stratification columns plus an importance column, or a named numeric vector for a single stratification variable (names are stratum levels).

power

Power exponent \(q\) for power allocation. Must satisfy \(0 \le q \le 1\). Defaults to 0.5.

Value

A modified sampling_design object with stratification specified.

Details

Allocation Methods

When no alloc is specified, the n parameter in draw() is interpreted as the sample size per stratum. When an alloc method is specified, n becomes the total sample size to be distributed according to the allocation method.

Equal Allocation

Each stratum receives n/H units, where H is the number of strata.

Proportional Allocation

Each stratum receives \(n \times N_h/N\) units, where \(N_h\) is the stratum population size and N is the total population size.

Neyman Allocation

Minimizes variance for fixed sample size. Each stratum receives: \(n \times (N_h \times S_h) / \sum(N_h \times S_h)\) where S_h is the stratum standard deviation.

Optimal Allocation

Minimizes variance for fixed cost (or cost for fixed variance). Each stratum receives: \(n \times (N_h \times S_h / \sqrt{C_h}) / \sum(N_h \times S_h / \sqrt{C_h})\) where C_h is the per-unit cost in stratum h.

Power Allocation

Power allocation (Bankier, 1988) is a compromise allocation: \(n_h \propto C_h \times X_h^q\), where \(C_h\) is stratum CV, \(X_h\) is a stratum importance measure, and \(q \in [0, 1]\).

Custom Allocation

For custom stratum-specific sample sizes or rates, pass a data frame directly to the n or frac argument in draw(). The data frame must contain columns for all stratification variables plus an n or frac column.

Auxiliary Input Formats (variance, cost, cv, importance)

  • With one stratification variable, you may use a named vector (e.g., variance = c(A = 1.2, B = 0.8)).

  • With multiple stratification variables, you must use a data frame containing all stratification columns plus the value column.

Data Frame Requirements

Auxiliary data frames (variance, cost) must contain:

  • All stratification variable columns (used as join keys)

  • The appropriate value column (var or cost)

See also

sampling_design() for creating designs, draw() for specifying sample sizes, cluster_by() for cluster sampling

Examples

# Simple stratification: 20 EAs per region
sampling_design() |>
  stratify_by(region) |>
  draw(n = 20) |>
  execute(bfa_eas, seed = 1234)
#> # A tbl_sample: 260 × 17
#> # Weights:      57.44 [30.75, 79.5]
#>    ea_id    region   province commune urban_rural population households area_km2
#>  * <chr>    <fct>    <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1 EA_03266 Boucle … Sourou   Di      Rural             3642        475     2.62
#>  2 EA_11550 Boucle … Mouhoun  Safane  Rural              549         79    16.4 
#>  3 EA_04535 Boucle … Nayala   Gassan  Rural             1900        210     9.41
#>  4 EA_10169 Boucle … Mouhoun  Ouarko… Rural             1201        165    41.8 
#>  5 EA_03624 Boucle … Kossi    Djibas… Rural             1615        206    11.0 
#>  6 EA_03184 Boucle … Mouhoun  Dedoug… Rural              966        171     0.5 
#>  7 EA_03765 Boucle … Kossi    Dokui   Rural             1314        163    22.0 
#>  8 EA_03213 Boucle … Mouhoun  Dedoug… Rural              821        146     0.25
#>  9 EA_12428 Boucle … Banwa    Solenzo Rural             1062        143    29.3 
#> 10 EA_03179 Boucle … Mouhoun  Dedoug… Rural             1368        243    43.6 
#> # ℹ 250 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

# Proportional allocation across regions
sampling_design() |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 200) |>
  execute(bfa_eas, seed = 123)
#> # A tbl_sample: 200 × 17
#> # Weights:      74.67 [72.2, 76.88]
#>    ea_id    region   province commune urban_rural population households area_km2
#>  * <chr>    <fct>    <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1 EA_12443 Boucle … Banwa    Solenzo Rural             1395        187     8.03
#>  2 EA_12900 Boucle … Banwa    Tansila Rural             1164        140    39.1 
#>  3 EA_11082 Boucle … Bale     Pompoi  Rural             1198        162    33.3 
#>  4 EA_00767 Boucle … Kossi    Barani  Rural             1433        188    17.5 
#>  5 EA_12105 Boucle … Bale     Siby    Rural              912        135     8.55
#>  6 EA_03217 Boucle … Mouhoun  Dedoug… Rural              998        177     3.4 
#>  7 EA_04552 Boucle … Nayala   Gassan  Rural              649         72     0.49
#>  8 EA_04732 Boucle … Sourou   Gomboro Rural             1825        257    32.9 
#>  9 EA_14304 Boucle … Nayala   Ye      Rural             1323        190    10.9 
#> 10 EA_14319 Boucle … Nayala   Ye      Rural             1030        148    18.1 
#> # ℹ 190 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

# Neyman allocation using pre-computed variances
sampling_design() |>
  stratify_by(region, alloc = "neyman", variance = bfa_eas_variance) |>
  draw(n = 200) |>
  execute(bfa_eas, seed = 12)
#> # A tbl_sample: 200 × 17
#> # Weights:      74.67 [52.65, 123]
#>    ea_id    region   province commune urban_rural population households area_km2
#>  * <chr>    <fct>    <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1 EA_12478 Boucle … Banwa    Solenzo Rural             2360        317    27.0 
#>  2 EA_12374 Boucle … Banwa    Solenzo Rural             1929        259    30.2 
#>  3 EA_11739 Boucle … Banwa    Sanaba  Rural              816        104    34.3 
#>  4 EA_14322 Boucle … Nayala   Ye      Rural             1281        184    25.9 
#>  5 EA_03138 Boucle … Mouhoun  Dedoug… Rural             1461        259    18.2 
#>  6 EA_11077 Boucle … Bale     Pompoi  Rural              917        124    26.8 
#>  7 EA_13839 Boucle … Sourou   Tougan  Rural             1934        265    24.2 
#>  8 EA_12957 Boucle … Mouhoun  Tcheri… Rural             1243        199     1.15
#>  9 EA_13862 Boucle … Sourou   Tougan  Rural              902        124    59.4 
#> 10 EA_02077 Boucle … Kossi    Bombor… Rural             1036        136     7.59
#> # ℹ 190 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

# Optimal allocation considering both variance and cost
sampling_design() |>
  stratify_by(region, alloc = "optimal",
              variance = bfa_eas_variance,
              cost = bfa_eas_cost) |>
  draw(n = 200) |>
  execute(bfa_eas, seed = 1)
#> # A tbl_sample: 200 × 17
#> # Weights:      74.67 [60.55, 107.25]
#>    ea_id    region   province commune urban_rural population households area_km2
#>  * <chr>    <fct>    <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1 EA_10182 Boucle … Mouhoun  Ouarko… Rural             1347        185    33.0 
#>  2 EA_03982 Boucle … Kossi    Doumba… Rural             1558        191    16.6 
#>  3 EA_10352 Boucle … Bale     Ouri    Rural              767         98    17.6 
#>  4 EA_03209 Boucle … Mouhoun  Dedoug… Rural              446         79    18.1 
#>  5 EA_12908 Boucle … Banwa    Tansila Rural             1076        129     9.45
#>  6 EA_11640 Boucle … Banwa    Sami    Rural              912        137    43.2 
#>  7 EA_06884 Boucle … Banwa    Kouka   Rural             1642        226    12.5 
#>  8 EA_13757 Boucle … Nayala   Toma    Rural              973        141     0.88
#>  9 EA_05785 Boucle … Sourou   Kassoum Rural             1315        195     1.5 
#> 10 EA_03598 Boucle … Kossi    Djibas… Rural              725         93     1.01
#> # ℹ 190 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

# Power allocation (Bankier, 1988)
sampling_design() |>
  stratify_by(
    region,
    alloc = "power",
    cv = data.frame(
      region = levels(bfa_eas$region),
      cv = c(0.40, 0.35, 0.12, 0.20, 0.30, 0.18,
             0.15, 0.38, 0.22, 0.32, 0.17, 0.45, 0.25)
    ),
    importance = data.frame(
      region = levels(bfa_eas$region),
      importance = c(60, 40, 120, 70, 80, 65,
                     50, 55, 90, 75, 45, 35, 30)
    ),
    power = 0.5
  ) |>
  draw(n = 200) |>
  execute(bfa_eas, seed = 7)
#> # A tbl_sample: 200 × 17
#> # Weights:      74.67 [39.24, 155.6]
#>    ea_id    region   province commune urban_rural population households area_km2
#>  * <chr>    <fct>    <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1 EA_04731 Boucle … Sourou   Gomboro Rural              809        114    17.4 
#>  2 EA_13853 Boucle … Sourou   Tougan  Rural             1078        148    52.7 
#>  3 EA_13726 Boucle … Sourou   Toeni   Rural             1545        160    38.2 
#>  4 EA_12913 Boucle … Banwa    Tansila Rural             1277        153    22.7 
#>  5 EA_07627 Boucle … Kossi    Madouba Rural             1781        220    26.6 
#>  6 EA_14064 Boucle … Bale     Yaho    Rural             1623        242    32.4 
#>  7 EA_03750 Boucle … Kossi    Dokui   Rural             1584        196     2.02
#>  8 EA_07243 Boucle … Sourou   Lankoue Rural             1870        246    40.0 
#>  9 EA_10181 Boucle … Mouhoun  Ouarko… Rural             1526        209    60.3 
#> 10 EA_02133 Boucle … Mouhoun  Bondok… Rural             1107        169    32.3 
#> # ℹ 190 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

# Custom sample sizes per stratum using a data frame
custom_sizes <- data.frame(
  region = levels(bfa_eas$region),
  n = c(20, 12, 25, 18, 22, 16, 14, 15, 20, 18, 12, 10, 8)
)
sampling_design() |>
  stratify_by(region) |>
  draw(n = custom_sizes) |>
  execute(bfa_eas, seed = 2026)
#> # A tbl_sample: 210 × 17
#> # Weights:      71.11 [43.93, 106]
#>    ea_id    region   province commune urban_rural population households area_km2
#>  * <chr>    <fct>    <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1 EA_08679 Boucle … Kossi    Nouna   Rural             1393        206    17.6 
#>  2 EA_10158 Boucle … Mouhoun  Ouarko… Rural             1123        154    23.4 
#>  3 EA_06908 Boucle … Banwa    Kouka   Rural             1229        169     9.3 
#>  4 EA_04542 Boucle … Nayala   Gassan  Rural             1556        172    42.6 
#>  5 EA_10401 Boucle … Bale     Pa      Rural             1150        165    82.3 
#>  6 EA_13746 Boucle … Nayala   Toma    Rural              955        138     0.98
#>  7 EA_12417 Boucle … Banwa    Solenzo Rural             2953        397    25.6 
#>  8 EA_06901 Boucle … Banwa    Kouka   Rural             1723        237    17.7 
#>  9 EA_02110 Boucle … Mouhoun  Bondok… Rural             1413        216    16.8 
#> 10 EA_11717 Boucle … Banwa    Sanaba  Rural              929        118     1.9 
#> # ℹ 200 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

# Multiple stratification variables
sampling_design() |>
  stratify_by(region, urban_rural, alloc = "proportional") |>
  draw(n = 300) |>
  execute(bfa_eas, seed = 2025)
#> # A tbl_sample: 300 × 17
#> # Weights:      49.7 [29, 62]
#>    ea_id    region   province commune urban_rural population households area_km2
#>  * <chr>    <fct>    <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1 EA_03197 Boucle … Mouhoun  Dedoug… Rural             1296        230    14.3 
#>  2 EA_12906 Boucle … Banwa    Tansila Rural             1109        133    15.6 
#>  3 EA_03220 Boucle … Mouhoun  Dedoug… Rural             1605        285     1.62
#>  4 EA_06902 Boucle … Banwa    Kouka   Rural              635         87     0.75
#>  5 EA_12106 Boucle … Bale     Siby    Rural             1100        163    47.6 
#>  6 EA_03257 Boucle … Sourou   Di      Rural             1192        155    56.3 
#>  7 EA_12907 Boucle … Banwa    Tansila Rural             1181        142    23.6 
#>  8 EA_10146 Boucle … Mouhoun  Ouarko… Rural              904        124    38.2 
#>  9 EA_03179 Boucle … Mouhoun  Dedoug… Rural             1368        243    43.6 
#> 10 EA_07233 Boucle … Sourou   Lankoue Rural             1411        186    19.0 
#> # ℹ 290 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <int>