Skip to contents

stratify_by() specifies stratification variables and optional allocation methods for a sampling design. Stratification ensures representation from all subgroups defined by the stratification variables.

Usage

stratify_by(
  .data,
  ...,
  alloc = NULL,
  variance = NULL,
  cost = NULL,
  cv = NULL,
  importance = NULL,
  power = NULL
)

Arguments

.data

A sampling_design object (piped from sampling_design() or add_stage()).

...

Stratification variables specified as bare column names.

alloc

Character string specifying the allocation method. One of:

  • NULL (default): No allocation; n in draw() is per stratum

  • "equal": Equal allocation across strata

  • "proportional": Proportional to stratum size

  • "neyman": Neyman optimal allocation (requires variance)

  • "optimal": Cost-variance optimal allocation (requires variance and cost)

  • "power": Power allocation (requires cv and importance)

variance

Stratum variances for Neyman or optimal allocation. Either a data frame with columns for all stratification variables plus a var column, or a named numeric vector (when using a single stratification variable) where names correspond to stratum levels. For named vectors, names must match the values in the stratification column (for example c(A = 1.2, B = 0.8)).

cost

Stratum costs for optimal allocation. Either a data frame with columns for all stratification variables plus a cost column, or a named numeric vector (when using a single stratification variable) where names correspond to stratum levels. For named vectors, names must match the values in the stratification column.

cv

Stratum coefficients of variation (\(C_h\)) for power allocation. Either a data frame with stratification columns plus a cv column, or a named numeric vector for a single stratification variable (names are stratum levels).

importance

Stratum importance measure (\(X_h\)) for power allocation. Either a data frame with stratification columns plus an importance column, or a named numeric vector for a single stratification variable (names are stratum levels).

power

Power exponent \(q\) for power allocation. Must satisfy \(0 \le q \le 1\). Defaults to 0.5.

Value

A modified sampling_design object with stratification specified.

Details

Allocation Methods

When no alloc is specified, the n parameter in draw() is interpreted as the sample size per stratum. When an alloc method is specified, n becomes the total sample size to be distributed according to the allocation method.

Equal Allocation

Each stratum receives n/H units, where H is the number of strata.

Proportional Allocation

Each stratum receives \(n \times N_h/N\) units, where \(N_h\) is the stratum population size and N is the total population size.

Neyman Allocation

Minimizes variance for fixed sample size. Each stratum receives: \(n \times (N_h \times S_h) / \sum(N_h \times S_h)\) where S_h is the stratum standard deviation.

Optimal Allocation

Minimizes variance for fixed cost (or cost for fixed variance). Each stratum receives: \(n \times (N_h \times S_h / \sqrt{C_h}) / \sum(N_h \times S_h / \sqrt{C_h})\) where C_h is the per-unit cost in stratum h.

Power Allocation

Power allocation (Bankier, 1988) is a compromise allocation: \(n_h \propto C_h \times X_h^q\), where \(C_h\) is stratum CV, \(X_h\) is a stratum importance measure, and \(q \in [0, 1]\).

Custom Allocation

For custom stratum-specific sample sizes or rates, pass a data frame directly to the n or frac argument in draw(). The data frame must contain columns for all stratification variables plus an n or frac column.

Auxiliary Input Formats (variance, cost, cv, importance)

  • With one stratification variable, you may use a named vector (e.g., variance = c(A = 1.2, B = 0.8)).

  • With multiple stratification variables, you must use a data frame containing all stratification columns plus the value column.

Data Frame Requirements

Auxiliary data frames (variance, cost) must contain:

  • All stratification variable columns (used as join keys)

  • The appropriate value column (var or cost)

See also

sampling_design() for creating designs, draw() for specifying sample sizes, cluster_by() for cluster sampling

Examples

# Simple stratification: 20 EAs per region
sampling_design() |>
  stratify_by(region) |>
  draw(n = 20) |>
  execute(bfa_eas, seed = 1234)
#> # A tbl_sample: 260 × 17
#> # Weights:      171.42 [80.6, 275.25]
#>    ea_id region      province commune urban_rural population households area_km2
#>  * <int> <fct>       <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1  9635 Boucle du … Banwa    Sanaba  Rural              357         45     0.77
#>  2  9908 Boucle du … Bale     Siby    Rural              113         17     8.92
#>  3 20945 Boucle du … Mouhoun  Bondok… Rural              395         60     7.58
#>  4 34029 Boucle du … Banwa    Sami    Rural              363         55     8.85
#>  5 31936 Boucle du … Sourou   Kiemba… Rural              683         89     0.6 
#>  6 26038 Boucle du … Mouhoun  Dedoug… Rural              351         62     8.67
#>  7 41793 Boucle du … Kossi    Dokui   Rural              184         23     8.27
#>  8  5643 Boucle du … Mouhoun  Kona    Rural              101         14     8.38
#>  9 21026 Boucle du … Mouhoun  Bondok… Rural               64         10     7.92
#> 10 33091 Boucle du … Kossi    Nouna   Rural               28          4     8.55
#> # ℹ 250 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

# Proportional allocation across regions
sampling_design() |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 200) |>
  execute(bfa_eas, seed = 123)
#> # A tbl_sample: 200 × 17
#> # Weights:      222.85 [217.78, 237.43]
#>    ea_id region      province commune urban_rural population households area_km2
#>  * <int> <fct>       <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1 33181 Boucle du … Kossi    Nouna   Rural              133         20     0.07
#>  2 33229 Boucle du … Kossi    Nouna   Rural               48          7     4.53
#>  3 21506 Boucle du … Kossi    Doumba… Rural              485         59     4.82
#>  4 45264 Boucle du … Bale     Pa      Rural              825        119     0.66
#>  5 26201 Boucle du … Sourou   Di      Rural               95         12    21.1 
#>  6 26077 Boucle du … Mouhoun  Dedoug… Rural              176         31     8.96
#>  7 25510 Boucle du … Kossi    Bombor… Rural              246         32     0.44
#>  8 23774 Boucle du … Banwa    Solenzo Rural              158         21     2.9 
#>  9  8332 Boucle du … Mouhoun  Ouarko… Rural               55          8     5.68
#> 10 43713 Boucle du … Mouhoun  Safane  Rural               97         14     6.46
#> # ℹ 190 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

# Neyman allocation using pre-computed variances
sampling_design() |>
  stratify_by(region, alloc = "neyman", variance = bfa_eas_variance) |>
  draw(n = 200) |>
  execute(bfa_eas, seed = 12)
#> # A tbl_sample: 200 × 17
#> # Weights:      222.85 [83.71, 2419.5]
#>    ea_id region      province commune urban_rural population households area_km2
#>  * <int> <fct>       <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1 31987 Boucle du … Sourou   Kiemba… Rural               24          3     7.24
#>  2  4968 Boucle du … Sourou   Kassoum Rural              303         45     4.13
#>  3  4958 Boucle du … Sourou   Kassoum Rural              142         21     8.24
#>  4 23903 Boucle du … Banwa    Solenzo Rural               94         13    13.5 
#>  5 25996 Boucle du … Mouhoun  Dedoug… Rural              538         95     8.63
#>  6 21501 Boucle du … Kossi    Doumba… Rural              247         30     7.93
#>  7 43794 Boucle du … Mouhoun  Safane  Rural               87         12     8.96
#>  8  9724 Boucle du … Banwa    Sanaba  Rural               51          6     8.02
#>  9 10979 Boucle du … Banwa    Tansila Rural               71          9     7.72
#> 10  8893 Boucle du … Bale     Pompoi  Rural              522         71     0.48
#> # ℹ 190 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

# Optimal allocation considering both variance and cost
sampling_design() |>
  stratify_by(region, alloc = "optimal",
              variance = bfa_eas_variance,
              cost = bfa_eas_cost) |>
  draw(n = 200) |>
  execute(bfa_eas, seed = 1)
#> # A tbl_sample: 200 × 17
#> # Weights:      222.85 [88.79, 1613]
#>    ea_id region      province commune urban_rural population households area_km2
#>  * <int> <fct>       <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1  9648 Boucle du … Banwa    Sanaba  Rural              279         35     8.37
#>  2 11547 Boucle du … Sourou   Toeni   Rural               49          5    21.4 
#>  3 41824 Boucle du … Kossi    Dokui   Rural               75          9    15.8 
#>  4 11012 Boucle du … Banwa    Tansila Rural              592         71     0.95
#>  5 32308 Boucle du … Sourou   Lanfie… Rural               57          8     9.49
#>  6  7017 Boucle du … Kossi    Madouba Rural              599         74     0.74
#>  7 36700 Boucle du … Bale     Fara    Rural              402         59     6.36
#>  8 11611 Boucle du … Nayala   Yaba    Rural              111         15     8.97
#>  9  8342 Boucle du … Mouhoun  Ouarko… Rural               94         13     8.1 
#> 10 11626 Boucle du … Nayala   Yaba    Rural               56          7     8.31
#> # ℹ 190 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

# Power allocation (Bankier, 1988)
sampling_design() |>
  stratify_by(
    region,
    alloc = "power",
    cv = data.frame(
      region = levels(bfa_eas$region),
      cv = c(0.40, 0.35, 0.12, 0.20, 0.30, 0.18,
             0.15, 0.38, 0.22, 0.32, 0.17, 0.45, 0.25)
    ),
    importance = data.frame(
      region = levels(bfa_eas$region),
      importance = c(60, 40, 120, 70, 80, 65,
                     50, 55, 90, 75, 45, 35, 30)
    ),
    power = 0.5
  ) |>
  draw(n = 200) |>
  execute(bfa_eas, seed = 7)
#> # A tbl_sample: 200 × 17
#> # Weights:      222.85 [139.52, 388.8]
#>    ea_id region      province commune urban_rural population households area_km2
#>  * <int> <fct>       <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1 10970 Boucle du … Banwa    Tansila Rural              488         59    10.7 
#>  2 32313 Boucle du … Sourou   Lanfie… Rural               45          6     8.78
#>  3 21006 Boucle du … Mouhoun  Bondok… Rural              102         16     3.08
#>  4 20930 Boucle du … Mouhoun  Bondok… Rural              792        121     9.76
#>  5 26158 Boucle du … Mouhoun  Dedoug… Rural              214         38     5.38
#>  6 10975 Boucle du … Banwa    Tansila Rural              141         17    16.0 
#>  7 44747 Boucle du … Nayala   Toma    Rural               73         11    10.2 
#>  8  9030 Boucle du … Bale     Poura   Urban              562         90     8.94
#>  9  4080 Boucle du … Sourou   Gomboro Rural              163         23     0.2 
#> 10 34042 Boucle du … Banwa    Sami    Rural              100         15     7.46
#> # ℹ 190 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

# Custom sample sizes per stratum using a data frame
custom_sizes <- data.frame(
  region = levels(bfa_eas$region),
  n = c(20, 12, 25, 18, 22, 16, 14, 15, 20, 18, 12, 10, 8)
)
sampling_design() |>
  stratify_by(region) |>
  draw(n = custom_sizes) |>
  execute(bfa_eas, seed = 2026)
#> # A tbl_sample: 210 × 17
#> # Weights:      212.24 [115.14, 414.4]
#>    ea_id region      province commune urban_rural population households area_km2
#>  * <int> <fct>       <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1 34795 Boucle du … Sourou   Tougan  Rural               45          6    14.0 
#>  2 44496 Boucle du … Mouhoun  Tcheri… Rural              180         29    16.6 
#>  3  9624 Boucle du … Banwa    Sanaba  Rural              177         23     6.56
#>  4  7012 Boucle du … Kossi    Madouba Rural              387         48     6.18
#>  5 44420 Boucle du … Mouhoun  Tcheri… Rural               63         10    22.3 
#>  6   515 Boucle du … Kossi    Barani  Rural              651         85     1.11
#>  7  1144 Boucle du … Bale     Boromo  Rural               75         10     8.92
#>  8  8378 Boucle du … Bale     Ouri    Rural              693         89     9.26
#>  9  4909 Boucle du … Sourou   Kassoum Rural              662         98     0.65
#> 10 21094 Boucle du … Kossi    Bouras… Rural               25          3     5.1 
#> # ℹ 200 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <int>

# Multiple stratification variables
sampling_design() |>
  stratify_by(region, urban_rural, alloc = "proportional") |>
  draw(n = 300) |>
  execute(bfa_eas, seed = 2025)
#> # A tbl_sample: 300 × 17
#> # Weights:      148.35 [82, 176]
#>    ea_id region      province commune urban_rural population households area_km2
#>  * <int> <fct>       <fct>    <fct>   <fct>            <dbl>      <int>    <dbl>
#>  1 34056 Boucle du … Banwa    Sami    Rural              136         20     7.25
#>  2 12794 Boucle du … Kossi    Djibas… Rural               61          8     8.98
#>  3 36709 Boucle du … Bale     Fara    Rural              341         50     6.47
#>  4 11042 Boucle du … Banwa    Tansila Rural              118         14     8.43
#>  5  1167 Boucle du … Bale     Boromo  Rural              120         16   181.  
#>  6 23952 Boucle du … Banwa    Solenzo Rural              106         14     7.32
#>  7 32327 Boucle du … Sourou   Lanfie… Rural              108         15     8.21
#>  8 44753 Boucle du … Nayala   Toma    Rural               72         10     7.27
#>  9 12732 Boucle du … Kossi    Djibas… Rural              772         99     8.62
#> 10 11071 Boucle du … Banwa    Tansila Rural               26          3     7.28
#> # ℹ 290 more rows
#> # ℹ 9 more variables: accessible <lgl>, dist_road_km <dbl>,
#> #   food_insecurity_pct <dbl>, cost <dbl>, .weight <dbl>, .sample_id <int>,
#> #   .stage <int>, .weight_1 <dbl>, .fpc_1 <int>