stratify_by() specifies stratification variables and optional allocation methods for a sampling design. Stratification ensures representation from all subgroups defined by the stratification variables.

stratify_by(.data, ..., alloc = NULL, variance = NULL, cost = NULL)

Arguments

.data

A sampling_design object (piped from sampling_design() or stage()).

...

<tidy-select> Stratification variables. These should be categorical variables that define the strata.

alloc

Character string specifying the allocation method. One of:

  • NULL (default): No allocation; n in draw() is per stratum

  • "equal": Equal allocation across strata

  • "proportional": Proportional to stratum size

  • "neyman": Neyman optimal allocation (requires variance)

  • "optimal": Cost-variance optimal allocation (requires variance and cost)

variance

A data frame with stratum variances for Neyman or optimal allocation. Must contain columns for all stratification variables plus a var column with variance estimates.

cost

A data frame with stratum costs for optimal allocation. Must contain columns for all stratification variables plus a cost column with per-unit costs.

Value

A modified sampling_design object with stratification specified.

Details

Allocation Methods

When no alloc is specified, the n parameter in draw() is interpreted as the sample size per stratum. When an alloc method is specified, n becomes the total sample size to be distributed according to the allocation method.

Equal Allocation

Each stratum receives n/H units, where H is the number of strata.

Proportional Allocation

Each stratum receives n × (N_h/N) units, where N_h is the stratum population size and N is the total population size.

Neyman Allocation

Minimizes variance for fixed sample size. Each stratum receives: n × (N_h × S_h) / Σ(N_h × S_h) where S_h is the stratum standard deviation.

Optimal Allocation

Minimizes variance for fixed cost (or cost for fixed variance). Each stratum receives: n × (N_h × S_h / √C_h) / Σ(N_h × S_h / √C_h) where C_h is the per-unit cost in stratum h.

Custom Allocation

For custom stratum-specific sample sizes or rates, pass a data frame directly to the n or frac argument in draw(). The data frame must contain columns for all stratification variables plus an n or frac column.

Data Frame Requirements

Auxiliary data frames (variance, cost) must contain:

  • All stratification variable columns (used as join keys)

  • The appropriate value column (var or cost)

See also

sampling_design() for creating designs, draw() for specifying sample sizes, cluster_by() for cluster sampling

Examples

if (FALSE) { # \dontrun{
# Simple stratification (n per stratum)
sampling_design() |>
  stratify_by(region) |>
  draw(n = 100) |>
  execute(frame, seed = 42)

# Proportional allocation
sampling_design() |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 1000) |>
  execute(frame, seed = 42)

# Neyman allocation
var_df <- data.frame(
  region = c("North", "South", "East", "West"),
  var = c(15.2, 22.1, 18.5, 20.3)
)
sampling_design() |>
  stratify_by(region, alloc = "neyman", variance = var_df) |>
  draw(n = 1000) |>
  execute(frame, seed = 42)

# Custom sizes (pass data frame to draw)
sizes_df <- data.frame(
  region = c("North", "South", "East", "West"),
  n = c(200, 350, 250, 200)
)
sampling_design() |>
  stratify_by(region) |>
  draw(n = sizes_df) |>
  execute(frame, seed = 42)

# Multiple stratification variables
sampling_design() |>
  stratify_by(region, urban_rural, alloc = "proportional") |>
  draw(n = 2000) |>
  execute(frame, seed = 42)
} # }