stratify_by() specifies stratification variables and optional allocation
methods for a sampling design. Stratification ensures representation from
all subgroups defined by the stratification variables.
stratify_by(.data, ..., alloc = NULL, variance = NULL, cost = NULL)A sampling_design object (piped from sampling_design() or
stage()).
<tidy-select> Stratification
variables. These should be categorical variables that define the strata.
Character string specifying the allocation method. One of:
NULL (default): No allocation; n in draw() is per stratum
"equal": Equal allocation across strata
"proportional": Proportional to stratum size
"neyman": Neyman optimal allocation (requires variance)
"optimal": Cost-variance optimal allocation (requires variance and cost)
A data frame with stratum variances for Neyman or optimal
allocation. Must contain columns for all stratification variables plus
a var column with variance estimates.
A data frame with stratum costs for optimal allocation.
Must contain columns for all stratification variables plus a cost
column with per-unit costs.
A modified sampling_design object with stratification specified.
When no alloc is specified, the n parameter in draw() is interpreted
as the sample size per stratum. When an alloc method is specified,
n becomes the total sample size to be distributed according to the
allocation method.
Each stratum receives n × (N_h/N) units, where N_h is the stratum population size and N is the total population size.
Minimizes variance for fixed sample size. Each stratum receives: n × (N_h × S_h) / Σ(N_h × S_h) where S_h is the stratum standard deviation.
Minimizes variance for fixed cost (or cost for fixed variance). Each stratum receives: n × (N_h × S_h / √C_h) / Σ(N_h × S_h / √C_h) where C_h is the per-unit cost in stratum h.
For custom stratum-specific sample sizes or rates, pass a data frame
directly to the n or frac argument in draw(). The data frame must
contain columns for all stratification variables plus an n or frac column.
Auxiliary data frames (variance, cost) must contain:
All stratification variable columns (used as join keys)
The appropriate value column (var or cost)
sampling_design() for creating designs,
draw() for specifying sample sizes,
cluster_by() for cluster sampling
if (FALSE) { # \dontrun{
# Simple stratification (n per stratum)
sampling_design() |>
stratify_by(region) |>
draw(n = 100) |>
execute(frame, seed = 42)
# Proportional allocation
sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 1000) |>
execute(frame, seed = 42)
# Neyman allocation
var_df <- data.frame(
region = c("North", "South", "East", "West"),
var = c(15.2, 22.1, 18.5, 20.3)
)
sampling_design() |>
stratify_by(region, alloc = "neyman", variance = var_df) |>
draw(n = 1000) |>
execute(frame, seed = 42)
# Custom sizes (pass data frame to draw)
sizes_df <- data.frame(
region = c("North", "South", "East", "West"),
n = c(200, 350, 250, 200)
)
sampling_design() |>
stratify_by(region) |>
draw(n = sizes_df) |>
execute(frame, seed = 42)
# Multiple stratification variables
sampling_design() |>
stratify_by(region, urban_rural, alloc = "proportional") |>
draw(n = 2000) |>
execute(frame, seed = 42)
} # }