Constrained Stratified Allocation

Distribute a total sample size across strata defined by a single stratification variable, under a fixed total $n$, target CV, or budget. When the design uses multiple stratification variables (e.g. region and urbanicity), cross them into a single variable beforehand so that each row of frame represents one unique stratum.

Usage

n_alloc(frame, ...)

# Default S3 method
n_alloc(
  frame,
  ...,
  domains = NULL,
  n = NULL,
  cv = NULL,
  budget = NULL,
  alloc = c("neyman", "optimal", "proportional", "power"),
  unit_cost = NULL,
  alpha = 0.05,
  deff = 1,
  resp_rate = 1,
  min_n = NULL,
  power_q = 0.5,
  plan = NULL
)

# S3 method for class 'svyplan_prec'
n_alloc(frame, ..., n = NULL, cv = NULL, budget = NULL)

Arguments

frame

For the default method: a stratum-level data frame describing the population you want to sample. Each row represents one stratum, a subgroup of the population defined by a stratification variable such as region, age group, or urbanicity. The values in this frame typically come from a census, a population register, or a previous survey. Any stratum table with the columns below works, for example the pool summary of an executed samplyr sample (samplyr::frame_summary()), once the measure columns are added.

When a design stratifies by several variables at once (e.g. region $\times$ urbanicity), cross them into a single variable before calling n_alloc (e.g. with interaction()) so that each row maps to exactly one population cell.

Required columns:

N

Number of units (e.g. households, individuals) in each stratum. These are population counts, not sample sizes. Must be positive and finite.

sd or var

A measure of how spread out the variable of interest is within each stratum. Provide exactly one:

sd: the stratum standard deviation ($\sqrt{\text{variance}}$), or
var: the stratum variance.

Both must be non-negative and finite. When all strata have equal variability (or variability is unknown), a constant column (e.g. sd = 1) yields proportional-to-size allocation.

Optional columns:

stratum: A label identifying each stratum (e.g. "Urban", "Rural"). If omitted, row numbers are used. Must be unique, or unique within each domain when domains is set.
mean or p: The stratum population mean or proportion of the variable of interest. Required when solving for cv, because the coefficient of variation is defined relative to the mean. Use mean for continuous variables and p (in $[0, 1]$) for binary (yes/no) variables.
unit_cost: Per-unit interviewing cost in each stratum (positive, finite). Set higher values for strata that are more expensive to reach. Defaults to 1 everywhere (equal cost).
max_weight: Maximum allowed sampling weight $N_h / n_h$. Caps how under-represented a stratum can be. Use NA for strata without a cap.
take_all: Logical (or 0/1). If TRUE, every unit in the stratum is included, a census stratum. Useful for small strata whose total population is tiny enough to enumerate.

For svyplan_prec objects: a precision result from prec_alloc().

...

Additional arguments passed to methods. Unused arguments are rejected.

domains

Character vector of column names in frame to treat as domain identifiers, or NULL (default) for no domains. All names must exist in frame. Domains define sub-populations that each contain one or more strata. When cv is the target, precision is enforced within every domain (see Details).

n

Total sample size. Specify exactly one of n, cv, or budget.

cv

Target coefficient of variation (relative standard error). For example, cv = 0.05 means the standard error of the estimated population mean or total should be at most 5 percent of the estimate. Requires mean or p in frame. When domain columns are present, this target is enforced in each domain. Specify exactly one of n, cv, or budget.

budget

Total field budget. Specify exactly one of n, cv, or budget.

alloc

Allocation rule: "neyman" (default), "optimal", "proportional", or "power".

unit_cost

Optional scalar or length-nrow(frame) vector of per-stratum unit costs, overriding frame$unit_cost.

alpha

Significance level, default 0.05.

deff

Design effect multiplier (> 0).

resp_rate

Expected response rate, in (0, 1]. Default 1.

min_n

Optional minimum sample size per stratum.

power_q

Bankier power parameter from 0 to 1, used when alloc = "power".

plan

Optional svyplan() object providing design defaults.

Value

A svyplan_n object with type = "alloc" and a stratum-level allocation table in $detail (also available via as.data.frame()), with columns:

stratum, N, sd, unit_cost: Stratum identifiers and inputs carried over from the frame. In cluster mode unit_cost is the derived effective per-element cost cost_psu / psu_size + cost_ssu (1 when no stage costs were given), while sd stays the input stratum SD.
n: Allocated sample size (continuous).
n_int: Integer allocation. In n mode the requested total is preserved (bounded largest-remainder rounding); in cv mode each stratum is rounded up so the integer design meets the target; in budget mode units are added by variance reduction per unit cost so the integer design stays within budget. Always inside the integerized bounds (ceiling(.lower), floor(.upper)); an error is raised when no integer allocation can satisfy them.
weight: Design weight N / n.
n_eff: Effective sample size n * resp_rate / deff.
.lower, .upper: Bounds applied to the stratum (from min_n, max_weight, take_all, or N).
.binding: Whether the allocation sits on one of its bounds.
take_all, mean: Present when take-all strata or stratum means were supplied.
psu_size, n_psu: Cluster mode only: the continuous per-stratum take and implied number of PSUs (n / psu_size).
n_psu_int, psu_size_int: Cluster mode only: the whole-unit field design (whole PSUs and whole takes), chosen so that budget designs stay within budget and cv designs meet the target; n_int = n_psu_int * psu_size_int.

The result also carries an $operational list describing the integer field design: n (total), cost, se, moe, and cv, all recomputed from the integer allocation. Top-level n, se, moe, and cv describe the continuous optimum, while as.integer() returns the operational total.

Details

Building the frame

The frame is a data frame where each row is one stratum of your target population. It summarizes what you know about each subgroup before sampling. A typical workflow:

Identify strata from a census or register (e.g. provinces, urban/rural areas, age groups).
Look up N: the population count per stratum.
Estimate sd: the standard deviation of your key variable within each stratum (from a pilot survey, a previous census, or expert judgement). If unknown, set sd = 1 everywhere for proportional allocation.
Add mean or p if you want to solve for a target CV.

A minimal frame:

frame <- data.frame(
  stratum = c("Urban", "Rural"),
  N       = c(50000, 120000),
  sd      = c(12, 20)
)

When a design stratifies by several variables (e.g. region $\times$ urbanicity), cross them into one variable first:

frame$stratum <- interaction(frame$region, frame$urban, drop = TRUE)

This ensures that each row maps to exactly one population cell and that the allocation formulas apply to the correct per-stratum N and sd pairs.

Cluster designs within strata

Adding a delta_psu column to the frame turns the allocation into a stratified two-stage design (e.g. enumeration areas then households within each stratum). Under the cluster variance model the problem reduces to the element allocation above with the stratum SD inflated to sd * sqrt(k_psu * (1 + delta_psu * (psu_size - 1))) and, when stage costs are given, a per-element cost of cost_psu / psu_size + cost_ssu. All solve modes, allocation methods, and constraints work unchanged. The n, cv, and budget modes keep their meanings.

Cluster-mode columns:

delta_psu (required): within-PSU homogeneity per stratum, e.g. from varcomp() with strata.
k_psu (optional, default 1): variance ratio per stratum.
psu_size (optional): fixes the per-stratum take. Any NA entries are replaced by the cost-optimal take sqrt(cost_psu / cost_ssu * (1 - delta_psu) / delta_psu).
cost_psu, cost_ssu (together): per-PSU and per-element costs. These are required for budget mode or when psu_size is not fixed. They replace unit_cost, which is not allowed in this mode.

The finite population correction stays at the element level, an approximation consistent with n_cluster()'s variance model.

Because delta_psu already accounts for the clustering, leave deff at 1 unless it captures a different source of design effect (e.g. weighting loss). A clustering deff on top of delta_psu would double-count. The constraints min_n, max_weight, and take_all stay in element units. For fielding, use the whole-unit design in n_psu_int and psu_size_int (n_int = n_psu_int * psu_size_int). Its actual field cost and precision are reported in $operational and, in budget mode, never exceed the budget.

Domains vs. strata

Domains are specified via the domains parameter. Domain columns partition strata into sub-populations. Each domain groups one or more strata. When cv is specified, the algorithm finds the minimum total $n$ such that the worst-case domain CV meets the target, i.e. every domain achieves the required precision.

In n or budget mode, domains affect reporting only: per-domain precision metrics appear in $domains but the allocation itself treats all strata globally.

Allocation methods

Allocation is controlled by the alloc parameter (same methods as strata_bound()):

proportional: $n_h \propto N_h$
neyman: $n_h \propto N_h S_h$
optimal: $n_h \propto N_h S_h / \sqrt{c_h}$
power: Bankier (1988), $n_h \propto S_h N_h^{power\_q}$

Stratum allocations are rounded to integers using the ORIC method (Cont and Heidari, 2015). Constraints (min_n, max_weight, take_all) are enforced via recursive Neyman allocation (RNA, Wesolowski et al., 2021).

When budget is specified, the algorithm finds the maximum affordable allocation under unit costs.

References

Valliant, R., Dever, J. A., & Kreuter, F. (2018). Practical Tools for Designing and Weighting Survey Samples (2nd ed.). Springer. Chapter 5.

Bankier, M. D. (1988). Power allocations: determining sample sizes for subnational areas. The American Statistician, 42(3), 174–177.

Examples

frame <- data.frame(
  stratum = c("A", "B", "C"),
  N    = c(4000, 3000, 3000),
  sd   = c(10, 15, 8),
  mean = c(50, 60, 55),
  unit_cost = c(1, 1.5, 1)
)

n_alloc(frame, n = 600)
#> Stratum allocation (neyman, 3 strata)
#> field design: n = 600, cv = 0.0079, cost = 724
#> continuous optimum: n = 600, cv = 0.0079, se = 0.4305
n_alloc(frame, cv = 0.03)
#> Stratum allocation (neyman, 3 strata)
#> field design: n = 46, cv = 0.0294, cost = 56
#> continuous optimum: n = 44.23479, cv = 0.0300, se = 1.6350

frame_constraints <- transform(
  frame,
  max_weight = c(25, 20, NA),
  take_all = c(FALSE, FALSE, TRUE)
)

n_alloc(frame_constraints, budget = 3500, alloc = "optimal", min_n = 40)
#> Stratum allocation (optimal, 3 strata)
#> field design: n = 3403, cv = 0.0076, cost = 3500
#> continuous optimum: n = 3403.425, cv = 0.0076, se = 0.4125
#> (min_n = 40)

frame_domains <- data.frame(
  province = c("North", "North", "South", "South"),
  stratum = c("Urban", "Rural", "Urban", "Rural"),
  N    = c(2000, 3000, 1800, 3200),
  sd   = c(12, 18, 10, 16),
  mean = c(55, 48, 58, 50)
)

n_alloc(frame_domains, domains = "province",
       cv = 0.04, alloc = "power", power_q = 0.3)
#> Stratum allocation (power, 4 strata)
#> field design: n = 112, cv = 0.0270, cost = 112
#> continuous optimum: n = 110.7422, cv = 0.0272, se = 1.4076
#> Domains: 2
#> ---
#>  province .domain .n       .se      .moe     .cv    .cost
#>  North    5_North 59.23404 2.032000 3.982647 0.0400 59   
#>  South    5_South 51.50815 1.948447 3.818886 0.0368 52   

# Stratified two-stage design (EAs then households per stratum)
frame_cluster <- data.frame(
  stratum   = c("Urban", "Rural"),
  N         = c(50000, 150000),
  sd        = c(0.45, 0.48),
  mean      = c(0.35, 0.25),
  delta_psu = c(0.03, 0.08),
  cost_psu  = c(300, 600),
  cost_ssu  = c(40, 60)
)

n_alloc(frame_cluster, cv = 0.05)
#> Stratum allocation (neyman, two-stage, 2 strata)
#> field design: n = 2021, n_psu = 171, cv = 0.0498, cost = 206500
#> continuous optimum: n = 1979.882, cv = 0.0500, se = 0.0138