Optimal Multistage Cluster Allocation

Compute optimal per-stage sample sizes for a multistage cluster design, minimizing cost for a given precision or minimizing variance for a given budget.

Usage

n_cluster(stage_cost = NULL, ...)

# Default S3 method
n_cluster(
  stage_cost = NULL,
  ...,
  delta = NULL,
  rel_var = 1,
  k = 1,
  cv = NULL,
  budget = NULL,
  n_psu = NULL,
  psu_size = NULL,
  ssu_size = NULL,
  resp_rate = 1,
  fixed_cost = 0,
  plan = NULL
)

# S3 method for class 'svyplan_prec'
n_cluster(stage_cost, ..., cv = NULL, budget = NULL)

Arguments

stage_cost: For the default method: numeric vector of per-stage costs. Length determines the number of stages (2 or 3). Named vectors are accepted with stage names cost_psu, cost_ssu, cost_tsu (cost_tsu aliases cost_ssu in 2-stage). For svyplan_prec objects: a precision result from prec_cluster().
...: Additional arguments passed to methods. Unused arguments are rejected.
delta: Numeric vector of homogeneity measures (length = stages - 1), or a svyplan_varcomp object. Delta quantifies how similar units within the same cluster are: 0 means no similarity (clusters are as variable as the whole population), 1 means perfect similarity (all units in a cluster are identical). Typical values in household surveys range from 0.01 to 0.10. Higher delta means more clusters are needed for the same precision. Use varcomp() to estimate delta from a previous survey or pilot data.
rel_var: Unit relvariance (default 1). For most applications, the default of 1 is appropriate. Non-unit values arise when working with variance components from varcomp() that separate the total variance into stage-specific pieces.
k: Ratio parameter(s). Scalar for 2-stage, length-2 vector for 3-stage (default 1). Controls how stage costs relate to stage variances in the optimization. The default of 1 is appropriate for most designs. Non-unit values are rarely needed outside specialized cost-variance models.
cv: Target coefficient of variation (relative standard error). For example, cv = 0.05 means the standard error of the estimate should be at most 5 percent of the estimate itself. Specify exactly one of cv or budget.
budget: Total budget. Specify exactly one of cv or budget.
n_psu: Fixed number of PSUs (stage-1 sample size). NULL (default) means optimize. For 2-stage, at most one of n_psu or psu_size may be specified. For 3-stage, up to two of n_psu, psu_size, ssu_size may be fixed.
psu_size: Fixed cluster size (stage-2 sample size per PSU). NULL (default) means optimize. This is the typical MICS/DHS parameterization where the number of households per cluster is fixed.
ssu_size: Fixed SSU take size (stage-3 sample size per SSU). NULL (default) means optimize. Only valid for 3-stage designs.
resp_rate: Expected response rate, in (0, 1]. Default 1 (no adjustment). The stage-1 sample size is inflated by 1 / resp_rate.
fixed_cost: Fixed overhead cost (C0). Default 0. The total cost model becomes C = C0 + c1*n_psu + c2*n_psu*psu_size [+ c3*n_psu*psu_size*ssu_size]. In budget mode, only budget - fixed_cost is available for variable costs. In CV mode, fixed_cost is added to the variable cost.
plan: Optional svyplan() object providing design defaults (including stage_cost, delta, rel_var, k, resp_rate, fixed_cost).

Value

A svyplan_cluster object with components:

n: Named numeric vector of continuous per-stage sample sizes (e.g. c(n_psu = 84.1, psu_size = 13.8)), the mathematical optimum.
stages: Number of stages (2 or 3).
total_n: Continuous total sample size (prod(n)).
cv: Coefficient of variation of the continuous optimum.
cost: Cost of the continuous optimum.
operational: The whole-unit field design, found by a discrete search: n (named integer stage sizes), total_n, cost, and cv, all recomputed from the integer design. In budget mode its cost never exceeds the budget, whereas in cv mode it meets the target at the lowest cost among the designs searched. as.integer() returns operational$n, and as.double() returns the continuous n.
params: List of input parameters.

Details

Getting started

A typical 2-stage household survey workflow:

Decide what you know. You need the cost per cluster visit (stage_cost[1], e.g. travel + logistics) and the cost per interview (stage_cost[2]), plus an estimate of within-cluster homogeneity (delta). Estimate delta from a pilot or previous survey with varcomp(), or use a plausible range (0.01–0.10 for most household indicators).
Choose a mode. If you have a target precision, set cv. If you have a fixed budget, set budget. Never set both.
Fix stages or let the optimizer decide. In MICS/DHS-style designs, the number of households per cluster is fixed by fieldwork logistics (e.g. psu_size = 20). The optimizer then solves for how many clusters (n_psu) to visit. If no stage is fixed, both are optimized jointly.

How it works

Stage count is determined by length(stage_cost):

2-stage (e.g. clusters then households): stage_cost has 2 elements, delta is a scalar.
3-stage (e.g. districts, clusters, households): stage_cost has 3 elements, delta is length 2.

Two solving modes:

CV mode: minimize total cost subject to achieving the target CV.
Budget mode: minimize the CV (maximize precision) within the available budget.

Fixing stage sizes

One or more stage sizes can be fixed, leaving the remaining stage(s) to be optimized or derived from the constraint. For 2-stage designs, at most one stage may be fixed. For 3-stage designs, up to two stages may be fixed. The remaining free stage is derived from the budget or CV constraint.

If delta is a svyplan_varcomp object, delta, rel_var, and k are extracted automatically.

Boundary delta values

Boundary and near-boundary homogeneity values are not supported by the analytical optimum used here. When delta is near 0, most variability is within PSUs, so the closed-form optimum collapses toward taking many units in very few PSUs. When delta is near 1, most variability is between PSUs, so the optimum collapses toward taking very few units in many PSUs. In both cases the analytical allocation becomes degenerate, so n_cluster() rejects values numerically too close to 0 or 1.

These functions assume sampling fractions are negligible at each stage (equivalent to sampling with replacement). No finite population correction is applied. This is standard for multistage planning when cluster populations are large relative to the sample.

References

Valliant, R., Dever, J. A., and Kreuter, F. (2018). Practical Tools for Designing and Weighting Survey Samples (2nd ed.). Springer. Ch. 9.

Examples

# 2-stage, budget mode
n_cluster(stage_cost = c(500, 50), delta = 0.05, budget = 100000)
#> Optimal 2-stage allocation
#> field design: n_psu = 80 | psu_size = 15 -> total n = 1200
#> cv = 0.0376, cost = 100000
#> continuous optimum: n_psu = 84.08997 | psu_size = 13.78405 (cv = 0.0376, cost = 100000)

# 2-stage, CV mode
n_cluster(stage_cost = c(500, 50), delta = 0.05, cv = 0.05)
#> Optimal 2-stage allocation
#> field design: n_psu = 52 | psu_size = 12 -> total n = 624
#> cv = 0.0498, cost = 57200
#> continuous optimum: n_psu = 47.5681 | psu_size = 13.78405 (cv = 0.0500, cost = 56568)

# 2-stage, fixed n_psu
n_cluster(stage_cost = c(500, 50), delta = 0.05, budget = 100000, n_psu = 40)
#> Optimal 2-stage allocation
#> field design: n_psu = 40 | psu_size = 40 -> total n = 1600
#> cv = 0.0429, cost = 100000
#> continuous optimum: n_psu = 40 | psu_size = 40 (cv = 0.0429, cost = 100000)

# 2-stage, fixed psu_size (MICS/DHS style: 20 households per cluster)
n_cluster(stage_cost = c(500, 50), delta = 0.05, budget = 100000, psu_size = 20)
#> Optimal 2-stage allocation
#> field design: n_psu = 66 | psu_size = 20 -> total n = 1320
#> cv = 0.0384, cost = 99000
#> continuous optimum: n_psu = 66.66667 | psu_size = 20 (cv = 0.0382, cost = 100000)

# 3-stage
n_cluster(stage_cost = c(500, 100, 50), delta = c(0.01, 0.05), cv = 0.05)
#> Optimal 3-stage allocation
#> field design: n_psu = 20 | psu_size = 6 | ssu_size = 5 -> total n = 600
#> cv = 0.0500, cost = 52000
#> continuous optimum: n_psu = 20.32883 | psu_size = 5 | ssu_size = 6.164414 (cv = 0.0500, cost = 51658)

# 3-stage, fixed n_psu + ssu_size (solve for psu_size)
n_cluster(
  stage_cost = c(500, 100, 50), delta = c(0.01, 0.05),
  budget = 500000, n_psu = 50, ssu_size = 8
)
#> Optimal 3-stage allocation
#> field design: n_psu = 50 | psu_size = 19 | ssu_size = 8 -> total n = 7600
#> cv = 0.0194, cost = 500000
#> continuous optimum: n_psu = 50 | psu_size = 19 | ssu_size = 8 (cv = 0.0194, cost = 500000)

# With fixed overhead cost
n_cluster(stage_cost = c(500, 50), delta = 0.05, budget = 100000, fixed_cost = 5000)
#> Optimal 2-stage allocation
#> field design: n_psu = 76 | psu_size = 15 -> total n = 1140
#> cv = 0.0386, cost = 100000 (fixed: 5000)
#> continuous optimum: n_psu = 79.88547 | psu_size = 13.78405 (cv = 0.0386, cost = 100000)