Skip to contents

Compute optimal per-stage sample sizes for a multistage cluster design, minimizing cost for a given precision or minimizing variance for a given budget.

Usage

n_cluster(stage_cost, ...)

# Default S3 method
n_cluster(
  stage_cost = NULL,
  delta = NULL,
  rel_var = 1,
  k = 1,
  cv = NULL,
  budget = NULL,
  n_psu = NULL,
  psu_size = NULL,
  ssu_size = NULL,
  resp_rate = 1,
  fixed_cost = 0,
  plan = NULL,
  ...
)

# S3 method for class 'svyplan_prec'
n_cluster(stage_cost, cv = NULL, budget = NULL, ...)

Arguments

stage_cost

For the default method: numeric vector of per-stage costs. Length determines the number of stages (2 or 3). Named vectors are accepted with stage names cost_psu, cost_ssu, cost_tsu (cost_tsu aliases cost_ssu in 2-stage). For svyplan_prec objects: a precision result from prec_cluster().

...

Additional arguments passed to methods.

delta

Numeric vector of homogeneity measures (length = stages - 1), or a svyplan_varcomp object. Delta quantifies how similar units within the same cluster are: 0 means no similarity (clusters are as variable as the whole population), 1 means perfect similarity (all units in a cluster are identical). Typical values in household surveys range from 0.01 to 0.10. Higher delta means more clusters are needed for the same precision. Use varcomp() to estimate delta from a previous survey or pilot data.

rel_var

Unit relvariance (default 1). For most applications, the default of 1 is appropriate. Non-unit values arise when working with variance components from varcomp() that separate the total variance into stage-specific pieces.

k

Ratio parameter(s). Scalar for 2-stage, length-2 vector for 3-stage (default 1). Controls how stage costs relate to stage variances in the optimization. The default of 1 is appropriate for most designs; non-unit values are rarely needed outside specialized cost-variance models.

cv

Target coefficient of variation (relative standard error). For example, cv = 0.05 means the standard error of the estimate should be at most 5 percent of the estimate itself. Specify exactly one of cv or budget.

budget

Total budget. Specify exactly one of cv or budget.

n_psu

Fixed number of PSUs (stage-1 sample size). NULL (default) means optimize. For 2-stage, at most one of n_psu or psu_size may be specified. For 3-stage, up to two of n_psu, psu_size, ssu_size may be fixed.

psu_size

Fixed cluster size (stage-2 sample size per PSU). NULL (default) means optimize. This is the typical MICS/DHS parameterization where the number of households per cluster is fixed.

ssu_size

Fixed SSU take size (stage-3 sample size per SSU). NULL (default) means optimize. Only valid for 3-stage designs.

resp_rate

Expected response rate, in (0, 1]. Default 1 (no adjustment). The stage-1 sample size is inflated by 1 / resp_rate.

fixed_cost

Fixed overhead cost (C0). Default 0. The total cost model becomes C = C0 + c1*n_psu + c2*n_psu*psu_size [+ c3*n_psu*psu_size*ssu_size]. In budget mode, only budget - fixed_cost is available for variable costs; in CV mode, fixed_cost is added to the variable cost.

plan

Optional svyplan() object providing design defaults (including stage_cost, delta, rel_var, k, resp_rate, fixed_cost).

Value

A svyplan_cluster object with components:

n

Named numeric vector of continuous per-stage sample sizes (e.g. c(n_psu = 84.1, psu_size = 13.8)). Use ceiling() for operational (integer) values.

stages

Number of stages (2 or 3).

total_n

Continuous total sample size (prod(n)). Use as.integer() for the operational total (product of ceiled stages), or as.double() for this continuous value.

cv

Achieved coefficient of variation (based on continuous optimum).

cost

Total cost.

params

List of input parameters.

Details

Getting started

A typical 2-stage household survey workflow:

  1. Decide what you know. You need the cost per cluster visit (stage_cost[1], e.g. travel + logistics) and the cost per interview (stage_cost[2]), plus an estimate of within-cluster homogeneity (delta). Estimate delta from a pilot or previous survey with varcomp(), or use a plausible range (0.01–0.10 for most household indicators).

  2. Choose a mode. If you have a target precision, set cv. If you have a fixed budget, set budget. Never set both.

  3. Fix stages or let the optimizer decide. In MICS/DHS-style designs, the number of households per cluster is fixed by fieldwork logistics (e.g. psu_size = 20); the optimizer then solves for how many clusters (n_psu) to visit. If no stage is fixed, both are optimized jointly.

How it works

Stage count is determined by length(stage_cost):

  • 2-stage (e.g. clusters then households): stage_cost has 2 elements, delta is a scalar.

  • 3-stage (e.g. districts, clusters, households): stage_cost has 3 elements, delta is length 2.

Two solving modes:

  • CV mode: minimize total cost subject to achieving the target CV.

  • Budget mode: minimize the CV (maximize precision) within the available budget.

Fixing stage sizes

One or more stage sizes can be fixed, leaving the remaining stage(s) to be optimized or derived from the constraint. For 2-stage designs, at most one stage may be fixed. For 3-stage designs, up to two stages may be fixed; the remaining free stage is derived from the budget or CV constraint.

If delta is a svyplan_varcomp object, delta, rel_var, and k are extracted automatically.

Boundary delta values

Boundary and near-boundary homogeneity values are not supported by the analytical optimum used here. When delta is near 0, most variability is within PSUs, so the closed-form optimum collapses toward taking many units in very few PSUs. When delta is near 1, most variability is between PSUs, so the optimum collapses toward taking very few units in many PSUs. In both cases the analytical allocation becomes degenerate, so n_cluster() rejects values numerically too close to 0 or 1.

These functions assume sampling fractions are negligible at each stage (equivalent to sampling with replacement). No finite population correction is applied. This is standard for multistage planning when cluster populations are large relative to the sample.

References

Valliant, R., Dever, J. A., and Kreuter, F. (2018). Practical Tools for Designing and Weighting Survey Samples (2nd ed.). Springer. Ch. 9.

See also

prec_cluster() for the inverse, varcomp() for estimating variance components.

Examples

# 2-stage, budget mode
n_cluster(stage_cost = c(500, 50), delta = 0.05, budget = 100000)
#> Optimal 2-stage allocation
#> n_psu = 85 | psu_size = 14 -> total n = 1190 (unrounded: 1159.1)
#> cv = 0.0376, cost = 100000

# 2-stage, CV mode
n_cluster(stage_cost = c(500, 50), delta = 0.05, cv = 0.05)
#> Optimal 2-stage allocation
#> n_psu = 48 | psu_size = 14 -> total n = 672 (unrounded: 655.681)
#> cv = 0.0500, cost = 56568

# 2-stage, fixed n_psu
n_cluster(stage_cost = c(500, 50), delta = 0.05, budget = 100000, n_psu = 40)
#> Optimal 2-stage allocation
#> n_psu = 40 | psu_size = 40 -> total n = 1600 (unrounded: 1600)
#> cv = 0.0429, cost = 100000

# 2-stage, fixed psu_size (MICS/DHS style: 20 households per cluster)
n_cluster(stage_cost = c(500, 50), delta = 0.05, budget = 100000, psu_size = 20)
#> Optimal 2-stage allocation
#> n_psu = 67 | psu_size = 20 -> total n = 1340 (unrounded: 1333.333)
#> cv = 0.0382, cost = 100000

# 3-stage
n_cluster(stage_cost = c(500, 100, 50), delta = c(0.01, 0.05), cv = 0.05)
#> Optimal 3-stage allocation
#> n_psu = 21 | psu_size = 5 | ssu_size = 7 -> total n = 735 (unrounded: 626.5766)
#> cv = 0.0500, cost = 51658

# 3-stage, fixed n_psu + ssu_size (solve for psu_size)
n_cluster(
  stage_cost = c(500, 100, 50), delta = c(0.01, 0.05),
  budget = 500000, n_psu = 50, ssu_size = 8
)
#> Optimal 3-stage allocation
#> n_psu = 50 | psu_size = 19 | ssu_size = 8 -> total n = 7600 (unrounded: 7600)
#> cv = 0.0194, cost = 500000

# With fixed overhead cost
n_cluster(stage_cost = c(500, 50), delta = 0.05, budget = 100000, fixed_cost = 5000)
#> Optimal 2-stage allocation
#> n_psu = 80 | psu_size = 14 -> total n = 1120 (unrounded: 1101.145)
#> cv = 0.0386, cost = 100000 (fixed: 5000)