Skip to contents

Determine where to cut a continuous stratification variable to form optimal strata. Supports four methods: cumulative root frequency (Dalenius-Hodges), geometric progression, Lavallée-Hidiroglou iterative, and Kozak random search.

Usage

strata_bound(
  x,
  n_strata = 3L,
  n = NULL,
  cv = NULL,
  method = "lh",
  alloc = "neyman",
  power_q = 0.5,
  cost = NULL,
  certain = NULL,
  nclass = NULL,
  maxiter = 200L,
  n_restart = NULL
)

Arguments

x

Numeric vector: stratification variable values. Must not contain NA.

n_strata

Integer: number of strata (including take-all if certain is specified). Must be >= 2.

n

Target total sample size. Specify at most one of n or cv. Required for methods "lh" and "kozak".

cv

Target coefficient of variation. Specify at most one of n or cv. Required for methods "lh" and "kozak".

method

Stratification method: "cumrootf" (Dalenius-Hodges), "geo" (geometric), "lh" (Lavallée-Hidiroglou), or "kozak" (random search). Default "lh".

alloc

Allocation rule: "proportional", "neyman", "optimal", or "power" (Bankier compromise). Default "neyman". See Details.

power_q

Bankier power parameter, used only when alloc = "power". Numeric scalar in \([0, 1]\). At power_q = 1 the allocation equals Neyman; at power_q = 0 it yields near-equal subnational CVs. Default 0.5.

cost

Per-stratum unit costs, ordered from lowest to highest stratum. Scalar (equal costs) or vector of length n_strata. Default NULL (equal unit costs, \(c_h = 1\) for all strata), in which case "optimal" and "neyman" coincide.

certain

Take-all threshold. Units with x >= certain form a census stratum.

nclass

Number of histogram bins for "cumrootf". Default NULL (Freedman-Diaconis rule).

maxiter

Maximum iterations for "lh" and "kozak". Default 200.

n_restart

Random restarts for "kozak". Default NULL (= 10 * n_strata).

Value

A svyplan_strata object with components:

boundaries

Numeric vector of cutpoints (length n_strata - 1).

n_strata

Number of strata.

n

Total sample size.

cv

Achieved coefficient of variation.

strata

Data frame with per-stratum summaries.

method

Algorithm used.

alloc

Allocation method name (character).

params

List of additional parameters.

converged

Logical (for iterative methods).

Details

The four methods differ in approach:

  • cumrootf: Dalenius-Hodges (1959) cumulative root frequency rule. Non-iterative, does not require n or cv.

  • geo: Gunning-Horgan (2004) geometric progression. Non-iterative, requires x > 0.

  • lh: Lavallée-Hidiroglou (1988) iterative coordinate-wise optimization. Requires n or cv.

  • kozak: Kozak (2004) random search, escapes local minima. Requires n or cv.

When n or cv is available, "lh" (the default) directly minimizes the objective and is fast even on large frames. Use "kozak" for highly skewed or multimodal data where coordinate-wise optimization may get trapped in local minima. The non-iterative methods ("cumrootf", "geo") are useful for quick exploratory stratification or when neither n nor cv is known yet.

Allocation is controlled by the alloc parameter. Four methods are available:

  • proportional: \(n_h \propto N_h\).

  • neyman: \(n_h \propto N_h S_h\). Minimizes the national CV when unit costs are equal.

  • optimal: \(n_h \propto N_h S_h / \sqrt{c_h}\). Accounts for differential unit costs.

  • power: Bankier (1988) compromise, \(n_h \propto S_h N_h^{power\_q}\). The parameter power_q controls the trade-off between national precision (power_q = 1, equivalent to Neyman) and near-equal subnational CVs (power_q = 0).

Stratum allocations n_h are rounded to integers using the ORIC method (Cont and Heidari, 2015), which preserves sum(n_h) = n while minimizing rounding distortion.

References

Dalenius, T. and Hodges, J. L. (1959). Minimum variance stratification. Journal of the American Statistical Association, 54(285), 88–101.

Lavallee, P. and Hidiroglou, M. (1988). On the stratification of skewed populations. Survey Methodology, 14(1), 33–43.

Kozak, M. (2004). Optimal stratification using random search method in agricultural surveys. Statistics in Transition, 6(5), 797–806.

Gunning, P. and Horgan, J. M. (2004). A new algorithm for the construction of stratum boundaries in skewed populations. Survey Methodology, 30(2), 159–166.

Wesolowski, J., Wieczorkowski, R. and Wojciak, W. (2021). Optimality of the recursive Neyman allocation. Journal of Survey Statistics and Methodology, 10(5), 1263–1275.

Bankier, M. D. (1988). Power allocations: determining sample sizes for subnational areas. The American Statistician, 42(3), 174–177.

Cont, R. and Heidari, M. (2015). Optimal rounding under integer constraints. arXiv preprint arXiv:1501.00014.

See also

predict.svyplan_strata to assign new data to strata.

Examples

set.seed(42)
x <- rlnorm(500, meanlog = 6, sdlog = 1.5)

# Dalenius-Hodges (non-iterative)
strata_bound(x, n_strata = 4, method = "cumrootf", n = 100)
#> Strata boundaries (Dalenius-Hodges, 4 strata)
#> Boundaries: 400.0, 1400.0, 3400.0
#> n = 100, cv = 0.0209
#> Allocation: neyman
#> ---
#>  stratum lower       upper    N_h W_h   S_h    n_h
#>  1          4.528383   400.00 261 0.522 112.1  18 
#>  2        400.000000  1400.00 147 0.294 290.8  27 
#>  3       1400.000000  3400.00  59 0.118 595.5  22 
#>  4       3400.000000 34502.88  33 0.066 6512.6 33 

# LH (default, iterative)
strata_bound(x, n_strata = 4, n = 100)
#> Strata boundaries (Lavallee-Hidiroglou, 4 strata)
#> Boundaries: 295.0, 827.9, 2090.5
#> n = 100, cv = 0.0192
#> Allocation: neyman
#> Converged: yes
#> ---
#>  stratum lower       upper      N_h W_h   S_h    n_h
#>  1          4.528383   294.9703 215 0.430 82.7    9 
#>  2        294.970346   827.8757 137 0.274 144.3  10 
#>  3        827.875709  2090.5324  81 0.162 339.1  14 
#>  4       2090.532395 34502.8792  67 0.134 5239.6 67 

# Bankier power allocation (compromise between national and subnational CVs)
strata_bound(x, n_strata = 4, n = 100, alloc = "power", power_q = 0.5)
#> Strata boundaries (Lavallee-Hidiroglou, 4 strata)
#> Boundaries: 272.1, 743.5, 2053.3
#> n = 100, cv = 0.0196
#> Allocation: power (power_q = 0.50)
#> Converged: no
#> ---
#>  stratum lower       upper      N_h W_h   S_h    n_h
#>  1          4.528383   272.1118 203 0.406 76.7    6 
#>  2        272.111825   743.4612 138 0.276 125.1   8 
#>  3        743.461152  2053.3060  90 0.180 337.0  17 
#>  4       2053.305984 34502.8792  69 0.138 5190.9 69 

# With take-all stratum
strata_bound(x, n_strata = 3, n = 80, certain = quantile(x, 0.95))
#> Strata boundaries (Lavallee-Hidiroglou, 3 strata)
#> Boundaries: 862.3, 3706.8
#> n = 80, cv = 0.0390
#> Allocation: neyman
#> Converged: yes
#> ---
#>  stratum lower       upper      N_h W_h   S_h    n_h
#>  1          4.528383   862.2665 358 0.716 217.4  24 
#>  2        862.266526  3706.8201 117 0.234 850.2  31 
#>  3       3706.820126 34502.8792  25 0.050 6931.7 25