Determine where to cut a continuous stratification variable to form optimal strata. Supports four methods: cumulative root frequency (Dalenius-Hodges), geometric progression, Lavallée-Hidiroglou iterative, and Kozak random search.
Usage
strata_bound(
x,
n_strata = 3L,
n = NULL,
cv = NULL,
method = "lh",
alloc = "neyman",
power_q = 0.5,
cost = NULL,
certain = NULL,
nclass = NULL,
maxiter = 200L,
n_restart = NULL
)Arguments
- x
Numeric vector: stratification variable values. Must not contain
NA.- n_strata
Integer: number of strata (including take-all if
certainis specified). Must be >= 2.- n
Target total sample size. Specify at most one of
norcv. Required for methods"lh"and"kozak".- cv
Target coefficient of variation. Specify at most one of
norcv. Required for methods"lh"and"kozak".- method
Stratification method:
"cumrootf"(Dalenius-Hodges),"geo"(geometric),"lh"(Lavallée-Hidiroglou), or"kozak"(random search). Default"lh".- alloc
Allocation rule:
"proportional","neyman","optimal", or"power"(Bankier compromise). Default"neyman". See Details.- power_q
Bankier power parameter, used only when
alloc = "power". Numeric scalar in \([0, 1]\). Atpower_q = 1the allocation equals Neyman; atpower_q = 0it yields near-equal subnational CVs. Default 0.5.- cost
Per-stratum unit costs, ordered from lowest to highest stratum. Scalar (equal costs) or vector of length
n_strata. DefaultNULL(equal unit costs, \(c_h = 1\) for all strata), in which case"optimal"and"neyman"coincide.- certain
Take-all threshold. Units with
x >= certainform a census stratum.- nclass
Number of histogram bins for
"cumrootf". DefaultNULL(Freedman-Diaconis rule).- maxiter
Maximum iterations for
"lh"and"kozak". Default 200.- n_restart
Random restarts for
"kozak". DefaultNULL(= 10 *n_strata).
Value
A svyplan_strata object with components:
- boundaries
Numeric vector of cutpoints (length
n_strata - 1).- n_strata
Number of strata.
- n
Total sample size.
- cv
Achieved coefficient of variation.
- strata
Data frame with per-stratum summaries.
- method
Algorithm used.
- alloc
Allocation method name (character).
- params
List of additional parameters.
- converged
Logical (for iterative methods).
Details
The four methods differ in approach:
cumrootf: Dalenius-Hodges (1959) cumulative root frequency rule. Non-iterative, does not require
norcv.geo: Gunning-Horgan (2004) geometric progression. Non-iterative, requires
x > 0.lh: Lavallée-Hidiroglou (1988) iterative coordinate-wise optimization. Requires
norcv.kozak: Kozak (2004) random search, escapes local minima. Requires
norcv.
When n or cv is available, "lh" (the default) directly
minimizes the objective and is fast even on large frames. Use "kozak"
for highly skewed or multimodal data where coordinate-wise optimization
may get trapped in local minima. The non-iterative methods ("cumrootf",
"geo") are useful for quick exploratory stratification or when neither
n nor cv is known yet.
Allocation is controlled by the alloc parameter. Four methods are
available:
proportional: \(n_h \propto N_h\).
neyman: \(n_h \propto N_h S_h\). Minimizes the national CV when unit costs are equal.
optimal: \(n_h \propto N_h S_h / \sqrt{c_h}\). Accounts for differential unit costs.
power: Bankier (1988) compromise, \(n_h \propto S_h N_h^{power\_q}\). The parameter
power_qcontrols the trade-off between national precision (power_q = 1, equivalent to Neyman) and near-equal subnational CVs (power_q = 0).
Stratum allocations n_h are rounded to integers using the ORIC method
(Cont and Heidari, 2015), which preserves sum(n_h) = n while minimizing
rounding distortion.
References
Dalenius, T. and Hodges, J. L. (1959). Minimum variance stratification. Journal of the American Statistical Association, 54(285), 88–101.
Lavallee, P. and Hidiroglou, M. (1988). On the stratification of skewed populations. Survey Methodology, 14(1), 33–43.
Kozak, M. (2004). Optimal stratification using random search method in agricultural surveys. Statistics in Transition, 6(5), 797–806.
Gunning, P. and Horgan, J. M. (2004). A new algorithm for the construction of stratum boundaries in skewed populations. Survey Methodology, 30(2), 159–166.
Wesolowski, J., Wieczorkowski, R. and Wojciak, W. (2021). Optimality of the recursive Neyman allocation. Journal of Survey Statistics and Methodology, 10(5), 1263–1275.
Bankier, M. D. (1988). Power allocations: determining sample sizes for subnational areas. The American Statistician, 42(3), 174–177.
Cont, R. and Heidari, M. (2015). Optimal rounding under integer constraints. arXiv preprint arXiv:1501.00014.
See also
predict.svyplan_strata to assign new data to strata.
Examples
set.seed(42)
x <- rlnorm(500, meanlog = 6, sdlog = 1.5)
# Dalenius-Hodges (non-iterative)
strata_bound(x, n_strata = 4, method = "cumrootf", n = 100)
#> Strata boundaries (Dalenius-Hodges, 4 strata)
#> Boundaries: 400.0, 1400.0, 3400.0
#> n = 100, cv = 0.0209
#> Allocation: neyman
#> ---
#> stratum lower upper N_h W_h S_h n_h
#> 1 4.528383 400.00 261 0.522 112.1 18
#> 2 400.000000 1400.00 147 0.294 290.8 27
#> 3 1400.000000 3400.00 59 0.118 595.5 22
#> 4 3400.000000 34502.88 33 0.066 6512.6 33
# LH (default, iterative)
strata_bound(x, n_strata = 4, n = 100)
#> Strata boundaries (Lavallee-Hidiroglou, 4 strata)
#> Boundaries: 295.0, 827.9, 2090.5
#> n = 100, cv = 0.0192
#> Allocation: neyman
#> Converged: yes
#> ---
#> stratum lower upper N_h W_h S_h n_h
#> 1 4.528383 294.9703 215 0.430 82.7 9
#> 2 294.970346 827.8757 137 0.274 144.3 10
#> 3 827.875709 2090.5324 81 0.162 339.1 14
#> 4 2090.532395 34502.8792 67 0.134 5239.6 67
# Bankier power allocation (compromise between national and subnational CVs)
strata_bound(x, n_strata = 4, n = 100, alloc = "power", power_q = 0.5)
#> Strata boundaries (Lavallee-Hidiroglou, 4 strata)
#> Boundaries: 272.1, 743.5, 2053.3
#> n = 100, cv = 0.0196
#> Allocation: power (power_q = 0.50)
#> Converged: no
#> ---
#> stratum lower upper N_h W_h S_h n_h
#> 1 4.528383 272.1118 203 0.406 76.7 6
#> 2 272.111825 743.4612 138 0.276 125.1 8
#> 3 743.461152 2053.3060 90 0.180 337.0 17
#> 4 2053.305984 34502.8792 69 0.138 5190.9 69
# With take-all stratum
strata_bound(x, n_strata = 3, n = 80, certain = quantile(x, 0.95))
#> Strata boundaries (Lavallee-Hidiroglou, 3 strata)
#> Boundaries: 862.3, 3706.8
#> n = 80, cv = 0.0390
#> Allocation: neyman
#> Converged: yes
#> ---
#> stratum lower upper N_h W_h S_h n_h
#> 1 4.528383 862.2665 358 0.716 217.4 24
#> 2 862.266526 3706.8201 117 0.234 850.2 31
#> 3 3706.820126 34502.8792 25 0.050 6931.7 25