Unequal Probability Sampling Without Replacement

Draws a sample with unequal inclusion probabilities, without replacement.

Usage

unequal_prob_wor(
  pik,
  method = c("cps", "sampford", "brewer", "systematic", "poisson", "sps", "pareto"),
  nrep = 1L,
  prn = NULL,
  ...
)

Arguments

pik

A numeric vector of inclusion probabilities. For fixed-size methods, sum(pik) must be an integer to floating-point accuracy: an exact fixed-size design cannot have a non-integer sum, so looser sums are rejected rather than silently rounded. Their target sample size, sum(pik), must be at least 1. Units with pik of exactly 0 are never selected and units with exactly 1 are always selected; values in between – however close to the boundary – are sampled as given.

method

The sampling method:

"cps": Conditional Poisson Sampling (maximum entropy; Chen et al., 1994). Fixed size, exact joint probabilities with all $\pi_{ij} > 0$. Calibration and the conditional probability table are O(Nn) (probability-domain Poisson-binomial recurrence); each draw is O(N) thereafter. Equal pik are drawn directly as SRS.
"sampford": Sampford's (1967) fixed-size PPS design. Gives the supplied first-order inclusion probabilities exactly and has exact joint inclusion probabilities. A native C kernel combines the fast rejection construction with Grafstrom's non-rejective conditional-Poisson fallback and draws the smaller of the sample and its complement. Typical O(N + n); O(Nn) in the fallback.
"brewer": Brewer's (1975) draw-by-draw method. Fixed size, approximate joint probabilities (high-entropy approximation; see joint_inclusion_prob()). O(Nn).
"systematic": Systematic PPS. Fixed size, exact joint probabilities but some may be zero (pairs that never co-occur), making the SYG estimator inapplicable; see sampling_cov(). O(N).
"poisson": Poisson sampling. Random sample size (expected $n = \sum \pi_k$). Units selected independently, so $\pi_{ij} = \pi_i \pi_j$. Supports PRN. O(N).
"sps": Sequential Poisson Sampling (Ohlsson, 1998). Order sampling with key $\xi_k = u_k / \pi_k$; the $n$ smallest are selected. Fixed size, high-entropy. Supports PRN. Approximate joint probabilities. The true first-order inclusion probabilities are approximately equal to the supplied pik; see inclusion_prob(). Expected O(N). Tied keys (possible with duplicated prn values) are broken toward the smallest population index.
"pareto": Pareto sampling (Rosen, 1997). Order sampling with odds-ratio key $\xi_k = [u_k/(1-u_k)] / [\pi_k/(1-\pi_k)]$. Same properties as "sps". Expected O(N).

nrep

Number of replicate samples (default 1). When nrep > 1, $sample holds a matrix (fixed-size) or list (random-size) of all replicates. The design object and all generics remain usable.

prn

Optional vector of permanent random numbers (length N, values in the open interval (0, 1)) for sample coordination. Supported by methods "sps", "pareto", and "poisson". When NULL, random numbers are generated internally. Cannot be used with nrep > 1 (identical PRN would produce identical replicates). Use a loop with different PRN vectors for coordinated repeated sampling.

...

Additional arguments passed to methods registered via register_method(). Built-in methods take no additional arguments; the former eps boundary-trimming argument was removed because it silently changed the design.

Value

An object of class c("unequal_prob", "wor", "sondage_sample"). When nrep = 1, $sample is an integer vector of selected unit indices. When nrep > 1, $sample is a matrix (n x nrep) for fixed-size methods, or a list of integer vectors of varying lengths for random-size methods ("poisson"). $n is an integer for fixed-size methods (realized size) and a double for "poisson" (expected size, sum(pik)); see sondage_sample.

Details

Near-certainty inclusion probabilities (CPS). The CPS fixed-point calibration converges geometrically for well-spread pik, but asymptotes at a non-zero defect when some pik are within a few decimal digits of 0 or 1 (e.g. 0.9999). When this happens the function emits a "CPS calibration did not reach tolerance" warning reporting the achieved max_diff. The realized first-order inclusion probabilities differ from the target by up to max_diff – typically 1e-5 or smaller for inputs in the 0.999-range, well within Monte Carlo error for most estimators. If the warning is unwanted, clip pik away from 0/1 before calling.

References

Chen, X. H., Dempster, A. P., & Liu, J. S. (1994). Weighted finite population sampling to maximize entropy. Biometrika, 81(3), 457-469.

Brewer, K.R.W. (1975). A simple procedure for sampling pi-ps wor. Australian Journal of Statistics, 17(3), 166-172.

Sampford, M.R. (1967). On sampling without replacement with unequal probabilities of selection. Biometrika, 54(3/4), 499-513.

Grafstrom, A. (2009). Non-rejective implementations of the Sampford sampling design. Journal of Statistical Planning and Inference, 139(6), 2111-2114.

Ohlsson, E. (1998). Sequential Poisson sampling. Journal of Official Statistics, 14(2), 149-162.

Rosen, B. (1997). On sampling with probability proportional to size. Journal of Statistical Planning and Inference, 62(2), 159-191.

Tillé, Y. (2006). Sampling Algorithms. Springer.

Examples

pik <- c(0.2, 0.4, 0.6, 0.8)

# Conditional Poisson Sampling
set.seed(123)
s <- unequal_prob_wor(pik, method = "cps")
s$sample
#> [1] 3 4

# Brewer's method
s <- unequal_prob_wor(pik, method = "brewer")
s$sample
#> [1] 3 4

# Sequential Poisson Sampling with PRN coordination
prn <- runif(4)
s <- unequal_prob_wor(pik, method = "sps", prn = prn)
s$sample
#> [1] 2 3

# Pareto sampling
s <- unequal_prob_wor(pik, method = "pareto", prn = prn)
s$sample
#> [1] 2 3

# \donttest{
# Batch mode for simulations
sim <- unequal_prob_wor(pik, method = "cps", nrep = 1000)
dim(sim$sample)  # 2 x 1000
#> [1]    2 1000
# }