Skip to contents

Overview

sondage provides fast implementations of survey sampling algorithms for finite populations. Every sampling function returns a design object that carries the selected sample alongside the design metadata (inclusion probabilities or expected hits, sample size, method). A set of generics then operates on these objects to compute quantities needed for variance estimation.

The five entry-point functions are organised by two dimensions, plus a dedicated dispatcher for balanced sampling:

Without replacement With replacement
Equal prob. equal_prob_wor() equal_prob_wr()
Unequal prob. unequal_prob_wor() unequal_prob_wr()
Balanced balanced_wor()

Equal probability sampling

Without replacement

The simplest case: draw n units from a population of size N with equal probabilities.

library(sondage)
s <- equal_prob_wor(N = 50, n = 5)
s
#> Equal prob WOR [srs] (n=5, N=50): 4 39 1 34 23

The returned object stores the sampled unit indices and the inclusion probabilities:

s$sample
#> [1]  4 39  1 34 23
s$pik[1:10]  # all equal to n/N = 0.1
#>  [1] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

Alternative methods are available. Systematic sampling provides implicit stratification based on the ordering of units:

s_sys <- equal_prob_wor(N = 50, n = 5, method = "systematic")
s_sys
#> Equal prob WOR [systematic] (n=5, N=50): 7 17 27 37 47

Bernoulli sampling selects each unit independently with probability p = n/N, producing a random sample size. The parameter n is the expected sample size not the realized sample size.

s_ber <- equal_prob_wor(N = 50, n = 5, method = "bernoulli")
s_ber
#> Equal prob WOR [bernoulli] (n=5, N=50): 10 13
length(s_ber$sample)  # may differ from 5
#> [1] 2

With replacement

s_wr <- equal_prob_wr(N = 50, n = 5)
s_wr
#> Equal prob WR [srs] (n=5, N=50): 39 42 6 24 32

With-replacement designs track hits (how many times each unit was selected) instead of inclusion probabilities:

s_wr$hits  # length-N vector of selection counts
#>  [1] 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0
#> [39] 1 0 0 1 0 0 0 0 0 0 0 0

Unequal probability sampling

When units have different sizes (employees, revenue, area, …), probability-proportional-to-size (PPS) designs give larger units higher selection probabilities, which reduces the variance of Horvitz–Thompson estimators (Horvitz and Thompson 1952).

Computing design quantities

Start from a vector of positive size measures and compute inclusion probabilities (WOR) or expected hits (WR):

x <- c(10, 20, 5, 40, 25)
n <- 3

# Without replacement: inclusion probabilities (capped at 1)
pik <- inclusion_prob(x, n)
pik
#> [1] 0.3333333 0.6666667 0.1666667 1.0000000 0.8333333
sum(pik)  # equals n
#> [1] 3

# With replacement: expected hits (can exceed 1)
eh <- expected_hits(x, n)
eh
#> [1] 0.30 0.60 0.15 1.20 0.75
sum(eh)   # equals n
#> [1] 3

Without replacement

Pass the inclusion probability vector to unequal_prob_wor(). The default method is Conditional Poisson Sampling (CPS, Chen, Dempster, and Liu 1994), which has maximum entropy among fixed-size designs:

s <- unequal_prob_wor(pik, method = "cps")
s
#> Unequal prob WOR [cps] (n=3, N=5): 4 1 5
s$sample
#> [1] 4 1 5
inclusion_prob(s)  # extract pik from the design
#> [1] 0.3333333 0.6666667 0.1666667 1.0000000 0.8333333

Other fixed-size methods include "brewer" (draw-by-draw, order-invariant), "systematic" (O(N), implicit stratification), "sps" (Sequential Poisson sampling, Ohlsson 1998, supports PRN), and "pareto" (Pareto sampling, Rosén 1997, supports PRN). The "poisson" method produces a random sample size.

With replacement

Pass the expected hits vector to unequal_prob_wr(). Chromy (1979)’s method is the default:

s_wr <- unequal_prob_wr(eh, method = "chromy")
s_wr
#> Unequal prob WR [chromy] (n=3, N=5): 2 4 5
s_wr$hits  # realized selection counts
#> [1] 0 1 0 1 1

The "multinomial" method draws each selection independently.

Balanced sampling

When auxiliary information is available at the design stage, balanced sampling ensures that Horvitz–Thompson estimates of auxiliary totals are (approximately) exact: \(\sum_{k \in S} x_k / \pi_k \approx \sum_{k \in U} x_k\). This reduces variance for any study variable correlated with the balancing variables.

The balanced_wor() function implements the cube method (Deville and Tillé 2004):

pik <- inclusion_prob(c(10, 20, 5, 40, 25), n = 3)
x <- matrix(c(10, 20, 5, 40, 25))  # auxiliary variable to balance on
s_bal <- balanced_wor(pik, aux = x)
s_bal
#> Unequal prob WOR [cube] (n=3, N=5): 2 4 5

The cube method proceeds in two phases: a flight phase that moves probabilities toward 0 or 1 while maintaining all balancing constraints, and a landing phase that resolves the remaining undecided units by progressively relaxing constraints (starting from the last column of aux). The sample size constraint is always placed first and never relaxed.

Users should order auxiliary variables by importance (most important first) in the aux matrix.

Stratified balanced sampling

When strata are known, the stratified cube method (Chauvet 2009) preserves within-stratum sample sizes while balancing on auxiliary variables. Exact preservation requires sum(pik) within each stratum to be close to an integer; when this condition is not met, within-stratum and total sample sizes will vary around their targets and the design is treated as random-size (fixed_size = FALSE):

N <- 20
pik_strat <- rep(0.4, N)
x_strat <- matrix(as.double(1:N), ncol = 1)
strata <- rep(1:4, each = 5)

s_strat <- balanced_wor(pik_strat, aux = x_strat, strata = strata)
s_strat
#> Unequal prob WOR [cube] (n=8, N=20): 3 5 6 10 13 14 17 20

# Check per-stratum sample sizes
table(strata[s_strat$sample])
#> 
#> 1 2 3 4 
#> 2 2 2 2

Joint inclusion probabilities for the cube use the high-entropy approximation, since the cube produces a near-maximum-entropy design.

Design queries

Once you have a design object, generics compute the quantities needed for variance estimation.

Joint inclusion probabilities (WOR)

pik <- inclusion_prob(c(10, 20, 5, 40, 25), n = 3)
s <- unequal_prob_wor(pik, method = "cps")

pikl <- joint_inclusion_prob(s)
pikl
#>            [,1]       [,2]       [,3]      [,4]      [,5]
#> [1,] 0.33333333 0.10026570 0.01935766 0.3333333 0.2137089
#> [2,] 0.10026570 0.66666667 0.04704226 0.6666667 0.5193577
#> [3,] 0.01935766 0.04704226 0.16666667 0.1666667 0.1002667
#> [4,] 0.33333333 0.66666667 0.16666667 1.0000000 0.8333333
#> [5,] 0.21370893 0.51935766 0.10026667 0.8333333 0.8333333

The diagonal holds the first-order probabilities \(\pi_i\) and the off-diagonal entries are \(\pi_{ij} = P(i \in S \text{ and } j \in S)\).

Joint expected hits (WR)

eh <- expected_hits(c(10, 20, 5, 40, 25), n = 3)
s_wr <- unequal_prob_wr(eh, method = "chromy")

joint_expected_hits(s_wr)
#>        [,1]   [,2]   [,3]   [,4]   [,5]
#> [1,] 0.2997 0.1077 0.0350 0.3529 0.1038
#> [2,] 0.1077 0.5934 0.0115 0.6341 0.4335
#> [3,] 0.0350 0.0115 0.1520 0.1577 0.0998
#> [4,] 0.3529 0.6341 0.1577 1.6261 0.8553
#> [5,] 0.1038 0.4335 0.0998 0.8553 0.7462

Submatrices for large populations

By default, joint_inclusion_prob() and joint_expected_hits() return the full N x N matrix, which is prohibitive when N > 10,000 (~0.8 GB). In practice, variance estimators only need the submatrix for sampled units. The sampled_only = TRUE argument returns this n x n submatrix directly:

pik_large <- inclusion_prob(runif(500), n = 10)
s_large <- unequal_prob_wor(pik_large, method = "brewer")

# Full 500 x 500 matrix
pikl_full <- joint_inclusion_prob(s_large)
dim(pikl_full)
#> [1] 500 500

# Only the 10 x 10 submatrix for sampled units
pikl_sub <- joint_inclusion_prob(s_large, sampled_only = TRUE)
dim(pikl_sub)
#> [1] 10 10

All methods compute the submatrix directly without allocating the N x N matrix, so arbitrarily large populations are supported. The sampling_cov() generic also accepts sampled_only:

delta_sub <- sampling_cov(s_large, sampled_only = TRUE)
dim(delta_sub)
#> [1] 10 10

Sampling covariance

The sampling_cov() generic computes \(\Delta_{ij} = \pi_{ij} - \pi_i \pi_j\) for WOR designs (or the analogous expression for WR designs):

sampling_cov(s)
#>             [,1]        [,2]        [,3] [,4]        [,5]
#> [1,]  0.22222222 -0.12195653 -0.03619789    0 -0.06406885
#> [2,] -0.12195653  0.22222222 -0.06406885    0 -0.03619789
#> [3,] -0.03619789 -0.06406885  0.13888889    0 -0.03862222
#> [4,]  0.00000000  0.00000000  0.00000000    0  0.00000000
#> [5,] -0.06406885 -0.03619789 -0.03862222    0  0.13888889

With weighted = TRUE, it returns the Sen–Yates–Grundy check quantities \(1 - \pi_i \pi_j / \pi_{ij}\) (Sen 1953; Yates and Grundy 1953). For well-behaved WOR designs, these should be non-positive off-diagonal:

sampling_cov(s, weighted = TRUE)
#>            [,1]        [,2]       [,3] [,4]        [,5]
#> [1,]  0.6666667 -1.21633352 -1.8699516    0 -0.29979492
#> [2,] -1.2163335  0.33333333 -1.3619425    0 -0.06969743
#> [3,] -1.8699516 -1.36194247  0.8333333    0 -0.38519497
#> [4,]  0.0000000  0.00000000  0.0000000    0  0.00000000
#> [5,] -0.2997949 -0.06969743 -0.3851950    0  0.16666667

Batch sampling

All five dispatchers accept an nrep parameter for drawing multiple independent samples in one call, useful for simulation studies. The result is always a design object (the same class as nrep = 1) but with $sample holding all replicates:

sim <- equal_prob_wor(N = 50, n = 5, nrep = 4)
sim  # design object
#> Equal prob WOR [srs] (n=5, N=50): 4 replicates
#>   rep 1: 47 50 20 49 42
dim(sim$sample)  # 5 x 4 matrix: each column is one sample
#> [1] 5 4

# Generics still work on batch objects
inclusion_prob(sim)
#>  [1] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
#> [20] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
#> [39] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

For fixed-size designs $sample is a matrix (n x nrep). For random-size designs (Bernoulli, Poisson) it is a list of integer vectors, and realized sample sizes are available via lengths(sim$sample).

Design properties and statistical guarantees

The table below summarises what sondage computes for each method. Understanding these properties is essential for choosing the right variance estimator.

HE approx = high-entropy approximation (Brewer & Donadio, 2003). Approx* = target probabilities, exact asymptotically. Simulated = Monte Carlo via nsim parameter.
Method Type Fixed n Exact 1st order Joint probs
SRS WOR EP WOR Yes Yes Exact
Systematic (EP) EP WOR Yes Yes Exact (zeros)
Bernoulli EP WOR No Yes Exact
SRS WR EP WR Yes Yes Exact
CPS UP WOR Yes Yes Exact
Brewer UP WOR Yes Yes HE approx
Systematic PPS UP WOR Yes Yes Exact (zeros)
Poisson UP WOR No Yes Exact
SPS UP WOR Yes Approx* HE approx
Pareto UP WOR Yes Approx* HE approx
Chromy UP WR Yes Yes Simulated
Multinomial UP WR Yes Yes Exact
Cube Balanced Yes Yes HE approx

Notes on specific methods

Equal-probability systematic sampling. In sondage, joint_inclusion_prob() returns exact joint inclusion probabilities computed via C code, using the same algorithm as systematic PPS with uniform \(\pi_k = n/N\). Some \(\pi_{ij}\) are structurally zero (pairs of units that can never co-occur in the same systematic sample), so the SYG estimator is generally inapplicable. Consider replication or successive-differences variance estimators for systematic designs.

SPS and Pareto (order sampling). Both are implemented via Rosén (1997)’s order sampling framework. SPS uses ranking key \(\xi_k = u_k / \pi_k\) (equivalent to Ohlsson 1998’s sequential threshold adjustment); Pareto uses \(\xi_k = [u_k/(1-u_k)] / [\pi_k/(1-\pi_k)]\). The \(n\) units with the smallest \(\xi_k\) are selected. First-order inclusion probabilities are approximately (not exactly) equal to the target \(\pi_k\) for finite populations; the discrepancy vanishes as \(N\) grows. Both support permanent random numbers (PRN) for sample coordination.

Systematic PPS. Joint inclusion probabilities are exact (computed via C code) but some \(\pi_{ij}\) are structurally zero e.g for pairs of units that can never co-occur. sampling_cov(s, weighted = TRUE) returns NA for those entries and issues a warning, because the SYG estimator is undefined for such pairs. Consider successive-differences, Hartley–Rao, or replication methods instead.

Chromy. Pairwise expectations \(E(n_i n_j)\) are estimated by Monte Carlo simulation in joint_expected_hits() (controlled by nsim, default 10 000).

Cube (balanced sampling). The cube method (Deville and Tillé 2004) produces near-maximum-entropy samples, so joint_inclusion_prob() uses the high-entropy approximation. The stratified variant (Chauvet 2009) preserves exact within-stratum sample sizes. Order auxiliary variables by importance (most important first) since the landing phase relaxes constraints from last to first.

Choosing a variance strategy

  • SYG with exact \(\pi_{ij}\): CPS and SRS WOR provide exact joint probabilities, so the Sen–Yates–Grundy variance estimator (Sen 1953; Yates and Grundy 1953) is unbiased.
  • SYG with approximate \(\pi_{ij}\): For Brewer, SPS, Pareto, and cube, joint probabilities use the high-entropy approximation (Brewer and Donadio 2003, eq. 18). This guarantees symmetry, non-negativity, and \(\pi_{ij} \le \min(\pi_i, \pi_j)\), but does not satisfy the fixed-size marginal identity \(\sum_{j \neq i}\pi_{ij} = (n-1)\pi_i\) exactly. The marginal defect is typically small but can be non-trivial for highly skewed \(\pi_k\) vectors when \(n\) is small (e.g., \(n = 2\)); a warning is issued when it exceeds 5% of \(n\). Use method = "cps" when exact second-order structure is required.
  • Independent designs (Poisson, Bernoulli): The covariance \(\Delta_{ij} = 0\) for \(i \neq j\), so the variance simplifies to \(\text{Var}(\hat{Y}_{HT}) = \sum_i (1 - \pi_i)\, y_i^2 / \pi_i\).
  • Systematic designs with \(\pi_{ij} = 0\): The SYG estimator is undefined. Use successive-differences, Hartley–Rao approximations, or replication methods.
  • With-replacement designs (Chromy, multinomial): Use the Hansen and Hurwitz (1943) estimator, or the generalised Horvitz–Thompson framework of Chromy (2009) which unifies WOR and WR variance estimation via sampling_cov().

References

Brewer, K. R. W., and M. E. Donadio. 2003. “The High Entropy Variance of the Horvitz–Thompson Estimator.” Survey Methodology 29 (2): 189–96.
Chauvet, Guillaume. 2009. “Stratified Balanced Sampling.” Survey Methodology 35: 115–19.
Chen, Shaw-Hwa, Arthur P. Dempster, and Jun S. Liu. 1994. “Weighted Finite Population Sampling to Maximize Entropy.” Biometrika 81 (3): 457–69. https://doi.org/10.1093/biomet/81.3.457.
Chromy, James R. 1979. “Sequential Sample Selection Methods.” In Proceedings of the Survey Research Methods Section, American Statistical Association, 401–6.
———. 2009. “Some Generalizations of the Horvitz–Thompson Estimator.” In Proceedings of the Survey Research Methods Section, American Statistical Association.
Deville, Jean-Claude, and Yves Tillé. 2004. “Efficient Balanced Sampling: The Cube Method.” Biometrika 91 (4): 893–912. https://doi.org/10.1093/biomet/91.4.893.
Hansen, Morris H., and William N. Hurwitz. 1943. “On the Theory of Sampling from Finite Populations.” The Annals of Mathematical Statistics 14 (4): 333–62. https://doi.org/10.1214/aoms/1177731356.
Horvitz, D. G., and D. J. Thompson. 1952. “A Generalization of Sampling Without Replacement from a Finite Universe.” Journal of the American Statistical Association 47 (260): 663–85. https://doi.org/10.1080/01621459.1952.10483446.
Ohlsson, Esbjörn. 1998. “Sequential Poisson Sampling.” Journal of Official Statistics 14 (2): 149–62.
Rosén, Bengt. 1997. “On Sampling with Probability Proportional to Size.” Journal of Statistical Planning and Inference 62 (2): 159–91. https://doi.org/10.1016/S0378-3758(96)00186-3.
Sen, A. R. 1953. “On the Estimate of the Variance in Sampling with Varying Probabilities.” Journal of the Indian Society of Agricultural Statistics 5: 119–27.
Yates, F., and P. M. Grundy. 1953. “Selection Without Replacement from Within Strata with Probability Proportional to Size.” Journal of the Royal Statistical Society, Series B 15 (2): 253–61. https://doi.org/10.1111/j.2517-6161.1953.tb00140.x.