Getting Started with sondage • sondage

Overview

sondage provides fast implementations of survey sampling algorithms for finite populations. Every sampling function returns a design object that carries the selected sample alongside the design metadata (inclusion probabilities or expected hits, sample size, method). A set of generics then operates on these objects to compute quantities needed for variance estimation.

For the full object contract and field definitions, see ?sondage_sample.

For without-replacement designs, the stored pik vector is the design-defining target inclusion probability vector. For methods with exact first-order guarantees, this equals the true first-order inclusion probabilities. For order-sampling methods such as sps and pareto, the stored vector remains the target pik, while the true finite-population first-order inclusion probabilities are only approximately equal to that target.

The five entry-point functions are organised by two dimensions, plus a dedicated dispatcher for balanced sampling:

	Without replacement	With replacement
Equal prob.	`equal_prob_wor()`	`equal_prob_wr()`
Unequal prob.	`unequal_prob_wor()`	`unequal_prob_wr()`
Balanced	`balanced_wor()`

Equal probability sampling

Without replacement

The simplest case: draw n units from a population of size N with equal probabilities.

library(sondage)
s <- equal_prob_wor(N = 50, n = 5)
s
#> Equal prob WOR [srs] (n=5, N=50): 4 39 1 34 23

The returned object stores the sampled unit indices and the inclusion probabilities:

s$sample
#> [1]  4 39  1 34 23
s$pik[1:10]  # all equal to n/N = 0.1
#>  [1] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

Alternative methods are available. Systematic sampling provides implicit stratification based on the ordering of units:

s_sys <- equal_prob_wor(N = 50, n = 5, method = "systematic")
s_sys
#> Equal prob WOR [systematic] (n=5, N=50): 7 17 27 37 47

Bernoulli sampling selects each unit independently with probability p = n/N, producing a random sample size. The parameter n is the expected sample size not the realized sample size.

s_ber <- equal_prob_wor(N = 50, n = 5, method = "bernoulli")
s_ber
#> Equal prob WOR [bernoulli] (expected n=5, realized n=2, N=50): 10 13
length(s_ber$sample)  # may differ from 5
#> [1] 2

With replacement

s_wr <- equal_prob_wr(N = 50, n = 5)
s_wr
#> Equal prob WR [srs] (n=5, N=50): 39 42 6 24 32

With-replacement designs track hits (how many times each unit was selected) instead of inclusion probabilities:

s_wr$hits  # length-N vector of selection counts
#>  [1] 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0
#> [39] 1 0 0 1 0 0 0 0 0 0 0 0

Unequal probability sampling

When units have different sizes (employees, revenue, area, …), probability-proportional-to-size (PPS) designs give larger units higher selection probabilities, which reduces the variance of Horvitz–Thompson estimators (Horvitz and Thompson 1952).

Computing design quantities

Start from a vector of positive size measures and compute inclusion probabilities (WOR) or expected hits (WR):

x <- c(10, 20, 5, 40, 25)
n <- 3

# Without replacement: inclusion probabilities (capped at 1)
pik <- inclusion_prob(x, n)
pik
#> [1] 0.3333333 0.6666667 0.1666667 1.0000000 0.8333333
sum(pik)  # equals n
#> [1] 3

# With replacement: expected hits (can exceed 1)
eh <- expected_hits(x, n)
eh
#> [1] 0.30 0.60 0.15 1.20 0.75
sum(eh)   # equals n
#> [1] 3

Without replacement

Pass the inclusion probability vector to unequal_prob_wor(). The default method is Conditional Poisson Sampling (CPS, Chen, Dempster, and Liu 1994), which has maximum entropy among fixed-size designs:

s <- unequal_prob_wor(pik, method = "cps")
s
#> Unequal prob WOR [cps] (n=3, N=5): 4 1 5
s$sample
#> [1] 4 1 5
inclusion_prob(s)  # extract the stored design-defining target pik
#> [1] 0.3333333 0.6666667 0.1666667 1.0000000 0.8333333

Other fixed-size methods include "brewer" (draw-by-draw, order-invariant), "systematic" (O(N), implicit stratification), "sps" (Sequential Poisson sampling, Ohlsson 1998, supports PRN), and "pareto" (Pareto sampling, Rosén 1997, supports PRN). The "poisson" method produces a random sample size.

For sps and pareto, inclusion_prob(s) still returns the stored target pik vector used to define the design. Those methods do not, in general, provide exact finite-population first-order inclusion probabilities equal to that target.

With replacement

Pass the expected hits vector to unequal_prob_wr(). Chromy (1979)’s method is the default:

s_wr <- unequal_prob_wr(eh, method = "chromy")
s_wr
#> Unequal prob WR [chromy] (n=3, N=5): 2 4 5
s_wr$hits  # realized selection counts
#> [1] 0 1 0 1 1

The "multinomial" method draws each selection independently.

Balanced sampling

When auxiliary information is available at the design stage, balanced sampling ensures that Horvitz–Thompson estimates of auxiliary totals are (approximately) exact: $\sum_{k \in S} x_k / \pi_k \approx \sum_{k \in U} x_k$. This reduces variance for any study variable correlated with the balancing variables.

The balanced_wor() function implements the cube method (Deville and Tillé 2004):

pik <- inclusion_prob(c(10, 20, 5, 40, 25), n = 3)
x <- matrix(c(10, 20, 5, 40, 25))  # auxiliary variable to balance on
s_bal <- balanced_wor(pik, aux = x)
s_bal
#> Balanced WOR [cube] (n=3, N=5): 2 4 5

The cube method proceeds in two phases: a flight phase that moves probabilities toward 0 or 1 while maintaining all balancing constraints, and a landing phase that resolves the remaining undecided units by progressively relaxing constraints (starting from the last column of aux). The sample size constraint is always placed first and never relaxed.

Users should order auxiliary variables by importance (most important first) in the aux matrix.

Stratified balanced sampling

When strata are known, the stratified cube method (Chauvet 2009) preserves within-stratum sample sizes while balancing on auxiliary variables. Exact preservation requires sum(pik) within each stratum to be close to an integer; when this condition is not met, within-stratum and total sample sizes will vary around their targets and the design is treated as random-size (fixed_size = FALSE):

N <- 20
pik_strat <- rep(0.4, N)
x_strat <- matrix(as.double(1:N), ncol = 1)
strata <- rep(1:4, each = 5)

s_strat <- balanced_wor(pik_strat, aux = x_strat, strata = strata)
s_strat
#> Balanced WOR [cube] (n=8, N=20): 3 5 6 10 13 14 17 20

# Check per-stratum sample sizes
table(strata[s_strat$sample])
#> 
#> 1 2 3 4 
#> 2 2 2 2

Joint inclusion probabilities for the cube use the high-entropy approximation, since the cube produces a near-maximum-entropy design.

Design queries

Once you have a design object, generics compute the quantities needed for variance estimation.

Joint inclusion probabilities (WOR)

pik <- inclusion_prob(c(10, 20, 5, 40, 25), n = 3)
s <- unequal_prob_wor(pik, method = "cps")

pikl <- joint_inclusion_prob(s)
pikl
#>            [,1]       [,2]       [,3]      [,4]      [,5]
#> [1,] 0.33333333 0.10026673 0.01935742 0.3333333 0.2137092
#> [2,] 0.10026673 0.66666667 0.04704252 0.6666667 0.5193574
#> [3,] 0.01935742 0.04704252 0.16666667 0.1666667 0.1002667
#> [4,] 0.33333333 0.66666667 0.16666667 1.0000000 0.8333333
#> [5,] 0.21370918 0.51935742 0.10026673 0.8333333 0.8333333

The diagonal holds the first-order probabilities $\pi_i$ and the off-diagonal entries are $\pi_{ij} = P(i \in S \text{ and } j \in S)$. For sps and pareto, the approximation is built from the stored target pik vector, not from unknown exact finite-population first-order inclusion probabilities.

Joint expected hits (WR)

eh <- expected_hits(c(10, 20, 5, 40, 25), n = 3)
s_wr <- unequal_prob_wr(eh, method = "chromy")

joint_expected_hits(s_wr)
#>        [,1]   [,2]   [,3]   [,4]   [,5]
#> [1,] 0.2997 0.1077 0.0350 0.3529 0.1038
#> [2,] 0.1077 0.5934 0.0115 0.6341 0.4335
#> [3,] 0.0350 0.0115 0.1520 0.1577 0.0998
#> [4,] 0.3529 0.6341 0.1577 1.6261 0.8553
#> [5,] 0.1038 0.4335 0.0998 0.8553 0.7462

Submatrices for large populations

By default, joint_inclusion_prob() and joint_expected_hits() return the full N x N matrix, which is prohibitive when N > 10,000 (~0.8 GB). In practice, variance estimators only need the submatrix for sampled units. The sampled_only = TRUE argument returns this n x n submatrix directly:

pik_large <- inclusion_prob(runif(500), n = 10)
s_large <- unequal_prob_wor(pik_large, method = "brewer")

# Full 500 x 500 matrix
pikl_full <- joint_inclusion_prob(s_large)
dim(pikl_full)
#> [1] 500 500

# Only the 10 x 10 submatrix for sampled units
pikl_sub <- joint_inclusion_prob(s_large, sampled_only = TRUE)
dim(pikl_sub)
#> [1] 10 10

All methods compute the submatrix directly without allocating the N x N matrix, so arbitrarily large populations are supported. The sampling_cov() generic also accepts sampled_only:

delta_sub <- sampling_cov(s_large, sampled_only = TRUE)
dim(delta_sub)
#> [1] 10 10

Sampling covariance

The sampling_cov() generic computes $\Delta_{ij} = \pi_{ij} - \pi_i \pi_j$ for WOR designs (or the analogous expression for WR designs):

sampling_cov(s)
#>             [,1]        [,2]        [,3] [,4]        [,5]
#> [1,]  0.22222222 -0.12195549 -0.03619814    0 -0.06406859
#> [2,] -0.12195549  0.22222222 -0.06406859    0 -0.03619814
#> [3,] -0.03619814 -0.06406859  0.13888889    0 -0.03862216
#> [4,]  0.00000000  0.00000000  0.00000000    0  0.00000000
#> [5,] -0.06406859 -0.03619814 -0.03862216    0  0.13888889

With weighted = TRUE, it returns the Sen–Yates–Grundy check quantities $1 - \pi_i \pi_j / \pi_{ij}$ (Sen 1953; Yates and Grundy 1953). For well-behaved WOR designs, these should be non-positive off-diagonal:

sampling_cov(s, weighted = TRUE)
#>            [,1]        [,2]       [,3] [,4]        [,5]
#> [1,]  0.6666667 -1.21631060 -1.8699881    0 -0.29979336
#> [2,] -1.2163106  0.33333333 -1.3619296    0 -0.06969793
#> [3,] -1.8699881 -1.36192958  0.8333333    0 -0.38519411
#> [4,]  0.0000000  0.00000000  0.0000000    0  0.00000000
#> [5,] -0.2997934 -0.06969793 -0.3851941    0  0.16666667

Batch sampling

All five dispatchers accept an nrep parameter for drawing multiple independent samples in one call, useful for simulation studies. The result is always a design object (the same class as nrep = 1) but with $sample holding all replicates:

sim <- equal_prob_wor(N = 50, n = 5, nrep = 4)
sim  # design object
#> Equal prob WOR [srs] (n=5, N=50): 4 replicates
#>   rep 1: 47 50 20 49 42
dim(sim$sample)  # 5 x 4 matrix: each column is one sample
#> [1] 5 4

# Generics still work on batch objects
inclusion_prob(sim)
#>  [1] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
#> [20] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
#> [39] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

For fixed-size designs $sample is a matrix (n x nrep). For random-size designs (Bernoulli, Poisson) it is a list of integer vectors, and realized sample sizes are available via lengths(sim$sample).

Design properties and statistical guarantees

The table below summarises what sondage computes for each method. Understanding these properties is essential for choosing the right variance estimator.

HE approx = high-entropy approximation (Brewer & Donadio, 2003). For WOR methods, design marginals are first-order inclusion probabilities pi_k; for WR methods, they are expected hits E(N_k). Target only* = the stored design-defining pik is the target vector; for sps and pareto the true finite-population first-order probabilities are only approximately equal to that target, with the discrepancy vanishing asymptotically. Simulated = Monte Carlo via nsim parameter.
Method	Type	Fixed n	Exact design marginals?	Joint probs
SRS WOR	EP WOR	Yes	Yes	Exact
Systematic (EP)	EP WOR	Yes	Yes	Exact (zeros)
Bernoulli	EP WOR	No	Yes	Exact
SRS WR	EP WR	Yes	Yes	Exact
CPS	UP WOR	Yes	Yes	Exact
Brewer	UP WOR	Yes	Yes	HE approx
Systematic PPS	UP WOR	Yes	Yes	Exact (zeros)
Poisson	UP WOR	No	Yes	Exact
SPS	UP WOR	Yes	Target only*	HE approx
Pareto	UP WOR	Yes	Target only*	HE approx
Chromy	UP WR	Yes	Yes	Simulated
Multinomial	UP WR	Yes	Yes	Exact
Cube	Balanced	Yes	Yes	HE approx

Notes on specific methods

Equal-probability systematic sampling. In sondage, joint_inclusion_prob() returns exact joint inclusion probabilities computed via C code, using the same algorithm as systematic PPS with uniform $\pi_k = n/N$. Some $\pi_{ij}$ are structurally zero (pairs of units that can never co-occur in the same systematic sample), so the SYG estimator is generally inapplicable. Consider replication or successive-differences variance estimators for systematic designs.

SPS and Pareto (order sampling). Both are implemented via Rosén (1997)’s order sampling framework. SPS uses ranking key $\xi_k = u_k / \pi_k$ (equivalent to Ohlsson 1998’s sequential threshold adjustment); Pareto uses $\xi_k = [u_k/(1-u_k)] / [\pi_k/(1-\pi_k)]$. The $n$ units with the smallest $\xi_k$ are selected. Design objects store the supplied pik as the design-defining target vector, and inclusion_prob(s) returns that stored target. The true finite-population first-order inclusion probabilities are approximately (not exactly) equal to the target $\pi_k$ for finite populations; the discrepancy vanishes as $N$ grows. Both support permanent random numbers (PRN) for sample coordination.

Systematic PPS. Joint inclusion probabilities are exact (computed via C code) but some $\pi_{ij}$ are structurally zero e.g for pairs of units that can never co-occur. sampling_cov(s, weighted = TRUE) returns NA for those entries and issues a warning, because the SYG estimator is undefined for such pairs. Consider successive-differences, Hartley–Rao, or replication methods instead.

Chromy. Pairwise expectations $E(n_i n_j)$ are estimated by Monte Carlo simulation in joint_expected_hits() (controlled by nsim, default 10 000).

Cube (balanced sampling). The cube method (Deville and Tillé 2004) produces near-maximum-entropy samples, so joint_inclusion_prob() uses the high-entropy approximation. The stratified variant (Chauvet 2009) preserves exact within-stratum sample sizes. Order auxiliary variables by importance (most important first) since the landing phase relaxes constraints from last to first.

Choosing a method

CPS: Use when exactness matters more than speed: it is the maximum-entropy fixed-size unequal-probability design, with exact first- and second-order inclusion probabilities.
Systematic: Use when very fast sampling and ordering or implicit stratification are central, and structural zeros in some joint inclusion probabilities are acceptable.
Brewer: Use when you want exact first-order inclusion probabilities with lower computational cost than cps, and can work with approximate second-order quantities in sondage.
Poisson: Use when a random sample size is acceptable and independent selection is desirable.
SPS / Pareto: Use as fast high-entropy order-sampling alternatives when approximate first- and second-order quantities are acceptable.
Cube: Use when balancing on auxiliary variables is more important than exact second-order inclusion probabilities.

Choosing a variance strategy

SYG with exact $\pi_{ij}$: CPS and SRS WOR provide exact joint probabilities, so the Sen–Yates–Grundy variance estimator (Sen 1953; Yates and Grundy 1953) is unbiased.
SYG with approximate $\pi_{ij}$: For Brewer, SPS, Pareto, and cube, joint probabilities use the high-entropy approximation (Brewer and Donadio 2003, eq. 18). This guarantees symmetry, non-negativity, and $\pi_{ij} \le \min(\pi_i, \pi_j)$, but does not satisfy the fixed-size marginal identity $\sum_{j \neq i}\pi_{ij} = (n-1)\pi_i$ exactly. The marginal defect is typically small but can be non-trivial for highly skewed $\pi_k$ vectors when $n$ is small (e.g., $n = 2$); a warning is issued when it exceeds 5% of $n$. For sps and pareto, this approximation is built from the stored target pik vector rather than the unknown exact finite-population first-order probabilities. Use method = "cps" when exact second-order structure is required.
Independent designs (Poisson, Bernoulli): The covariance $\Delta_{ij} = 0$ for $i \neq j$, so the variance simplifies to $\text{Var}(\hat{Y}_{HT}) = \sum_i (1 - \pi_i)\, y_i^2 / \pi_i$.
Systematic designs with $\pi_{ij} = 0$: The SYG estimator is undefined. Use successive-differences, Hartley–Rao approximations, or replication methods.
With-replacement designs (Chromy, multinomial): Use the Hansen and Hurwitz (1943) estimator, or the generalised Horvitz–Thompson framework of Chromy (2009) which unifies WOR and WR variance estimation via sampling_cov().

Extending sondage

register_method() lets you plug custom unequal-probability sampling algorithms into the existing dispatchers and generics. Registered methods work with unequal_prob_wor(), joint_inclusion_prob(), sampling_cov(), batch mode, and all downstream tooling. See vignette("custom-methods") for worked examples.

References

Brewer, K. R. W., and M. E. Donadio. 2003. “The High Entropy Variance of the Horvitz–Thompson Estimator.” Survey Methodology 29 (2): 189–96.

Chauvet, Guillaume. 2009. “Stratified Balanced Sampling.” Survey Methodology 35 (1): 115–19.

Chen, Xiang-Hui, Arthur P. Dempster, and Jun S. Liu. 1994. “Weighted Finite Population Sampling to Maximize Entropy.” Biometrika 81 (3): 457–69. https://doi.org/10.1093/biomet/81.3.457.

Chromy, James R. 1979. “Sequential Sample Selection Methods.” In Proceedings of the Survey Research Methods Section, American Statistical Association, 401–6.

———. 2009. “Some Generalizations of the Horvitz–Thompson Estimator.” In Proceedings of the Survey Research Methods Section, American Statistical Association.

Deville, Jean-Claude, and Yves Tillé. 2004. “Efficient Balanced Sampling: The Cube Method.” Biometrika 91 (4): 893–912. https://doi.org/10.1093/biomet/91.4.893.

Hansen, Morris H., and William N. Hurwitz. 1943. “On the Theory of Sampling from Finite Populations.” The Annals of Mathematical Statistics 14 (4): 333–62. https://doi.org/10.1214/aoms/1177731356.

Horvitz, D. G., and D. J. Thompson. 1952. “A Generalization of Sampling Without Replacement from a Finite Universe.” Journal of the American Statistical Association 47 (260): 663–85. https://doi.org/10.1080/01621459.1952.10483446.

Ohlsson, Esbjörn. 1998. “Sequential Poisson Sampling.” Journal of Official Statistics 14 (2): 149–62.

Rosén, Bengt. 1997. “On Sampling with Probability Proportional to Size.” Journal of Statistical Planning and Inference 62 (2): 159–91. https://doi.org/10.1016/S0378-3758(96)00186-3.

Sen, A. R. 1953. “On the Estimate of the Variance in Sampling with Varying Probabilities.” Journal of the Indian Society of Agricultural Statistics 5: 119–27.

Yates, F., and P. M. Grundy. 1953. “Selection Without Replacement from Within Strata with Probability Proportional to Size.” Journal of the Royal Statistical Society, Series B 15 (2): 253–61. https://doi.org/10.1111/j.2517-6161.1953.tb00140.x.