Overview
sondage provides fast implementations of survey sampling
algorithms for finite populations. Every sampling function returns a
design object that carries the selected sample alongside the
design metadata (inclusion probabilities or expected hits, sample size,
method). A set of generics then operates on these objects to compute
quantities needed for variance estimation.
The five entry-point functions are organised by two dimensions, plus a dedicated dispatcher for balanced sampling:
| Without replacement | With replacement | |
|---|---|---|
| Equal prob. | equal_prob_wor() |
equal_prob_wr() |
| Unequal prob. | unequal_prob_wor() |
unequal_prob_wr() |
| Balanced | balanced_wor() |
Equal probability sampling
Without replacement
The simplest case: draw n units from a population of
size N with equal probabilities.
library(sondage)
s <- equal_prob_wor(N = 50, n = 5)
s
#> Equal prob WOR [srs] (n=5, N=50): 4 39 1 34 23The returned object stores the sampled unit indices and the inclusion probabilities:
s$sample
#> [1] 4 39 1 34 23
s$pik[1:10] # all equal to n/N = 0.1
#> [1] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1Alternative methods are available. Systematic sampling provides implicit stratification based on the ordering of units:
s_sys <- equal_prob_wor(N = 50, n = 5, method = "systematic")
s_sys
#> Equal prob WOR [systematic] (n=5, N=50): 7 17 27 37 47Bernoulli sampling selects each unit independently with probability
p = n/N, producing a random sample size. The parameter
n is the expected sample size not the realized sample
size.
s_ber <- equal_prob_wor(N = 50, n = 5, method = "bernoulli")
s_ber
#> Equal prob WOR [bernoulli] (n=5, N=50): 10 13
length(s_ber$sample) # may differ from 5
#> [1] 2With replacement
s_wr <- equal_prob_wr(N = 50, n = 5)
s_wr
#> Equal prob WR [srs] (n=5, N=50): 39 42 6 24 32With-replacement designs track hits (how many times each unit was selected) instead of inclusion probabilities:
s_wr$hits # length-N vector of selection counts
#> [1] 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0
#> [39] 1 0 0 1 0 0 0 0 0 0 0 0Unequal probability sampling
When units have different sizes (employees, revenue, area, …), probability-proportional-to-size (PPS) designs give larger units higher selection probabilities, which reduces the variance of Horvitz–Thompson estimators (Horvitz and Thompson 1952).
Computing design quantities
Start from a vector of positive size measures and compute inclusion probabilities (WOR) or expected hits (WR):
x <- c(10, 20, 5, 40, 25)
n <- 3
# Without replacement: inclusion probabilities (capped at 1)
pik <- inclusion_prob(x, n)
pik
#> [1] 0.3333333 0.6666667 0.1666667 1.0000000 0.8333333
sum(pik) # equals n
#> [1] 3
# With replacement: expected hits (can exceed 1)
eh <- expected_hits(x, n)
eh
#> [1] 0.30 0.60 0.15 1.20 0.75
sum(eh) # equals n
#> [1] 3Without replacement
Pass the inclusion probability vector to
unequal_prob_wor(). The default method is Conditional
Poisson Sampling (CPS, Chen, Dempster, and Liu
1994), which has maximum entropy among fixed-size designs:
s <- unequal_prob_wor(pik, method = "cps")
s
#> Unequal prob WOR [cps] (n=3, N=5): 4 1 5
s$sample
#> [1] 4 1 5
inclusion_prob(s) # extract pik from the design
#> [1] 0.3333333 0.6666667 0.1666667 1.0000000 0.8333333Other fixed-size methods include "brewer" (draw-by-draw,
order-invariant), "systematic" (O(N), implicit
stratification), "sps" (Sequential
Poisson sampling, Ohlsson 1998, supports PRN), and
"pareto" (Pareto sampling, Rosén
1997, supports PRN). The "poisson" method produces a
random sample size.
With replacement
Pass the expected hits vector to unequal_prob_wr().
Chromy (1979)’s method is the default:
s_wr <- unequal_prob_wr(eh, method = "chromy")
s_wr
#> Unequal prob WR [chromy] (n=3, N=5): 2 4 5
s_wr$hits # realized selection counts
#> [1] 0 1 0 1 1The "multinomial" method draws each selection
independently.
Balanced sampling
When auxiliary information is available at the design stage, balanced sampling ensures that Horvitz–Thompson estimates of auxiliary totals are (approximately) exact: \(\sum_{k \in S} x_k / \pi_k \approx \sum_{k \in U} x_k\). This reduces variance for any study variable correlated with the balancing variables.
The balanced_wor() function implements the cube method
(Deville and Tillé 2004):
pik <- inclusion_prob(c(10, 20, 5, 40, 25), n = 3)
x <- matrix(c(10, 20, 5, 40, 25)) # auxiliary variable to balance on
s_bal <- balanced_wor(pik, aux = x)
s_bal
#> Unequal prob WOR [cube] (n=3, N=5): 2 4 5The cube method proceeds in two phases: a flight phase that
moves probabilities toward 0 or 1 while maintaining all balancing
constraints, and a landing phase that resolves the remaining
undecided units by progressively relaxing constraints (starting from the
last column of aux). The sample size constraint is always
placed first and never relaxed.
Users should order auxiliary variables by importance (most important
first) in the aux matrix.
Stratified balanced sampling
When strata are known, the stratified cube method (Chauvet 2009) preserves within-stratum sample
sizes while balancing on auxiliary variables. Exact preservation
requires sum(pik) within each stratum to be close to an
integer; when this condition is not met, within-stratum and total sample
sizes will vary around their targets and the design is treated as
random-size (fixed_size = FALSE):
N <- 20
pik_strat <- rep(0.4, N)
x_strat <- matrix(as.double(1:N), ncol = 1)
strata <- rep(1:4, each = 5)
s_strat <- balanced_wor(pik_strat, aux = x_strat, strata = strata)
s_strat
#> Unequal prob WOR [cube] (n=8, N=20): 3 5 6 10 13 14 17 20
# Check per-stratum sample sizes
table(strata[s_strat$sample])
#>
#> 1 2 3 4
#> 2 2 2 2Joint inclusion probabilities for the cube use the high-entropy approximation, since the cube produces a near-maximum-entropy design.
Design queries
Once you have a design object, generics compute the quantities needed for variance estimation.
Joint inclusion probabilities (WOR)
pik <- inclusion_prob(c(10, 20, 5, 40, 25), n = 3)
s <- unequal_prob_wor(pik, method = "cps")
pikl <- joint_inclusion_prob(s)
pikl
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.33333333 0.10026570 0.01935766 0.3333333 0.2137089
#> [2,] 0.10026570 0.66666667 0.04704226 0.6666667 0.5193577
#> [3,] 0.01935766 0.04704226 0.16666667 0.1666667 0.1002667
#> [4,] 0.33333333 0.66666667 0.16666667 1.0000000 0.8333333
#> [5,] 0.21370893 0.51935766 0.10026667 0.8333333 0.8333333The diagonal holds the first-order probabilities \(\pi_i\) and the off-diagonal entries are \(\pi_{ij} = P(i \in S \text{ and } j \in S)\).
Joint expected hits (WR)
eh <- expected_hits(c(10, 20, 5, 40, 25), n = 3)
s_wr <- unequal_prob_wr(eh, method = "chromy")
joint_expected_hits(s_wr)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.2997 0.1077 0.0350 0.3529 0.1038
#> [2,] 0.1077 0.5934 0.0115 0.6341 0.4335
#> [3,] 0.0350 0.0115 0.1520 0.1577 0.0998
#> [4,] 0.3529 0.6341 0.1577 1.6261 0.8553
#> [5,] 0.1038 0.4335 0.0998 0.8553 0.7462Submatrices for large populations
By default, joint_inclusion_prob() and
joint_expected_hits() return the full N x N matrix, which
is prohibitive when N > 10,000 (~0.8 GB). In practice, variance
estimators only need the submatrix for sampled units. The
sampled_only = TRUE argument returns this n x n submatrix
directly:
pik_large <- inclusion_prob(runif(500), n = 10)
s_large <- unequal_prob_wor(pik_large, method = "brewer")
# Full 500 x 500 matrix
pikl_full <- joint_inclusion_prob(s_large)
dim(pikl_full)
#> [1] 500 500
# Only the 10 x 10 submatrix for sampled units
pikl_sub <- joint_inclusion_prob(s_large, sampled_only = TRUE)
dim(pikl_sub)
#> [1] 10 10All methods compute the submatrix directly without allocating the N x
N matrix, so arbitrarily large populations are supported. The
sampling_cov() generic also accepts
sampled_only:
delta_sub <- sampling_cov(s_large, sampled_only = TRUE)
dim(delta_sub)
#> [1] 10 10Sampling covariance
The sampling_cov() generic computes \(\Delta_{ij} = \pi_{ij} - \pi_i \pi_j\) for
WOR designs (or the analogous expression for WR designs):
sampling_cov(s)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.22222222 -0.12195653 -0.03619789 0 -0.06406885
#> [2,] -0.12195653 0.22222222 -0.06406885 0 -0.03619789
#> [3,] -0.03619789 -0.06406885 0.13888889 0 -0.03862222
#> [4,] 0.00000000 0.00000000 0.00000000 0 0.00000000
#> [5,] -0.06406885 -0.03619789 -0.03862222 0 0.13888889With weighted = TRUE, it returns the Sen–Yates–Grundy
check quantities \(1 - \pi_i \pi_j /
\pi_{ij}\) (Sen 1953; Yates and Grundy
1953). For well-behaved WOR designs, these should be non-positive
off-diagonal:
sampling_cov(s, weighted = TRUE)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.6666667 -1.21633352 -1.8699516 0 -0.29979492
#> [2,] -1.2163335 0.33333333 -1.3619425 0 -0.06969743
#> [3,] -1.8699516 -1.36194247 0.8333333 0 -0.38519497
#> [4,] 0.0000000 0.00000000 0.0000000 0 0.00000000
#> [5,] -0.2997949 -0.06969743 -0.3851950 0 0.16666667Batch sampling
All five dispatchers accept an nrep parameter for
drawing multiple independent samples in one call, useful for simulation
studies. The result is always a design object (the same class as
nrep = 1) but with $sample holding all
replicates:
sim <- equal_prob_wor(N = 50, n = 5, nrep = 4)
sim # design object
#> Equal prob WOR [srs] (n=5, N=50): 4 replicates
#> rep 1: 47 50 20 49 42
dim(sim$sample) # 5 x 4 matrix: each column is one sample
#> [1] 5 4
# Generics still work on batch objects
inclusion_prob(sim)
#> [1] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
#> [20] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
#> [39] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1For fixed-size designs $sample is a matrix
(n x nrep). For random-size designs (Bernoulli, Poisson) it
is a list of integer vectors, and realized sample sizes are available
via lengths(sim$sample).
Design properties and statistical guarantees
The table below summarises what sondage computes for
each method. Understanding these properties is essential for choosing
the right variance estimator.
| Method | Type | Fixed n | Exact 1st order | Joint probs |
|---|---|---|---|---|
| SRS WOR | EP WOR | Yes | Yes | Exact |
| Systematic (EP) | EP WOR | Yes | Yes | Exact (zeros) |
| Bernoulli | EP WOR | No | Yes | Exact |
| SRS WR | EP WR | Yes | Yes | Exact |
| CPS | UP WOR | Yes | Yes | Exact |
| Brewer | UP WOR | Yes | Yes | HE approx |
| Systematic PPS | UP WOR | Yes | Yes | Exact (zeros) |
| Poisson | UP WOR | No | Yes | Exact |
| SPS | UP WOR | Yes | Approx* | HE approx |
| Pareto | UP WOR | Yes | Approx* | HE approx |
| Chromy | UP WR | Yes | Yes | Simulated |
| Multinomial | UP WR | Yes | Yes | Exact |
| Cube | Balanced | Yes | Yes | HE approx |
Notes on specific methods
Equal-probability systematic sampling. In
sondage, joint_inclusion_prob() returns exact
joint inclusion probabilities computed via C code, using the same
algorithm as systematic PPS with uniform \(\pi_k = n/N\). Some \(\pi_{ij}\) are structurally zero (pairs of
units that can never co-occur in the same systematic sample), so the SYG
estimator is generally inapplicable. Consider replication or
successive-differences variance estimators for systematic designs.
SPS and Pareto (order sampling). Both are implemented via Rosén (1997)’s order sampling framework. SPS uses ranking key \(\xi_k = u_k / \pi_k\) (equivalent to Ohlsson 1998’s sequential threshold adjustment); Pareto uses \(\xi_k = [u_k/(1-u_k)] / [\pi_k/(1-\pi_k)]\). The \(n\) units with the smallest \(\xi_k\) are selected. First-order inclusion probabilities are approximately (not exactly) equal to the target \(\pi_k\) for finite populations; the discrepancy vanishes as \(N\) grows. Both support permanent random numbers (PRN) for sample coordination.
Systematic PPS. Joint inclusion probabilities are
exact (computed via C code) but some \(\pi_{ij}\) are structurally zero e.g for
pairs of units that can never co-occur.
sampling_cov(s, weighted = TRUE) returns NA
for those entries and issues a warning, because the SYG estimator is
undefined for such pairs. Consider successive-differences, Hartley–Rao,
or replication methods instead.
Chromy. Pairwise expectations \(E(n_i n_j)\) are estimated by Monte Carlo
simulation in joint_expected_hits() (controlled by
nsim, default 10 000).
Cube (balanced sampling). The cube method (Deville and Tillé 2004) produces
near-maximum-entropy samples, so joint_inclusion_prob()
uses the high-entropy approximation. The stratified variant (Chauvet 2009) preserves exact within-stratum
sample sizes. Order auxiliary variables by importance (most important
first) since the landing phase relaxes constraints from last to
first.
Choosing a variance strategy
- SYG with exact \(\pi_{ij}\): CPS and SRS WOR provide exact joint probabilities, so the Sen–Yates–Grundy variance estimator (Sen 1953; Yates and Grundy 1953) is unbiased.
-
SYG with approximate \(\pi_{ij}\): For Brewer, SPS,
Pareto, and cube, joint probabilities use the high-entropy approximation
(Brewer and Donadio 2003, eq. 18). This
guarantees symmetry, non-negativity, and \(\pi_{ij} \le \min(\pi_i, \pi_j)\), but does
not satisfy the fixed-size marginal identity \(\sum_{j \neq i}\pi_{ij} = (n-1)\pi_i\)
exactly. The marginal defect is typically small but can be non-trivial
for highly skewed \(\pi_k\) vectors
when \(n\) is small (e.g., \(n = 2\)); a warning is issued when it
exceeds 5% of \(n\). Use
method = "cps"when exact second-order structure is required. - Independent designs (Poisson, Bernoulli): The covariance \(\Delta_{ij} = 0\) for \(i \neq j\), so the variance simplifies to \(\text{Var}(\hat{Y}_{HT}) = \sum_i (1 - \pi_i)\, y_i^2 / \pi_i\).
- Systematic designs with \(\pi_{ij} = 0\): The SYG estimator is undefined. Use successive-differences, Hartley–Rao approximations, or replication methods.
-
With-replacement designs (Chromy, multinomial): Use
the Hansen and Hurwitz (1943) estimator,
or the generalised Horvitz–Thompson framework of Chromy (2009) which unifies WOR and WR variance
estimation via
sampling_cov().