Overview
sondage ships with 12 built-in sampling methods, but
researchers and agencies often need algorithms that are not included.
The registration API lets you plug a custom unequal probability sampling
method into the existing dispatchers and generics without modifying the
package source.
After registering a method, it works everywhere a built-in method
works: unequal_prob_wor(),
joint_inclusion_prob(), sampling_cov(), batch
mode (nrep), etc.
The registration API
A single call to register_method() adds a new
method:
library(sondage)
register_method(
name, # unique method name (character)
type = "wor", # "wor" or "wr"
sample_fn, # function(pik, n, prn, ...) -> integer indices
joint_fn = NULL, # function(pik, sample_idx, ...) -> matrix (optional)
fixed_size = TRUE, # does the method always return exactly n units?
supports_prn = FALSE # does the method support permanent random numbers?
)Two contracts must be respected:
-
sample_fn(pik, n = NULL, prn = NULL, ...)receives the inclusion probability vector and returns an integer vector of selected unit indices (1-based). -
joint_fn(pik, sample_idx = NULL, ...)(optional) returns an \(N \times N\) joint inclusion probability matrix whensample_idxisNULL, or a submatrix for the given indices when it is not.
Helper functions registered_methods(),
is_registered_method(), and
unregister_method() manage the registry.
Example 1: Sampford’s method with high-entropy joint probabilities
Sampford’s (1967) rejection procedure draws exact \(\pi\)ps samples. The first unit is drawn with probability proportional to \(\pi_k\); the remaining \(n - 1\) units are drawn independently with replacement, each with probability proportional to \(\pi_k / (1 -\pi_k)\). The draw is accepted only when all \(n\) units are distinct.
sampford_sample <- function(pik, n = NULL, prn = NULL, ...) {
N <- length(pik)
if (n == 1L) {
return(sample.int(N, 1L, prob = pik))
}
q <- pik / (1 - pik)
repeat {
first <- sample.int(N, 1L, prob = pik)
rest <- sample.int(N, n - 1L, replace = TRUE, prob = q)
s <- c(first, rest)
if (anyDuplicated(s) == 0L)
return(sort(s))
}
}The acceptance rate decreases as \(n / N\) grows, so this implementation is best suited for moderate sampling fractions.
Joint probabilities via he_jip()
Sampford’s design is a high-entropy design, so the high-entropy (HE)
approximation of Brewer and Donadio (2003)
applies. Rather than reimplementing the formula, we use
he_jip(), which sondage exports for exactly
this purpose. It calls the same optimised C code used internally for
Brewer, SPS, Pareto, and cube methods, and it already matches the
joint_fn signature expected by
register_method().
For designs closer to conditional Poisson (rejective) sampling,
hajek_jip() provides a lighter alternative based on Hájek (1964).
Registering and using Sampford’s method
library(sondage)
register_method(
"sampford",
type = "wor",
sample_fn = sampford_sample,
joint_fn = he_jip,
fixed_size = TRUE
)Once registered, the method is available through the standard dispatcher:
pik <- inclusion_prob(c(2, 3, 4, 5, 6, 7, 8, 9), n = 4)
pik
#> [1] 0.1818182 0.2727273 0.3636364 0.4545455 0.5454545 0.6363636 0.7272727
#> [8] 0.8181818
s <- unequal_prob_wor(pik, method = "sampford")
s
#> Unequal prob WOR [sampford] (n=4, N=8): 2 6 7 8All generics work transparently:
# Joint inclusion probabilities
pikl <- joint_inclusion_prob(s)
round(pikl, 4)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,] 0.1818 0.0350 0.0480 0.0618 0.0765 0.0925 0.1098 0.1289
#> [2,] 0.0350 0.2727 0.0737 0.0948 0.1174 0.1417 0.1682 0.1972
#> [3,] 0.0480 0.0737 0.3636 0.1296 0.1604 0.1935 0.2294 0.2688
#> [4,] 0.0618 0.0948 0.1296 0.4545 0.2059 0.2482 0.2939 0.3440
#> [5,] 0.0765 0.1174 0.1604 0.2059 0.5455 0.3062 0.3624 0.4237
#> [6,] 0.0925 0.1417 0.1935 0.2482 0.3062 0.6364 0.4355 0.5087
#> [7,] 0.1098 0.1682 0.2294 0.2939 0.3624 0.4355 0.7273 0.5999
#> [8,] 0.1289 0.1972 0.2688 0.3440 0.4237 0.5087 0.5999 0.8182
# Sampling covariance (Sen-Yates-Grundy check quantities)
round(sampling_cov(s, weighted = TRUE), 4)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,] 0.8182 -0.4165 -0.3784 -0.3383 -0.2960 -0.2515 -0.2043 -0.1544
#> [2,] -0.4165 0.7273 -0.3457 -0.3075 -0.2671 -0.2245 -0.1793 -0.1314
#> [3,] -0.3784 -0.3457 0.6364 -0.2750 -0.2366 -0.1959 -0.1528 -0.1070
#> [4,] -0.3383 -0.3075 -0.2750 0.5455 -0.2042 -0.1656 -0.1246 -0.0810
#> [5,] -0.2960 -0.2671 -0.2366 -0.2042 0.4545 -0.1334 -0.0946 -0.0532
#> [6,] -0.2515 -0.2245 -0.1959 -0.1656 -0.1334 0.3636 -0.0626 -0.0236
#> [7,] -0.2043 -0.1793 -0.1528 -0.1246 -0.0946 -0.0626 0.2727 0.0081
#> [8,] -0.1544 -0.1314 -0.1070 -0.0810 -0.0532 -0.0236 0.0081 0.1818
# sampled_only for large populations
joint_inclusion_prob(s, sampled_only = TRUE)
#> [,1] [,2] [,3] [,4]
#> [1,] 0.2727273 0.1417381 0.1681894 0.1972218
#> [2,] 0.1417381 0.6363636 0.4355263 0.5086542
#> [3,] 0.1681894 0.4355263 0.7272727 0.5999281
#> [4,] 0.1972218 0.5086542 0.5999281 0.8181818Verifying with simulation
We can check that the empirical inclusion frequencies match the target \(\pi_k\) and that the empirical pairwise frequencies match the HE approximation.
sim <- unequal_prob_wor(pik, method = "sampford", nrep = 5000)
dim(sim$sample) # n x nrep
#> [1] 4 5000
# First-order: empirical vs target
freq <- tabulate(sim$sample, nbins = length(pik)) / 5000
cbind(target = pik, empirical = freq)
#> target empirical
#> [1,] 0.1818182 0.1826
#> [2,] 0.2727273 0.2724
#> [3,] 0.3636364 0.3634
#> [4,] 0.4545455 0.4528
#> [5,] 0.5454545 0.5464
#> [6,] 0.6363636 0.6278
#> [7,] 0.7272727 0.7294
#> [8,] 0.8181818 0.8252
# Second-order: empirical vs HE approximation
N <- length(pik)
co_occur <- matrix(0, N, N)
for (j in seq_len(5000)) {
idx <- sim$sample[, j]
co_occur[idx, idx] <- co_occur[idx, idx] + 1
}
diag(co_occur) <- 0
empirical_jip <- co_occur / 5000
diag(empirical_jip) <- freq
he_pikl <- he_jip(pik)
# Compare a few off-diagonal pairs
pairs <- data.frame(
i = c(1, 2, 3, 5),
j = c(8, 7, 6, 8),
HE = round(sapply(1:4, \(k) he_pikl[c(1,2,3,5)[k], c(8,7,6,8)[k]]), 4),
empirical = round(sapply(1:4, \(k) empirical_jip[c(1,2,3,5)[k], c(8,7,6,8)[k]]), 4)
)
pairs
#> i j HE empirical
#> 1 1 8 0.1289 0.1382
#> 2 2 7 0.1682 0.1724
#> 3 3 6 0.1935 0.1834
#> 4 5 8 0.4237 0.4240The HE approximation is close because Sampford’s design is itself a high-entropy design.
Example 2: Wrapping an external package (sampling::UPtille)
When an algorithm is already implemented in another package, the
wrapper is minimal. Here we wrap UPtille and
UPtillepi2 from the sampling package (Tillé 2006).
tille_sample <- function(pik, n = NULL, prn = NULL, ...) {
which(as.logical(sampling::UPtille(pik)))
}
tille_joint <- function(pik, sample_idx = NULL, ...) {
pikl <- sampling::UPtillepi2(pik)
if (!is.null(sample_idx)) {
pikl <- pikl[sample_idx, sample_idx, drop = FALSE]
}
pikl
}
register_method(
"tille",
type = "wor",
sample_fn = tille_sample,
joint_fn = tille_joint,
fixed_size = TRUE
)
pik <- inclusion_prob(c(2, 3, 4, 5, 6, 7, 8, 9), n = 4)
s <- unequal_prob_wor(pik, method = "tille")
s
#> Unequal prob WOR [tille] (n=4, N=8): 3 5 6 7
# Exact joint inclusion probabilities from UPtillepi2
round(joint_inclusion_prob(s), 4)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,] 0.1818 0.0120 0.0361 0.0585 0.0791 0.0996 0.1199 0.1403
#> [2,] 0.0120 0.2727 0.0602 0.0877 0.1187 0.1494 0.1798 0.2104
#> [3,] 0.0361 0.0602 0.3636 0.1169 0.1582 0.1992 0.2397 0.2805
#> [4,] 0.0585 0.0877 0.1169 0.4545 0.2012 0.2490 0.2997 0.3506
#> [5,] 0.0791 0.1187 0.1582 0.2012 0.5455 0.2988 0.3596 0.4208
#> [6,] 0.0996 0.1494 0.1992 0.2490 0.2988 0.6364 0.4221 0.4909
#> [7,] 0.1199 0.1798 0.2397 0.2997 0.3596 0.4221 0.7273 0.5610
#> [8,] 0.1403 0.2104 0.2805 0.3506 0.4208 0.4909 0.5610 0.8182
# Full variance estimation chain
round(sampling_cov(s), 4)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,] 0.1488 -0.0375 -0.0300 -0.0242 -0.0201 -0.0161 -0.0124 -0.0085
#> [2,] -0.0375 0.1983 -0.0390 -0.0363 -0.0301 -0.0241 -0.0185 -0.0128
#> [3,] -0.0300 -0.0390 0.2314 -0.0484 -0.0401 -0.0322 -0.0247 -0.0170
#> [4,] -0.0242 -0.0363 -0.0484 0.2479 -0.0467 -0.0402 -0.0309 -0.0213
#> [5,] -0.0201 -0.0301 -0.0401 -0.0467 0.2479 -0.0483 -0.0371 -0.0255
#> [6,] -0.0161 -0.0241 -0.0322 -0.0402 -0.0483 0.2314 -0.0407 -0.0298
#> [7,] -0.0124 -0.0185 -0.0247 -0.0309 -0.0371 -0.0407 0.1983 -0.0340
#> [8,] -0.0085 -0.0128 -0.0170 -0.0213 -0.0255 -0.0298 -0.0340 0.1488Writing a joint_fn
A joint_fn is optional but enables
joint_inclusion_prob(), joint_expected_hits(),
and sampling_cov(). Four common strategies:
Exact formula. When the design has a known closed form (e.g.,
UPtillepi2above).he_jip(). The high-entropy approximation (Brewer and Donadio 2003). Best for maximum-entropy and high-entropy designs (Sampford, Tille, and most \(\pi\)ps procedures). Uses optimised C code internally. Pass it directly:joint_fn = he_jip.hajek_jip(). The Hajek (1964) approximation based on conditional Poisson (rejective) sampling theory. Simpler formula and slightly cheaper, but generally a bit less accurate thanhe_jip(). Best when the design is obtained by conditioning independent Poisson trials on the sample size. Pass it directly:joint_fn = hajek_jip.Other approximations. Other methods exist in the literature and can be implemented as custom
joint_fnfunctions. Tillé (1996) reviews several alternatives, including approximations based on Poisson, rejective, and successive sampling theory, each with different accuracy–computation trade-offs. Any function that accepts(pik, sample_idx = NULL, ...)and returns a symmetric matrix is a validjoint_fn.Monte Carlo estimation. Resample \(B\) times and estimate \(\hat{\pi}_{ij} = B^{-1} \sum_{b=1}^B I(i \in S_b)\, I(j \in S_b)\). Slow but universal:
mc_joint <- function(pik, sample_idx = NULL, ..., B = 5000) {
N <- length(pik)
n <- as.integer(round(sum(pik)))
co <- matrix(0, N, N)
for (b in seq_len(B)) {
s <- my_sampler(pik, n = n)
co[s, s] <- co[s, s] + 1
}
pikl <- co / B
diag(pikl) <- tabulate(unlist(
replicate(B, my_sampler(pik, n = n), simplify = FALSE)
), nbins = N) / B
if (!is.null(sample_idx)) {
pikl <- pikl[sample_idx, sample_idx, drop = FALSE]
}
pikl
}The sample_idx argument enables the
sampled_only = TRUE path in
joint_inclusion_prob(). When sample_idx is
non-NULL, you may either (a) compute the full matrix and subset as
above, or (b) skip rows/columns not in sample_idx for
efficiency.
Session persistence
The registry lives in the package namespace and resets when
sondage is reloaded. To make a registration persistent
across sessions, place the register_method() call in your
.Rprofile or in a project-level setup script.