Skip to contents

Overview

sondage ships with 12 built-in sampling methods, but researchers and agencies often need algorithms that are not included. The registration API lets you plug a custom unequal probability sampling method into the existing dispatchers and generics without modifying the package source.

After registering a method, it works everywhere a built-in method works: unequal_prob_wor(), joint_inclusion_prob(), sampling_cov(), batch mode (nrep), etc.

The registration API

A single call to register_method() adds a new method:

library(sondage)
register_method(
  name,                   # unique method name (character)
  type = "wor",           # "wor" or "wr"
  sample_fn,              # function(pik, n, prn, ...) -> integer indices
  joint_fn = NULL,        # function(pik, sample_idx, ...) -> matrix (optional)
  fixed_size = TRUE,      # does the method always return exactly n units?
  supports_prn = FALSE    # does the method support permanent random numbers?
)

Two contracts must be respected:

  • sample_fn(pik, n = NULL, prn = NULL, ...) receives the inclusion probability vector and returns an integer vector of selected unit indices (1-based).
  • joint_fn(pik, sample_idx = NULL, ...) (optional) returns an \(N \times N\) joint inclusion probability matrix when sample_idx is NULL, or a submatrix for the given indices when it is not.

Helper functions registered_methods(), is_registered_method(), and unregister_method() manage the registry.

Example 1: Sampford’s method with high-entropy joint probabilities

Sampford’s (1967) rejection procedure draws exact \(\pi\)ps samples. The first unit is drawn with probability proportional to \(\pi_k\); the remaining \(n - 1\) units are drawn independently with replacement, each with probability proportional to \(\pi_k / (1 -\pi_k)\). The draw is accepted only when all \(n\) units are distinct.

sampford_sample <- function(pik, n = NULL, prn = NULL, ...) {
  N <- length(pik)
  if (n == 1L) {
    return(sample.int(N, 1L, prob = pik))
  }
  q <- pik / (1 - pik)
  repeat {
    first <- sample.int(N, 1L, prob = pik)
    rest  <- sample.int(N, n - 1L, replace = TRUE, prob = q)
    s <- c(first, rest)
    if (anyDuplicated(s) == 0L)
      return(sort(s))
  }
}

The acceptance rate decreases as \(n / N\) grows, so this implementation is best suited for moderate sampling fractions.

Joint probabilities via he_jip()

Sampford’s design is a high-entropy design, so the high-entropy (HE) approximation of Brewer and Donadio (2003) applies. Rather than reimplementing the formula, we use he_jip(), which sondage exports for exactly this purpose. It calls the same optimised C code used internally for Brewer, SPS, Pareto, and cube methods, and it already matches the joint_fn signature expected by register_method().

For designs closer to conditional Poisson (rejective) sampling, hajek_jip() provides a lighter alternative based on Hájek (1964).

Registering and using Sampford’s method

library(sondage)

register_method(
  "sampford",
  type       = "wor",
  sample_fn  = sampford_sample,
  joint_fn   = he_jip,
  fixed_size = TRUE
)

Once registered, the method is available through the standard dispatcher:

pik <- inclusion_prob(c(2, 3, 4, 5, 6, 7, 8, 9), n = 4)
pik
#> [1] 0.1818182 0.2727273 0.3636364 0.4545455 0.5454545 0.6363636 0.7272727
#> [8] 0.8181818

s <- unequal_prob_wor(pik, method = "sampford")
s
#> Unequal prob WOR [sampford] (n=4, N=8): 2 6 7 8

All generics work transparently:

# Joint inclusion probabilities
pikl <- joint_inclusion_prob(s)
round(pikl, 4)
#>        [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]
#> [1,] 0.1818 0.0350 0.0480 0.0618 0.0765 0.0925 0.1098 0.1289
#> [2,] 0.0350 0.2727 0.0737 0.0948 0.1174 0.1417 0.1682 0.1972
#> [3,] 0.0480 0.0737 0.3636 0.1296 0.1604 0.1935 0.2294 0.2688
#> [4,] 0.0618 0.0948 0.1296 0.4545 0.2059 0.2482 0.2939 0.3440
#> [5,] 0.0765 0.1174 0.1604 0.2059 0.5455 0.3062 0.3624 0.4237
#> [6,] 0.0925 0.1417 0.1935 0.2482 0.3062 0.6364 0.4355 0.5087
#> [7,] 0.1098 0.1682 0.2294 0.2939 0.3624 0.4355 0.7273 0.5999
#> [8,] 0.1289 0.1972 0.2688 0.3440 0.4237 0.5087 0.5999 0.8182

# Sampling covariance (Sen-Yates-Grundy check quantities)
round(sampling_cov(s, weighted = TRUE), 4)
#>         [,1]    [,2]    [,3]    [,4]    [,5]    [,6]    [,7]    [,8]
#> [1,]  0.8182 -0.4165 -0.3784 -0.3383 -0.2960 -0.2515 -0.2043 -0.1544
#> [2,] -0.4165  0.7273 -0.3457 -0.3075 -0.2671 -0.2245 -0.1793 -0.1314
#> [3,] -0.3784 -0.3457  0.6364 -0.2750 -0.2366 -0.1959 -0.1528 -0.1070
#> [4,] -0.3383 -0.3075 -0.2750  0.5455 -0.2042 -0.1656 -0.1246 -0.0810
#> [5,] -0.2960 -0.2671 -0.2366 -0.2042  0.4545 -0.1334 -0.0946 -0.0532
#> [6,] -0.2515 -0.2245 -0.1959 -0.1656 -0.1334  0.3636 -0.0626 -0.0236
#> [7,] -0.2043 -0.1793 -0.1528 -0.1246 -0.0946 -0.0626  0.2727  0.0081
#> [8,] -0.1544 -0.1314 -0.1070 -0.0810 -0.0532 -0.0236  0.0081  0.1818

# sampled_only for large populations
joint_inclusion_prob(s, sampled_only = TRUE)
#>           [,1]      [,2]      [,3]      [,4]
#> [1,] 0.2727273 0.1417381 0.1681894 0.1972218
#> [2,] 0.1417381 0.6363636 0.4355263 0.5086542
#> [3,] 0.1681894 0.4355263 0.7272727 0.5999281
#> [4,] 0.1972218 0.5086542 0.5999281 0.8181818

Verifying with simulation

We can check that the empirical inclusion frequencies match the target \(\pi_k\) and that the empirical pairwise frequencies match the HE approximation.

sim <- unequal_prob_wor(pik, method = "sampford", nrep = 5000)
dim(sim$sample) # n x nrep
#> [1]    4 5000

# First-order: empirical vs target
freq <- tabulate(sim$sample, nbins = length(pik)) / 5000
cbind(target = pik, empirical = freq)
#>         target empirical
#> [1,] 0.1818182    0.1826
#> [2,] 0.2727273    0.2724
#> [3,] 0.3636364    0.3634
#> [4,] 0.4545455    0.4528
#> [5,] 0.5454545    0.5464
#> [6,] 0.6363636    0.6278
#> [7,] 0.7272727    0.7294
#> [8,] 0.8181818    0.8252
# Second-order: empirical vs HE approximation
N <- length(pik)
co_occur <- matrix(0, N, N)
for (j in seq_len(5000)) {
  idx <- sim$sample[, j]
  co_occur[idx, idx] <- co_occur[idx, idx] + 1
}
diag(co_occur) <- 0
empirical_jip <- co_occur / 5000
diag(empirical_jip) <- freq

he_pikl <- he_jip(pik)

# Compare a few off-diagonal pairs
pairs <- data.frame(
  i      = c(1, 2, 3, 5),
  j      = c(8, 7, 6, 8),
  HE     = round(sapply(1:4, \(k) he_pikl[c(1,2,3,5)[k], c(8,7,6,8)[k]]), 4),
  empirical = round(sapply(1:4, \(k) empirical_jip[c(1,2,3,5)[k], c(8,7,6,8)[k]]), 4)
)
pairs
#>   i j     HE empirical
#> 1 1 8 0.1289    0.1382
#> 2 2 7 0.1682    0.1724
#> 3 3 6 0.1935    0.1834
#> 4 5 8 0.4237    0.4240

The HE approximation is close because Sampford’s design is itself a high-entropy design.

Example 2: Wrapping an external package (sampling::UPtille)

When an algorithm is already implemented in another package, the wrapper is minimal. Here we wrap UPtille and UPtillepi2 from the sampling package (Tillé 2006).

tille_sample <- function(pik, n = NULL, prn = NULL, ...) {
  which(as.logical(sampling::UPtille(pik)))
}

tille_joint <- function(pik, sample_idx = NULL, ...) {
  pikl <- sampling::UPtillepi2(pik)
  if (!is.null(sample_idx)) {
    pikl <- pikl[sample_idx, sample_idx, drop = FALSE]
  }
  pikl
}

register_method(
  "tille",
  type       = "wor",
  sample_fn  = tille_sample,
  joint_fn   = tille_joint,
  fixed_size = TRUE
)

pik <- inclusion_prob(c(2, 3, 4, 5, 6, 7, 8, 9), n = 4)
s <- unequal_prob_wor(pik, method = "tille")
s
#> Unequal prob WOR [tille] (n=4, N=8): 3 5 6 7

# Exact joint inclusion probabilities from UPtillepi2
round(joint_inclusion_prob(s), 4)
#>        [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]
#> [1,] 0.1818 0.0120 0.0361 0.0585 0.0791 0.0996 0.1199 0.1403
#> [2,] 0.0120 0.2727 0.0602 0.0877 0.1187 0.1494 0.1798 0.2104
#> [3,] 0.0361 0.0602 0.3636 0.1169 0.1582 0.1992 0.2397 0.2805
#> [4,] 0.0585 0.0877 0.1169 0.4545 0.2012 0.2490 0.2997 0.3506
#> [5,] 0.0791 0.1187 0.1582 0.2012 0.5455 0.2988 0.3596 0.4208
#> [6,] 0.0996 0.1494 0.1992 0.2490 0.2988 0.6364 0.4221 0.4909
#> [7,] 0.1199 0.1798 0.2397 0.2997 0.3596 0.4221 0.7273 0.5610
#> [8,] 0.1403 0.2104 0.2805 0.3506 0.4208 0.4909 0.5610 0.8182

# Full variance estimation chain
round(sampling_cov(s), 4)
#>         [,1]    [,2]    [,3]    [,4]    [,5]    [,6]    [,7]    [,8]
#> [1,]  0.1488 -0.0375 -0.0300 -0.0242 -0.0201 -0.0161 -0.0124 -0.0085
#> [2,] -0.0375  0.1983 -0.0390 -0.0363 -0.0301 -0.0241 -0.0185 -0.0128
#> [3,] -0.0300 -0.0390  0.2314 -0.0484 -0.0401 -0.0322 -0.0247 -0.0170
#> [4,] -0.0242 -0.0363 -0.0484  0.2479 -0.0467 -0.0402 -0.0309 -0.0213
#> [5,] -0.0201 -0.0301 -0.0401 -0.0467  0.2479 -0.0483 -0.0371 -0.0255
#> [6,] -0.0161 -0.0241 -0.0322 -0.0402 -0.0483  0.2314 -0.0407 -0.0298
#> [7,] -0.0124 -0.0185 -0.0247 -0.0309 -0.0371 -0.0407  0.1983 -0.0340
#> [8,] -0.0085 -0.0128 -0.0170 -0.0213 -0.0255 -0.0298 -0.0340  0.1488

Writing a joint_fn

A joint_fn is optional but enables joint_inclusion_prob(), joint_expected_hits(), and sampling_cov(). Four common strategies:

  1. Exact formula. When the design has a known closed form (e.g., UPtillepi2 above).

  2. he_jip(). The high-entropy approximation (Brewer and Donadio 2003). Best for maximum-entropy and high-entropy designs (Sampford, Tille, and most \(\pi\)ps procedures). Uses optimised C code internally. Pass it directly: joint_fn = he_jip.

  3. hajek_jip(). The Hajek (1964) approximation based on conditional Poisson (rejective) sampling theory. Simpler formula and slightly cheaper, but generally a bit less accurate than he_jip(). Best when the design is obtained by conditioning independent Poisson trials on the sample size. Pass it directly: joint_fn = hajek_jip.

  4. Other approximations. Other methods exist in the literature and can be implemented as custom joint_fn functions. Tillé (1996) reviews several alternatives, including approximations based on Poisson, rejective, and successive sampling theory, each with different accuracy–computation trade-offs. Any function that accepts (pik, sample_idx = NULL, ...) and returns a symmetric matrix is a valid joint_fn.

  5. Monte Carlo estimation. Resample \(B\) times and estimate \(\hat{\pi}_{ij} = B^{-1} \sum_{b=1}^B I(i \in S_b)\, I(j \in S_b)\). Slow but universal:

mc_joint <- function(pik, sample_idx = NULL, ..., B = 5000) {
  N <- length(pik)
  n <- as.integer(round(sum(pik)))
  co <- matrix(0, N, N)
  for (b in seq_len(B)) {
    s <- my_sampler(pik, n = n)
    co[s, s] <- co[s, s] + 1
  }
  pikl <- co / B
  diag(pikl) <- tabulate(unlist(
    replicate(B, my_sampler(pik, n = n), simplify = FALSE)
  ), nbins = N) / B
  if (!is.null(sample_idx)) {
    pikl <- pikl[sample_idx, sample_idx, drop = FALSE]
  }
  pikl
}

The sample_idx argument enables the sampled_only = TRUE path in joint_inclusion_prob(). When sample_idx is non-NULL, you may either (a) compute the full matrix and subset as above, or (b) skip rows/columns not in sample_idx for efficiency.

Session persistence

The registry lives in the package namespace and resets when sondage is reloaded. To make a registration persistent across sessions, place the register_method() call in your .Rprofile or in a project-level setup script.

References

Brewer, K. R. W., and M. E. Donadio. 2003. “The High Entropy Variance of the Horvitz–Thompson Estimator.” Survey Methodology 29 (2): 189–96.
Hájek, Jaroslav. 1964. “Asymptotic Theory of Rejective Sampling with Varying Probabilities from a Finite Population.” The Annals of Mathematical Statistics 35 (4): 1491–1523. https://doi.org/10.1214/aoms/1177700375.
Sampford, M. R. 1967. “On Sampling Without Replacement with Unequal Probabilities of Selection.” Biometrika 54 (3–4): 499–513. https://doi.org/10.1093/biomet/54.3-4.499.
Tillé, Yves. 1996. “Some Remarks on Unequal Probability Sampling Designs Without Replacement.” Annales d’Économie Et de Statistique, no. 44: 177–89. https://doi.org/10.2307/20076043.
———. 2006. Sampling Algorithms. Springer Series in Statistics. Springer. https://doi.org/10.1007/0-387-34240-0.