Sampling Coordination with Permanent Random Numbers
Source:vignettes/sampling-coordination.Rmd
sampling-coordination.RmdWhy Coordinate Samples?
Repeated surveys (business panels, labour force surveys, agricultural censuses) draw fresh samples at regular intervals from the same population. Similarly, multiple survey programmes often sample from the same register simultaneously. Two natural questions arise:
- Positive coordination: How can we maximise overlap between samples to reduce respondent burden and improve change estimates?
- Negative coordination: How can we minimise overlap to spread the burden and maximise cumulative coverage?
These questions apply both temporally (across waves of the same survey) and cross-sectionally (across different surveys on the same frame, known as collocated sampling).
The permanent random number (PRN) technique, introduced by Brewer, Early, and Joyce (1972) and developed by Ohlsson (1995) and Cotton and Hesse (1992), answers both questions with a single mechanism: attach a stable \(U(0,1)\) value to each frame unit and update it between waves.
samplyr supports PRN-based coordination for four
selection methods: bernoulli, pps_poisson,
pps_sps, and pps_pareto.
Setup
We use ken_enterprises, a synthetic business register
with 6,823 enterprises.
library(samplyr)
library(sondage)
library(dplyr)
data(ken_enterprises)
glimpse(ken_enterprises)
#> Rows: 6,823
#> Columns: 9
#> $ enterprise_id <chr> "KEN_00001", "KEN_00002", "KEN_00003", "KEN_00004", "…
#> $ county <fct> Kiambu, Kiambu, Kiambu, Kiambu, Kiambu, Kiambu, Kiamb…
#> $ region <fct> Kiambu, Kiambu, Kiambu, Kiambu, Kiambu, Kiambu, Kiamb…
#> $ sector <fct> Chemicals & Plastics, Chemicals & Plastics, Chemicals…
#> $ size_class <fct> Small, Medium, Medium, Medium, Large, Large, Small, S…
#> $ employees <int> 13, 54, 37, 24, 107, 136, 10, 14, 43, 30, 70, 42, 84,…
#> $ revenue_millions <dbl> 25.1, 57.8, 51.7, 40.5, 73.7, 507.9, 25.5, 16.8, 39.9…
#> $ year_established <int> 1993, 2005, 2002, 2014, 1994, 2007, 2005, 1997, 1991,…
#> $ exporter <lgl> TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, …Each enterprise has a revenue_millions column that
serves as our measure of size for PPS sampling. We target \(n = 200\) enterprises per wave.
n_wave <- 200Assigning permanent random numbers
Every frame unit gets a \(U(0,1)\) value drawn once and stored permanently in the register. These values are reused or updated across waves.
Computing inclusion probabilities
Coordination requires knowing each unit’s first-order inclusion
probability \(\pi_k\). For PPS without
replacement, sondage::inclusion_prob() computes these from
the measure of size and sample size. This must be done on the
full frame, not just the sample.
frame <- frame |>
mutate(pik = inclusion_prob(revenue_millions, n_wave))
round(quantile(frame$pik), 4)
#> 0% 25% 50% 75% 100%
#> 0.0003 0.0029 0.0067 0.0179 1.0000Positive Coordination: Maximum Overlap
To maximise overlap between waves, reuse the same PRN values. With sequential Poisson sampling (SPS), two waves drawn from the same frame with identical PRNs produce identical samples.
wave1 <- sampling_design() |>
draw(n = n_wave, method = "pps_sps", mos = revenue_millions, prn = u) |>
execute(frame, seed = 42)
wave2 <- sampling_design() |>
draw(n = n_wave, method = "pps_sps", mos = revenue_millions, prn = u) |>
execute(frame, seed = 123)
overlap_positive <- length(intersect(wave1$enterprise_id,
wave2$enterprise_id))
overlap_positive
#> [1] 200Perfect overlap. This is useful when the goal is to track the same units over time, for example measuring quarterly changes in revenue within a fixed panel.
Negative Coordination: Minimum Overlap
To minimise overlap, update the PRN values between waves using Tillé (2006, sec. 8.2) formula:
\[u_k^{(t+1)} = \bigl(u_k^{(t)} - \pi_k\bigr) \bmod 1\]
Selected units (those with small \(u_k\)) get pushed to the high end of \((0,1)\), making them unlikely to be selected again. Non-selected units shift downward, increasing their chance of selection.
# Wave 1
wave1 <- sampling_design() |>
draw(n = n_wave, method = "pps_sps", mos = revenue_millions, prn = u) |>
execute(frame, seed = 1)
# Update PRNs on the full frame
frame <- frame |>
mutate(u = (u - pik) %% 1)
# Wave 2
wave2 <- sampling_design() |>
draw(n = n_wave, method = "pps_sps", mos = revenue_millions, prn = u) |>
execute(frame, seed = 1)
overlap_negative <- length(intersect(wave1$enterprise_id, wave2$enterprise_id))
overlap_negative
#> [1] 43Why isn’t the overlap zero?
Not all overlap can be eliminated. Units with large \(\pi_k\) (big enterprises) are selected in almost every wave because their inclusion probability is close to or equal to 1. The theoretical minimum overlap, the Anticipated Lower Bound (ALB), is:
\[\text{ALB} = \sum_{k=1}^{N} \max(0,\; 2\pi_k - 1)\]
Only units with \(\pi_k > 0.5\) contribute to forced overlap.
alb <- sum(pmax(0, 2 * frame$pik - 1))
expected_independent <- sum(frame$pik^2)
tibble(x = c("ALB (theoretical minimum)",
"Observed overlap",
"Expected if independent",
"AUB (theoretical max)"),
y = c(round(alb), overlap_negative,
round(expected_independent), n_wave))
#> # A tibble: 4 × 2
#> x y
#> <chr> <dbl>
#> 1 ALB (theoretical minimum) 40
#> 2 Observed overlap 43
#> 3 Expected if independent 66
#> 4 AUB (theoretical max) 200The negative coordination brings the overlap close to the ALB, well below what independent sampling would produce.
Comparison: Three Strategies
# Independent sampling: fresh PRNs for wave 2
set.seed(1)
frame_ind <- ken_enterprises |>
mutate(u1 = runif(n()),
u2 = runif(n()))
wave1_ind <- sampling_design() |>
draw(n = n_wave, method = "pps_sps", mos = revenue_millions, prn = u1) |>
execute(frame_ind, seed = 1)
wave2_ind <- sampling_design() |>
draw(n = n_wave, method = "pps_sps", mos = revenue_millions, prn = u2) |>
execute(frame_ind, seed = 1)
overlap_ind <- length(intersect(wave1_ind$enterprise_id,
wave2_ind$enterprise_id))
tibble(strategy = c("Positive (same PRN)", "Independent (fresh PRN)", "Negative (Tillé update)"),
overlap = c(overlap_positive, overlap_ind, overlap_negative),
bound = c(n_wave, round(expected_independent), round(alb))) |>
mutate(pct = paste0(round(100 * overlap / n_wave), "%"))
#> # A tibble: 3 × 4
#> strategy overlap bound pct
#> <chr> <int> <dbl> <chr>
#> 1 Positive (same PRN) 200 200 100%
#> 2 Independent (fresh PRN) 70 66 35%
#> 3 Negative (Tillé update) 43 40 22%Multi-Wave Rotation
In practice, surveys run for many waves. The PRN update is applied after each wave, rotating units through the sample.
set.seed(2026)
frame <- ken_enterprises |>
mutate(u = runif(n()),
pik = inclusion_prob(revenue_millions, n_wave))
n_waves <- 5
waves <- vector("list", n_waves)
for (t in seq_len(n_waves)) {
waves[[t]] <- sampling_design() |>
draw(n = n_wave, method = "pps_sps", mos = revenue_millions, prn = u) |>
execute(frame, seed = 1)
# Update PRNs for next wave
frame <- frame |>
mutate(u = (u - pik) %% 1)
}Pairwise overlap matrix
ids <- lapply(waves, `[[`, "enterprise_id")
overlap_mat <- outer(ids, ids, Vectorize(\(a, b) length(intersect(a, b))))
dimnames(overlap_mat) <- rep(list(paste("Wave", seq_len(n_waves))), 2)
overlap_mat
#> Wave 1 Wave 2 Wave 3 Wave 4 Wave 5
#> Wave 1 200 38 65 60 61
#> Wave 2 38 200 45 61 52
#> Wave 3 65 45 200 49 62
#> Wave 4 60 61 49 200 55
#> Wave 5 61 52 62 55 200Consecutive waves have low overlap (near the ALB). Non-consecutive waves show varying overlap as the rotation cycle progresses.
Cumulative coverage
After \(T\) waves, how many distinct enterprises have been surveyed?
cumulative <- integer(n_waves)
seen <- character(0)
for (t in seq_len(n_waves)) {
seen <- union(seen, waves[[t]]$enterprise_id)
cumulative[t] <- length(seen)
}
tibble(wave = seq_len(n_waves),
new_units = c(n_wave, diff(cumulative)),
cumulative = cumulative,
pct_frame = paste0(round(100 * cumulative / nrow(ken_enterprises), 1), "%"))
#> # A tibble: 5 × 4
#> wave new_units cumulative pct_frame
#> <int> <dbl> <int> <chr>
#> 1 1 200 200 2.9%
#> 2 2 162 362 5.3%
#> 3 3 122 484 7.1%
#> 4 4 112 596 8.7%
#> 5 5 103 699 10.2%Negative coordination maximises the number of distinct units covered, which is valuable for building comprehensive registers or spreading respondent burden.
Handling Births and Deaths
Real business registers change between waves. New enterprises (births) enter the frame; closed enterprises (deaths) leave it.
- Deaths: Simply remove them from the frame. Their PRN values are discarded. No special treatment needed.
- Births: Assign fresh \(U(0,1)\) values. These new units enter the coordination cycle at a random position.
# Simulate: remove 200 enterprises (deaths), add 500 new ones (births)
set.seed(123)
n_births <- 500
gen_data <- function(x, n) {
sample(levels(x), n, replace = TRUE) |>
factor(levels = levels(x))
}
# Deaths: remove a random subset
deaths <- sample(frame$enterprise_id, 200)
frame_t2 <- frame |>
filter(!enterprise_id %in% deaths)
# Births: new enterprises with fresh PRNs
# Resample revenue from the existing frame to get a realistic size distribution
births <- tibble(enterprise_id = paste0("NEW_", sprintf("%05d", seq_len(n_births))),
county = gen_data(ken_enterprises$county, n_births),
region = gen_data(ken_enterprises$region, n_births),
sector = gen_data(ken_enterprises$sector, n_births),
size_class = gen_data(ken_enterprises$size_class, n_births),
employees = sample(ken_enterprises$employees, n_births, replace = TRUE),
revenue_millions = sample(ken_enterprises$revenue_millions, n_births, replace = TRUE),
year_established = sample(ken_enterprises$year_established, n_births, replace = TRUE),
exporter = sample(ken_enterprises$exporter, n_births, replace = TRUE),
u = runif(n_births))
frame_t2 <- bind_rows(frame_t2, births)
# Recompute inclusion probabilities on the new frame
frame_t2 <- frame_t2 |>
mutate(pik = inclusion_prob(revenue_millions, n_wave))
# Draw wave from updated frame
wave_t2 <- sampling_design() |>
draw(n = n_wave, method = "pps_sps", mos = revenue_millions, prn = u) |>
execute(frame_t2, seed = 1)
# How many births were selected?
n_births_selected <- sum(wave_t2$enterprise_id %in% births$enterprise_id)
n_births_selected
#> [1] 13
n_births_selected/n_births
#> [1] 0.026The coordination mechanism automatically handles frame changes. Surviving units retain their updated PRN values and births start with fresh PRN.
Using Pareto Sampling
The pps_pareto method also supports PRN-based
coordination. In theory, Pareto sampling has maximum entropy among
fixed-size PPS designs that use PRNs, making it particularly well-suited
for coordination.
set.seed(12345)
frame <- ken_enterprises |>
mutate(u = runif(n()),
pik = inclusion_prob(revenue_millions, n_wave))
wave1_par <- sampling_design() |>
draw(n = n_wave, method = "pps_pareto", mos = revenue_millions, prn = u) |>
execute(frame, seed = 1)
frame <- frame |> mutate(u = (u - pik) %% 1)
wave2_par <- sampling_design() |>
draw(n = n_wave, method = "pps_pareto", mos = revenue_millions, prn = u) |>
execute(frame, seed = 1)
overlap_par <- length(intersect(wave1_par$enterprise_id,
wave2_par$enterprise_id))
overlap_par
#> [1] 39
alb <- round(sum(pmax(0, 2 * frame$pik - 1)))
alb
#> [1] 40Stratified Coordination
When the design is stratified, inclusion probabilities must be computed within each stratum. The PRN update is then applied globally. The per-stratum \(\pi_k\) values ensure correct rotation within each group.
set.seed(1980)
frame <- ken_enterprises |>
mutate(u = runif(n()))
# Compute within-stratum pik
frame <- frame |>
mutate(pik = inclusion_prob(revenue_millions, 50),
.by = size_class)
max_pik <- max(frame$pik)
max_pik
#> [1] 0.6451258
alb <- round(sum(pmax(0, 2 * frame$pik - 1)))
alb
#> [1] 1
wave1_str <- sampling_design() |>
stratify_by(size_class) |>
draw(n = 50, method = "pps_pareto", mos = revenue_millions, prn = u) |>
execute(frame, seed = 1)
frame <- frame |> mutate(u = (u - pik) %% 1)
wave2_str <- sampling_design() |>
stratify_by(size_class) |>
draw(n = 50, method = "pps_pareto", mos = revenue_millions, prn = u) |>
execute(frame, seed = 1)
overlap_str <- length(intersect(wave1_str$enterprise_id,
wave2_str$enterprise_id))
overlap_str
#> [1] 1With smaller within-stratum \(\pi_k\) values (all below 0.5), the ALB is zero and perfect negative coordination is achievable.
Collocated Sampling: Coordinating Across Surveys
The examples above coordinate samples from the same survey across time (waves). A related problem is coordinating different surveys drawn from the same frame at the same time point. This is called collocated sampling (Ernst 1986; Ohlsson 1995) and is common in business survey programmes where a single register serves multiple survey operations.
The mechanism is identical: all surveys share the same PRN column.
- Positive coordination (maximise overlap): both surveys use the same PRN. Useful when surveys share common modules or when joint analysis is planned.
- Negative coordination (minimise overlap): one survey uses shifted PRNs (\(u' = (u + 0.5) \bmod 1\)). Useful when the goal is to spread respondent burden across the register.
Example: Turnover survey and employment survey
Suppose a statistical agency runs two surveys on the same business
register: a revenue survey (\(n =
200\), PPS on revenue_millions) and an employment
survey (\(n = 150\), PPS on
employees). The surveys have different sample sizes,
different measures of size, and different stratification, but they share
the same frame and the same PRN column.
set.seed(123)
frame <- ken_enterprises |>
mutate(u = runif(n()))
# Survey A: revenue survey (PPS on revenue_millions, stratified by sector)
survey_a <- sampling_design() |>
draw(n = 200, method = "pps_sps", mos = revenue_millions, prn = u) |>
execute(frame, seed = 1)
# Survey B: employment survey (PPS on employees), positive coordination
survey_b <- sampling_design() |>
draw(n = 150, method = "pps_sps", mos = employees, prn = u) |>
execute(frame, seed = 1)
overlap_pos <- length(intersect(survey_a$enterprise_id,
survey_b$enterprise_id))
overlap_pos
#> [1] 132For negative coordination, shift the PRN by 0.5:
# Survey B with negative coordination
frame_neg <- frame |>
mutate(u_neg = (u + 0.5) %% 1)
survey_b_neg <- sampling_design() |>
draw(n = 150, method = "pps_sps", mos = employees, prn = u_neg) |>
execute(frame_neg, seed = 1)
overlap_neg <- length(intersect(survey_a$enterprise_id,
survey_b_neg$enterprise_id))
overlap_neg
#> [1] 18Compare with independent sampling (fresh PRNs):
Overlap bounds for different surveys
When two surveys have different inclusion probabilities \(\pi_k^A\) and \(\pi_k^B\), the bounds generalise naturally:
pik_a <- inclusion_prob(frame$revenue_millions, 200)
pik_b <- inclusion_prob(frame$employees, 150)
alb_ab <- sum(pmax(0, pik_a + pik_b - 1))
aub_ab <- sum(pmin(pik_a, pik_b))
exp_ind <- sum(pik_a * pik_b)
tibble(strategy = c("Positive (same PRN)", "Independent (fresh PRN)", "Negative (shifted PRN)"),
overlap = c(overlap_pos, overlap_ind, overlap_neg),
bound = c(round(aub_ab), round(exp_ind), round(alb_ab)))
#> # A tibble: 3 × 3
#> strategy overlap bound
#> <chr> <int> <dbl>
#> 1 Positive (same PRN) 132 133
#> 2 Independent (fresh PRN) 33 42
#> 3 Negative (shifted PRN) 18 18Because revenue_millions and employees are
correlated (large firms have both high revenue and many employees), the
overlap under positive coordination is substantial. Negative
coordination via PRN shift effectively spreads the burden.
Scaling to more than two surveys
For \(K\) surveys, shift each survey’s PRN by \(j/K\) for \(j = 0, \ldots, K-1\):
# Negative coordination across 4 surveys
n_surveys <- 4
for (j in seq_len(n_surveys)) {
frame_j <- frame |>
mutate(u_j = (u + (j - 1) / n_surveys) %% 1)
samples[[j]] <- sampling_design() |>
draw(n = n_j, method = "pps_sps", mos = mos_j, prn = u_j) |>
execute(frame_j, seed = 1)
}This simple shift strategy works well when the surveys have similar \(\pi_k\). For more complex programmes (varying burden weights, priority orderings, or dozens of surveys), NSOs use dedicated coordination systems such as the Rivière method (Rivière 2001), the Netherlands EDS (De Ree 1983; Van Huis, De Ree, and Grotendorst 1994), or the Swiss method (Graf and Qualité 2014). These methods track cumulative burden per unit and permute PRNs within microstrata to achieve near-optimal coordination across the full survey programme. See Tillé (2020, secs. 8.2.7–8.2.9) for a detailed treatment.
Summary
The PRN coordination workflow in samplyr is:
-
Assign PRNs once:
mutate(u = runif(n())) -
Compute \(\pi_k\)
on the full frame:
mutate(pik = inclusion_prob(mos, n)) -
Draw the sample:
sampling_design() |> draw(method = "pps_sps", prn = u) |> execute(frame) -
Update or shift PRNs:
- Temporal negative coordination (waves):
mutate(u = (u - pik) %% 1) - Temporal positive coordination: keep
uunchanged - Cross-survey negative coordination: shift by
(u + 0.5) %% 1 - Cross-survey positive coordination: reuse same
u
- Temporal negative coordination (waves):
| Goal | PRN update | Expected overlap |
|---|---|---|
| Maximum overlap | None (reuse) | \(\sum \min(\pi_k^A, \pi_k^B)\) |
| Independent | Fresh \(U(0,1)\) | \(\sum \pi_k^A \cdot \pi_k^B\) |
| Minimum overlap | Shift or Tillé update | \(\sum \max(0, \pi_k^A + \pi_k^B - 1)\) |