Sampling Coordination with Permanent Random Numbers • samplyr

Why Coordinate Samples?

Repeated surveys (business panels, labor force surveys, agricultural censuses) draw fresh samples at regular intervals from the same population. Similarly, multiple survey programs often sample from the same register simultaneously. Two natural questions arise:

Positive coordination: How can we maximize overlap between samples to reduce respondent burden and improve change estimates?
Negative coordination: How can we minimize overlap to spread the burden and maximize cumulative coverage?

These questions apply both temporally (across waves of the same survey) and cross-sectionally (across different surveys on the same frame, known as collocated sampling).

The permanent random number (PRN) technique, introduced by Brewer, Early, and Joyce (1972) and developed by Ohlsson (1995) and Cotton and Hesse (1992), answers both questions with a single mechanism: attach a stable \(U(0,1)\) value to each frame unit and update it between waves.

samplyr supports PRN-based coordination for four selection methods: bernoulli, pps_poisson, pps_sps, and pps_pareto.

Setup

We use ken_enterprises, a synthetic business register with 17,004 enterprises.

library(samplyr)
library(sondage)
library(dplyr)

data(ken_enterprises)
glimpse(ken_enterprises)
#> Rows: 17,004
#> Columns: 9
#> $ enterprise_id    <chr> "KEN_00001", "KEN_00002", "KEN_00003", "KEN_00004", "…
#> $ county           <fct> Kiambu, Kiambu, Kiambu, Kirinyaga, Murang'a, Murang'a…
#> $ region           <fct> Central, Central, Central, Central, Central, Central,…
#> $ sector           <fct> Chemicals & Chemical Products, Chemicals & Chemical P…
#> $ size_class       <fct> Medium, Small, Small, Small, Medium, Small, Small, La…
#> $ employees        <int> 43, 15, 6, 17, 44, 13, 9, 179, 42, 10, 8, 11, 2000, 6…
#> $ revenue_millions <dbl> 108.0, 45.7, 4.4, 40.0, 67.3, 25.6, 32.9, 265.5, 55.2…
#> $ year_established <int> 1994, 2014, 2008, 1992, 2016, 1982, 2002, 1982, 2005,…
#> $ exporter         <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE,…

Each enterprise has a revenue_millions column that serves as our measure of size for PPS sampling. We target \(n = 200\) enterprises per wave.

n_wave <- 200

Assigning permanent random numbers

Every frame unit gets a \(U(0,1)\) value drawn once and stored permanently in the register. These values are reused or updated across waves.

set.seed(1)

frame <- ken_enterprises |>
  mutate(u = runif(n()))

Computing inclusion probabilities

Coordination requires knowing each unit’s first-order inclusion probability \(\pi_k\). For PPS without replacement, sondage::inclusion_prob() computes these from the measure of size and sample size. This must be done on the full frame, not just the sample.

frame <- frame |>
  mutate(pik = inclusion_prob(revenue_millions, n_wave))

round(quantile(frame$pik), 4)
#>     0%    25%    50%    75%   100% 
#> 0.0001 0.0013 0.0024 0.0062 1.0000

Positive Coordination: Maximum Overlap

To maximize overlap between waves, reuse the same PRN values. With sequential Poisson sampling (SPS), two waves drawn from the same frame with identical PRNs produce identical samples.

wave1 <- sampling_design() |>
  draw(n = n_wave, method = "pps_sps", mos = revenue_millions, prn = u) |>
  execute(frame, seed = 42)

wave2 <- sampling_design() |>
  draw(n = n_wave, method = "pps_sps", mos = revenue_millions, prn = u) |>
  execute(frame, seed = 123)

overlap_positive <- length(intersect(wave1$enterprise_id,
                                     wave2$enterprise_id))
overlap_positive
#> [1] 200

Perfect overlap. This is useful when the goal is to track the same units over time, for example measuring quarterly changes in revenue within a fixed panel.

Negative Coordination: Minimum Overlap

To minimize overlap, update the PRN values between waves using Tillé (2006, sec. 8.2) formula:

\[u_k^{(t+1)} = \bigl(u_k^{(t)} - \pi_k\bigr) \bmod 1\]

Selected units (those with small \(u_k\)) get pushed to the high end of \((0,1)\), making them unlikely to be selected again. Non-selected units shift downward, increasing their chance of selection.

# Wave 1
wave1 <- sampling_design() |>
  draw(n = n_wave, method = "pps_sps", mos = revenue_millions, prn = u) |>
  execute(frame, seed = 1)

# Update PRNs on the full frame
frame <- frame |>
  mutate(u = (u - pik) %% 1)

# Wave 2
wave2 <- sampling_design() |>
  draw(n = n_wave, method = "pps_sps", mos = revenue_millions, prn = u) |>
  execute(frame, seed = 1)

overlap_negative <- length(intersect(wave1$enterprise_id, wave2$enterprise_id))
overlap_negative
#> [1] 15

Why isn’t the overlap zero?

Not all overlap can be eliminated. Units with large \(\pi_k\) (big enterprises) are selected in almost every wave because their inclusion probability is close to or equal to 1. The Anticipated Lower Bound (ALB) for expected overlap is:

\[\text{ALB} = \sum_{k=1}^{N} \max(0,\; 2\pi_k - 1)\]

Only units with \(\pi_k > 0.5\) contribute to forced overlap.

alb <- sum(pmax(0, 2 * frame$pik - 1))
expected_independent <- sum(frame$pik^2)

tibble(x = c("ALB (expected lower bound)",
             "Observed overlap",
             "Expected if independent",
             "AUB (expected upper bound)"),
       y = c(round(alb), overlap_negative,
             round(expected_independent), n_wave))
#> # A tibble: 4 × 2
#>   x                              y
#>   <chr>                      <dbl>
#> 1 ALB (expected lower bound)    18
#> 2 Observed overlap              15
#> 3 Expected if independent       44
#> 4 AUB (expected upper bound)   200

The negative coordination brings the realized overlap near the anticipated lower-bound benchmark, well below what independent sampling would produce. The ALB applies to expected overlap across repeated coordinated draws; one realized overlap can fall on either side of its rounded value.

Comparison: Three Strategies

# Independent sampling: fresh PRNs for wave 2
set.seed(1)
frame_ind <- ken_enterprises |>
  mutate(u1 = runif(n()),
         u2 = runif(n()))

wave1_ind <- sampling_design() |>
  draw(n = n_wave, method = "pps_sps", mos = revenue_millions, prn = u1) |>
  execute(frame_ind, seed = 1)

wave2_ind <- sampling_design() |>
  draw(n = n_wave, method = "pps_sps", mos = revenue_millions, prn = u2) |>
  execute(frame_ind, seed = 1)

overlap_ind <- length(intersect(wave1_ind$enterprise_id,
                                wave2_ind$enterprise_id))

tibble(strategy = c("Positive (same PRN)", "Independent (fresh PRN)", "Negative (Tille update)"),
       overlap = c(overlap_positive, overlap_ind, overlap_negative),
       bound = c(n_wave, round(expected_independent), round(alb))) |>
  mutate(pct = paste0(round(100 * overlap / n_wave), "%"))
#> # A tibble: 3 × 4
#>   strategy                overlap bound pct  
#>   <chr>                     <int> <dbl> <chr>
#> 1 Positive (same PRN)         200   200 100% 
#> 2 Independent (fresh PRN)      46    44 23%  
#> 3 Negative (Tille update)      15    18 8%

Multi-Wave Rotation

In practice, surveys run for many waves. The PRN update is applied after each wave, rotating units through the sample.

set.seed(2026)

frame <- ken_enterprises |>
  mutate(u = runif(n()),
         pik = inclusion_prob(revenue_millions, n_wave))

n_waves <- 5
waves <- vector("list", n_waves)

for (t in seq_len(n_waves)) {
  waves[[t]] <- sampling_design() |>
    draw(n = n_wave, method = "pps_sps", mos = revenue_millions, prn = u) |>
    execute(frame, seed = 1)

  # Update PRNs for next wave
  frame <- frame |>
    mutate(u = (u - pik) %% 1)
}

Pairwise overlap matrix

ids <- lapply(waves, `[[`, "enterprise_id")

overlap_mat <- outer(ids, ids, Vectorize(\(a, b) length(intersect(a, b))))
dimnames(overlap_mat) <- rep(list(paste("Wave", seq_len(n_waves))), 2)

overlap_mat
#>        Wave 1 Wave 2 Wave 3 Wave 4 Wave 5
#> Wave 1    200     22     35     39     43
#> Wave 2     22    200     11     41     43
#> Wave 3     35     11    200     15     34
#> Wave 4     39     41     15    200     22
#> Wave 5     43     43     34     22    200

Consecutive waves have low overlap around the anticipated lower-bound benchmark. Non-consecutive waves show varying overlap as the rotation cycle progresses.

Cumulative coverage

After \(T\) waves, how many distinct enterprises have been surveyed?

cumulative <- integer(n_waves)
seen <- character(0)

for (t in seq_len(n_waves)) {
  seen <- union(seen, waves[[t]]$enterprise_id)
  cumulative[t] <- length(seen)
}

tibble(wave = seq_len(n_waves),
       new_units = c(n_wave, diff(cumulative)),
       cumulative = cumulative,
       pct_frame = paste0(round(100 * cumulative / nrow(ken_enterprises), 1), "%"))
#> # A tibble: 5 × 4
#>    wave new_units cumulative pct_frame
#>   <int>     <dbl>      <int> <chr>    
#> 1     1       200        200 1.2%     
#> 2     2       178        378 2.2%     
#> 3     3       164        542 3.2%     
#> 4     4       138        680 4%       
#> 5     5       126        806 4.7%

Negative coordination aims to maximize the number of distinct units covered, which is valuable for building comprehensive registers or spreading respondent burden.

Handling Births and Deaths

Real business registers change between waves. New enterprises (births) enter the frame; closed enterprises (deaths) leave it.

Deaths: Simply remove them from the frame. Their PRN values are discarded. No special treatment needed.
Births: Assign fresh \(U(0,1)\) values. These new units enter the coordination cycle at a random position.

# Simulate: remove 200 enterprises (deaths), add 500 new ones (births)
set.seed(123)
n_births <- 500

# Deaths: remove a random subset
deaths <- sample(frame$enterprise_id, 200)
frame_t2 <- frame |>
  filter(!enterprise_id %in% deaths)

# Births: new enterprises with fresh PRNs
# Sample complete donor records so county-region pairs and business attributes
# remain internally coherent.
births <- ken_enterprises |>
  slice_sample(n = n_births, replace = TRUE) |>
  mutate(
    enterprise_id = paste0("NEW_", sprintf("%05d", seq_len(n_births))),
    u = runif(n_births)
  )

frame_t2 <- bind_rows(frame_t2, births)

# Recompute inclusion probabilities on the new frame
frame_t2 <- frame_t2 |>
  mutate(pik = inclusion_prob(revenue_millions, n_wave))

# Draw wave from updated frame
wave_t2 <- sampling_design() |>
  draw(n = n_wave, method = "pps_sps", mos = revenue_millions, prn = u) |>
  execute(frame_t2, seed = 1)

# How many births were selected?
n_births_selected <- sum(wave_t2$enterprise_id %in% births$enterprise_id)
n_births_selected
#> [1] 9
n_births_selected / n_births
#> [1] 0.018

The coordination mechanism automatically handles frame changes. Surviving units retain their updated PRN values and births start with fresh PRN.

Using Pareto Sampling

The pps_pareto method also supports PRN-based coordination. In theory, Pareto sampling has maximum entropy among fixed-size PPS designs that use PRNs, making it particularly well-suited for coordination.

set.seed(12345)

frame <- ken_enterprises |>
  mutate(u = runif(n()),
         pik = inclusion_prob(revenue_millions, n_wave))

wave1_par <- sampling_design() |>
  draw(n = n_wave, method = "pps_pareto", mos = revenue_millions, prn = u) |>
  execute(frame, seed = 1)

frame <- frame |> mutate(u = (u - pik) %% 1)

wave2_par <- sampling_design() |>
  draw(n = n_wave, method = "pps_pareto", mos = revenue_millions, prn = u) |>
  execute(frame, seed = 1)

overlap_par <- length(intersect(wave1_par$enterprise_id,
                                wave2_par$enterprise_id))

overlap_par
#> [1] 18
alb <- round(sum(pmax(0, 2 * frame$pik - 1)))
alb
#> [1] 18

Stratified Coordination

When the design is stratified, inclusion probabilities must be computed within each stratum. The PRN update is then applied globally. The per-stratum \(\pi_k\) values ensure correct rotation within each group.

set.seed(1980)

frame <- ken_enterprises |>
  mutate(u = runif(n()))

# Compute within-stratum pik
frame <- frame |>
  mutate(pik = inclusion_prob(revenue_millions, 50),
         .by = size_class)

max_pik <- max(frame$pik)
max_pik
#> [1] 0.4993961
alb <- round(sum(pmax(0, 2 * frame$pik - 1)))
alb
#> [1] 0

wave1_str <- sampling_design() |>
  stratify_by(size_class) |>
  draw(n = 50, method = "pps_pareto", mos = revenue_millions, prn = u) |>
  execute(frame, seed = 1)

frame <- frame |> mutate(u = (u - pik) %% 1)

wave2_str <- sampling_design() |>
  stratify_by(size_class) |>
  draw(n = 50, method = "pps_pareto", mos = revenue_millions, prn = u) |>
  execute(frame, seed = 1)

overlap_str <- length(intersect(wave1_str$enterprise_id,
                                wave2_str$enterprise_id))
overlap_str
#> [1] 9

With smaller within-stratum \(\pi_k\) values (all below 0.5), the ALB is zero: no overlap is forced by certainty or near-certainty units. This is an expected lower bound, however, not a guarantee that one randomized pair of samples will have zero realized overlap.

Collocated Sampling: Coordinating Across Surveys

The examples above coordinate samples from the same survey across time (waves). A related problem is coordinating different surveys drawn from the same frame at the same time point. This is called collocated sampling (Ernst 1986; Ohlsson 1995) and is common in business survey programs where a single register serves multiple survey operations.

The mechanism is identical: all surveys share the same PRN column.

Positive coordination (maximize overlap): both surveys use the same PRN. Useful when surveys share common modules or when joint analysis is planned.
Negative coordination (minimize overlap): one survey uses shifted PRNs (\(u' = (u + 0.5) \bmod 1\)). Useful when the goal is to spread respondent burden across the register.

Example: Turnover survey and employment survey

Suppose a statistical agency runs two surveys on the same business register: a revenue survey (\(n = 200\), PPS on revenue_millions) and an employment survey (\(n = 150\), PPS on employees). The surveys have different sample sizes, different measures of size, and different stratification, but they share the same frame and the same PRN column.

set.seed(123)

frame <- ken_enterprises |>
  mutate(u = runif(n()))

# Survey A: revenue survey (PPS on revenue_millions)
survey_a <- sampling_design() |>
  draw(n = 200, method = "pps_sps", mos = revenue_millions, prn = u) |>
  execute(frame, seed = 1)

# Survey B: employment survey (PPS on employees), positive coordination
survey_b <- sampling_design() |>
  draw(n = 150, method = "pps_sps", mos = employees, prn = u) |>
  execute(frame, seed = 1)

overlap_pos <- length(intersect(survey_a$enterprise_id,
                                survey_b$enterprise_id))
overlap_pos
#> [1] 128

For negative coordination, shift the PRN by 0.5:

# Survey B with negative coordination
frame_neg <- frame |>
  mutate(u_neg = (u + 0.5) %% 1)

survey_b_neg <- sampling_design() |>
  draw(n = 150, method = "pps_sps", mos = employees, prn = u_neg) |>
  execute(frame_neg, seed = 1)

overlap_neg <- length(intersect(survey_a$enterprise_id,
                                survey_b_neg$enterprise_id))
overlap_neg
#> [1] 7

Compare with independent sampling (fresh PRNs):

frame_ind <- frame |>
  mutate(u_fresh = runif(n()))

survey_b_ind <- sampling_design() |>
  draw(n = 150, method = "pps_sps", mos = employees, prn = u_fresh) |>
  execute(frame_ind, seed = 1)

overlap_ind <- length(intersect(survey_a$enterprise_id,
                                survey_b_ind$enterprise_id))
overlap_ind
#> [1] 30

Overlap bounds for different surveys

When two surveys have different inclusion probabilities \(\pi_k^A\) and \(\pi_k^B\), the bounds generalize naturally:

pik_a <- inclusion_prob(frame$revenue_millions, 200)
pik_b <- inclusion_prob(frame$employees, 150)

alb_ab <- sum(pmax(0, pik_a + pik_b - 1))
aub_ab <- sum(pmin(pik_a, pik_b))
exp_ind <- sum(pik_a * pik_b)

tibble(strategy = c("Positive (same PRN)", "Independent (fresh PRN)", "Negative (shifted PRN)"),
       overlap = c(overlap_pos, overlap_ind, overlap_neg),
       bound = c(round(aub_ab), round(exp_ind), round(alb_ab)))
#> # A tibble: 3 × 3
#>   strategy                overlap bound
#>   <chr>                     <int> <dbl>
#> 1 Positive (same PRN)         128   130
#> 2 Independent (fresh PRN)      30    25
#> 3 Negative (shifted PRN)        7     3

Because revenue_millions and employees are correlated (large firms have both high revenue and many employees), the overlap under positive coordination is substantial. Negative coordination via PRN shift effectively spreads the burden.

Scaling to more than two surveys

For \(K\) surveys, shift each survey’s PRN by \(j/K\) for \(j = 0, \ldots, K-1\):

# Negative coordination across 4 surveys
n_surveys <- 4
for (j in seq_len(n_surveys)) {
  frame_j <- frame |>
    mutate(u_j = (u + (j - 1) / n_surveys) %% 1)

  samples[[j]] <- sampling_design() |>
    draw(n = n_j, method = "pps_sps", mos = mos_j, prn = u_j) |>
    execute(frame_j, seed = 1)
}

This simple shift strategy works well when the surveys have similar \(\pi_k\). For more complex programs (varying burden weights, priority orderings, or dozens of surveys), NSOs use dedicated coordination systems such as the Rivière method (Rivière 2001), the Netherlands EDS (De Ree 1983; Van Huis, De Ree, and Grotendorst 1994), or the Swiss method (Graf and Qualité 2014). These methods track cumulative burden per unit and permute PRNs within microstrata to achieve near-optimal coordination across the full survey program. See Tillé (2020, secs. 8.2.7–8.2.9) for a detailed treatment.

Summary

The PRN coordination workflow in samplyr is:

Assign PRNs once: mutate(u = runif(n()))
Compute \(\pi_k\) on the full frame: mutate(pik = inclusion_prob(mos, n))
Draw the sample: sampling_design() |> draw(method = "pps_sps", prn = u) |> execute(frame)
Update or shift PRNs:
- Temporal negative coordination (waves): mutate(u = (u - pik) %% 1)
- Temporal positive coordination: keep u unchanged
- Cross-survey negative coordination: shift by (u + 0.5) %% 1
- Cross-survey positive coordination: reuse same u

Goal	PRN update	Expected overlap
Maximum overlap	None (reuse)	\(\sum \min(\pi_k^A, \pi_k^B)\)
Independent	Fresh \(U(0,1)\)	\(\sum \pi_k^A \cdot \pi_k^B\)
Minimum overlap	Shift or Tillé update	\(\sum \max(0, \pi_k^A + \pi_k^B - 1)\)

Reference

Brewer, K. R. W., L. J. Early, and S. F. Joyce. 1972. “Selecting Several Samples from a Single Population.” Australian Journal of Statistics 14 (3): 231–39.

Cotton, F., and C. Hesse. 1992. “Tirages Coordonnés d’échantillons.” In Actes Des Journées de méthodologie Statistique. INSEE.

De Ree, S. J. M. 1983. “The EDS Sampling System for Business Surveys at the Netherlands Central Bureau of Statistics.” In Proceedings of the Section on Survey Research Methods, 197–202. ASA.

Ernst, Lawrence R. 1986. “Maximizing the Overlap Between Surveys When Information Is Incomplete.” European Journal of Operational Research 27: 192–200.

Graf, Éric, and Lionel Qualité. 2014. “Sondage Dans Des Registres de Population Et de ménages En Suisse : Coordination d’enquêtes, Pondération Et Imputation.” Journal de La Société Française de Statistique 155 (4): 95–133.

Ohlsson, Esbjorn. 1995. “Coordination of Samples Using Permanent Random Numbers.” In Business Survey Methods, edited by Brenda G. Cox et al. Wiley.

Rivière, Pascal. 2001. “Coordination d’échantillons Par La méthode Des Microstrates.” In Recueil Du Symposium 2001 de Statistique Canada. Statistics Canada.

Tillé, Yves. 2006. Sampling Algorithms. Springer.

———. 2020. Sampling and Estimation from Finite Populations. Wiley.

Van Huis, L. T., S. J. M. De Ree, and A. Grotendorst. 1994. “A Business Survey Redesign at Statistics Netherlands.” International Statistical Review 62 (3): 357–76.