Skip to contents

Why Coordinate Samples?

Repeated surveys (business panels, labour force surveys, agricultural censuses) draw fresh samples at regular intervals from the same population. Similarly, multiple survey programmes often sample from the same register simultaneously. Two natural questions arise:

  • Positive coordination: How can we maximise overlap between samples to reduce respondent burden and improve change estimates?
  • Negative coordination: How can we minimise overlap to spread the burden and maximise cumulative coverage?

These questions apply both temporally (across waves of the same survey) and cross-sectionally (across different surveys on the same frame, known as collocated sampling).

The permanent random number (PRN) technique, introduced by Brewer, Early, and Joyce (1972) and developed by Ohlsson (1995) and Cotton and Hesse (1992), answers both questions with a single mechanism: attach a stable \(U(0,1)\) value to each frame unit and update it between waves.

samplyr supports PRN-based coordination for four selection methods: bernoulli, pps_poisson, pps_sps, and pps_pareto.

Setup

We use ken_enterprises, a synthetic business register with 6,823 enterprises.

library(samplyr)
library(sondage)
library(dplyr)

data(ken_enterprises)
glimpse(ken_enterprises)
#> Rows: 6,823
#> Columns: 9
#> $ enterprise_id    <chr> "KEN_00001", "KEN_00002", "KEN_00003", "KEN_00004", "…
#> $ county           <fct> Kiambu, Kiambu, Kiambu, Kiambu, Kiambu, Kiambu, Kiamb…
#> $ region           <fct> Kiambu, Kiambu, Kiambu, Kiambu, Kiambu, Kiambu, Kiamb…
#> $ sector           <fct> Chemicals & Plastics, Chemicals & Plastics, Chemicals…
#> $ size_class       <fct> Small, Medium, Medium, Medium, Large, Large, Small, S…
#> $ employees        <int> 13, 54, 37, 24, 107, 136, 10, 14, 43, 30, 70, 42, 84,
#> $ revenue_millions <dbl> 25.1, 57.8, 51.7, 40.5, 73.7, 507.9, 25.5, 16.8, 39.9…
#> $ year_established <int> 1993, 2005, 2002, 2014, 1994, 2007, 2005, 1997, 1991,
#> $ exporter         <lgl> TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, 

Each enterprise has a revenue_millions column that serves as our measure of size for PPS sampling. We target \(n = 200\) enterprises per wave.

n_wave <- 200

Assigning permanent random numbers

Every frame unit gets a \(U(0,1)\) value drawn once and stored permanently in the register. These values are reused or updated across waves.

set.seed(1)

frame <- ken_enterprises |>
  mutate(u = runif(n()))

Computing inclusion probabilities

Coordination requires knowing each unit’s first-order inclusion probability \(\pi_k\). For PPS without replacement, sondage::inclusion_prob() computes these from the measure of size and sample size. This must be done on the full frame, not just the sample.

frame <- frame |>
  mutate(pik = inclusion_prob(revenue_millions, n_wave))

round(quantile(frame$pik), 4)
#>     0%    25%    50%    75%   100% 
#> 0.0003 0.0029 0.0067 0.0179 1.0000

Positive Coordination: Maximum Overlap

To maximise overlap between waves, reuse the same PRN values. With sequential Poisson sampling (SPS), two waves drawn from the same frame with identical PRNs produce identical samples.

wave1 <- sampling_design() |>
  draw(n = n_wave, method = "pps_sps", mos = revenue_millions, prn = u) |>
  execute(frame, seed = 42)

wave2 <- sampling_design() |>
  draw(n = n_wave, method = "pps_sps", mos = revenue_millions, prn = u) |>
  execute(frame, seed = 123)

overlap_positive <- length(intersect(wave1$enterprise_id,
                                     wave2$enterprise_id))
overlap_positive
#> [1] 200

Perfect overlap. This is useful when the goal is to track the same units over time, for example measuring quarterly changes in revenue within a fixed panel.

Negative Coordination: Minimum Overlap

To minimise overlap, update the PRN values between waves using Tillé (2006, sec. 8.2) formula:

\[u_k^{(t+1)} = \bigl(u_k^{(t)} - \pi_k\bigr) \bmod 1\]

Selected units (those with small \(u_k\)) get pushed to the high end of \((0,1)\), making them unlikely to be selected again. Non-selected units shift downward, increasing their chance of selection.

# Wave 1
wave1 <- sampling_design() |>
  draw(n = n_wave, method = "pps_sps", mos = revenue_millions, prn = u) |>
  execute(frame, seed = 1)

# Update PRNs on the full frame
frame <- frame |>
  mutate(u = (u - pik) %% 1)

# Wave 2
wave2 <- sampling_design() |>
  draw(n = n_wave, method = "pps_sps", mos = revenue_millions, prn = u) |>
  execute(frame, seed = 1)

overlap_negative <- length(intersect(wave1$enterprise_id, wave2$enterprise_id))
overlap_negative
#> [1] 43

Why isn’t the overlap zero?

Not all overlap can be eliminated. Units with large \(\pi_k\) (big enterprises) are selected in almost every wave because their inclusion probability is close to or equal to 1. The theoretical minimum overlap, the Anticipated Lower Bound (ALB), is:

\[\text{ALB} = \sum_{k=1}^{N} \max(0,\; 2\pi_k - 1)\]

Only units with \(\pi_k > 0.5\) contribute to forced overlap.

alb <- sum(pmax(0, 2 * frame$pik - 1))
expected_independent <- sum(frame$pik^2)

tibble(x = c("ALB (theoretical minimum)",
             "Observed overlap",
             "Expected if independent",
             "AUB (theoretical max)"),
       y = c(round(alb), overlap_negative,
             round(expected_independent), n_wave))
#> # A tibble: 4 × 2
#>   x                             y
#>   <chr>                     <dbl>
#> 1 ALB (theoretical minimum)    40
#> 2 Observed overlap             43
#> 3 Expected if independent      66
#> 4 AUB (theoretical max)       200

The negative coordination brings the overlap close to the ALB, well below what independent sampling would produce.

Comparison: Three Strategies

# Independent sampling: fresh PRNs for wave 2
set.seed(1)
frame_ind <- ken_enterprises |>
  mutate(u1 = runif(n()),
         u2 = runif(n()))

wave1_ind <- sampling_design() |>
  draw(n = n_wave, method = "pps_sps", mos = revenue_millions, prn = u1) |>
  execute(frame_ind, seed = 1)

wave2_ind <- sampling_design() |>
  draw(n = n_wave, method = "pps_sps", mos = revenue_millions, prn = u2) |>
  execute(frame_ind, seed = 1)

overlap_ind <- length(intersect(wave1_ind$enterprise_id,
                                wave2_ind$enterprise_id))

tibble(strategy = c("Positive (same PRN)", "Independent (fresh PRN)", "Negative (Tillé update)"),
       overlap = c(overlap_positive, overlap_ind, overlap_negative),
       bound = c(n_wave, round(expected_independent), round(alb))) |>
  mutate(pct = paste0(round(100 * overlap / n_wave), "%"))
#> # A tibble: 3 × 4
#>   strategy                overlap bound pct  
#>   <chr>                     <int> <dbl> <chr>
#> 1 Positive (same PRN)         200   200 100% 
#> 2 Independent (fresh PRN)      70    66 35%  
#> 3 Negative (Tillé update)      43    40 22%

Multi-Wave Rotation

In practice, surveys run for many waves. The PRN update is applied after each wave, rotating units through the sample.

set.seed(2026)

frame <- ken_enterprises |>
  mutate(u = runif(n()),
         pik = inclusion_prob(revenue_millions, n_wave))

n_waves <- 5
waves <- vector("list", n_waves)

for (t in seq_len(n_waves)) {
  waves[[t]] <- sampling_design() |>
    draw(n = n_wave, method = "pps_sps", mos = revenue_millions, prn = u) |>
    execute(frame, seed = 1)

  # Update PRNs for next wave
  frame <- frame |>
    mutate(u = (u - pik) %% 1)
}

Pairwise overlap matrix

ids <- lapply(waves, `[[`, "enterprise_id")

overlap_mat <- outer(ids, ids, Vectorize(\(a, b) length(intersect(a, b))))
dimnames(overlap_mat) <- rep(list(paste("Wave", seq_len(n_waves))), 2)

overlap_mat
#>        Wave 1 Wave 2 Wave 3 Wave 4 Wave 5
#> Wave 1    200     38     65     60     61
#> Wave 2     38    200     45     61     52
#> Wave 3     65     45    200     49     62
#> Wave 4     60     61     49    200     55
#> Wave 5     61     52     62     55    200

Consecutive waves have low overlap (near the ALB). Non-consecutive waves show varying overlap as the rotation cycle progresses.

Cumulative coverage

After \(T\) waves, how many distinct enterprises have been surveyed?

cumulative <- integer(n_waves)
seen <- character(0)

for (t in seq_len(n_waves)) {
  seen <- union(seen, waves[[t]]$enterprise_id)
  cumulative[t] <- length(seen)
}

tibble(wave = seq_len(n_waves),
       new_units = c(n_wave, diff(cumulative)),
       cumulative = cumulative,
       pct_frame = paste0(round(100 * cumulative / nrow(ken_enterprises), 1), "%"))
#> # A tibble: 5 × 4
#>    wave new_units cumulative pct_frame
#>   <int>     <dbl>      <int> <chr>    
#> 1     1       200        200 2.9%     
#> 2     2       162        362 5.3%     
#> 3     3       122        484 7.1%     
#> 4     4       112        596 8.7%     
#> 5     5       103        699 10.2%

Negative coordination maximises the number of distinct units covered, which is valuable for building comprehensive registers or spreading respondent burden.

Handling Births and Deaths

Real business registers change between waves. New enterprises (births) enter the frame; closed enterprises (deaths) leave it.

  • Deaths: Simply remove them from the frame. Their PRN values are discarded. No special treatment needed.
  • Births: Assign fresh \(U(0,1)\) values. These new units enter the coordination cycle at a random position.
# Simulate: remove 200 enterprises (deaths), add 500 new ones (births)
set.seed(123)
n_births <- 500

gen_data <- function(x, n) {
  sample(levels(x), n, replace = TRUE) |>
    factor(levels = levels(x))
}

# Deaths: remove a random subset
deaths <- sample(frame$enterprise_id, 200)
frame_t2 <- frame |>
  filter(!enterprise_id %in% deaths)

# Births: new enterprises with fresh PRNs
# Resample revenue from the existing frame to get a realistic size distribution
births <- tibble(enterprise_id = paste0("NEW_", sprintf("%05d", seq_len(n_births))),
                 county = gen_data(ken_enterprises$county, n_births),
                 region = gen_data(ken_enterprises$region, n_births),
                 sector = gen_data(ken_enterprises$sector, n_births),
                 size_class = gen_data(ken_enterprises$size_class, n_births),
                 employees = sample(ken_enterprises$employees, n_births, replace = TRUE),
                 revenue_millions = sample(ken_enterprises$revenue_millions, n_births, replace = TRUE),
                 year_established = sample(ken_enterprises$year_established, n_births, replace = TRUE),
                 exporter = sample(ken_enterprises$exporter, n_births, replace = TRUE),
                 u = runif(n_births))

frame_t2 <- bind_rows(frame_t2, births)

# Recompute inclusion probabilities on the new frame
frame_t2 <- frame_t2 |>
  mutate(pik = inclusion_prob(revenue_millions, n_wave))

# Draw wave from updated frame
wave_t2 <- sampling_design() |>
  draw(n = n_wave, method = "pps_sps", mos = revenue_millions, prn = u) |>
  execute(frame_t2, seed = 1)

# How many births were selected?
n_births_selected <- sum(wave_t2$enterprise_id %in% births$enterprise_id)
n_births_selected
#> [1] 13
n_births_selected/n_births
#> [1] 0.026

The coordination mechanism automatically handles frame changes. Surviving units retain their updated PRN values and births start with fresh PRN.

Using Pareto Sampling

The pps_pareto method also supports PRN-based coordination. In theory, Pareto sampling has maximum entropy among fixed-size PPS designs that use PRNs, making it particularly well-suited for coordination.

set.seed(12345)

frame <- ken_enterprises |>
  mutate(u = runif(n()),
         pik = inclusion_prob(revenue_millions, n_wave))

wave1_par <- sampling_design() |>
  draw(n = n_wave, method = "pps_pareto", mos = revenue_millions, prn = u) |>
  execute(frame, seed = 1)

frame <- frame |> mutate(u = (u - pik) %% 1)

wave2_par <- sampling_design() |>
  draw(n = n_wave, method = "pps_pareto", mos = revenue_millions, prn = u) |>
  execute(frame, seed = 1)

overlap_par <- length(intersect(wave1_par$enterprise_id,
                                wave2_par$enterprise_id))

overlap_par
#> [1] 39
alb <- round(sum(pmax(0, 2 * frame$pik - 1)))
alb
#> [1] 40

Stratified Coordination

When the design is stratified, inclusion probabilities must be computed within each stratum. The PRN update is then applied globally. The per-stratum \(\pi_k\) values ensure correct rotation within each group.

set.seed(1980)

frame <- ken_enterprises |>
  mutate(u = runif(n()))

# Compute within-stratum pik
frame <- frame |>
  mutate(pik = inclusion_prob(revenue_millions, 50),
         .by = size_class)

max_pik <- max(frame$pik)
max_pik
#> [1] 0.6451258
alb <- round(sum(pmax(0, 2 * frame$pik - 1)))
alb
#> [1] 1

wave1_str <- sampling_design() |>
  stratify_by(size_class) |>
  draw(n = 50, method = "pps_pareto", mos = revenue_millions, prn = u) |>
  execute(frame, seed = 1)

frame <- frame |> mutate(u = (u - pik) %% 1)

wave2_str <- sampling_design() |>
  stratify_by(size_class) |>
  draw(n = 50, method = "pps_pareto", mos = revenue_millions, prn = u) |>
  execute(frame, seed = 1)

overlap_str <- length(intersect(wave1_str$enterprise_id,
                                wave2_str$enterprise_id))
overlap_str
#> [1] 1

With smaller within-stratum \(\pi_k\) values (all below 0.5), the ALB is zero and perfect negative coordination is achievable.

Collocated Sampling: Coordinating Across Surveys

The examples above coordinate samples from the same survey across time (waves). A related problem is coordinating different surveys drawn from the same frame at the same time point. This is called collocated sampling (Ernst 1986; Ohlsson 1995) and is common in business survey programmes where a single register serves multiple survey operations.

The mechanism is identical: all surveys share the same PRN column.

  • Positive coordination (maximise overlap): both surveys use the same PRN. Useful when surveys share common modules or when joint analysis is planned.
  • Negative coordination (minimise overlap): one survey uses shifted PRNs (\(u' = (u + 0.5) \bmod 1\)). Useful when the goal is to spread respondent burden across the register.

Example: Turnover survey and employment survey

Suppose a statistical agency runs two surveys on the same business register: a revenue survey (\(n = 200\), PPS on revenue_millions) and an employment survey (\(n = 150\), PPS on employees). The surveys have different sample sizes, different measures of size, and different stratification, but they share the same frame and the same PRN column.

set.seed(123)

frame <- ken_enterprises |>
  mutate(u = runif(n()))

# Survey A: revenue survey (PPS on revenue_millions, stratified by sector)
survey_a <- sampling_design() |>
  draw(n = 200, method = "pps_sps", mos = revenue_millions, prn = u) |>
  execute(frame, seed = 1)

# Survey B: employment survey (PPS on employees), positive coordination
survey_b <- sampling_design() |>
  draw(n = 150, method = "pps_sps", mos = employees, prn = u) |>
  execute(frame, seed = 1)

overlap_pos <- length(intersect(survey_a$enterprise_id,
                                survey_b$enterprise_id))
overlap_pos
#> [1] 132

For negative coordination, shift the PRN by 0.5:

# Survey B with negative coordination
frame_neg <- frame |>
  mutate(u_neg = (u + 0.5) %% 1)

survey_b_neg <- sampling_design() |>
  draw(n = 150, method = "pps_sps", mos = employees, prn = u_neg) |>
  execute(frame_neg, seed = 1)

overlap_neg <- length(intersect(survey_a$enterprise_id,
                                survey_b_neg$enterprise_id))
overlap_neg
#> [1] 18

Compare with independent sampling (fresh PRNs):

frame_ind <- frame |>
  mutate(u_fresh = runif(n()))

survey_b_ind <- sampling_design() |>
  draw(n = 150, method = "pps_sps", mos = employees, prn = u_fresh) |>
  execute(frame_ind, seed = 1)

overlap_ind <- length(intersect(survey_a$enterprise_id,
                                survey_b_ind$enterprise_id))
overlap_ind
#> [1] 33

Overlap bounds for different surveys

When two surveys have different inclusion probabilities \(\pi_k^A\) and \(\pi_k^B\), the bounds generalise naturally:

pik_a <- inclusion_prob(frame$revenue_millions, 200)
pik_b <- inclusion_prob(frame$employees, 150)

alb_ab <- sum(pmax(0, pik_a + pik_b - 1))
aub_ab <- sum(pmin(pik_a, pik_b))
exp_ind <- sum(pik_a * pik_b)

tibble(strategy = c("Positive (same PRN)", "Independent (fresh PRN)", "Negative (shifted PRN)"),
       overlap = c(overlap_pos, overlap_ind, overlap_neg),
       bound = c(round(aub_ab), round(exp_ind), round(alb_ab)))
#> # A tibble: 3 × 3
#>   strategy                overlap bound
#>   <chr>                     <int> <dbl>
#> 1 Positive (same PRN)         132   133
#> 2 Independent (fresh PRN)      33    42
#> 3 Negative (shifted PRN)       18    18

Because revenue_millions and employees are correlated (large firms have both high revenue and many employees), the overlap under positive coordination is substantial. Negative coordination via PRN shift effectively spreads the burden.

Scaling to more than two surveys

For \(K\) surveys, shift each survey’s PRN by \(j/K\) for \(j = 0, \ldots, K-1\):

# Negative coordination across 4 surveys
n_surveys <- 4
for (j in seq_len(n_surveys)) {
  frame_j <- frame |>
    mutate(u_j = (u + (j - 1) / n_surveys) %% 1)

  samples[[j]] <- sampling_design() |>
    draw(n = n_j, method = "pps_sps", mos = mos_j, prn = u_j) |>
    execute(frame_j, seed = 1)
}

This simple shift strategy works well when the surveys have similar \(\pi_k\). For more complex programmes (varying burden weights, priority orderings, or dozens of surveys), NSOs use dedicated coordination systems such as the Rivière method (Rivière 2001), the Netherlands EDS (De Ree 1983; Van Huis, De Ree, and Grotendorst 1994), or the Swiss method (Graf and Qualité 2014). These methods track cumulative burden per unit and permute PRNs within microstrata to achieve near-optimal coordination across the full survey programme. See Tillé (2020, secs. 8.2.7–8.2.9) for a detailed treatment.

Summary

The PRN coordination workflow in samplyr is:

  1. Assign PRNs once: mutate(u = runif(n()))
  2. Compute \(\pi_k\) on the full frame: mutate(pik = inclusion_prob(mos, n))
  3. Draw the sample: sampling_design() |> draw(method = "pps_sps", prn = u) |> execute(frame)
  4. Update or shift PRNs:
    • Temporal negative coordination (waves): mutate(u = (u - pik) %% 1)
    • Temporal positive coordination: keep u unchanged
    • Cross-survey negative coordination: shift by (u + 0.5) %% 1
    • Cross-survey positive coordination: reuse same u
Goal PRN update Expected overlap
Maximum overlap None (reuse) \(\sum \min(\pi_k^A, \pi_k^B)\)
Independent Fresh \(U(0,1)\) \(\sum \pi_k^A \cdot \pi_k^B\)
Minimum overlap Shift or Tillé update \(\sum \max(0, \pi_k^A + \pi_k^B - 1)\)

Reference

Brewer, K. R. W., L. J. Early, and S. F. Joyce. 1972. “Selecting Several Samples from a Single Population.” Australian Journal of Statistics 14 (3): 231–39.
Cotton, F., and C. Hesse. 1992. “Tirages Coordonnés d’échantillons.” In Actes Des Journées de méthodologie Statistique. INSEE.
De Ree, S. J. M. 1983. “The EDS Sampling System for Business Surveys at the Netherlands Central Bureau of Statistics.” In Proceedings of the Section on Survey Research Methods, 197–202. ASA.
Ernst, Lawrence R. 1986. “Maximizing the Overlap Between Surveys When Information Is Incomplete.” European Journal of Operational Research 27: 192–200.
Graf, Éric, and Lionel Qualité. 2014. “Sondage Dans Des Registres de Population Et de ménages En Suisse : Coordination d’enquêtes, Pondération Et Imputation.” Journal de La Société Française de Statistique 155 (4): 95–133.
Ohlsson, Esbjorn. 1995. “Coordination of Samples Using Permanent Random Numbers.” In Business Survey Methods, edited by Brenda G. Cox et al. Wiley.
Rivière, Pascal. 2001. “Coordination d’échantillons Par La méthode Des Microstrates.” In Recueil Du Symposium 2001 de Statistique Canada. Statistics Canada.
Tillé, Yves. 2006. Sampling Algorithms. Springer.
———. 2020. Sampling and Estimation from Finite Populations. Wiley.
Van Huis, L. T., S. J. M. De Ree, and A. Grotendorst. 1994. “A Business Survey Redesign at Statistics Netherlands.” International Statistical Review 62 (3): 357–76.