Skip to contents

Reconstructs the second-order design quantities for PPS stages by replaying the design specification against the original sampling frame. For without-replacement (WOR) stages, this produces the joint inclusion probabilities \(\pi_{kl}\). For with-replacement (WR) and PMR stages, this produces the joint expected hits \(E(n_k \cdot n_l)\).

Usage

joint_expectation(x, frame, stage = NULL)

Arguments

x

A tbl_sample object produced by execute().

frame

The data frame originally passed to execute(). Must contain the same columns used during sampling (strata variables, cluster variables, measure of size).

stage

An integer vector of stage numbers to compute, or NULL (default) to compute all PPS stages. Non-PPS stages produce NULL entries in the returned list.

Value

A named list of length equal to the number of executed stages. Each element is either:

  • For PPS WOR stages: a square matrix of joint inclusion probabilities \(\pi_{kl}\), usable with survey::ppsmat() for exact variance estimation.

  • For PPS WR/PMR stages (pps_multinomial, pps_chromy): a square matrix of joint expected hits \(E(n_k \cdot n_l)\).

  • NULL for non-PPS stages (SRS, systematic) or stages not requested via the stage argument. Dimensions match the number of unique sampled units (or sampled clusters for clustered stages). Rows and columns are ordered to match the sample.

Details

For each PPS stage, the function:

  1. Reconstructs the full-population first-order quantities from the frame using the stage's method and measure of size

  2. Dispatches to the appropriate sondage joint probability or joint expected hits function

  3. Extracts the submatrix corresponding to sampled units

For stratified stages, the target sample size per stratum (n_h) is reconstructed by replaying the same allocation logic used during execute() (proportional, Neyman, optimal, etc.) against the frame. This ensures first-order quantities match what was computed at sampling time, regardless of allocation method.

For stratified or conditional (within-cluster) stages, joint quantities are computed independently within each group and assembled into a block-diagonal matrix.

Exact vs. approximate computation

The accuracy of the returned matrix depends on the sampling method. Some algorithms yield closed-form joint probabilities; others require approximation or simulation.

WOR methods (\(\pi_{kl}\))

samplyr methodsondage functionQuality
pps_cpsjoint_inclusion_prob()Exact (Aires' formula via C)
pps_systematicjoint_inclusion_prob()Exact (combinatorial enumeration)
pps_poissonjoint_inclusion_prob()Exact (\(\pi_{kl} = \pi_k \pi_l\), independent draws)
pps_brewerjoint_inclusion_prob()Approximate\(^*\) (high-entropy / Hajek-Brewer-Donadio)
pps_spsjoint_inclusion_prob()Approximate (high-entropy / Hajek-Brewer-Donadio)
pps_paretojoint_inclusion_prob()Approximate (high-entropy / Hajek-Brewer-Donadio)
balancedjoint_inclusion_prob()Approximate (high-entropy / Hajek-Brewer-Donadio)

\(^*\) Exact recursive formulas for Brewer's joint inclusion probabilities exist (Brewer 2002, ch. 9) but are \(O(N^3)\), making them impractical for frames of more than a few hundred units. The high-entropy approximation is \(O(N^2)\) and sufficiently accurate for variance estimation in practice. The same trade-off applies to SPS and Pareto, whose exact joint probabilities would require enumerating the combinatorial sample space.

The high-entropy approximation assumes the design is close to the maximum-entropy design with the same marginal \(\pi_i\) (Hajek 1964; Brewer and Donadio 2003). This is a good approximation for most PPS designs and is the same quantity that underlies the Berger (2004) variance estimator used by survey::svydesign(pps = "brewer"). For CPS (conditional Poisson / maximum entropy), the joint probabilities are exact by definition.

WR/PMR methods (\(E(n_k \cdot n_l)\))

samplyr methodsondage functionQuality
pps_multinomialjoint_expected_hits()Exact (analytic: \(n(n-1) p_k p_l + n p_k \mathbf{1}_{k=l}\))
pps_chromyjoint_expected_hits()Approximate (Monte Carlo simulation, 10 000 replicates)

For pps_chromy, the sequential dependence structure does not admit a closed-form expression for \(E(n_k \cdot n_l)\). sondage uses Monte Carlo simulation (default 10 000 replicates) to estimate the pairwise expectations. Increasing nsim (passed through ...) reduces Monte Carlo error at the cost of computation time.

Limitations

  • Requires the original sampling frame. The frame must be unchanged from what was passed to execute().

  • Units in the frame must be uniquely identifiable within each stratum/cluster group by their column values.

  • For WOR designs with certainty selections (\(\pi_i = 1\)), the joint matrix is decomposed: certainty units are separated from the stochastic part, the joint probabilities for non-certainty units are computed from the reduced \(\pi\) vector, and the full matrix is reassembled with \(\pi_{ij} = 1\) for certainty pairs and \(\pi_{ij} = \pi_j\) for certainty x non-certainty pairs.

See also

as_svydesign() for the default export using Brewer's approximation, survey::ppsmat() for wrapping joint matrices

Examples

sample <- sampling_design() |>
  add_stage() |>
    stratify_by(region) |>
    cluster_by(ea_id) |>
    draw(n = 5, method = "pps_brewer", mos = households) |>
  add_stage() |>
    draw(n = 12) |>
  execute(bfa_eas, seed = 2025)

# Compute joint probabilities for stage 1
jip <- joint_expectation(sample, bfa_eas, stage = 1)

# Use with survey package for exact variance (WOR stages)
svy <- as_svydesign(sample, pps = survey::ppsmat(jip[[1]]))

# Compute all PPS stages at once
jip_all <- joint_expectation(sample, bfa_eas)