Compute pairwise joint expectations from a sample

Reconstructs the second-order design quantities for PPS stages. For without-replacement (WOR) stages, this produces the joint inclusion probabilities \(\pi_{kl}\). For with-replacement (WR) and PMR stages, this produces the joint expected hits \(E(n_k \cdot n_l)\).

Usage

joint_expectation(x, frame = NULL, stage = NULL, nsim = 10000L)

Arguments

x: A tbl_sample object produced by execute().
frame: The data frame originally passed to execute(). Must contain the same columns used during sampling (strata variables, cluster variables, measure of size). When NULL (the default), the computation uses the frame digest recorded on the sample instead.
stage: An integer vector of stage numbers to compute, or NULL (default) to compute all PPS stages. Non-PPS stages produce NULL entries in the returned list.
nsim: Positive integer number of simulations used for Chromy's pairwise expected hits (default 10000). Also forwarded to registered WR joint_fns that explicitly declare an nsim formal. Ignored by analytic methods.

Value

A named list of length equal to the number of executed stages. Each element is either:

For PPS WOR stages: a square matrix of joint inclusion probabilities \(\pi_{kl}\), usable with survey::ppsmat() for exact variance estimation.
For PPS WR/PMR stages (pps_multinomial, pps_chromy): a square matrix of joint expected hits \(E(n_k \cdot n_l)\).
NULL for non-PPS stages (SRS, systematic) or stages not requested via the stage argument. Rows and columns represent stage-specific sampled units in first appearance order. At a WR stage, repeated hits of the same population unit appear once, so dimensions match the number of distinct sampled units (or clusters). At a later stage below a WR parent, each parent draw occurrence defines a separate conditional block, so the same child population identity can appear in more than one block.

Details

Without frame, the computation runs off the frame digest recorded at execution: the digest holds each pool's exact resolved chance vector, which is all the joint computation needs, so a sample that traveled without its (possibly confidential) frame still yields exact joint expectations. This requires an exact chance representation: cluster stages always have one, element stages with constant chances have one, and element stages with varying chances keep one only under execute(frame_digest = "full"). A summarized representation refuses rather than approximates. With frame, the quantities are replayed against it as before; the frame must be unchanged since execution (validate_frame() reports drift).

For each PPS stage, the function:

Reconstructs the full-population first-order quantities from the frame using the stage's method and measure of size
Dispatches to the appropriate sondage joint probability or joint expected hits function
Extracts the submatrix corresponding to sampled units

For stratified stages, the target sample size per stratum (n_h) is reconstructed by replaying the same allocation logic used during execute() (proportional, Neyman, optimal, etc.) against the frame. This ensures first-order quantities match what was computed at sampling time, regardless of allocation method.

For stratified or conditional (within-cluster) stages, joint quantities are computed independently within each group. Blocks follow their first appearance in the sample, as do units within a block. Cross-block entries are products of the corresponding marginal chances.

A stage below a WR parent is conditional on each parent draw occurrence, not only on the parent's population identity. Repeated hits of one parent therefore produce separate independent child blocks. Pair the returned matrix with stage-specific identities in this order; do not pair it blindly with every sample row when descendants duplicate a selected unit.

Exact vs. approximate computation

The accuracy of the returned matrix depends on the sampling method. Some algorithms yield closed-form joint probabilities; others require approximation or simulation.

WOR methods (\(\pi_{kl}\))

samplyr method	sondage function	Quality
`pps_cps`	`joint_inclusion_prob()`	Exact (Aires' formula via C)
`pps_sampford`	`joint_inclusion_prob()`	Exact (Sampford design)
`pps_systematic`	`joint_inclusion_prob()`	Exact (circular-interval overlap)
`pps_poisson`	`joint_inclusion_prob()`	Exact (\(\pi_{kl} = \pi_k \pi_l\), independent draws)
`pps_brewer`	`joint_inclusion_prob()`	Approximate\(^*\) (high-entropy / Hajek-Brewer-Donadio)
`pps_sps`	`joint_inclusion_prob()`	Approximate (high-entropy / Hajek-Brewer-Donadio)
`pps_pareto`	`joint_inclusion_prob()`	Approximate (high-entropy / Hajek-Brewer-Donadio)
`cube`	`joint_inclusion_prob()`	Approximate when unconstrained (high-entropy / Hajek-Brewer-Donadio)
`lpm2`	unavailable	Spatial spreading is not represented
`scps`	unavailable	Spatial spreading is not represented

\(^*\) Exact recursive formulas for Brewer's joint inclusion probabilities exist (Brewer 2002, ch. 9) but are \(O(N^3)\), making them impractical for frames of more than a few hundred units. The high-entropy approximation is \(O(N^2)\) and sufficiently accurate for variance estimation in practice. The same trade-off applies to SPS and Pareto, whose exact joint probabilities would require enumerating the combinatorial sample space.

The high-entropy approximation assumes the design is close to the maximum-entropy design with the same marginal \(\pi_i\) (Hajek 1964; Brewer and Donadio 2003). This is a good approximation for most PPS designs and is the same quantity that underlies the Berger (2004) variance estimator used by survey::svydesign(pps = "brewer"). For CPS (conditional Poisson / maximum entropy), the joint probabilities are exact by definition.

Bounded cube, LPM2, and SCPS designs are rejected because count constraints and spatial spreading alter pairwise selection behavior beyond the available approximation. Use as_svrepdesign() with type = "subbootstrap" for a generic bootstrap approximation instead.

WR/PMR methods (\(E(n_k \cdot n_l)\))

samplyr method	sondage function	Quality
`pps_multinomial`	`joint_expected_hits()`	Exact (analytic: \(n(n-1) p_k p_l + n p_k \mathbf{1}_{k=l}\))
`pps_chromy`	`joint_expected_hits()`	Approximate (Monte Carlo simulation, 10 000 replicates)

For pps_chromy, the sequential dependence structure does not admit a closed-form expression for \(E(n_k \cdot n_l)\). sondage uses Monte Carlo simulation (default 10 000 replicates) to estimate the pairwise expectations. Increasing nsim reduces Monte Carlo error at the cost of computation time.

Limitations

The frame-free path requires a digest with exact chances: the default summary digest suffices for cluster stages and constant-chance element stages; element stages with varying chances need execute(frame_digest = "full"). Otherwise pass the frame.
When frame is supplied it must be unchanged from what was passed to execute(), and units in it must be uniquely identifiable within each stratum/cluster group by their column values.
For WOR designs with certainty selections (\(\pi_i = 1\)), the joint matrix is decomposed: certainty units are separated from the stochastic part, the joint probabilities for non-certainty units are computed from the reduced \(\pi\) vector, and the full matrix is reassembled with \(\pi_{ij} = 1\) for certainty pairs and \(\pi_{ij} = \pi_j\) for certainty x non-certainty pairs.

Examples

sample <- sampling_design() |>
  add_stage() |>
    stratify_by(region) |>
    cluster_by(ea_id) |>
    draw(n = 5, method = "pps_brewer", mos = households) |>
  add_stage() |>
    draw(n = 12) |>
  execute(bfa_eas, seed = 2025)

# Compute joint probabilities for stage 1
jip <- joint_expectation(sample, bfa_eas, stage = 1)

# Use with survey package for exact variance (WOR stages)
svy <- as_svydesign(sample, pps = survey::ppsmat(jip[[1]]))
#> Warning: Exact PPS variance (`pps`) is single-stage in survey.
#> ℹ Exporting the stage-1 design only; later-stage sampling variance is not
#>   represented.
#> ℹ Omit `pps` for exact multi-stage linearization with Brewer's approximation at
#>   the PPS stage.

# Compute all PPS stages at once
jip_all <- joint_expectation(sample, bfa_eas)