Compute pairwise joint expectations from a sample and its frame
Source:R/joint.R
joint_expectation.RdReconstructs the second-order design quantities for PPS stages by replaying the design specification against the original sampling frame. For without-replacement (WOR) stages, this produces the joint inclusion probabilities \(\pi_{kl}\). For with-replacement (WR) and PMR stages, this produces the joint expected hits \(E(n_k \cdot n_l)\).
Arguments
- x
A
tbl_sampleobject produced byexecute().- frame
The data frame originally passed to
execute(). Must contain the same columns used during sampling (strata variables, cluster variables, measure of size).- stage
An integer vector of stage numbers to compute, or
NULL(default) to compute all PPS stages. Non-PPS stages produceNULLentries in the returned list.
Value
A named list of length equal to the number of executed stages. Each element is either:
For PPS WOR stages: a square matrix of joint inclusion probabilities \(\pi_{kl}\), usable with
survey::ppsmat()for exact variance estimation.For PPS WR/PMR stages (
pps_multinomial,pps_chromy): a square matrix of joint expected hits \(E(n_k \cdot n_l)\).NULLfor non-PPS stages (SRS, systematic) or stages not requested via thestageargument. Dimensions match the number of unique sampled units (or sampled clusters for clustered stages). Rows and columns are ordered to match the sample.
Details
For each PPS stage, the function:
Reconstructs the full-population first-order quantities from the frame using the stage's method and measure of size
Dispatches to the appropriate sondage joint probability or joint expected hits function
Extracts the submatrix corresponding to sampled units
For stratified stages, the target sample size per stratum (n_h) is
reconstructed by replaying the same allocation logic used during
execute() (proportional, Neyman, optimal, etc.) against the
frame. This ensures first-order quantities match what was computed
at sampling time, regardless of allocation method.
For stratified or conditional (within-cluster) stages, joint quantities are computed independently within each group and assembled into a block-diagonal matrix.
Exact vs. approximate computation
The accuracy of the returned matrix depends on the sampling method. Some algorithms yield closed-form joint probabilities; others require approximation or simulation.
WOR methods (\(\pi_{kl}\))
| samplyr method | sondage function | Quality |
pps_cps | joint_inclusion_prob() | Exact (Aires' formula via C) |
pps_systematic | joint_inclusion_prob() | Exact (combinatorial enumeration) |
pps_poisson | joint_inclusion_prob() | Exact (\(\pi_{kl} = \pi_k \pi_l\), independent draws) |
pps_brewer | joint_inclusion_prob() | Approximate\(^*\) (high-entropy / Hajek-Brewer-Donadio) |
pps_sps | joint_inclusion_prob() | Approximate (high-entropy / Hajek-Brewer-Donadio) |
pps_pareto | joint_inclusion_prob() | Approximate (high-entropy / Hajek-Brewer-Donadio) |
balanced | joint_inclusion_prob() | Approximate (high-entropy / Hajek-Brewer-Donadio) |
\(^*\) Exact recursive formulas for Brewer's joint inclusion probabilities exist (Brewer 2002, ch. 9) but are \(O(N^3)\), making them impractical for frames of more than a few hundred units. The high-entropy approximation is \(O(N^2)\) and sufficiently accurate for variance estimation in practice. The same trade-off applies to SPS and Pareto, whose exact joint probabilities would require enumerating the combinatorial sample space.
The high-entropy approximation assumes the design is close to the
maximum-entropy design with the same marginal \(\pi_i\)
(Hajek 1964; Brewer and Donadio 2003). This is a good approximation
for most PPS designs and is the same quantity that underlies the
Berger (2004) variance estimator used by survey::svydesign(pps = "brewer"). For CPS (conditional Poisson / maximum entropy), the
joint probabilities are exact by definition.
WR/PMR methods (\(E(n_k \cdot n_l)\))
| samplyr method | sondage function | Quality |
pps_multinomial | joint_expected_hits() | Exact (analytic: \(n(n-1) p_k p_l + n p_k \mathbf{1}_{k=l}\)) |
pps_chromy | joint_expected_hits() | Approximate (Monte Carlo simulation, 10 000 replicates) |
For pps_chromy, the sequential dependence structure does not admit
a closed-form expression for \(E(n_k \cdot n_l)\).
sondage uses Monte Carlo simulation (default 10 000 replicates) to
estimate the pairwise expectations. Increasing nsim (passed
through ...) reduces Monte Carlo error at the cost of computation
time.
Limitations
Requires the original sampling frame. The frame must be unchanged from what was passed to
execute().Units in the frame must be uniquely identifiable within each stratum/cluster group by their column values.
For WOR designs with certainty selections (\(\pi_i = 1\)), the joint matrix is decomposed: certainty units are separated from the stochastic part, the joint probabilities for non-certainty units are computed from the reduced \(\pi\) vector, and the full matrix is reassembled with \(\pi_{ij} = 1\) for certainty pairs and \(\pi_{ij} = \pi_j\) for certainty x non-certainty pairs.
See also
as_svydesign() for the default export using Brewer's
approximation, survey::ppsmat() for wrapping joint matrices
Examples
sample <- sampling_design() |>
add_stage() |>
stratify_by(region) |>
cluster_by(ea_id) |>
draw(n = 5, method = "pps_brewer", mos = households) |>
add_stage() |>
draw(n = 12) |>
execute(bfa_eas, seed = 2025)
# Compute joint probabilities for stage 1
jip <- joint_expectation(sample, bfa_eas, stage = 1)
# Use with survey package for exact variance (WOR stages)
svy <- as_svydesign(sample, pps = survey::ppsmat(jip[[1]]))
# Compute all PPS stages at once
jip_all <- joint_expectation(sample, bfa_eas)