Convert a tbl_sample to a survey design object

Creates a survey::svydesign() object from a tbl_sample, using the sampling design metadata (strata, clusters, weights, and finite population corrections) captured during execute().

Usage

as_svydesign(x, ...)

# S3 method for class 'tbl_sample'
as_svydesign(x, ..., nest = TRUE, method = NULL)

Arguments

x: A tbl_sample object produced by execute().
...: Additional arguments passed to survey::svydesign(). In particular, you can pass pps = survey::ppsmat(joint_matrix) to supply exact joint inclusion probabilities instead of the default Brewer approximation (see Details).
nest: If TRUE, relabel cluster ids to enforce nesting within strata. Passed to survey::svydesign(). Default is TRUE, which is appropriate for most complex survey designs.
method: For two-phase samples, the variance method passed to survey::twophase(). One of "full", "approx", or "simple". This argument is only accepted for two-phase samples.

Value

A survey.design2 object for single-phase and multistage samples, or a twophase/twophase2 object for two-phase samples.

Details

The conversion maps samplyr's design specification to the arguments expected by survey::svydesign():

Cluster ids (ids): one formula term per executed stage. Clustered stages use the cluster_by() variable; when a stage clusters by several variables, their combination (which execution treats as a single cluster id) is collapsed into one synthesized interaction column, because survey::svydesign() reads each formula term as a separate sampling stage. A final unclustered stage (elements sampled within the previous stage's clusters) gets a synthesized row-identity column so that its sampling variance is represented. For WR/PMR stages, the .draw_k column is used as the sampling unit identifier instead (each draw is treated as an independent unit for Hansen–Hurwitz variance estimation).
Strata (strata): one term per stage, aligned with ids. A stage stratified by several variables exports their cross-classification as a single synthesized interaction column (survey silently ignores extra variables within a stage's term). Trailing unstratified stages are omitted; unstratified stages before a stratified stage get a constant placeholder column.
Weights (weights): the .weight column – the compound weight across all stages (i.e., the product of per-stage weights $w = \prod w_k = \prod 1/\pi_k$). This is the inverse of the overall inclusion probability and is the correct weight for design-based point estimation ($\hat{Y} = \sum w_i y_i$).
FPC (fpc): one term per stage, aligned with ids. Because survey::svydesign() requires every FPC term on the same scale, two encodings are used:
- Count scale (designs without unequal-probability WOR stages): .fpc_k (the stratum population count $N_h$) is passed for equal-probability WOR stages; a synthetic Inf column (no correction, Hansen–Hurwitz variance) for WR/PMR stages and for random-size Poisson stages after the first.
- Fraction scale (multi-stage designs with a PPS WOR, balanced, or custom WOR stage): every WOR stage passes its per-unit stage sampling fraction $1 / w_k = \pi_k$; WR/PMR and later Poisson stages pass 0 (no correction). A single-stage PPS WOR design passes $\pi_i$ directly, which survey interprets as inclusion probabilities.

Multi-stage designs

Every executed sampling stage is represented in the exported design: one ids term, one fpc term, and (when stratified) one strata term per stage, so survey::svydesign() performs exact multi-stage linearization (Sarndal et al. 1992, ch. 4.3). In particular, a design whose first stage is a census of PSUs correctly attributes all variance to the later stages.

Operational execution does not change this classification. For example, stage1 <- execute(design, psu_frame, stages = 1) followed by sample <- execute(stage1, listing_frame) remains one multistage design. The partial tbl_sample stores the same design plus the realized PSU selection; the final sample records all executed stages and as_svydesign() calls survey::svydesign(), not survey::twophase().

A two-phase sample has a different provenance: a new phase-2 sampling_design is executed with the phase-1 tbl_sample as its frame, for example phase2 <- execute(design2, phase1). That execution records a previous-phase link, and as_svydesign() calls survey::twophase().

One shape cannot be represented: an unclustered element-sampling stage followed by further stages is not nested cluster sampling (the later selections are conditional on the realized element sample, i.e. phase sampling), and as_svydesign() raises an error. Express such designs as two-phase samples instead: execute the element stage under its first-phase design, then execute a new second-phase design with that sample as its frame. This exports via survey::twophase().

Concretely, for a two-stage stratified-cluster design with a final element stage, the exported call is equivalent to:


survey::svydesign(
  ids     = ~ ea_id + .id_2,       # stage-1 clusters, stage-2 elements
  strata  = ~ region,              # stage-1 strata
  weights = ~ .weight,             # product of per-stage weights
  fpc     = ~ .fpc_pi_1 + .fpc_f_2,  # per-stage sampling fractions
  data    = sample,
  nest    = TRUE
)

Modified samples and domain analysis

The conversion requires a sample whose rows still match the executed design. A tbl_sample whose row set was changed after execute() (rows removed by dplyr::filter() or [, added, or duplicated by a join) or whose internal design columns (.weight, .weight_k, .fpc_k, ...) were overwritten, dropped, or renamed is marked as modified, and as_svydesign() raises an error. The check is authoritative, not just mark-based: the sample is verified against an integrity record (row count and a hash of the weights, design metadata, and strata/cluster columns) stored at execution, so modifications through routes the dplyr hooks cannot see (base assignment, rbind(), vctrs operations, third-party verbs) are also caught, and an overwrite that left every value identical passes. Physically dropping out-of-domain rows before conversion is not equivalent to domain estimation: the point estimate agrees, but the variance is understated because the domain sample size is random under the design.

For subpopulation estimates, convert the full sample first and then subset the design, which applies the proper domain estimator:


svy <- as_svydesign(sample)
survey::svymean(~y, subset(svy, domain))
# or with srvyr:
as_survey_design(sample) |> filter(domain) |> summarise(...)

Row reordering, one-to-one joins, and adding ordinary data columns do not mark the sample. Extracting one complete replicate from a replicated execution (filter(.replicate == r)) is verified against the execution metadata and remains supported.

Equal-probability systematic sampling

systematic stages are exported with the SRSWOR variance estimator, the standard approximation for systematic sampling. Depending on the frame ordering (see the control argument of draw()), the true variance can be smaller (favorable ordering) or larger (periodic ordering) than this estimate.

Variance estimation for PPS designs

For fixed-size PPS without-replacement stages (pps_brewer, pps_systematic, pps_cps, pps_sampford, pps_sps, pps_pareto), variance is estimated by default using Brewer's approximation (pps = "brewer" in survey's terminology), which approximates the joint inclusion probabilities from the marginal inclusion probabilities. Here Brewer names the variance estimator, not the selection algorithm: Sampford selection, for example, receives this default treatment. This is the approximation described by Berger (2004) and works well for most PPS designs regardless of the sampling algorithm used.

For supported methods, you can instead compute joint inclusion probabilities using joint_expectation() and pass them via pps = survey::ppsmat(joint_matrix). The matrix is exact for CPS, Sampford, systematic PPS, and Poisson selection; generalized Brewer, SPS, Pareto, and unconstrained cube use the documented high-entropy approximation.

Spatial and constrained balanced methods

Bounded cube, LPM2, and SCPS alter pairwise selection behavior beyond the available linearization approximation. as_svydesign() therefore refuses these designs, and joint_expectation() does not provide a matrix for them. Use as_svrepdesign(type = "subbootstrap") or "mrbbootstrap" for a generic PPS bootstrap approximation. These replicates do not recreate the count constraints or spatial algorithm and are not an exact, design-specific variance estimator.

Random-size Poisson methods

Methods bernoulli and pps_poisson select units independently with known marginal inclusion probabilities, so the realized sample size is random. The standard SRSWOR variance estimator is not appropriate, and Brewer's approximation (designed for fixed-size PPS) understates the variance. Instead, these methods are exported with pps = survey::poisson_sampling(pi), which produces the Horvitz-Thompson Poisson variance estimator $\hat V = \sum_{i \in S} (1 - \pi_i) / \pi_i^2 \cdot y_i^2$ described in Sarndal, Swensson and Wretman (1992), section 2.8.

This applies under the following conditions.

Single-stage designs (no cluster_by(), or cluster_by() with one row per sampled cluster) are exported with poisson_sampling() and produce the exact Horvitz-Thompson Poisson variance.
Multi-stage designs with a random-size Poisson method at stage k > 1 omit the finite-population correction at the Poisson stage (the same handling used for with-replacement methods). The Poisson stage is treated as sampled with replacement, which is mildly conservative.
Multi-stage designs with a random-size Poisson method at stage 1 are not supported by survey::svydesign(), which rejects multi-stage designs when the pps argument is set. Such designs raise an error suggesting as_svrepdesign(type = "subbootstrap").
Single-stage designs that use cluster_by() with multiple rows per sampled cluster (for example a household listing within sampled EAs) raise an error. survey::poisson_sampling() treats rows as independent and does not honor within-cluster correlation. Use as_svrepdesign(type = "subbootstrap") for these designs.
Custom methods registered with fixed_size = FALSE (sondage::register_method()) are also random-size, but samplyr cannot verify that their selections are independent across units, which the Poisson estimator requires. The method author can settle this at registration: a method registered with variance_family = "poisson" asserts independent selections and is exported through poisson_sampling() exactly like the built-ins above. Undeclared methods raise an error; if you know the method is Poisson-type, pass the probabilities explicitly: as_svydesign(x, pps = survey::poisson_sampling(1 / x$.weight)), or use as_svrepdesign(type = "subbootstrap").

Declared variance families for custom methods

sondage::register_method() accepts a variance_family declaration ("srs", "pps_brewer", "poisson", "wr", "unsupported"). When present it overrides the classification samplyr would otherwise infer from the method's type and fixed_size: "srs" receives the equal-probability treatment (count-scale FPC), "pps_brewer" the fixed-size PPS treatment (Brewer approximation), "poisson" exact Poisson linearization, and "wr" the with-replacement treatment. A method declared "unsupported" cannot be linearized at all: as_svydesign() refuses with an error and as_svrepdesign(type = "subbootstrap") remains the escape hatch.

Chromy's sequential PPS method (PMR)

pps_chromy is classified as a Probability Minimum Replacement (PMR) method – neither with-replacement nor without-replacement. Each unit receives exactly $\lfloor E(n_i) \rfloor$ or $\lfloor E(n_i) \rfloor + 1$ hits, where $E(n_i) = n \cdot \textrm{mos}_i / \sum \textrm{mos}$. When all expected hit counts are below 1, this reduces to WOR; otherwise large units receive multiple hits.

For variance estimation, Chromy (2009) recommends the Hansen-Hurwitz (with-replacement) approximation rather than exact pairwise expectations, which he found "quite variable." Chauvet (2019) confirmed this in simulation. Accordingly, as_svydesign() treats pps_chromy stages like with-replacement stages (no FPC, no pps argument).

Note that survey::ppsmat() is not valid for the general PMR case. The survey package reads $\pi_i$ from the diagonal of the joint matrix, but for PMR the diagonal contains $E(n_i^2)$, which differs from $E(n_i)$ when units receive multiple hits. The generalized Sen-Yates-Grundy variance requires $E(n_i) E(n_j) - E(n_i n_j)$ as the pairwise weight (Chromy 2009, eq. 5), not $E(n_i^2) E(n_j^2) - E(n_i n_j)$.

Certainty stratum (take-all units)

For PPS without-replacement stages that use certainty selection (certainty_size or certainty_prop), units with inclusion probability $\pi_i = 1$ are placed in a separate take-all stratum. This follows the standard practice from Cochran (1977, ch. 11) and Sarndal et al. (1992, ch. 3.5): the take-all stratum contributes zero variance (it is a census) and does not inflate the degrees of freedom for the probability stratum.

For stages using with-replacement methods (srswr, pps_multinomial), the finite population correction is omitted and the .draw_k column (sequential draw index) is used as the sampling unit identifier for Hansen-Hurwitz variance estimation.

The survey package is required but not imported – it must be installed to use this function.

References

Berger, Y.G. (2004). A Simple Variance Estimator for Unequal Probability Sampling Without Replacement. Journal of Applied Statistics, 31, 305-315.

Brewer, K.R.W. (2002). Combined Survey Sampling Inference (Weighing Basu's Elephants). Chapter 9.

Chauvet, G. (2019). Properties of Chromy's sampling procedure. arXiv:1912.10896.

Chromy, J.R. (2009). Some Generalizations of the Horvitz-Thompson Estimator. JSM Proceedings, Survey Research Methods Section.

Cochran, W.G. (1977). Sampling Techniques. 3rd edition. Wiley.

Sarndal, C.-E., Swensson, B. and Wretman, J. (1992). Model Assisted Survey Sampling. Springer.

Examples

# Stratified sample -> survey design
sample <- sampling_design() |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 300) |>
  execute(bfa_eas, seed = 42)

svy <- as_svydesign(sample)
survey::svymean(~households, svy)
#>              mean     SE
#> households 71.668 3.6922

# Two-stage cluster sample with PPS first stage
sample <- sampling_design() |>
  add_stage() |>
    stratify_by(region) |>
    cluster_by(ea_id) |>
    draw(n = 5, method = "pps_brewer", mos = households) |>
  add_stage() |>
    draw(n = 12) |>
  execute(bfa_eas, seed = 2025)

# Default: Brewer variance approximation
svy <- as_svydesign(sample)

# Exact: compute joint probabilities from frame
jip <- joint_expectation(sample, bfa_eas, stage = 1)
svy_exact <- as_svydesign(sample, pps = survey::ppsmat(jip[[1]]))
#> Warning: Exact PPS variance (`pps`) is single-stage in survey.
#> ℹ Exporting the stage-1 design only; later-stage sampling variance is not
#>   represented.
#> ℹ Omit `pps` for exact multi-stage linearization with Brewer's approximation at
#>   the PPS stage.