Skip to contents

Overview

In a typical survey, sampling is one step in a longer process:

  1. Design and execute the sample using samplyr to select units from a sampling frame.
  2. Collect data in the field on the selected units.
  3. Merge the collected data back into the sample, preserving the design metadata (strata, clusters, weights, FPC).
  4. Analyze using the survey or srvyr package for design-based inference.

samplyr handles step 1. For steps 3 and 4, it provides bridge functions that convert a tbl_sample into survey design objects. The tbl_sample carries all the information that survey::svydesign() needs (strata, cluster ids, weights, finite population corrections), so the conversion is automatic.

Because tbl_sample is a tibble subclass, standard dplyr operations like left_join() preserve the sampling metadata. This makes step 3 straightforward: join your collected data by unit identifier and the result is still a valid tbl_sample.

This vignette covers sample diagnostics (before fieldwork), the merge step, and integration with survey and srvyr for estimation.

Sample Diagnostics

After executing a design, summary() gives a concise overview: the design specification, execution details, stratum allocation table, and weight diagnostics. This is what you check before going to the field to verify the sample looks right.

library(samplyr)
library(dplyr)
library(survey)

data(bfa_eas)

sample <- sampling_design() |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 300) |>
  execute(bfa_eas, seed = 1)

summary(sample)
#> ── Sample Summary ──────────────────────────────────────────────────────────────
#> 
#> ℹ n = 300 | stages = 1/1 | seed = 1
#> 
#> ── Design: Stage 1 ─────────────────────────────────────────────────────────────
#> • Strata: region (proportional)
#> • Method: srswor
#> 
#> ── Allocation: Stage 1 ─────────────────────────────────────────────────────────
#>   region             N_h    n_h  f_h   
#>   Boucle du Mouhoun  1483   30   0.0202
#>   Cascades           667    14   0.0210
#>   Centre             1556   31   0.0199
#>   Centre-Est         1259   25   0.0199
#>   Centre-Nord        1375   28   0.0204
#>   Centre-Ouest       1287   26   0.0202
#>   Centre-Sud         608    12   0.0197
#>   Est                1590   32   0.0201
#>   Hauts-Bassins      1483   30   0.0202
#>   Nord               1211   24   0.0198
#>   Plateau-Central    757    15   0.0198
#>   Sahel              902    18   0.0200
#>   Sud-Ouest          722    15   0.0208
#>                      ─────  ───  ──────
#>   Total              14900  300  0.0201
#> 
#> ── Weights ─────────────────────────────────────────────────────────────────────
#> • Range: [47.64, 50.67]
#> • Mean:  49.67 · CV: 0.02
#> • DEFF:  1 · n_eff: 300

The allocation table shows the population size (\(N_h\)), sample size (\(n_h\)), and sampling fraction (\(f_h\)) per stratum. The weight diagnostics report the Kish design effect and effective sample size.

Merging Collected Data

After fieldwork, you have collected data (survey responses, measurements) for the selected units. Join it into the tbl_sample by a shared identifier. The tbl_sample class and its design metadata are preserved through the join.

# Simulate collected data for the selected EAs
set.seed(1)
collected <- tibble(
  ea_id = sample$ea_id,
  avg_expenditure = rlnorm(nrow(sample), meanlog = 10, sdlog = 0.8),
  food_insecure = rbinom(nrow(sample), 1, sample$food_insecurity_pct / 100)
)

# Merge into the sample
sample_with_data <- sample |>
  left_join(collected, by = "ea_id")

# Still a tbl_sample with all design metadata intact
class(sample_with_data)
#> [1] "tbl_sample" "tbl_df"     "tbl"        "data.frame"

The merged object can now be exported to survey for design-based analysis of the collected variables.

Exporting to the survey Package

as_svydesign() converts a tbl_sample into a survey::svydesign() object. It automatically maps strata, clusters, weights, and finite population corrections from the sampling metadata.

Stratified design

svy <- as_svydesign(sample_with_data)
svy
#> Stratified Independent Sampling design
#> svydesign(ids = ~1, strata = ~region, weights = ~.weight, fpc = ~.fpc_1, 
#>     data = data, nest = TRUE)

# Estimate collected variables
svymean(~avg_expenditure, svy)
#>                  mean   SE
#> avg_expenditure 30466 1554
svymean(~food_insecure, svy)
#>                 mean     SE
#> food_insecure 0.1896 0.0218

# Frame variables are still available for validation
svytotal(~population, svy)
#>               total     SE
#> population 20503608 373599

Cluster design with PPS

For PPS without-replacement designs, as_svydesign() uses Brewer’s variance approximation by default.

data(bfa_eas)

cluster_sample <- sampling_design() |>
  cluster_by(ea_id) |>
  draw(n = 50, method = "pps_brewer", mos = households) |>
  execute(bfa_eas, seed = 2)

svy_cluster <- as_svydesign(cluster_sample)
svy_cluster
#> Independent Sampling design
#> svydesign(ids = ~ea_id, strata = NULL, weights = ~.weight, fpc = ~.fpc_pi_1, 
#>     data = data, nest = TRUE)
svymean(~households, svy_cluster)
#>              mean     SE
#> households 221.53 12.771

Multi-stage design

For multi-stage designs, each stage’s cluster variable maps to a level of the id formula. Strata are exported from the first stage, consistent with the standard with-replacement variance approximation (Cochran 1977, ch. 11).

data(zwe_eas)

zwe_frame <- zwe_eas |>
  mutate(district_hh = sum(households), .by = district)

ms_sample <- sampling_design() |>
  add_stage(label = "Districts") |>
    stratify_by(province) |>
    cluster_by(district) |>
    draw(n = 2, method = "pps_brewer", mos = district_hh) |>
  add_stage(label = "EAs") |>
    draw(n = 3) |>
  execute(zwe_frame, seed = 3)
#> Warning: Sample size capped to population in 1 stratum/strata: "Bulawayo".
#>  Requested total: 20. Actual total: 19.

svy_ms <- as_svydesign(ms_sample)
svy_ms
#> Stratified 1 - level Cluster Sampling design
#> With (19) clusters.
#> svydesign(ids = ~district, strata = ~province, weights = ~.weight, 
#>     fpc = ~.fpc_pi_1, data = data, nest = TRUE)
svymean(~households, svy_ms)
#>              mean     SE
#> households 170.98 5.6658

Replicate-Weight Export

For replicate-based variance estimation (jackknife/BRR/bootstrap), use as_svrepdesign(). This converts a tbl_sample to svydesign, then to svyrep.design using survey::as.svrepdesign().

rep_svy <- as_svrepdesign(sample, type = "auto")
svymean(~households, rep_svy)
#>              mean    SE
#> households 207.54 4.289

You can also export directly to a srvyr replicate design:

library(srvyr)
#> 
#> Attaching package: 'srvyr'
#> The following object is masked from 'package:stats':
#> 
#>     filter

rep_tbl <- as_survey_rep(sample, type = "auto")
rep_tbl |>
  summarise(mean_hh = survey_mean(households, vartype = "se"))
#> # A tibble: 1 × 2
#>   mean_hh mean_hh_se
#>     <dbl>      <dbl>
#> 1    208.       4.29

Current scope: replicate export supports single-phase non-PPS designs. Two-phase and PPS designs should use as_svydesign() (linearization).

Exact Variance with Joint Expectations

The Brewer approximation works well in practice, but if you need to use second-order quantities directly, joint_expectation() computes them for all PPS stages. For WOR stages, this produces joint inclusion probabilities (pi_kl). For WR/PMR stages (pps_multinomial, pps_chromy), it produces joint expected hits E(n_k * n_l). See ?joint_expectation to see which methods provide exact second-order moments, which approximations are provided, and for the full method-by-method breakdown.

It replays the design specification against the original frame, computing quantities per stratum and assembling a block-diagonal matrix.

pps_sample <- sampling_design() |>
  add_stage() |>
    stratify_by(region) |>
    cluster_by(ea_id) |>
    draw(n = 5, method = "pps_cps", mos = households) |>
  add_stage() |>
    draw(n = 12) |>
  execute(bfa_eas, seed = 2025)

# Compute joint probabilities for stage 1 (exact for CPS)
jip <- joint_expectation(pps_sample, bfa_eas, stage = 1)

# Use with survey for exact variance (WOR stages)
svy_exact <- as_svydesign(pps_sample, pps = ppsmat(jip[[1]]))
svymean(~households, svy_exact)
#>             mean     SE
#> households 214.3 8.5475

svy_brewer <- as_svydesign(pps_sample)
svymean(~households, svy_brewer)
#>             mean     SE
#> households 214.3 8.5474

The supported methods :

samplyr method joint_expectation() returns Precision
pps_brewer \(\pi_{kl}\) Approximate
pps_systematic \(\pi_{kl}\) Exact
pps_cps \(\pi_{kl}\) Exact
pps_poisson \(\pi_{kl}\) Exact
pps_sps \(\pi_{kl}\) Approximate
pps_pareto \(\pi_{kl}\) Approximate
balanced \(\pi_{kl}\) Approximate
pps_multinomial \(E(n_k * n_l)\) Exact
pps_chromy \(E(n_k * n_l)\) Approximate

End-to-End Workflow

A complete workflow from design to estimates:

# 1. Define the design
design <- sampling_design(title = "Burkina Faso EA Survey") |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 300)

# 2. Validate the frame
validate_frame(design, bfa_eas)

# 3. Draw the sample
ea_sample <- execute(design, bfa_eas, seed = 2026)

# 4. Inspect before fieldwork
summary(ea_sample)
#> ── Sample Summary: Burkina Faso EA Survey ──────────────────────────────────────
#> 
#> ℹ n = 300 | stages = 1/1 | seed = 2026
#> 
#> ── Design: Stage 1 ─────────────────────────────────────────────────────────────
#> • Strata: region (proportional)
#> • Method: srswor
#> 
#> ── Allocation: Stage 1 ─────────────────────────────────────────────────────────
#>   region             N_h    n_h  f_h   
#>   Boucle du Mouhoun  1483   30   0.0202
#>   Cascades           667    14   0.0210
#>   Centre             1556   31   0.0199
#>   Centre-Est         1259   25   0.0199
#>   Centre-Nord        1375   28   0.0204
#>   Centre-Ouest       1287   26   0.0202
#>   Centre-Sud         608    12   0.0197
#>   Est                1590   32   0.0201
#>   Hauts-Bassins      1483   30   0.0202
#>   Nord               1211   24   0.0198
#>   Plateau-Central    757    15   0.0198
#>   Sahel              902    18   0.0200
#>   Sud-Ouest          722    15   0.0208
#>                      ─────  ───  ──────
#>   Total              14900  300  0.0201
#> 
#> ── Weights ─────────────────────────────────────────────────────────────────────
#> • Range: [47.64, 50.67]
#> • Mean:  49.67 · CV: 0.02
#> • DEFF:  1 · n_eff: 300

# 5. Collect data (simulated here)
set.seed(99)
field_data <- tibble(
  ea_id = ea_sample$ea_id,
  avg_expenditure = rlnorm(nrow(ea_sample), meanlog = 10, sdlog = 0.8)
)

# 6. Merge collected data
ea_sample <- ea_sample |>
  left_join(field_data, by = "ea_id")

# 7. Export and analyze
svy <- as_svydesign(ea_sample)
svymean(~avg_expenditure, svy)
#>                  mean     SE
#> avg_expenditure 28645 1436.7
confint(svymean(~avg_expenditure, svy))
#>                    2.5 %   97.5 %
#> avg_expenditure 25829.12 31460.92

Reference

Cochran, William G. 1977. Sampling Techniques. 3rd ed. Wiley.