Overview
In a typical survey, sampling is one step in a longer process:
-
Design and execute the sample using
samplyrto select units from a sampling frame. - Collect data in the field on the selected units.
- Merge the collected data back into the sample, preserving the design metadata (strata, clusters, weights, FPC).
-
Analyze using the
surveyorsrvyrpackage for design-based inference.
samplyr handles step 1. For steps 3 and 4, it provides
bridge functions that convert a tbl_sample into
survey design objects. The tbl_sample carries
all the information that survey::svydesign() needs (strata,
cluster ids, weights, finite population corrections), so the conversion
is automatic.
Because tbl_sample is a tibble subclass, standard dplyr
operations like left_join() preserve the sampling metadata.
This makes step 3 straightforward: join your collected data by unit
identifier and the result is still a valid tbl_sample.
This vignette covers sample diagnostics (before fieldwork), the merge
step, and integration with survey and srvyr
for estimation.
Sample Diagnostics
After executing a design, summary() gives a concise
overview: the design specification, execution details, stratum
allocation table, and weight diagnostics. This is what you check before
going to the field to verify the sample looks right.
library(samplyr)
library(dplyr)
library(survey)
data(bfa_eas)
sample <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 300) |>
execute(bfa_eas, seed = 1)
summary(sample)
#> ── Sample Summary ──────────────────────────────────────────────────────────────
#>
#> ℹ n = 300 | stages = 1/1 | seed = 1
#>
#> ── Design: Stage 1 ─────────────────────────────────────────────────────────────
#> • Strata: region (proportional)
#> • Method: srswor
#>
#> ── Allocation: Stage 1 ─────────────────────────────────────────────────────────
#> region N_h n_h f_h
#> Boucle du Mouhoun 1483 30 0.0202
#> Cascades 667 14 0.0210
#> Centre 1556 31 0.0199
#> Centre-Est 1259 25 0.0199
#> Centre-Nord 1375 28 0.0204
#> Centre-Ouest 1287 26 0.0202
#> Centre-Sud 608 12 0.0197
#> Est 1590 32 0.0201
#> Hauts-Bassins 1483 30 0.0202
#> Nord 1211 24 0.0198
#> Plateau-Central 757 15 0.0198
#> Sahel 902 18 0.0200
#> Sud-Ouest 722 15 0.0208
#> ───── ─── ──────
#> Total 14900 300 0.0201
#>
#> ── Weights ─────────────────────────────────────────────────────────────────────
#> • Range: [47.64, 50.67]
#> • Mean: 49.67 · CV: 0.02
#> • DEFF: 1 · n_eff: 300The allocation table shows the population size (\(N_h\)), sample size (\(n_h\)), and sampling fraction (\(f_h\)) per stratum. The weight diagnostics report the Kish design effect and effective sample size.
Merging Collected Data
After fieldwork, you have collected data (survey responses,
measurements) for the selected units. Join it into the
tbl_sample by a shared identifier. The
tbl_sample class and its design metadata are preserved
through the join.
# Simulate collected data for the selected EAs
set.seed(1)
collected <- tibble(
ea_id = sample$ea_id,
avg_expenditure = rlnorm(nrow(sample), meanlog = 10, sdlog = 0.8),
food_insecure = rbinom(nrow(sample), 1, sample$food_insecurity_pct / 100)
)
# Merge into the sample
sample_with_data <- sample |>
left_join(collected, by = "ea_id")
# Still a tbl_sample with all design metadata intact
class(sample_with_data)
#> [1] "tbl_sample" "tbl_df" "tbl" "data.frame"The merged object can now be exported to survey for
design-based analysis of the collected variables.
Exporting to the survey Package
as_svydesign() converts a tbl_sample into a
survey::svydesign() object. It automatically maps strata,
clusters, weights, and finite population corrections from the sampling
metadata.
Stratified design
svy <- as_svydesign(sample_with_data)
svy
#> Stratified Independent Sampling design
#> svydesign(ids = ~1, strata = ~region, weights = ~.weight, fpc = ~.fpc_1,
#> data = data, nest = TRUE)
# Estimate collected variables
svymean(~avg_expenditure, svy)
#> mean SE
#> avg_expenditure 30466 1554
svymean(~food_insecure, svy)
#> mean SE
#> food_insecure 0.1896 0.0218
# Frame variables are still available for validation
svytotal(~population, svy)
#> total SE
#> population 20503608 373599Cluster design with PPS
For PPS without-replacement designs, as_svydesign() uses
Brewer’s variance approximation by default.
data(bfa_eas)
cluster_sample <- sampling_design() |>
cluster_by(ea_id) |>
draw(n = 50, method = "pps_brewer", mos = households) |>
execute(bfa_eas, seed = 2)
svy_cluster <- as_svydesign(cluster_sample)
svy_cluster
#> Independent Sampling design
#> svydesign(ids = ~ea_id, strata = NULL, weights = ~.weight, fpc = ~.fpc_pi_1,
#> data = data, nest = TRUE)
svymean(~households, svy_cluster)
#> mean SE
#> households 221.53 12.771Multi-stage design
For multi-stage designs, each stage’s cluster variable maps to a
level of the id formula. Strata are exported from the first
stage, consistent with the standard with-replacement variance
approximation (Cochran
1977, ch. 11).
data(zwe_eas)
zwe_frame <- zwe_eas |>
mutate(district_hh = sum(households), .by = district)
ms_sample <- sampling_design() |>
add_stage(label = "Districts") |>
stratify_by(province) |>
cluster_by(district) |>
draw(n = 2, method = "pps_brewer", mos = district_hh) |>
add_stage(label = "EAs") |>
draw(n = 3) |>
execute(zwe_frame, seed = 3)
#> Warning: Sample size capped to population in 1 stratum/strata: "Bulawayo".
#> ℹ Requested total: 20. Actual total: 19.
svy_ms <- as_svydesign(ms_sample)
svy_ms
#> Stratified 1 - level Cluster Sampling design
#> With (19) clusters.
#> svydesign(ids = ~district, strata = ~province, weights = ~.weight,
#> fpc = ~.fpc_pi_1, data = data, nest = TRUE)
svymean(~households, svy_ms)
#> mean SE
#> households 170.98 5.6658Replicate-Weight Export
For replicate-based variance estimation (jackknife/BRR/bootstrap),
use as_svrepdesign(). This converts a
tbl_sample to svydesign, then to
svyrep.design using
survey::as.svrepdesign().
rep_svy <- as_svrepdesign(sample, type = "auto")
svymean(~households, rep_svy)
#> mean SE
#> households 207.54 4.289You can also export directly to a srvyr replicate
design:
library(srvyr)
#>
#> Attaching package: 'srvyr'
#> The following object is masked from 'package:stats':
#>
#> filter
rep_tbl <- as_survey_rep(sample, type = "auto")
rep_tbl |>
summarise(mean_hh = survey_mean(households, vartype = "se"))
#> # A tibble: 1 × 2
#> mean_hh mean_hh_se
#> <dbl> <dbl>
#> 1 208. 4.29Current scope: replicate export supports single-phase non-PPS
designs. Two-phase and PPS designs should use
as_svydesign() (linearization).
Exact Variance with Joint Expectations
The Brewer approximation works well in practice, but if you need to
use second-order quantities directly, joint_expectation()
computes them for all PPS stages. For WOR stages, this produces joint
inclusion probabilities (pi_kl). For WR/PMR stages
(pps_multinomial, pps_chromy), it produces
joint expected hits E(n_k * n_l). See ?joint_expectation to
see which methods provide exact second-order moments, which
approximations are provided, and for the full method-by-method
breakdown.
It replays the design specification against the original frame, computing quantities per stratum and assembling a block-diagonal matrix.
pps_sample <- sampling_design() |>
add_stage() |>
stratify_by(region) |>
cluster_by(ea_id) |>
draw(n = 5, method = "pps_cps", mos = households) |>
add_stage() |>
draw(n = 12) |>
execute(bfa_eas, seed = 2025)
# Compute joint probabilities for stage 1 (exact for CPS)
jip <- joint_expectation(pps_sample, bfa_eas, stage = 1)
# Use with survey for exact variance (WOR stages)
svy_exact <- as_svydesign(pps_sample, pps = ppsmat(jip[[1]]))
svymean(~households, svy_exact)
#> mean SE
#> households 214.3 8.5475
svy_brewer <- as_svydesign(pps_sample)
svymean(~households, svy_brewer)
#> mean SE
#> households 214.3 8.5474The supported methods :
| samplyr method |
joint_expectation() returns |
Precision |
|---|---|---|
pps_brewer |
\(\pi_{kl}\) | Approximate |
pps_systematic |
\(\pi_{kl}\) | Exact |
pps_cps |
\(\pi_{kl}\) | Exact |
pps_poisson |
\(\pi_{kl}\) | Exact |
pps_sps |
\(\pi_{kl}\) | Approximate |
pps_pareto |
\(\pi_{kl}\) | Approximate |
balanced |
\(\pi_{kl}\) | Approximate |
pps_multinomial |
\(E(n_k * n_l)\) | Exact |
pps_chromy |
\(E(n_k * n_l)\) | Approximate |
End-to-End Workflow
A complete workflow from design to estimates:
# 1. Define the design
design <- sampling_design(title = "Burkina Faso EA Survey") |>
stratify_by(region, alloc = "proportional") |>
draw(n = 300)
# 2. Validate the frame
validate_frame(design, bfa_eas)
# 3. Draw the sample
ea_sample <- execute(design, bfa_eas, seed = 2026)
# 4. Inspect before fieldwork
summary(ea_sample)
#> ── Sample Summary: Burkina Faso EA Survey ──────────────────────────────────────
#>
#> ℹ n = 300 | stages = 1/1 | seed = 2026
#>
#> ── Design: Stage 1 ─────────────────────────────────────────────────────────────
#> • Strata: region (proportional)
#> • Method: srswor
#>
#> ── Allocation: Stage 1 ─────────────────────────────────────────────────────────
#> region N_h n_h f_h
#> Boucle du Mouhoun 1483 30 0.0202
#> Cascades 667 14 0.0210
#> Centre 1556 31 0.0199
#> Centre-Est 1259 25 0.0199
#> Centre-Nord 1375 28 0.0204
#> Centre-Ouest 1287 26 0.0202
#> Centre-Sud 608 12 0.0197
#> Est 1590 32 0.0201
#> Hauts-Bassins 1483 30 0.0202
#> Nord 1211 24 0.0198
#> Plateau-Central 757 15 0.0198
#> Sahel 902 18 0.0200
#> Sud-Ouest 722 15 0.0208
#> ───── ─── ──────
#> Total 14900 300 0.0201
#>
#> ── Weights ─────────────────────────────────────────────────────────────────────
#> • Range: [47.64, 50.67]
#> • Mean: 49.67 · CV: 0.02
#> • DEFF: 1 · n_eff: 300
# 5. Collect data (simulated here)
set.seed(99)
field_data <- tibble(
ea_id = ea_sample$ea_id,
avg_expenditure = rlnorm(nrow(ea_sample), meanlog = 10, sdlog = 0.8)
)
# 6. Merge collected data
ea_sample <- ea_sample |>
left_join(field_data, by = "ea_id")
# 7. Export and analyze
svy <- as_svydesign(ea_sample)
svymean(~avg_expenditure, svy)
#> mean SE
#> avg_expenditure 28645 1436.7
confint(svymean(~avg_expenditure, svy))
#> 2.5 % 97.5 %
#> avg_expenditure 25829.12 31460.92