Overview
In a typical survey, sampling is one step in a longer process:
-
Design and execute the sample using
samplyrto select units from a sampling frame. - Collect data in the field on the selected units.
- Merge the collected data back into the sample, preserving the design metadata (strata, clusters, weights, FPC).
-
Analyze using the
surveyorsrvyrpackage for design-based inference.
samplyr handles step 1. For steps 3 and 4, it provides
bridge functions that convert a tbl_sample into
survey design objects. The tbl_sample carries
all the information that survey::svydesign() needs (strata,
cluster ids, weights, finite population corrections), so the conversion
is automatic.
Because tbl_sample is a tibble subclass, standard dplyr
operations like left_join() preserve the sampling metadata.
This makes step 3 straightforward: join your collected data by unit
identifier and the result is still a valid tbl_sample.
This vignette covers sample diagnostics (before fieldwork), the merge
step, and integration with survey and srvyr
for estimation.
Sample Diagnostics
After executing a design, summary() gives a concise
overview: the design specification, execution details, stratum
allocation table, and weight diagnostics. This is what you check before
going to the field to verify the sample looks right.
library(samplyr)
library(dplyr)
library(survey)
data(bfa_eas)
sample <- sampling_design() |>
stratify_by(region, alloc = "proportional") |>
draw(n = 300) |>
execute(bfa_eas, seed = 1)
summary(sample)
#> ── Sample Summary ──────────────────────────────────────────────────────────────
#>
#> ℹ n = 300 | stages = 1/1 | seed = 1
#>
#> ── Design: Stage 1 ─────────────────────────────────────────────────────────────
#> • Strata: region (proportional)
#> • Method: srswor
#>
#> ── Allocation: Stage 1 ─────────────────────────────────────────────────────────
#> region N_h n_h f_h
#> Boucle du Mouhoun 5009 34 0.0068
#> Cascades 2508 17 0.0068
#> Centre 3888 26 0.0067
#> Centre-Est 2941 20 0.0068
#> Centre-Nord 3402 23 0.0068
#> Centre-Ouest 3723 25 0.0067
#> Centre-Sud 1612 11 0.0068
#> Est 5505 37 0.0067
#> Hauts-Bassins 4839 32 0.0066
#> Nord 2930 20 0.0068
#> Plateau-Central 1662 11 0.0066
#> Sahel 4144 28 0.0068
#> Sud-Ouest 2407 16 0.0066
#> ───── ─── ──────
#> Total 44570 300 0.0067
#>
#> ── Weights ─────────────────────────────────────────────────────────────────────
#> • Range: [146.5, 151.22]
#> • Mean: 148.57 · CV: 0.01
#> • DEFF: 1 · n_eff: 300The allocation table shows the population size (\(N_h\)), sample size (\(n_h\)), and sampling fraction (\(f_h\)) per stratum. The weight diagnostics report the Kish design effect and effective sample size.
Merging Collected Data
After fieldwork, you have collected data (survey responses,
measurements) for the selected units. Join it into the
tbl_sample by a shared identifier. The
tbl_sample class and its design metadata are preserved
through the join.
# Simulate collected data for the selected EAs
set.seed(1)
collected <- tibble(
ea_id = sample$ea_id,
avg_expenditure = rlnorm(nrow(sample), meanlog = 10, sdlog = 0.8),
food_insecure = rbinom(nrow(sample), 1, sample$food_insecurity_pct / 100)
)
# Merge into the sample
sample_with_data <- sample |>
left_join(collected, by = "ea_id")
# Still a tbl_sample with all design metadata intact
class(sample_with_data)
#> [1] "tbl_sample" "tbl_df" "tbl" "data.frame"The merged object can now be exported to survey for
design-based analysis of the collected variables.
Exporting to the survey Package
as_svydesign() converts a tbl_sample into a
survey::svydesign() object. It automatically maps strata,
clusters, weights, and finite population corrections from the sampling
metadata.
Stratified design
svy <- as_svydesign(sample_with_data)
svy
#> Stratified Independent Sampling design
#> svydesign(ids = ~1, strata = ~region, weights = ~.weight, fpc = ~.fpc_1,
#> data = data, nest = TRUE)
# Estimate collected variables
svymean(~avg_expenditure, svy)
#> mean SE
#> avg_expenditure 30465 1557
svymean(~food_insecure, svy)
#> mean SE
#> food_insecure 0.10286 0.0164
# Frame variables are still available for validation
svytotal(~population, svy)
#> total SE
#> population 20024627 788500Cluster design with PPS
For PPS without-replacement designs, as_svydesign() uses
Brewer’s variance approximation by default.
data(bfa_eas)
cluster_sample <- sampling_design() |>
cluster_by(ea_id) |>
draw(n = 50, method = "pps_brewer", mos = households) |>
execute(bfa_eas, seed = 2)
svy_cluster <- as_svydesign(cluster_sample)
svy_cluster
#> Independent Sampling design
#> svydesign(ids = ~ea_id, strata = NULL, weights = ~.weight, fpc = ~.fpc_pi_1,
#> data = data, nest = TRUE)
svymean(~households, svy_cluster)
#> mean SE
#> households 62.255 12.673Multi-stage design
For multi-stage designs, each stage’s cluster variable maps to a
level of the id formula. Strata are exported from the first
stage, consistent with the standard with-replacement variance
approximation (Cochran
1977, ch. 11).
data(zwe_eas)
zwe_frame <- zwe_eas |>
mutate(district_hh = sum(households), .by = district)
ms_sample <- sampling_design() |>
add_stage(label = "Districts") |>
stratify_by(province) |>
cluster_by(district) |>
draw(n = 2, method = "pps_brewer", mos = district_hh) |>
add_stage(label = "EAs") |>
draw(n = 3) |>
execute(zwe_frame, seed = 3)
#> Warning: Sample size capped to population in 1 stratum/strata: "Bulawayo".
#> ℹ Requested total: 20. Actual total: 19.
svy_ms <- as_svydesign(ms_sample)
svy_ms
#> Stratified 1 - level Cluster Sampling design
#> With (19) clusters.
#> svydesign(ids = ~district, strata = ~province, weights = ~.weight,
#> fpc = ~.fpc_pi_1, data = data, nest = TRUE)
svymean(~households, svy_ms)
#> mean SE
#> households 33.599 2.7796Replicate-Weight Export
For replicate-based variance estimation (jackknife/BRR/bootstrap),
use as_svrepdesign(). This converts a
tbl_sample to svydesign, then to
svyrep.design using
survey::as.svrepdesign().
rep_svy <- as_svrepdesign(sample, type = "auto")
svymean(~households, rep_svy)
#> mean SE
#> households 65.321 2.6708You can also export directly to a srvyr replicate
design:
library(srvyr)
#>
#> Attaching package: 'srvyr'
#> The following object is masked from 'package:stats':
#>
#> filter
rep_tbl <- as_survey_rep(sample, type = "auto")
rep_tbl |>
summarise(mean_hh = survey_mean(households, vartype = "se"))
#> # A tibble: 1 × 2
#> mean_hh mean_hh_se
#> <dbl> <dbl>
#> 1 65.3 2.67Replicate export supports single-phase designs. For PPS designs, use
type = "subbootstrap" or
type = "mrbbootstrap". Two-phase designs should use
as_svydesign() (linearization).
Exact Variance with Joint Expectations
The Brewer approximation works well in practice, but if you need to
use second-order quantities directly, joint_expectation()
computes them for all PPS stages. For WOR stages, this produces joint
inclusion probabilities (pi_kl). For WR/PMR stages
(pps_multinomial, pps_chromy), it produces
joint expected hits E(n_k * n_l). See ?joint_expectation to
see which methods provide exact second-order moments, which
approximations are provided, and for the full method-by-method
breakdown.
It replays the design specification against the original frame, computing quantities per stratum and assembling a block-diagonal matrix.
pps_sample <- sampling_design() |>
add_stage() |>
stratify_by(region) |>
cluster_by(ea_id) |>
draw(n = 5, method = "pps_cps", mos = households) |>
add_stage() |>
draw(n = 12) |>
execute(bfa_eas, seed = 2025)
# Compute joint probabilities for stage 1 (exact for CPS)
jip <- joint_expectation(pps_sample, bfa_eas, stage = 1)
# Use with survey for exact variance (WOR stages)
svy_exact <- as_svydesign(pps_sample, pps = ppsmat(jip[[1]]))
svymean(~households, svy_exact)
#> mean SE
#> households 90.623 10.17
svy_brewer <- as_svydesign(pps_sample)
svymean(~households, svy_brewer)
#> mean SE
#> households 90.623 10.162The supported methods :
| samplyr method |
joint_expectation() returns |
Precision |
|---|---|---|
pps_brewer |
\(\pi_{kl}\) | Approximate |
pps_systematic |
\(\pi_{kl}\) | Exact |
pps_cps |
\(\pi_{kl}\) | Exact |
pps_poisson |
\(\pi_{kl}\) | Exact |
pps_sps |
\(\pi_{kl}\) | Approximate |
pps_pareto |
\(\pi_{kl}\) | Approximate |
balanced |
\(\pi_{kl}\) | Approximate |
pps_multinomial |
\(E(n_k * n_l)\) | Exact |
pps_chromy |
\(E(n_k * n_l)\) | Approximate |
End-to-End Workflow
A complete workflow from design to estimates:
# 1. Define the design
design <- sampling_design(title = "Burkina Faso EA Survey") |>
stratify_by(region, alloc = "proportional") |>
draw(n = 300)
# 2. Validate the frame
validate_frame(design, bfa_eas)
# 3. Draw the sample
ea_sample <- execute(design, bfa_eas, seed = 2026)
# 4. Inspect before fieldwork
summary(ea_sample)
#> ── Sample Summary: Burkina Faso EA Survey ──────────────────────────────────────
#>
#> ℹ n = 300 | stages = 1/1 | seed = 2026
#>
#> ── Design: Stage 1 ─────────────────────────────────────────────────────────────
#> • Strata: region (proportional)
#> • Method: srswor
#>
#> ── Allocation: Stage 1 ─────────────────────────────────────────────────────────
#> region N_h n_h f_h
#> Boucle du Mouhoun 5009 34 0.0068
#> Cascades 2508 17 0.0068
#> Centre 3888 26 0.0067
#> Centre-Est 2941 20 0.0068
#> Centre-Nord 3402 23 0.0068
#> Centre-Ouest 3723 25 0.0067
#> Centre-Sud 1612 11 0.0068
#> Est 5505 37 0.0067
#> Hauts-Bassins 4839 32 0.0066
#> Nord 2930 20 0.0068
#> Plateau-Central 1662 11 0.0066
#> Sahel 4144 28 0.0068
#> Sud-Ouest 2407 16 0.0066
#> ───── ─── ──────
#> Total 44570 300 0.0067
#>
#> ── Weights ─────────────────────────────────────────────────────────────────────
#> • Range: [146.5, 151.22]
#> • Mean: 148.57 · CV: 0.01
#> • DEFF: 1 · n_eff: 300
# 5. Collect data (simulated here)
set.seed(99)
field_data <- tibble(
ea_id = ea_sample$ea_id,
avg_expenditure = rlnorm(nrow(ea_sample), meanlog = 10, sdlog = 0.8)
)
# 6. Merge collected data
ea_sample <- ea_sample |>
left_join(field_data, by = "ea_id")
# 7. Export and analyze
svy <- as_svydesign(ea_sample)
svymean(~avg_expenditure, svy)
#> mean SE
#> avg_expenditure 28639 1442.2
confint(svymean(~avg_expenditure, svy))
#> 2.5 % 97.5 %
#> avg_expenditure 25812.54 31465.83