Survey Analysis with samplyr • samplyr

Overview

In a typical survey, sampling is one step in a longer process:

Design and execute the sample using samplyr to select units from a sampling frame.
Collect data in the field on the selected units.
Merge the collected data back into the sample, preserving the design metadata (strata, clusters, weights, FPC).
Analyze using the survey or srvyr package for design-based inference.

samplyr handles step 1. For steps 3 and 4, it provides bridge functions that convert a tbl_sample into survey design objects. The tbl_sample carries all the information that survey::svydesign() needs (strata, cluster ids, weights, finite population corrections), so the conversion is automatic.

Because tbl_sample is a tibble subclass, standard dplyr operations like left_join() preserve the sampling metadata. This makes step 3 straightforward: join your collected data by unit identifier and the result is still a valid tbl_sample.

This vignette covers sample diagnostics (before fieldwork), the merge step, and integration with survey and srvyr for estimation.

Sample Diagnostics

After executing a design, summary() gives a concise overview: one design line and one realization line per stage, then weight diagnostics. This is what you check before going to the field to verify the sample looks right. Per-pool allocation tables are in frame_summary(detail = "pool").

library(samplyr)
library(dplyr)
library(survey)

data(bfa_eas)

sample <- sampling_design() |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 300) |>
  execute(bfa_eas, seed = 1)

summary(sample)
#> ── Sample Summary ──────────────────────────────────────────────────────────────
#> 
#> ℹ n = 300 of 44,570 | stages = 1/1 | seed = 1
#> 
#> ── Stage 1 ─────────────────────────────────────────────────────────────────────
#> • srswor, by region (proportional)
#> • 13 strata: N_h 1,612-5,505, n_h 11-37, f_h 0.0066-0.0068
#> 
#> ── Weights ─────────────────────────────────────────────────────────────────────
#> • Mean 148.57 [146.5, 151.22] | CV 0.01 | DEFF 1 | n_eff 300

The realization line uses compact sampling notation. Its \(N_h\), \(n_h\), and \(f_h\) labels correspond to the N, n_realized, and take_rate columns in the pool table:

allocation <- frame_summary(sample, detail = "pool")

allocation |>
  select(region, replicate, N, n_target, n_expected, n_realized, take_rate)
#> # A tibble: 13 × 7
#>    region            replicate     N n_target n_expected n_realized take_rate
#>    <fct>                 <int> <dbl>    <dbl>      <dbl>      <dbl>     <dbl>
#>  1 Boucle du Mouhoun         1  5009       34         34         34   0.00679
#>  2 Cascades                  1  2508       17         17         17   0.00678
#>  3 Centre                    1  3888       26         26         26   0.00669
#>  4 Centre-Est                1  2941       20         20         20   0.00680
#>  5 Centre-Nord               1  3402       23         23         23   0.00676
#>  6 Centre-Ouest              1  3723       25         25         25   0.00672
#>  7 Centre-Sud                1  1612       11         11         11   0.00682
#>  8 Est                       1  5505       37         37         37   0.00672
#>  9 Hauts-Bassins             1  4839       32         32         32   0.00661
#> 10 Nord                      1  2930       20         20         20   0.00683
#> 11 Plateau-Central           1  1662       11         11         11   0.00662
#> 12 Sahel                     1  4144       28         28         28   0.00676
#> 13 Sud-Ouest                 1  2407       16         16         16   0.00665

n_target is the nominal allocation requested by the design, n_expected is the sum of the final resolved selection chances, and n_realized is what the execution actually selected. They coincide here because this is a feasible fixed-size design. For Bernoulli or Poisson sampling, n_target records the requested expectation, which can differ from both the resolved expectation and the random realized count. take_rate is n_realized / N; it is not a unit-level inclusion probability for an unequal-probability design.

Every pool table has a replicate column (1 for an ordinary execution). With execute(reps = ...), it contains one row per pool and replicate, so the realized variation remains numeric and can be grouped, plotted, or filtered without unpacking list columns. The weight diagnostics report the Kish design effect and effective sample size.

Merging Collected Data

After fieldwork, you have collected data (survey responses, measurements) for the selected units. Join it into the tbl_sample by a shared identifier. The tbl_sample class and its design metadata are preserved through the join.

# Simulate household-budget outcomes collected after selecting the EAs.
# These outcomes are not variables on the sampling frame.
set.seed(1)
log_consumption <- rnorm(
  nrow(sample),
  mean = 11 + 0.25 * (sample$urban_rural == "Urban") -
    0.12 * (sample$remoteness == "High"),
  sd = 0.65
)
collected <- tibble(
  ea_id = sample$ea_id,
  consumption_per_capita = round(exp(log_consumption)),
  below_consumption_threshold = as.integer(log_consumption < 11)
)

# Merge into the sample
sample_with_data <- sample |>
  left_join(collected, by = "ea_id")

# Still a tbl_sample with all design metadata intact
class(sample_with_data)
#> [1] "tbl_sample" "tbl_df"     "tbl"        "data.frame"

The merged object can now be exported to survey for design-based analysis of the collected variables.

Exporting to the survey Package

as_svydesign() converts a tbl_sample into a survey::svydesign() object. It automatically maps strata, clusters, weights, and finite population corrections from the sampling metadata.

Stratified design

svy <- as_svydesign(sample_with_data)
svy
#> Stratified Independent Sampling design
#> svydesign(ids = ~1, strata = ~region, weights = ~.weight, fpc = ~.fpc_1, 
#>     data = data, nest = TRUE)

# Estimate collected variables
svymean(~consumption_per_capita, svy)
#>                         mean     SE
#> consumption_per_capita 75808 3112.2
svymean(~below_consumption_threshold, svy)
#>                                mean     SE
#> below_consumption_threshold 0.50316 0.0288

# Frame variables are still available for validation
svytotal(~population, svy)
#>               total     SE
#> population 20024627 788500

Cluster design with PPS

For fixed-size PPS without-replacement designs (pps_brewer, pps_systematic, pps_cps, pps_sampford, pps_sps, and pps_pareto), as_svydesign() uses Brewer’s variance approximation by default. Here “Brewer” names the variance approximation used by survey, not the selection algorithm: a Sampford sample, for example, still receives this default unless you supply its exact joint inclusion probabilities.

data(bfa_eas)

cluster_sample <- sampling_design() |>
  cluster_by(ea_id) |>
  draw(n = 50, method = "pps_brewer", mos = households) |>
  execute(bfa_eas, seed = 2)

svy_cluster <- as_svydesign(cluster_sample)
svy_cluster
#> Independent Sampling design
#> svydesign(ids = ~ea_id, strata = NULL, weights = ~.weight, fpc = ~.fpc_pi_1, 
#>     data = data, nest = TRUE)
svymean(~households, svy_cluster)
#>              mean     SE
#> households 78.682 9.3999

Multi-stage design

For multi-stage designs, every executed stage maps to one term of the id formula, with a matching finite population correction. A final stage without cluster_by() gets a synthesized element identifier. Strata are exported per stage, so the result is exact multi-stage linearization (Särndal, Swensson, and Wretman 1992, ch. 4.3).

data(zwe_eas)

zwe_frame <- zwe_eas |>
  mutate(district_hh = sum(households), .by = district)

ms_sample <- sampling_design() |>
  add_stage(label = "Districts") |>
    stratify_by(province) |>
    cluster_by(district) |>
    draw(n = 2, method = "pps_brewer", mos = district_hh) |>
  add_stage(label = "EAs") |>
    draw(n = 3) |>
  execute(zwe_frame, seed = 3)
#> Warning: Sample size capped to population in 1 stratum/strata: "Bulawayo".
#> ℹ Requested total: 20. Actual total: 19.

svy_ms <- as_svydesign(ms_sample)
svy_ms
#> Stratified 2 - level Cluster Sampling design
#> With (19, 57) clusters.
#> svydesign(ids = ~district + .id_2, strata = ~province, weights = ~.weight, 
#>     fpc = ~.fpc_pi_1 + .fpc_f_2, data = data, nest = TRUE)
svymean(~households, svy_ms)
#>              mean     SE
#> households 33.599 3.2636

Domain (Subpopulation) Analysis

To estimate for a subpopulation, convert the full sample first and subset the survey design. Do not filter the tbl_sample rows before converting.

urban_svy <- subset(svy, urban_rural == "Urban")
svymean(~consumption_per_capita, urban_svy)
#>                         mean    SE
#> consumption_per_capita 98994 10526

The distinction matters for variance. Dropping out-of-domain rows before conversion produces the same point estimate but understates the standard error, because the number of sampled units falling in the domain is itself random and that variation is part of the design variance (Särndal, Swensson, and Wretman 1992, ch. 5.8). survey::subset() keeps the full design structure and applies the correct domain estimator.

For this reason a filtered tbl_sample is marked as modified and as_svydesign() refuses it:

urban_only <- sample_with_data |>
  filter(urban_rural == "Urban")

as_svydesign(urban_only)
#> Error in `as_svydesign()`:
#> ! `as_svydesign()` requires a sample that still matches its executed
#>   design.
#> ✖ Rows were removed, added, or duplicated after `execute()`.
#> ℹ For domain (subpopulation) analysis, convert the full sample first, then
#>   subset the design: `subset(as_svydesign(full_sample), condition)`, or with
#>   srvyr: `as_survey_design(full_sample) |> filter(condition)`.
#> ℹ To subsample an executed sample, run a second phase: `sampling_design() |>
#>   draw(...) |> execute(full_sample)`.

The same rule applies to any operation that removes, adds, or duplicates rows, or that drops, renames, or overwrites internal design columns such as .weight or .fpc_1 (a select() that keeps only .weight would silently lose the finite population correction). One-to-one joins (as in the merge above), row reordering, and column reordering are unaffected. Extracting a single complete replicate with filter(.replicate == r) from a replicated execution remains supported.

With srvyr, the same domain analysis reads naturally as a pipeline: convert first, then filter the design object.

library(srvyr)
#> 
#> Attaching package: 'srvyr'
#> The following object is masked from 'package:stats':
#> 
#>     filter

sample_with_data |>
  as_survey_design() |>
  filter(urban_rural == "Urban") |>
  summarise(mean_consumption = survey_mean(consumption_per_capita))
#> # A tibble: 1 × 2
#>   mean_consumption mean_consumption_se
#>              <dbl>               <dbl>
#> 1           98994.              10526.

Replicate-Weight Export

For replicate-based variance estimation (jackknife/BRR/bootstrap), use as_svrepdesign(). This converts a tbl_sample to svydesign, then to svyrep.design using survey::as.svrepdesign().

rep_svy <- as_svrepdesign(sample, type = "auto")
svymean(~households, rep_svy)
#>              mean     SE
#> households 65.744 2.7501

You can also export directly to a srvyr replicate design:

library(srvyr)

rep_tbl <- as_survey_rep(sample, type = "auto")
rep_tbl |>
  summarise(mean_hh = survey_mean(households, vartype = "se"))
#> # A tibble: 1 × 2
#>   mean_hh mean_hh_se
#>     <dbl>      <dbl>
#> 1    65.7       2.75

Replicate export supports single-phase designs. For PPS designs, use type = "subbootstrap" or type = "mrbbootstrap". Two-phase designs should use as_svydesign() (linearization).

LPM2 and SCPS require a different qualification. Their spatial spreading changes pairwise selection behavior, and samplyr does not currently provide a design-specific linearization estimator or joint inclusion probabilities for them. Consequently, as_svydesign() refuses these designs. as_svrepdesign(type = "subbootstrap") provides a generic PPS bootstrap approximation, but it does not reproduce the spatial selection algorithm inside each replicate and should not be described as exact.

spatial_frame <- slice_head(bfa_eas, n = 500)

spatial_sample <- sampling_design() |>
  draw(
    n = 40,
    method = "lpm2",
    spread = c(longitude, latitude)
  ) |>
  execute(spatial_frame, seed = 7)

# Use method = "scps" for spatially correlated Poisson sampling.
spatial_rep <- as_svrepdesign(
  spatial_sample,
  type = "subbootstrap",
  replicates = 20
)
svymean(~households, spatial_rep)
#>            mean     SE
#> households   77 10.053

Variance with Joint Expectations

The Brewer approximation works well in practice, but joint_expectation() can reconstruct second-order quantities directly for supported methods. For WOR stages, it returns joint inclusion probabilities \(\pi_{kl}\). For WR/PMR stages (pps_multinomial, pps_chromy), it returns joint expected hits \(E(n_k n_l)\). Depending on the selection method, these quantities are exact or approximate.

It replays the design specification against the original frame, computing quantities per stratum or conditional parent occurrence. Blocks and units within blocks follow first appearance in the sample. Cross-block entries are products of marginal chances. Below a WR parent, every parent draw occurrence defines a separate independent child block, even when several draws hit the same population cluster.

sampford_sample <- sampling_design() |>
  stratify_by(region) |>
  cluster_by(ea_id) |>
  draw(n = 5, method = "pps_sampford", mos = households) |>
  execute(bfa_eas, seed = 2025)

# Sampford joint inclusion probabilities are exact.
jip <- joint_expectation(sampford_sample, bfa_eas, stage = 1)

# Use the exact matrix for single-stage PPS variance estimation.
svy_exact <- as_svydesign(
  sampford_sample,
  pps = ppsmat(jip[[1]])
)
svymean(~households, svy_exact)
#>              mean     SE
#> households 76.339 7.6495

# The default export instead uses the generic Brewer approximation.
svy_brewer <- as_svydesign(sampford_sample)
svymean(~households, svy_brewer)
#>              mean     SE
#> households 76.339 7.6458

The method-specific availability is:

samplyr method	`joint_expectation()` result	Precision or status
`pps_brewer`	\(\pi_{kl}\)	Approximate
`pps_sampford`	\(\pi_{kl}\)	Exact
`pps_systematic`	\(\pi_{kl}\)	Exact
`pps_cps`	\(\pi_{kl}\)	Exact
`pps_poisson`	\(\pi_{kl}\)	Exact
`pps_sps`	\(\pi_{kl}\)	Approximate
`pps_pareto`	\(\pi_{kl}\)	Approximate
`pps_multinomial`	\(E(n_k n_l)\)	Exact
`pps_chromy`	\(E(n_k n_l)\)	Approximate
`cube`	\(\pi_{kl}\)	Approximate when unconstrained
`lpm2`	Unavailable	Spatial bootstrap only
`scps`	Unavailable	Spatial bootstrap only

For bounded cube designs, joint probabilities are also unavailable because the count constraints alter pairwise selection behavior. For LPM2, SCPS, and bounded cube, use the generic bootstrap route described above. For a multi-stage PPS design, supplying pps = ppsmat(...) represents only the stage covered by that matrix; omit pps to retain samplyr’s full multi-stage linearization with the default stage-level approximation.

End-to-End Workflow

A complete workflow from design to estimates:

# 1. Define the design
design <- sampling_design(title = "Burkina Faso EA Survey") |>
  stratify_by(region, alloc = "proportional") |>
  draw(n = 300)

# 2. Validate the frame
validate_frame(design, bfa_eas)

# 3. Draw the sample
ea_sample <- execute(design, bfa_eas, seed = 2026)

# 4. Inspect before fieldwork
summary(ea_sample)
#> ── Sample Summary: Burkina Faso EA Survey ──────────────────────────────────────
#> 
#> ℹ n = 300 of 44,570 | stages = 1/1 | seed = 2026
#> 
#> ── Stage 1 ─────────────────────────────────────────────────────────────────────
#> • srswor, by region (proportional)
#> • 13 strata: N_h 1,612-5,505, n_h 11-37, f_h 0.0066-0.0068
#> 
#> ── Weights ─────────────────────────────────────────────────────────────────────
#> • Mean 148.57 [146.5, 151.22] | CV 0.01 | DEFF 1 | n_eff 300

# 5. Collect data (simulated here)
set.seed(99)
field_data <- tibble(
  ea_id = ea_sample$ea_id,
  consumption_per_capita = round(
    rlnorm(nrow(ea_sample), meanlog = 11, sdlog = 0.65)
  )
)

# 6. Merge collected data
ea_sample <- ea_sample |>
  left_join(field_data, by = "ea_id")

# 7. Export and analyze
svy <- as_svydesign(ea_sample)
svymean(~consumption_per_capita, svy)
#>                         mean     SE
#> consumption_per_capita 70677 2811.7
confint(svymean(~consumption_per_capita, svy))
#>                           2.5 %  97.5 %
#> consumption_per_capita 65166.31 76187.8

Reference

Särndal, Carl-Erik, Bengt Swensson, and Jan Wretman. 1992. Model Assisted Survey Sampling. Springer.