A synthetic school survey frame inspired by education census and survey data. Uses real Tanzania regions and districts but contains entirely fictional data.
tanzania_schoolsA tibble with approximately 2,500 rows and 9 columns:
Character. Unique school identifier
Factor. Region name (7 regions)
Factor. District name
Factor. Primary or Secondary
Factor. Government or Private
Integer. Total student enrollment (measure of size)
Integer. Number of teachers
Logical. Whether school has electricity
Logical. Whether school has water supply
This dataset is designed for demonstrating:
Education surveys
Two-stage sampling (schools then students)
PPS sampling using enrollment
Stratification by school level and ownership
The dataset reflects typical East African education system characteristics with more primary than secondary schools, and infrastructure varying by urban/rural location.
This is a synthetic dataset. Regions and districts are real but all data values are fictional.
# Explore the data
head(tanzania_schools)
#> # A tibble: 6 × 9
#> school_id region district school_level ownership enrollment n_teachers
#> <chr> <fct> <fct> <fct> <fct> <dbl> <dbl>
#> 1 TZ_04_0001 Arusha Arusha City Primary Government 364 10
#> 2 TZ_04_0002 Arusha Arusha City Primary Government 303 6
#> 3 TZ_04_0003 Arusha Arusha City Primary Government 179 5
#> 4 TZ_04_0004 Arusha Arusha City Primary Government 455 11
#> 5 TZ_04_0005 Arusha Arusha City Primary Private 520 10
#> 6 TZ_04_0006 Arusha Arusha City Primary Private 145 3
#> # ℹ 2 more variables: has_electricity <lgl>, has_water <lgl>
table(tanzania_schools$school_level, tanzania_schools$ownership)
#>
#> Government Private
#> Primary 1433 345
#> Secondary 514 142
# Two-stage cluster sample: schools then students
sampling_design() |>
stage(label = "Schools") |>
stratify_by(school_level) |>
cluster_by(school_id) |>
draw(n = 25, method = "pps_brewer", mos = enrollment) |>
stage(label = "Students") |>
draw(n = 20) |>
execute(tanzania_schools, seed = 42)
#> == tbl_sample ==
#> Weights: 7.47 - 214.92 (mean: 46.98 )
#>
#> # A tibble: 50 × 17
#> school_id region district school_level ownership enrollment n_teachers
#> * <chr> <fct> <fct> <fct> <fct> <dbl> <dbl>
#> 1 TZ_02_0007 Dar es Sala… Kinondo… Secondary Private 273 7
#> 2 TZ_02_0055 Dar es Sala… Kinondo… Primary Private 926 25
#> 3 TZ_03_0007 Dar es Sala… Temeke Primary Governme… 428 12
#> 4 TZ_04_0016 Arusha Arusha … Secondary Governme… 398 9
#> 5 TZ_04_0017 Arusha Arusha … Secondary Governme… 505 13
#> 6 TZ_05_0001 Arusha Arusha … Secondary Governme… 239 5
#> 7 TZ_05_0083 Arusha Arusha … Primary Governme… 439 12
#> 8 TZ_06_0069 Arusha Meru Primary Governme… 546 12
#> 9 TZ_07_0028 Arusha Monduli Primary Governme… 592 14
#> 10 TZ_07_0030 Arusha Monduli Primary Governme… 2040 53
#> # ℹ 40 more rows
#> # ℹ 10 more variables: has_electricity <lgl>, has_water <lgl>, .weight <dbl>,
#> # .sample_id <int>, .stage <int>, .weight_2 <dbl>, .fpc_2 <int>,
#> # .weight_1 <dbl>, .fpc_1 <int>, .certainty_1 <lgl>