A synthetic school survey frame inspired by education census and survey data. Uses real Tanzania regions and districts but contains entirely fictional data.

tanzania_schools

Format

A tibble with approximately 2,500 rows and 9 columns:

school_id

Character. Unique school identifier

region

Factor. Region name (7 regions)

district

Factor. District name

school_level

Factor. Primary or Secondary

ownership

Factor. Government or Private

enrollment

Integer. Total student enrollment (measure of size)

n_teachers

Integer. Number of teachers

has_electricity

Logical. Whether school has electricity

has_water

Logical. Whether school has water supply

Details

This dataset is designed for demonstrating:

  • Education surveys

  • Two-stage sampling (schools then students)

  • PPS sampling using enrollment

  • Stratification by school level and ownership

The dataset reflects typical East African education system characteristics with more primary than secondary schools, and infrastructure varying by urban/rural location.

Note

This is a synthetic dataset. Regions and districts are real but all data values are fictional.

Examples

# Explore the data
head(tanzania_schools)
#> # A tibble: 6 × 9
#>   school_id  region district    school_level ownership  enrollment n_teachers
#>   <chr>      <fct>  <fct>       <fct>        <fct>           <dbl>      <dbl>
#> 1 TZ_04_0001 Arusha Arusha City Primary      Government        364         10
#> 2 TZ_04_0002 Arusha Arusha City Primary      Government        303          6
#> 3 TZ_04_0003 Arusha Arusha City Primary      Government        179          5
#> 4 TZ_04_0004 Arusha Arusha City Primary      Government        455         11
#> 5 TZ_04_0005 Arusha Arusha City Primary      Private           520         10
#> 6 TZ_04_0006 Arusha Arusha City Primary      Private           145          3
#> # ℹ 2 more variables: has_electricity <lgl>, has_water <lgl>
table(tanzania_schools$school_level, tanzania_schools$ownership)
#>            
#>             Government Private
#>   Primary         1433     345
#>   Secondary        514     142

# Two-stage cluster sample: schools then students
sampling_design() |>
  stage(label = "Schools") |>
    stratify_by(school_level) |>
    cluster_by(school_id) |>
    draw(n = 25, method = "pps_brewer", mos = enrollment) |>
  stage(label = "Students") |>
    draw(n = 20) |>
  execute(tanzania_schools, seed = 42)
#> == tbl_sample ==
#> Weights: 7.47 - 214.92 (mean: 46.98 )
#> 
#> # A tibble: 50 × 17
#>    school_id  region       district school_level ownership enrollment n_teachers
#>  * <chr>      <fct>        <fct>    <fct>        <fct>          <dbl>      <dbl>
#>  1 TZ_02_0007 Dar es Sala… Kinondo… Secondary    Private          273          7
#>  2 TZ_02_0055 Dar es Sala… Kinondo… Primary      Private          926         25
#>  3 TZ_03_0007 Dar es Sala… Temeke   Primary      Governme…        428         12
#>  4 TZ_04_0016 Arusha       Arusha … Secondary    Governme…        398          9
#>  5 TZ_04_0017 Arusha       Arusha … Secondary    Governme…        505         13
#>  6 TZ_05_0001 Arusha       Arusha … Secondary    Governme…        239          5
#>  7 TZ_05_0083 Arusha       Arusha … Primary      Governme…        439         12
#>  8 TZ_06_0069 Arusha       Meru     Primary      Governme…        546         12
#>  9 TZ_07_0028 Arusha       Monduli  Primary      Governme…        592         14
#> 10 TZ_07_0030 Arusha       Monduli  Primary      Governme…       2040         53
#> # ℹ 40 more rows
#> # ℹ 10 more variables: has_electricity <lgl>, has_water <lgl>, .weight <dbl>,
#> #   .sample_id <int>, .stage <int>, .weight_2 <dbl>, .fpc_2 <int>,
#> #   .weight_1 <dbl>, .fpc_1 <int>, .certainty_1 <lgl>