We want to show in this document how to pull data from HDX into R using the HDX Python API. There is many ways to approach this task but the more elegant approach is based on the R package reticulate. reticulate
allows you to easily wrap any Python code using nicely written API.
Here is a small example of how we can use reticulate
to get data into R. The first step is to load the R packages we will need for this analysis:
reticulate
to wrap the the HDX Python librarytidyverse
,readxl
andsf
to read and manipulate xlsx sheets or simple features (spatial vector data)
library(reticulate)
library(tidyverse)
library(readxl)
library(sf)
I assume that you have R installed and all the package above, you need to also install the HDX Python library.
pip install hdx-python-api
And we can run the first example from the tutorial on te Github page of the API
from hdx.hdx_configuration import Configuration
from hdx.data.dataset import Dataset
Configuration.create(hdx_site='prod', hdx_read_only=True)
dataset = Dataset.read_from_hdx('acled-conflict-data-for-africa-realtime-2016')
print(dataset.get_dataset_date())
#> 2016-12-03
The equivalent of the same code in R using reticulate
is the following:
hdx <- import("hdx")
hdx$hdx_configuration$Configuration$create(hdx_site = "prod", hdx_read_only = TRUE)
#> [1] "https://data.humdata.org/"
dataset <- hdx$data$dataset$Dataset$read_from_hdx('acled-conflict-data-for-africa-realtime-2016')
print(dataset$get_dataset_date())
#> [1] "2016-12-03"
We can save some keystrokes and directly write some R functions to do the same type of operations. We will create some functions to connect to an HDX server, search and read a dataset.
hdx_setup <- function(hdx_site = "prod", hdx_read_only = TRUE) {
conf <- hdx$hdx_configuration$Configuration
conf$delete()
conf$create(hdx_site = hdx_site, hdx_read_only = hdx_read_only)
}
read_hdx <- function(x) {
dataset <- hdx$data$dataset$Dataset
dataset$read_from_hdx(x)
}
search_hdx <- function(x, rows = 10L) {
dataset <- hdx$data$dataset$Dataset
dataset$search_in_hdx(x, rows = rows)
}
Let’s test these functions using the ACLED data from 2016.
hdx_setup("prod", hdx_read_only = TRUE)
#> [1] "https://data.humdata.org/"
acled_ds <- read_hdx('acled-conflict-data-for-africa-realtime-2016')
We have read a dataset but each dataset can have one or many resources. In order to read the file into R, we need first to check the resources and their extensions. For this example, we chose the first resource and it’s a XLSX file. The readxl
package and particularly the read_excel
function can be used to open it in R.
resources <- acled_ds$get_resources()
resources[[1]]$get_file_type()
#> [1] "XLSX"
dwnld <- resources[[1]]$download()
acled <- read_excel(dwnld[[2]])
glimpse(acled)
#> Observations: 14,827
#> Variables: 25
#> $ GWNO <dbl> 615, 615, 615, 615, 615, 615, 615, 615, 615, ...
#> $ EVENT_ID_CNTY <chr> "8701RTA", "8702RTA", "14606RTA", "14607RTA",...
#> $ EVENT_ID_NO_CNTY <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
#> $ EVENT_DATE <dttm> 2011-01-31, 2011-02-09, 2015-07-05, 2015-07-...
#> $ YEAR <dbl> 2011, 2011, 2015, 2015, 2015, 2015, 2015, 201...
#> $ TIME_PRECISION <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
#> $ EVENT_TYPE <chr> "Remote violence", "Remote violence", "Remote...
#> $ ACTOR1 <chr> "Military Forces of Algeria (1999-)", "Uniden...
#> $ ALLY_ACTOR_1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, "CNES: Nation...
#> $ INTER1 <dbl> 1, 3, 3, 2, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
#> $ ACTOR2 <chr> "Unidentified Armed Group (Algeria)", "Civili...
#> $ ALLY_ACTOR_2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
#> $ INTER2 <dbl> 3, 7, 7, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
#> $ INTERACTION <dbl> 13, 37, 37, 12, 13, 60, 60, 60, 60, 60, 60, 6...
#> $ COUNTRY <chr> "Algeria", "Algeria", "Algeria", "Algeria", "...
#> $ ADMIN1 <chr> "Tebessa", "Djelfa", "Sidi Bel Abbes", "Ain D...
#> $ ADMIN2 <chr> "Bir El Ater", "Messaad", "Sidi Bel Abbes", "...
#> $ ADMIN3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
#> $ LOCATION <chr> "Bir El-Ater", "Messaad", "Sidi Bel Abbes", "...
#> $ LATITUDE <dbl> 34.74488, 34.15429, 35.19390, 36.01064, 36.38...
#> $ LONGITUDE <dbl> 8.060240, 3.503093, -0.641400, 2.366121, 3.90...
#> $ GEO_PRECISION <dbl> 2, 1, 1, 1, 2, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, ...
#> $ SOURCE <chr> "Tout sur L'Algerie", "Xinhua General News Se...
#> $ NOTES <chr> "Two soldiers were seriously wounded by the e...
#> $ FATALITIES <dbl> 0, 5, 0, 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
Once the data is in R, we can play with it, summarize it and visualize it
acled %>%
group_by(COUNTRY) %>%
summarise(tot_fatalities = sum(FATALITIES, na.rm = TRUE)) %>%
ggplot(aes(fct_reorder(COUNTRY, tot_fatalities), tot_fatalities)) +
geom_point() +
labs(x = "") +
coord_flip()
In HDX, there’s a lot of really nice spatial data. Let’s see how to get some them in R and manipulate them using the sf
R package. Let’s use Mali official administrative boundaries, The first step is to search for the dataset and its resources:
res <- search_hdx("Mali Administrative Boundaries", rows = 10L)
mli_adm_ds <- res[[1]]
resources <- mli_adm_ds$get_resources()
map_chr(resources, function(r) r$data$description)
#> [1] "Admin Levels 0 - 4 for Mali COD "
#> [2] "Admin Levels 0 - 4 for Mali COD "
#> [3] "This dataset is a WKT file (embedded in CSV) of COD administrative boundaries for Mali"
#> [4] "Admin Levels 0 for Mali COD - polygon shapefile "
#> [5] "Admin Levels 1 for Mali COD - polygon shapefile "
#> [6] "Admin Levels 2 for Mali COD - polygon shapefile "
#> [7] "Admin Levels 3 for Mali COD - polygon shapefile "
#> [8] "Admin Levels 4 for Mali COD - polygon shapefile "
#> [9] "This map service contains OCHA Common Operational Datasets for Mali: Administrative Boundaries and Populated Places. The service is available as ESRI Map, ESRI Feature, WMS, and KML Services. "
#> [10] "This map service contains OCHA Common Operational Datasets for Mali: Administrative Boundaries and Populated Places. The service is available as ESRI Map, ESRI Feature, WMS, and KML Services. "
#> [11] "This map service is intended as a labelling layer for PCODES from OCHA's Common Operational Datasets for Mali. It also provides access to Administrative Boundaries geometry. The service is available as ESRI Map, WMS, WFS and KML Services. The map service pcode labels are intended to be used in conjunction with the basemap services located at http://gistmaps.itos.uga.edu/arcgis/rest/services/COD_External/MLI_FR/MapServer \n"
#> [12] "This map service is intended as a labelling layer for PCODES from OCHA's Common Operational Datasets for Mali. It also provides access to Administrative Boundaries geometry. The service is available as ESRI Map, WMS, WFS and KML Services. The map service pcode labels are intended to be used in conjunction with the basemap services located at http://gistmaps.itos.uga.edu/arcgis/rest/services/COD_External/MLI_FR/MapServer"
From this output, the 4th resource is the one we are looking for, the country boundaries in zipped shapefile. sf
rely on GDAL to read and manipulate spatial data and GDAL (OGR) can directly read zipped shapefile using the /vsizip
handler for virtual file system.
dwnld <- resources[[4]]$download()
file_to_read <- file.path("/vsizip", dwnld[[2]])
st_layers(file_to_read)
#> Driver: ESRI Shapefile
#> Available layers:
#> layer_name geometry_type features fields
#> 1 mli_admbnda_adm0_gov Polygon 1 11
Once the layer checked, we can now read our file using the read_sf
function from the sf
package.
mli <- read_sf(file_to_read)
mli
#> Simple feature collection with 1 feature and 11 fields
#> geometry type: POLYGON
#> dimension: XY
#> bbox: xmin: -12.23924 ymin: 10.14137 xmax: 4.24467 ymax: 24.99951
#> epsg (SRID): 4326
#> proj4string: +proj=longlat +datum=WGS84 +no_defs
#> # A tibble: 1 x 12
#> OBJECTID_1 admin0Name admin0Pcod admin0RefN admin0AltN admin0Al_1
#> <dbl> <chr> <chr> <chr> <chr> <chr>
#> 1 1 Mali ML Mali <NA> <NA>
#> # ... with 6 more variables: date <date>, validOn <date>, validTo <date>,
#> # Shape_Leng <dbl>, Shape_Area <dbl>, geometry <simple_feature>
We can even plot it using ggplot2
.
ggplot(mli) +
geom_sf()
We will do the same drill but this time with the Mali road network dataset in HDX.
res <- search_hdx("Mali roads wfp", rows = 10L)
mli_roads_ds <- res[[1]]
resources <- mli_roads_ds$get_resources()
resources[[1]]$get_file_type() ### Zipped shapefile
#> [1] "zipped shapefile"
dwnld <- resources[[1]]$download() ## give a name to the downlooded file
file_to_read <- file.path("/vsizip", dwnld[[2]])
st_layers(file_to_read) ### check the layers
#> Driver: ESRI Shapefile
#> Available layers:
#> layer_name geometry_type features fields
#> 1 RouteMali_Tronçons Line String 1852 44
mli_roads <- read_sf(file_to_read)
mli_roads
#> Simple feature collection with 1852 features and 44 fields
#> geometry type: MULTILINESTRING
#> dimension: XY
#> bbox: xmin: -12.2424 ymin: 10.20723 xmax: 3.8902 ymax: 25.00003
#> epsg (SRID): 4326
#> proj4string: +proj=longlat +datum=WGS84 +no_defs
#> # A tibble: 1,852 x 45
#> OBJECTID_1 FID_1 OBJECTID SOURCEID EXS NOTES ONME RTENME
#> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <chr> <chr>
#> 1 1 0 0 srce201105130009 0 <NA> <NA> <NA>
#> 2 2 0 0 srce201105130009 0 <NA> <NA> <NA>
#> 3 3 0 0 srce201105130009 0 <NA> <NA> <NA>
#> 4 4 0 0 srce201105130009 0 <NA> <NA> <NA>
#> 5 5 0 0 srce201105130009 0 <NA> <NA> <NA>
#> 6 6 0 0 srce201105130009 0 <NA> <NA> <NA>
#> 7 7 0 0 srce201105130009 0 <NA> <NA> <NA>
#> 8 8 0 0 srce201105130009 0 <NA> <NA> <NA>
#> 9 9 0 0 srce201105130009 0 <NA> <NA> <NA>
#> 10 10 0 0 srce201105130009 0 <NA> <NA> <NA>
#> # ... with 1,842 more rows, and 37 more variables: NTLCLASS <chr>,
#> # FCLASS <dbl>, CRGWAY <dbl>, NUMLANES <dbl>, LNEWIDTHM <dbl>,
#> # RDWIDTHM <dbl>, AXLELOADMT <dbl>, TOTLOADMT <dbl>, SRFTPE <dbl>,
#> # SRFCOND <dbl>, SRFPREP <dbl>, ISSEASONAL <dbl>, CURNTPRAC <dbl>,
#> # GDWTHRPRAC <dbl>, BDWTHRPRAC <dbl>, SPEEDLIMIT <dbl>,
#> # CURNTSPEED <dbl>, GNRALSPEED <dbl>, ISUNDRCSTR <dbl>,
#> # CSTWRKETC <date>, GRADDEG <dbl>, SEC <dbl>, HASSHOULDR <dbl>,
#> # HASSIDEWLK <dbl>, DRIVSIDE <dbl>, ISELEVATED <dbl>, HASMEDIAN <dbl>,
#> # OPSTATUS <dbl>, SHAPE_LENG <dbl>, ADM0_CODE <dbl>, ADM0_NAME <chr>,
#> # CONTINENT <chr>, REGION <chr>, ROADID <dbl>, Shape_Le_1 <dbl>,
#> # Distance <dbl>, geometry <simple_feature>
ggplot(mli) +
geom_sf() +
geom_sf(data = mli_roads, colour = "steelblue")
Finally, we can also in few lines of code overlay the number of fatalities from the ACLED database on top of the road network.
acled %>%
filter(COUNTRY == "Mali") %>%
select(year = YEAR, lat = LATITUDE, long = LONGITUDE, fatal = FATALITIES) %>%
ggplot() +
geom_sf(data = mli) +
geom_sf(data = mli_roads) +
geom_point(aes(long, lat, size = fatal))
We showed how the HDX Python API can be used in R to directly access the 6000 datasets on HDX and start some exploratory analysis. The next step will be to develop an R package to allow the R community to easily access HDX datasets and also push data into HDX from R.