Introduction à R

class: center, middle, inverse, title-slide

# Introduction à R
## et au concept de “tidy data”
### Ahmadou Dicko
### 22 Mars 2018

---

## Ce que nous allons voir ensemble

- C'est quoi R ? 
  - Bien démarrer avec R
  - Notion de tidy data et mise en pratique

---
class: center, middle, inverse
## C'est quoi R ?

.center[![tidy data](./images/Rlogo.png)]

---
## C'est quoi R ?

- R est un langage de programmation interpreté
 - Quelques dates importantes :
   - 1990 : Ross Ihaka et Robert Gentleman développent `R`
   - 1996 : Le projet devient open source
   - 2000 : La version 1.0 de `R` voit le jour
   - 2017 : `R` 3.4 sort en Avril et il y a environs 11000 packages (add-ons)

---
## C'est quoi R ?

- `R` est gratuit et open source
 - `R` il existe une grande communauté autour
 - `R` a de bon outils pour la manipulation des données
 - Les graphiques `R` sont de très bonnes qualités

---
class: center, middle, inverse
## Tidy data

---
## La notion de tidy data
  
> “Happy families are all alike; every unhappy family is unhappy in its own way.” – Leo Tolstoy
  
> “Tidy datasets are all alike, but every messy dataset is messy in its own way.” – Hadley Wickham

---
## La notion de tidy data

1. À chaque variable sa propre colonne
2. À chaque variable sa propre ligne
3. À chaque valeur sa propre cellule

.center[![tidy data](./images/tidy-data.png)]

---
## Tidy ou pas ?
<div id="htmlwidget-3a7f2142392d7a46a195" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-3a7f2142392d7a46a195">{"x":{"filter":"none","data":[["1","2","3","4","5","6"],["Afghanistan","Afghanistan","Brazil","Brazil","China","China"],[1999,2000,1999,2000,1999,2000],[745,2666,37737,80488,212258,213766],[19987071,20595360,172006362,174504898,1272915272,1280428583]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>country<\/th>\n      <th>year<\/th>\n      <th>cases<\/th>\n      <th>population<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"dom":"t","columnDefs":[{"className":"dt-right","targets":[2,3,4]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>

---
## Tidy ou pas ?
.center[![tidy data](./images/tidy-data.png)]

---
## Tidy ou pas ?
<div id="htmlwidget-6e61f3007c3e1b066679" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-6e61f3007c3e1b066679">{"x":{"filter":"none","data":[["1","2","3","4","5","6","7","8","9","10","11","12"],["Afghanistan","Afghanistan","Afghanistan","Afghanistan","Brazil","Brazil","Brazil","Brazil","China","China","China","China"],[1999,1999,2000,2000,1999,1999,2000,2000,1999,1999,2000,2000],["cases","population","cases","population","cases","population","cases","population","cases","population","cases","population"],[745,19987071,2666,20595360,37737,172006362,80488,174504898,212258,1272915272,213766,1280428583]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>country<\/th>\n      <th>year<\/th>\n      <th>type<\/th>\n      <th>count<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"dom":"t","columnDefs":[{"className":"dt-right","targets":[2,4]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>

---
## Tidy ou pas ?

.center[![tidy data](./images/tidy-5.png)]

---
## Tidy ou pas ?
<div id="htmlwidget-806185ade0d0af05986e" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-806185ade0d0af05986e">{"x":{"filter":"none","data":[["1","2","3","4","5","6"],["Afghanistan","Afghanistan","Brazil","Brazil","China","China"],[1999,2000,1999,2000,1999,2000],["745/19987071","2666/20595360","37737/172006362","80488/174504898","212258/1272915272","213766/1280428583"]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>country<\/th>\n      <th>year<\/th>\n      <th>rate<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"dom":"t","columnDefs":[{"className":"dt-right","targets":2},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>

---
## Tidy ou pas ?

.center[![tidy data](./images/tidy-6.png)]

---
## Notion de tidy data

Les tidy data permettent de:

- Manipuler les données (pivot, etc)
- Combiner à d'autres données (jointure)
- Visualiser les données
- Exporter les données (i.e vers une BDD Postgres)

---
class: center, middle, inverse
## Le tidyverse

---
## Le tidyverse

**Les packages**

- `tidyr` (rendre les données tidy)
- `readr` (lecture de données tabulaires)
- `tibble` (structure de données)
- `purrr` (programmation fonctionnelle)
- `dplyr` (manipulation des données)
- `ggplot2` (graphique)

.center[![tidy logos](./images/tidyverse.jpg)]

---
## Comment les installer tous

Vous pouvez avoir tout les package de la tidyverse en executant cette commande:

```r
install.packages("tidyverse") ## télécharger le package
```

```r
library(tidyverse) # Importer tidyverse
```

---
## Le workflow d'un analyste

.center[![Data science exploration general workflow](./images/data-science.png)]

---
## Lire les données
.center[![wrangle](./images/data-science-wrangle.png)]

---
## Ouvrir un fichier Excel

```r
library(readxl)
```

.center[![readxl logos](./images/readxl.png)]

---
## Ouvrir un fichier Excel

```r
raw_tb <- read_excel(path = "data/tabular/tidy_data.xlsx", sheet = 2)
```

```
## # A tibble: 12 x 4
##    country      year key              value
##    <chr>       <dbl> <chr>            <dbl>
##  1 Afghanistan 1999. cases             745.
##  2 Afghanistan 1999. population   19987071.
##  3 Afghanistan 2000. cases            2666.
##  4 Afghanistan 2000. population   20595360.
##  5 Brazil      1999. cases           37737.
##  6 Brazil      1999. population  172006362.
##  7 Brazil      2000. cases           80488.
##  8 Brazil      2000. population  174504898.
##  9 China       1999. cases          212258.
## 10 China       1999. population 1272915272.
## 11 China       2000. cases          213766.
## 12 China       2000. population 1280428583.
```

---
## Rendre les données tidy

```r
library(tidyr)
```

.center[![tidyr logos](./images/tidyr.png)]

---
## `tidyr::spread`

.center[![spread data](./images/tidy-8.png)]

---
## `tidyr::spread`

```r
spread(raw_tb, key, value)
```

```
## # A tibble: 6 x 4
##   country      year   cases  population
##   <chr>       <dbl>   <dbl>       <dbl>
## 1 Afghanistan 1999.    745.   19987071.
## 2 Afghanistan 2000.   2666.   20595360.
## 3 Brazil      1999.  37737.  172006362.
## 4 Brazil      2000.  80488.  174504898.
## 5 China       1999. 212258. 1272915272.
## 6 China       2000. 213766. 1280428583.
```

---
## Ouvrir un fichier Excel

```r
raw_tb <- read_excel(path = "data/tabular/tidy_data.xlsx", sheet = 4)
```

```
## # A tibble: 3 x 3
##   country      `1999`  `2000`
##   <chr>         <dbl>   <dbl>
## 1 Afghanistan    745.   2666.
## 2 Brazil       37737.  80488.
## 3 China       212258. 213766.
```

---
## `tidyr::gather`

.center[![spread data](./images/tidy-9.png)]

---
## `tidyr::gather`

```r
gather(raw_tb, key = "year", value = "cases", 2:3) ## no
```

```
## # A tibble: 6 x 3
##   country     year    cases
##   <chr>       <chr>   <dbl>
## 1 Afghanistan 1999     745.
## 2 Brazil      1999   37737.
## 3 China       1999  212258.
## 4 Afghanistan 2000    2666.
## 5 Brazil      2000   80488.
## 6 China       2000  213766.
```

---
## Ouvrir un fichier CSV

```r
library(readr)
```

.center[![readr logos](./images/readr.png)]

---
## Ouvrir un fichier CSV

```r
raw_weather <- read_csv(file = "data/tabular/weather_tmax.csv", na = "NA")
```

```
## # A tibble: 11 x 10
##    id           year month    d1    d2    d3    d4    d5    d6    d7
##    <chr>       <int> <int> <int> <int> <int> <int> <int> <int> <int>
##  1 MX000017004  2010     1    NA    NA    NA    NA    NA    NA    NA
##  2 MX000017004  2010     2    NA   273   241    NA    NA    NA    NA
##  3 MX000017004  2010     3    NA    NA    NA    NA   321    NA    NA
##  4 MX000017004  2010     4    NA    NA    NA    NA    NA    NA    NA
##  5 MX000017004  2010     5    NA    NA    NA    NA    NA    NA    NA
##  6 MX000017004  2010     6    NA    NA    NA    NA    NA    NA    NA
##  7 MX000017004  2010     7    NA    NA   286    NA    NA    NA    NA
##  8 MX000017004  2010     8    NA    NA    NA    NA   296    NA    NA
##  9 MX000017004  2010    10    NA    NA    NA    NA   270    NA   281
## 10 MX000017004  2010    11    NA   313    NA   272   263    NA    NA
## 11 MX000017004  2010    12   299    NA    NA    NA    NA   278    NA
```

---
## Rendre les données tidy

```r
gather(raw_weather, key = "day", value = "tmax", d1:d31, na.rm = TRUE)
```

```
## # A tibble: 33 x 5
##    id           year month day   tmax 
##  * <chr>       <int> <int> <chr> <chr>
##  1 MX000017004  2010    12 d1    299  
##  2 MX000017004  2010     2 d2    273  
##  3 MX000017004  2010    11 d2    313  
##  4 MX000017004  2010     2 d3    241  
##  5 MX000017004  2010     7 d3    286  
##  6 MX000017004  2010    11 d4    272  
##  7 MX000017004  2010     3 d5    321  
##  8 MX000017004  2010     8 d5    296  
##  9 MX000017004  2010    10 d5    270  
## 10 MX000017004  2010    11 d5    263  
## # ... with 23 more rows
```

---
class: center, middle, inverse
## Autre chose que la tidyverse propose

---
class: center, middle, inverse
## Manipulation de données avec dplyr

---
## Manipulation de données avec dplyr

```r
clean_weather %>%
  group_by(month) %>%
  summarise(tmax_moy = mean(tmax), tmax = max(tmax))
```

```
## # A tibble: 11 x 3
##    month tmax_moy  tmax
##    <int>    <dbl> <dbl>
##  1     1     278.  278.
##  2     2     278.  299.
##  3     3     326.  345.
##  4     4     363.  363.
##  5     5     332.  332.
##  6     6     290.  301.
##  7     7     292.  299.
##  8     8     283.  298.
##  9    10     289.  312.
## 10    11     281.  313.
## 11    12     288.  299.
```

---
class: center, middle, inverse
## Graphique avec ggplot2

---
## Graphique avec ggplot2

```r
clean_weather %>%
  unite(date, year, month, day, sep = "-") %>%
  mutate(date = as.Date(date)) %>%
  ggplot(aes(date, tmax)) +
    geom_line()
```

![](index_files/figure-html/unnamed-chunk-19-1.png)

---
class: center, middle, inverse
## Interaction avec SGBD

---
## Interaction avec SGBD

```r
library(DBI)
library(dbplyr)
conn <- dbConnect(RSQLite::SQLite(),  "data/db/taille.db")
```

```r
dbListTables(conn)
```

```
## [1] "taille"
```

```r
dbListFields(conn, "taille")
```

```
## [1] "nom"    "taille"
```
---
## Interaction avec SGBD

```r
dbGetQuery(conn, "SELECT * FROM taille")
```

```
##     nom taille
## 1   Ali    170
## 2 Modou    185
## 3 Marie    165
```

```r
dbGetQuery(conn, "SELECT nom from taille WHERE taille > 165")
```

```
##     nom
## 1   Ali
## 2 Modou
```

---
## Interaction avec SGBD

```r
df <- tbl(conn, "taille")
df
```

```
## # Source:   table<taille> [?? x 2]
## # Database: sqlite 3.19.3 [/builds/dickoa/intro-tidy/data/db/taille.db]
##   nom   taille
##   <chr>  <int>
## 1 Ali      170
## 2 Modou    185
## 3 Marie    165
```

---
## Interaction avec SGBD

```r
df %>%
  filter(taille > 165) %>%
  select(nom)
```

```
## # Source:   lazy query [?? x 1]
## # Database: sqlite 3.19.3 [/builds/dickoa/intro-tidy/data/db/taille.db]
##   nom  
##   <chr>
## 1 Ali  
## 2 Modou
```

---
## Interaction avec SGBD

```r
df %>%
  filter(taille > 165) %>%
  select(nom) %>%
  collect()
```

```
## # A tibble: 2 x 1
##   nom  
##   <chr>
## 1 Ali  
## 2 Modou
```

---
class: center, middle, inverse
## Données spatiales avec R