Introduction à R

class: center, middle, inverse, title-slide

# Introduction à R
## et au concept de “tidy data”
### Nafissatou Pouye et Ahmadou Dicko
### 13 juillet 2017

---

## Ce que nous allons voir ensemble

- C'est quoi R ? 
  - Bien démarrer avec R
  - Notion de tidy data et mise en pratique

---
class: center, middle, inverse
## C'est quoi R ?

.center[![tidy data](./images/Rlogo.png)]

---
## C'est quoi R ?

- R est un langage de programmation interpreté
 - Quelques dates importantes :
   - 1990 : Ross Ihaka et Robert Gentleman développent `R`
   - 1996 : Le projet devient open source
   - 2000 : La version 1.0 de `R` voit le jour
   - 2017 : `R` 3.4 sort en Avril et il y a environs 11000 packages (add-ons)

---
## C'est quoi R ?

- `R` est gratuit et open source
 - `R` il existe une grande communauté autour
 - `R` a de bon outils pour la manipulation des données
 - Les graphiques `R` sont de très bonnes qualités

---
class: center, middle, inverse
## Bien démarrer avec R

---
## Bien démarrer avec R

Le  symbole  > est appelé prompt et indique qu'une commande est attendue.
Une fois que la commande soit bien libellée après le prompt, on appuie sur "Entrée" pour avoir le résultat qui s'affichera à la suite.

---
## Bien démarrer avec R
La console R peut etre utilisée comme une calculatrice.
 Exécuter

```r
3 * pi
```

```
## [1] 9.424778
```

```r
2 * 15
```

```
## [1] 30
```

```r
(3 * 15) + 38 / 2
```

```
## [1] 64
```

---
class: center, middle, inverse
## Tidy data

---
## La notion de tidy data
  
> “Happy families are all alike; every unhappy family is unhappy in its own way.” – Leo Tolstoy
  
> “Tidy datasets are all alike, but every messy dataset is messy in its own way.” – Hadley Wickham

---
## La notion de tidy data

1. À chaque variable sa propre colonne
2. À chaque variable sa propre ligne
3. À chaque valeur sa propre cellule

.center[![tidy data](./images/tidy-data.png)]

---
## Tidy ou pas ?
<div id="htmlwidget-ccd56a830ecfbc9c262f" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-ccd56a830ecfbc9c262f">{"x":{"filter":"none","data":[["1","2","3","4","5","6"],["Afghanistan","Afghanistan","Brazil","Brazil","China","China"],[1999,2000,1999,2000,1999,2000],[745,2666,37737,80488,212258,213766],[19987071,20595360,172006362,174504898,1272915272,1280428583]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>country<\/th>\n      <th>year<\/th>\n      <th>cases<\/th>\n      <th>population<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"dom":"t","columnDefs":[{"className":"dt-right","targets":[2,3,4]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>

---
## Tidy ou pas ?
.center[![tidy data](./images/tidy-data.png)]

---
## Tidy ou pas ?
<div id="htmlwidget-fe24b3870f9ec00604e6" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-fe24b3870f9ec00604e6">{"x":{"filter":"none","data":[["1","2","3","4","5","6","7","8","9","10","11","12"],["Afghanistan","Afghanistan","Afghanistan","Afghanistan","Brazil","Brazil","Brazil","Brazil","China","China","China","China"],[1999,1999,2000,2000,1999,1999,2000,2000,1999,1999,2000,2000],["cases","population","cases","population","cases","population","cases","population","cases","population","cases","population"],[745,19987071,2666,20595360,37737,172006362,80488,174504898,212258,1272915272,213766,1280428583]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>country<\/th>\n      <th>year<\/th>\n      <th>type<\/th>\n      <th>count<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"dom":"t","columnDefs":[{"className":"dt-right","targets":[2,4]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>

---
## Tidy ou pas ?

.center[![tidy data](./images/tidy-5.png)]

---
## Tidy ou pas ?
<div id="htmlwidget-bcb206e7bd6fd539d257" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-bcb206e7bd6fd539d257">{"x":{"filter":"none","data":[["1","2","3","4","5","6"],["Afghanistan","Afghanistan","Brazil","Brazil","China","China"],[1999,2000,1999,2000,1999,2000],["745/19987071","2666/20595360","37737/172006362","80488/174504898","212258/1272915272","213766/1280428583"]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>country<\/th>\n      <th>year<\/th>\n      <th>rate<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"dom":"t","columnDefs":[{"className":"dt-right","targets":2},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>

---
## Tidy ou pas ?

.center[![tidy data](./images/tidy-6.png)]

---
## Notion de tidy data

Les tidy data permettent de:

- Manipuler les données (pivot, etc)
- Combiner à d'autres données (jointure)
- Visualiser les données
- Exporter les données (i.e vers une BDD Postgres)

---
class: center, middle, inverse
## Le tidyverse

---
## Le tidyverse

**Les packages**

- `tidyr` (rendre les données tidy)
- `readr` (lecture de données tabulaires)
- `tibble` (structure de données)
- `purrr` (programmation fonctionnelle)
- `dplyr` (manipulation des données)
- `ggplot2` (graphique)

.center[![tidy logos](./images/tidyverse.jpg)]

---
## Comment les installer tous

Vous pouvez avoir tout les package de la tidyverse en executant cette commande:

```r
install.packages("tidyverse") ## télécharger le package
```

```r
library(tidyverse) # Importer tidyverse
```

---
## Le workflow d'un analyste

.center[![Data science exploration general workflow](./images/data-science.png)]

---
## Lire les données
.center[![wrangle](./images/data-science-wrangle.png)]

---
## Ouvrir un fichier Excel

```r
library(readxl)
```

.center[![readxl logos](./images/readxl.png)]

---
## Ouvrir un fichier Excel

```r
raw_tb <- read_excel(path = "data/tidy_data.xlsx", sheet = 2)
```

```
## # A tibble: 12 x 4
##        country  year        key      value
##          <chr> <dbl>      <chr>      <dbl>
##  1 Afghanistan  1999      cases        745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000      cases       2666
##  4 Afghanistan  2000 population   20595360
##  5      Brazil  1999      cases      37737
##  6      Brazil  1999 population  172006362
##  7      Brazil  2000      cases      80488
##  8      Brazil  2000 population  174504898
##  9       China  1999      cases     212258
## 10       China  1999 population 1272915272
## 11       China  2000      cases     213766
## 12       China  2000 population 1280428583
```

---
## Rendre les données tidy

```r
library(tidyr)
```

.center[![tidyr logos](./images/tidyr.png)]

---
## `tidyr::spread`

.center[![spread data](./images/tidy-8.png)]

---
## `tidyr::spread`

```r
spread(raw_tb, key, value)
```

```
## # A tibble: 6 x 4
##       country  year  cases population
## *       <chr> <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3      Brazil  1999  37737  172006362
## 4      Brazil  2000  80488  174504898
## 5       China  1999 212258 1272915272
## 6       China  2000 213766 1280428583
```

---
## Ouvrir un fichier Excel

```r
raw_tb <- read_excel(path = "data/tidy_data.xlsx", sheet = 4)
```

```
## # A tibble: 3 x 3
##       country `1999` `2000`
##         <chr>  <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2      Brazil  37737  80488
## 3       China 212258 213766
```

---
## `tidyr::gather`

.center[![spread data](./images/tidy-9.png)]

---
## `tidyr::gather`

```r
gather(raw_tb, key = "year", value = "cases", 2:3) ## no
```

```
## # A tibble: 6 x 3
##       country  year  cases
##         <chr> <chr>  <dbl>
## 1 Afghanistan  1999    745
## 2      Brazil  1999  37737
## 3       China  1999 212258
## 4 Afghanistan  2000   2666
## 5      Brazil  2000  80488
## 6       China  2000 213766
```

---
## Ouvrir un fichier CSV

```r
library(readr)
```

.center[![readr logos](./images/readr.png)]

---
## Ouvrir un fichier CSV

```r
raw_weather <- read_csv(file = "data/weather_tmax.csv", na = "NA")
```

```
## # A tibble: 11 x 10
##             id  year month    d1    d2    d3    d4    d5    d6    d7
##          <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
##  1 MX000017004  2010     1    NA    NA    NA    NA    NA    NA    NA
##  2 MX000017004  2010     2    NA   273   241    NA    NA    NA    NA
##  3 MX000017004  2010     3    NA    NA    NA    NA   321    NA    NA
##  4 MX000017004  2010     4    NA    NA    NA    NA    NA    NA    NA
##  5 MX000017004  2010     5    NA    NA    NA    NA    NA    NA    NA
##  6 MX000017004  2010     6    NA    NA    NA    NA    NA    NA    NA
##  7 MX000017004  2010     7    NA    NA   286    NA    NA    NA    NA
##  8 MX000017004  2010     8    NA    NA    NA    NA   296    NA    NA
##  9 MX000017004  2010    10    NA    NA    NA    NA   270    NA   281
## 10 MX000017004  2010    11    NA   313    NA   272   263    NA    NA
## 11 MX000017004  2010    12   299    NA    NA    NA    NA   278    NA
```

---
## Rendre les données tidy

```r
gather(raw_weather, key = "day", value = "tmax", d1:d31, na.rm = TRUE)
```

```
## # A tibble: 33 x 5
##             id  year month   day  tmax
##  *       <chr> <int> <int> <chr> <chr>
##  1 MX000017004  2010    12    d1   299
##  2 MX000017004  2010     2    d2   273
##  3 MX000017004  2010    11    d2   313
##  4 MX000017004  2010     2    d3   241
##  5 MX000017004  2010     7    d3   286
##  6 MX000017004  2010    11    d4   272
##  7 MX000017004  2010     3    d5   321
##  8 MX000017004  2010     8    d5   296
##  9 MX000017004  2010    10    d5   270
## 10 MX000017004  2010    11    d5   263
## # ... with 23 more rows
```

---
class: center, middle, inverse
## Autre chose que la tidyverse propose

---
class: center, middle, inverse
## Manipulation de données avec dplyr

---
## Manipulation de données avec dplyr

```r
clean_weather %>%
  group_by(month) %>%
  summarise(tmax_moy = mean(tmax), tmax = max(tmax))
```

```
## # A tibble: 11 x 3
##    month tmax_moy  tmax
##    <int>    <dbl> <dbl>
##  1     1 278.0000   278
##  2     2 277.5000   299
##  3     3 325.6667   345
##  4     4 363.0000   363
##  5     5 332.0000   332
##  6     6 290.5000   301
##  7     7 292.5000   299
##  8     8 282.7143   298
##  9    10 289.0000   312
## 10    11 281.2000   313
## 11    12 288.5000   299
```

---
class: center, middle, inverse
## Graphique avec ggplot2

---
## Graphique avec ggplot2

```r
clean_weather %>%
  unite(date, year, month, day, sep = "-") %>%
  mutate(date = as.Date(date)) %>%
  ggplot(aes(date, tmax)) +
    geom_line()
```

![](index_files/figure-html/unnamed-chunk-22-1.png)

---
class: center, middle, inverse
## Questions?
### #dataskills
[humdata.org](www.data.humdata.org)