#blogR - Data wrangling with {dplyr}

What is this post about?

What is the {dplyr} package?
Presenting arrange() and filter().

History

Hadley Wickham, one of the minds behind the tidyverse, argues in his tidy data paper that around 80% of data analysis is spent on the process of cleaning and preparing the data.

Given all this work and repetition of tasks, some of the tidyverse R packages were designed to maintain good and reproducible ways to handle different datasets.

The {dplyr} package, according with my ~~humble~~ opinion, is the flagship of tidyverse when the objective is fixing messy data.

I decided then to present some functions of the main tidyverse packages. We are starting with the {dplyr} functions arrange() and filter().

Hands on

Firts of all we need to load the {dplyr} package:

library(dplyr)

If you do not have it yet, just install using:

install.packages("dplyr")

We will use the Palmer Archipelago (Antarctica) penguin dataset, wich you can load and add into an R object like this:

# install.packages("palmerpenguins")
penguins <- palmerpenguins::penguins

glimpse(penguins)

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

You can read more about this dataset at Allison Horst Github page.

Figure 1: Meet the Palmer penguins. Artwork by @allison_horst.

`arrange()`

The first {dplyr} function is arrange().

One thing that I like in the {dplyr} functions, is that they represent the actions of what they pretend to do. For instance, arrange() will literally arrange the dataset based in one or more columns, at the ascending or descending order (default is ascending).

In the example below, we are sorting the dataset based on the name of the species:

penguins |> 
    arrange(species)

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

If you want to know more about the pipe (|>), check my previous post (in brazilian portuguese).

Now, let’s arrange the dataset based on the island by descending order, and species by ascending order:

penguins |> 
    arrange(desc(island), species)

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

The two variables above were alphabetically sorted, now we can try the same using the body_mass_g numeric variable:

penguins |> 
    arrange(body_mass_g)

# A tibble: 344 × 8
   species   island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>     <fct>             <dbl>         <dbl>             <int>       <int>
 1 Chinstrap Dream              46.9          16.6               192        2700
 2 Adelie    Biscoe             36.5          16.6               181        2850
 3 Adelie    Biscoe             36.4          17.1               184        2850
 4 Adelie    Biscoe             34.5          18.1               187        2900
 5 Adelie    Dream              33.1          16.1               178        2900
 6 Adelie    Torgers…           38.6          17                 188        2900
 7 Chinstrap Dream              43.2          16.6               187        2900
 8 Adelie    Biscoe             37.9          18.6               193        2925
 9 Adelie    Dream              37.5          18.9               179        2975
10 Adelie    Dream              37            16.9               185        3000
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

The lightest measured penguin weighed 2,700 g, it was from the Chinstrap species from the Dream Island.

What about the heaviest?

penguins |> 
    arrange(desc(body_mass_g))

# A tibble: 344 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo  Biscoe           49.2          15.2               221        6300
 2 Gentoo  Biscoe           59.6          17                 230        6050
 3 Gentoo  Biscoe           51.1          16.3               220        6000
 4 Gentoo  Biscoe           48.8          16.2               222        6000
 5 Gentoo  Biscoe           45.2          16.4               223        5950
 6 Gentoo  Biscoe           49.8          15.9               229        5950
 7 Gentoo  Biscoe           48.4          14.6               213        5850
 8 Gentoo  Biscoe           49.3          15.7               217        5850
 9 Gentoo  Biscoe           55.1          16                 230        5850
10 Gentoo  Biscoe           49.5          16.2               229        5800
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

The heaviest was a Gentoo, from Biscoe Island, and weighed 6,300 g.

`filter()`

The filter() function is one of the more important at the {dplyr} package.

According to its name, filter() will subset the observations (rows) of the dataset based on a given rule.

To retain the values, the proposed rules must produce a TRUE value for all conditions.

To do such conditions, some relational operators are necessary, here is a list of them:

Relational operators in R

Operator Description

< Less than

> Greater than

<= Less than or equal to

>= Greater than or equal to

== Equal to

!= Not equal to

Relational operators in R
Operator	Description
<	Less than
>	Greater than
<=	Less than or equal to
>=	Greater than or equal to
==	Equal to
!=	Not equal to

Let’s start using only one condition.

First, we want to pick only the penguins that weight less than 3,000 g (or 3 kg):

penguins |> 
    filter(body_mass_g < 3000)

# A tibble: 9 × 8
  species   island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>     <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie    Dream               37.5          18.9               179        2975
2 Adelie    Biscoe              34.5          18.1               187        2900
3 Adelie    Biscoe              36.5          16.6               181        2850
4 Adelie    Biscoe              36.4          17.1               184        2850
5 Adelie    Dream               33.1          16.1               178        2900
6 Adelie    Biscoe              37.9          18.6               193        2925
7 Adelie    Torgersen           38.6          17                 188        2900
8 Chinstrap Dream               43.2          16.6               187        2900
9 Chinstrap Dream               46.9          16.6               192        2700
# ℹ 2 more variables: sex <fct>, year <int>

The result is a filtered dataset containing only 9 penguins.

Now we want to filter only the female in the penguins data:

penguins |> 
    filter(sex == "female")

# A tibble: 165 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.5          17.4               186        3800
 2 Adelie  Torgersen           40.3          18                 195        3250
 3 Adelie  Torgersen           36.7          19.3               193        3450
 4 Adelie  Torgersen           38.9          17.8               181        3625
 5 Adelie  Torgersen           41.1          17.6               182        3200
 6 Adelie  Torgersen           36.6          17.8               185        3700
 7 Adelie  Torgersen           38.7          19                 195        3450
 8 Adelie  Torgersen           34.4          18.4               184        3325
 9 Adelie  Biscoe              37.8          18.3               174        3400
10 Adelie  Biscoe              35.9          19.2               189        3800
# ℹ 155 more rows
# ℹ 2 more variables: sex <fct>, year <int>

There are 165 female penguins on this dataset (the total is 344).

We can filter the dataset based in more than one condition. We want to subset the data by the male, Adelie species, that were collected at the Dream Island:

penguins |> 
    filter(sex == "male",
           species == "Adelie",
           island == "Dream") |> 
    knitr::kable() # only to show the whole data

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Adelie	Dream	37.2	18.1	178	3900	male	2007
Adelie	Dream	40.9	18.9	184	3900	male	2007
Adelie	Dream	39.2	21.1	196	4150	male	2007
Adelie	Dream	38.8	20.0	190	3950	male	2007
Adelie	Dream	39.8	19.1	184	4650	male	2007
Adelie	Dream	40.8	18.4	195	3900	male	2007
Adelie	Dream	44.1	19.7	196	4400	male	2007
Adelie	Dream	39.6	18.8	190	4600	male	2007
Adelie	Dream	41.1	19.0	182	3425	male	2007
Adelie	Dream	42.3	21.2	191	4150	male	2007
Adelie	Dream	41.3	20.3	194	3550	male	2008
Adelie	Dream	36.3	19.5	190	3800	male	2008
Adelie	Dream	38.3	19.2	189	3950	male	2008
Adelie	Dream	41.1	18.1	205	4300	male	2008
Adelie	Dream	39.6	18.1	186	4450	male	2008
Adelie	Dream	40.8	18.9	208	4300	male	2008
Adelie	Dream	40.3	18.5	196	4350	male	2008
Adelie	Dream	43.2	18.5	192	4100	male	2008
Adelie	Dream	37.5	18.5	199	4475	male	2009
Adelie	Dream	41.1	17.5	190	3900	male	2009
Adelie	Dream	40.2	20.1	200	3975	male	2009
Adelie	Dream	39.7	17.9	193	4250	male	2009
Adelie	Dream	40.6	17.2	187	3475	male	2009
Adelie	Dream	40.7	17.0	190	3725	male	2009
Adelie	Dream	39.0	18.7	185	3650	male	2009
Adelie	Dream	39.2	18.6	190	4250	male	2009
Adelie	Dream	37.8	18.1	193	3750	male	2009
Adelie	Dream	41.5	18.5	201	4000	male	2009

If we need to subset more than one information in a single variable, we can use the %in% operator. For example, we want the penguins that have flipper_length_mm smaller than 180, that were collected at the years of 2007 and 2009:

penguins |> 
    filter(flipper_length_mm < 180,
           year %in% c(2007, 2009)) |> # the `c()` stands for "concatenate" 
    knitr::kable()

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Adelie	Biscoe	37.8	18.3	174	3400	female	2007
Adelie	Biscoe	37.9	18.6	172	3150	female	2007
Adelie	Dream	39.5	16.7	178	3250	female	2007
Adelie	Dream	37.2	18.1	178	3900	male	2007
Adelie	Dream	37.5	18.9	179	2975	NA	2007
Adelie	Torgersen	40.2	17.0	176	3450	female	2009
Chinstrap	Dream	46.1	18.2	178	3250	female	2007

So, 7 out of 344 penguins were collected at 2007 or 2009 and had flippers lengths smaller than 180 mm.

Finishing

There is a lot more to do, and a lot of specific details at the filter() function, but the main idea of this post is just to do a glimpse of what it does.

In this post, we explored:

{dplyr} package
- arrange() function;
- filter() function.

I hope you have enjoyed.

See you soon.

Cheers,