Data wrangling with {dplyr}

Learning {dplyr} functions - arrange() and filter()

data wrangling
dplyr
en
R
tidying
Author

Ricardo

Published

October 24, 2022

What is this post about?

  • What is the {dplyr} package?
  • Presenting arrange() and filter().

History

Hadley Wickham, one of the minds behind the tidyverse, argues in his tidy data paper that around 80% of data analysis is spent on the process of cleaning and preparing the data.

Given all this work and repetition of tasks, some of the tidyverse R packages were designed to maintain good and reproducible ways to handle different datasets.

The {dplyr} package, according with my humble opinion, is the flagship of tidyverse when the objective is fixing messy data.

I decided then to present some functions of the main tidyverse packages. We are starting with the {dplyr} functions arrange() and filter().

Hands on

Firts of all we need to load the {dplyr} package:

library(dplyr)

If you do not have it yet, just install using:

install.packages("dplyr")

We will use the Palmer Archipelago (Antarctica) penguin dataset, wich you can load and add into an R object like this:

# install.packages("palmerpenguins")
penguins <- palmerpenguins::penguins

glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

You can read more about this dataset at Allison Horst Github page.

Figure 1: Meet the Palmer penguins. Artwork by @allison_horst.

arrange()

The first {dplyr} function is arrange().

One thing that I like in the {dplyr} functions, is that they represent the actions of what they pretend to do. For instance, arrange() will literally arrange the dataset based in one or more columns, at the ascending or descending order (default is ascending).

In the example below, we are sorting the dataset based on the name of the species:

penguins |> 
    arrange(species)
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

If you want to know more about the pipe (|>), check my previous post (in brazilian portuguese).

Now, let’s arrange the dataset based on the island by descending order, and species by ascending order:

penguins |> 
    arrange(desc(island), species)
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

The two variables above were alphabetically sorted, now we can try the same using the body_mass_g numeric variable:

penguins |> 
    arrange(body_mass_g)
# A tibble: 344 × 8
   species   island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>     <fct>             <dbl>         <dbl>             <int>       <int>
 1 Chinstrap Dream              46.9          16.6               192        2700
 2 Adelie    Biscoe             36.5          16.6               181        2850
 3 Adelie    Biscoe             36.4          17.1               184        2850
 4 Adelie    Biscoe             34.5          18.1               187        2900
 5 Adelie    Dream              33.1          16.1               178        2900
 6 Adelie    Torgers…           38.6          17                 188        2900
 7 Chinstrap Dream              43.2          16.6               187        2900
 8 Adelie    Biscoe             37.9          18.6               193        2925
 9 Adelie    Dream              37.5          18.9               179        2975
10 Adelie    Dream              37            16.9               185        3000
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

The lightest measured penguin weighed 2,700 g, it was from the Chinstrap species from the Dream Island.

What about the heaviest?

penguins |> 
    arrange(desc(body_mass_g))
# A tibble: 344 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo  Biscoe           49.2          15.2               221        6300
 2 Gentoo  Biscoe           59.6          17                 230        6050
 3 Gentoo  Biscoe           51.1          16.3               220        6000
 4 Gentoo  Biscoe           48.8          16.2               222        6000
 5 Gentoo  Biscoe           45.2          16.4               223        5950
 6 Gentoo  Biscoe           49.8          15.9               229        5950
 7 Gentoo  Biscoe           48.4          14.6               213        5850
 8 Gentoo  Biscoe           49.3          15.7               217        5850
 9 Gentoo  Biscoe           55.1          16                 230        5850
10 Gentoo  Biscoe           49.5          16.2               229        5800
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

The heaviest was a Gentoo, from Biscoe Island, and weighed 6,300 g.

filter()

The filter() function is one of the more important at the {dplyr} package.

According to its name, filter() will subset the observations (rows) of the dataset based on a given rule.

To retain the values, the proposed rules must produce a TRUE value for all conditions.

To do such conditions, some relational operators are necessary, here is a list of them:

Relational operators in R
Operator Description
< Less than
> Greater than
<= Less than or equal to
>= Greater than or equal to
== Equal to
!= Not equal to

Let’s start using only one condition.

First, we want to pick only the penguins that weight less than 3,000 g (or 3 kg):

penguins |> 
    filter(body_mass_g < 3000)
# A tibble: 9 × 8
  species   island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>     <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie    Dream               37.5          18.9               179        2975
2 Adelie    Biscoe              34.5          18.1               187        2900
3 Adelie    Biscoe              36.5          16.6               181        2850
4 Adelie    Biscoe              36.4          17.1               184        2850
5 Adelie    Dream               33.1          16.1               178        2900
6 Adelie    Biscoe              37.9          18.6               193        2925
7 Adelie    Torgersen           38.6          17                 188        2900
8 Chinstrap Dream               43.2          16.6               187        2900
9 Chinstrap Dream               46.9          16.6               192        2700
# ℹ 2 more variables: sex <fct>, year <int>

The result is a filtered dataset containing only 9 penguins.

Now we want to filter only the female in the penguins data:

penguins |> 
    filter(sex == "female")
# A tibble: 165 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.5          17.4               186        3800
 2 Adelie  Torgersen           40.3          18                 195        3250
 3 Adelie  Torgersen           36.7          19.3               193        3450
 4 Adelie  Torgersen           38.9          17.8               181        3625
 5 Adelie  Torgersen           41.1          17.6               182        3200
 6 Adelie  Torgersen           36.6          17.8               185        3700
 7 Adelie  Torgersen           38.7          19                 195        3450
 8 Adelie  Torgersen           34.4          18.4               184        3325
 9 Adelie  Biscoe              37.8          18.3               174        3400
10 Adelie  Biscoe              35.9          19.2               189        3800
# ℹ 155 more rows
# ℹ 2 more variables: sex <fct>, year <int>

There are 165 female penguins on this dataset (the total is 344).

We can filter the dataset based in more than one condition. We want to subset the data by the male, Adelie species, that were collected at the Dream Island:

penguins |> 
    filter(sex == "male",
           species == "Adelie",
           island == "Dream") |> 
    knitr::kable() # only to show the whole data
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Dream 37.2 18.1 178 3900 male 2007
Adelie Dream 40.9 18.9 184 3900 male 2007
Adelie Dream 39.2 21.1 196 4150 male 2007
Adelie Dream 38.8 20.0 190 3950 male 2007
Adelie Dream 39.8 19.1 184 4650 male 2007
Adelie Dream 40.8 18.4 195 3900 male 2007
Adelie Dream 44.1 19.7 196 4400 male 2007
Adelie Dream 39.6 18.8 190 4600 male 2007
Adelie Dream 41.1 19.0 182 3425 male 2007
Adelie Dream 42.3 21.2 191 4150 male 2007
Adelie Dream 41.3 20.3 194 3550 male 2008
Adelie Dream 36.3 19.5 190 3800 male 2008
Adelie Dream 38.3 19.2 189 3950 male 2008
Adelie Dream 41.1 18.1 205 4300 male 2008
Adelie Dream 39.6 18.1 186 4450 male 2008
Adelie Dream 40.8 18.9 208 4300 male 2008
Adelie Dream 40.3 18.5 196 4350 male 2008
Adelie Dream 43.2 18.5 192 4100 male 2008
Adelie Dream 37.5 18.5 199 4475 male 2009
Adelie Dream 41.1 17.5 190 3900 male 2009
Adelie Dream 40.2 20.1 200 3975 male 2009
Adelie Dream 39.7 17.9 193 4250 male 2009
Adelie Dream 40.6 17.2 187 3475 male 2009
Adelie Dream 40.7 17.0 190 3725 male 2009
Adelie Dream 39.0 18.7 185 3650 male 2009
Adelie Dream 39.2 18.6 190 4250 male 2009
Adelie Dream 37.8 18.1 193 3750 male 2009
Adelie Dream 41.5 18.5 201 4000 male 2009

If we need to subset more than one information in a single variable, we can use the %in% operator. For example, we want the penguins that have flipper_length_mm smaller than 180, that were collected at the years of 2007 and 2009:

penguins |> 
    filter(flipper_length_mm < 180,
           year %in% c(2007, 2009)) |> # the `c()` stands for "concatenate" 
    knitr::kable()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Biscoe 37.8 18.3 174 3400 female 2007
Adelie Biscoe 37.9 18.6 172 3150 female 2007
Adelie Dream 39.5 16.7 178 3250 female 2007
Adelie Dream 37.2 18.1 178 3900 male 2007
Adelie Dream 37.5 18.9 179 2975 NA 2007
Adelie Torgersen 40.2 17.0 176 3450 female 2009
Chinstrap Dream 46.1 18.2 178 3250 female 2007

So, 7 out of 344 penguins were collected at 2007 or 2009 and had flippers lengths smaller than 180 mm.

Finishing

There is a lot more to do, and a lot of specific details at the filter() function, but the main idea of this post is just to do a glimpse of what it does.

In this post, we explored:

  • {dplyr} package
    • arrange() function;
    • filter() function.

I hope you have enjoyed.

See you soon.

Cheers,