library(dplyr)
What is this post about?
- What is the
{dplyr}
package? - Presenting
arrange()
andfilter()
.
History
Hadley Wickham, one of the minds behind the tidyverse, argues in his tidy data paper that around 80% of data analysis is spent on the process of cleaning and preparing the data.
Given all this work and repetition of tasks, some of the tidyverse R packages were designed to maintain good and reproducible ways to handle different datasets.
The {dplyr}
package, according with my humble opinion, is the flagship of tidyverse when the objective is fixing messy data.
I decided then to present some functions of the main tidyverse packages. We are starting with the {dplyr}
functions arrange()
and filter()
.
Hands on
Firts of all we need to load the {dplyr}
package:
If you do not have it yet, just install using:
install.packages("dplyr")
We will use the Palmer Archipelago (Antarctica) penguin dataset, wich you can load and add into an R object like this:
# install.packages("palmerpenguins")
<- palmerpenguins::penguins
penguins
glimpse(penguins)
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
You can read more about this dataset at Allison Horst Github page.
arrange()
The first {dplyr}
function is arrange()
.
One thing that I like in the {dplyr}
functions, is that they represent the actions of what they pretend to do. For instance, arrange()
will literally arrange the dataset based in one or more columns, at the ascending or descending order (default is ascending).
In the example below, we are sorting the dataset based on the name of the species:
|>
penguins arrange(species)
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
If you want to know more about the pipe (|>), check my previous post (in brazilian portuguese).
Now, let’s arrange the dataset based on the island by descending order, and species by ascending order:
|>
penguins arrange(desc(island), species)
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
The two variables above were alphabetically sorted, now we can try the same using the body_mass_g numeric variable:
|>
penguins arrange(body_mass_g)
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Chinstrap Dream 46.9 16.6 192 2700
2 Adelie Biscoe 36.5 16.6 181 2850
3 Adelie Biscoe 36.4 17.1 184 2850
4 Adelie Biscoe 34.5 18.1 187 2900
5 Adelie Dream 33.1 16.1 178 2900
6 Adelie Torgers… 38.6 17 188 2900
7 Chinstrap Dream 43.2 16.6 187 2900
8 Adelie Biscoe 37.9 18.6 193 2925
9 Adelie Dream 37.5 18.9 179 2975
10 Adelie Dream 37 16.9 185 3000
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
The lightest measured penguin weighed 2,700 g, it was from the Chinstrap species from the Dream Island.
What about the heaviest?
|>
penguins arrange(desc(body_mass_g))
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Gentoo Biscoe 49.2 15.2 221 6300
2 Gentoo Biscoe 59.6 17 230 6050
3 Gentoo Biscoe 51.1 16.3 220 6000
4 Gentoo Biscoe 48.8 16.2 222 6000
5 Gentoo Biscoe 45.2 16.4 223 5950
6 Gentoo Biscoe 49.8 15.9 229 5950
7 Gentoo Biscoe 48.4 14.6 213 5850
8 Gentoo Biscoe 49.3 15.7 217 5850
9 Gentoo Biscoe 55.1 16 230 5850
10 Gentoo Biscoe 49.5 16.2 229 5800
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
The heaviest was a Gentoo, from Biscoe Island, and weighed 6,300 g.
filter()
The filter()
function is one of the more important at the {dplyr}
package.
According to its name, filter()
will subset the observations (rows) of the dataset based on a given rule.
To retain the values, the proposed rules must produce a TRUE value for all conditions.
To do such conditions, some relational operators are necessary, here is a list of them:
Relational operators in R Operator Description < Less than > Greater than <= Less than or equal to >= Greater than or equal to == Equal to != Not equal to
Let’s start using only one condition.
First, we want to pick only the penguins that weight less than 3,000 g (or 3 kg):
|>
penguins filter(body_mass_g < 3000)
# A tibble: 9 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Dream 37.5 18.9 179 2975
2 Adelie Biscoe 34.5 18.1 187 2900
3 Adelie Biscoe 36.5 16.6 181 2850
4 Adelie Biscoe 36.4 17.1 184 2850
5 Adelie Dream 33.1 16.1 178 2900
6 Adelie Biscoe 37.9 18.6 193 2925
7 Adelie Torgersen 38.6 17 188 2900
8 Chinstrap Dream 43.2 16.6 187 2900
9 Chinstrap Dream 46.9 16.6 192 2700
# ℹ 2 more variables: sex <fct>, year <int>
The result is a filtered dataset containing only 9 penguins.
Now we want to filter only the female in the penguins
data:
|>
penguins filter(sex == "female")
# A tibble: 165 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.5 17.4 186 3800
2 Adelie Torgersen 40.3 18 195 3250
3 Adelie Torgersen 36.7 19.3 193 3450
4 Adelie Torgersen 38.9 17.8 181 3625
5 Adelie Torgersen 41.1 17.6 182 3200
6 Adelie Torgersen 36.6 17.8 185 3700
7 Adelie Torgersen 38.7 19 195 3450
8 Adelie Torgersen 34.4 18.4 184 3325
9 Adelie Biscoe 37.8 18.3 174 3400
10 Adelie Biscoe 35.9 19.2 189 3800
# ℹ 155 more rows
# ℹ 2 more variables: sex <fct>, year <int>
There are 165 female penguins on this dataset (the total is 344).
We can filter the dataset based in more than one condition. We want to subset the data by the male, Adelie species, that were collected at the Dream Island:
|>
penguins filter(sex == "male",
== "Adelie",
species == "Dream") |>
island ::kable() # only to show the whole data knitr
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|
Adelie | Dream | 37.2 | 18.1 | 178 | 3900 | male | 2007 |
Adelie | Dream | 40.9 | 18.9 | 184 | 3900 | male | 2007 |
Adelie | Dream | 39.2 | 21.1 | 196 | 4150 | male | 2007 |
Adelie | Dream | 38.8 | 20.0 | 190 | 3950 | male | 2007 |
Adelie | Dream | 39.8 | 19.1 | 184 | 4650 | male | 2007 |
Adelie | Dream | 40.8 | 18.4 | 195 | 3900 | male | 2007 |
Adelie | Dream | 44.1 | 19.7 | 196 | 4400 | male | 2007 |
Adelie | Dream | 39.6 | 18.8 | 190 | 4600 | male | 2007 |
Adelie | Dream | 41.1 | 19.0 | 182 | 3425 | male | 2007 |
Adelie | Dream | 42.3 | 21.2 | 191 | 4150 | male | 2007 |
Adelie | Dream | 41.3 | 20.3 | 194 | 3550 | male | 2008 |
Adelie | Dream | 36.3 | 19.5 | 190 | 3800 | male | 2008 |
Adelie | Dream | 38.3 | 19.2 | 189 | 3950 | male | 2008 |
Adelie | Dream | 41.1 | 18.1 | 205 | 4300 | male | 2008 |
Adelie | Dream | 39.6 | 18.1 | 186 | 4450 | male | 2008 |
Adelie | Dream | 40.8 | 18.9 | 208 | 4300 | male | 2008 |
Adelie | Dream | 40.3 | 18.5 | 196 | 4350 | male | 2008 |
Adelie | Dream | 43.2 | 18.5 | 192 | 4100 | male | 2008 |
Adelie | Dream | 37.5 | 18.5 | 199 | 4475 | male | 2009 |
Adelie | Dream | 41.1 | 17.5 | 190 | 3900 | male | 2009 |
Adelie | Dream | 40.2 | 20.1 | 200 | 3975 | male | 2009 |
Adelie | Dream | 39.7 | 17.9 | 193 | 4250 | male | 2009 |
Adelie | Dream | 40.6 | 17.2 | 187 | 3475 | male | 2009 |
Adelie | Dream | 40.7 | 17.0 | 190 | 3725 | male | 2009 |
Adelie | Dream | 39.0 | 18.7 | 185 | 3650 | male | 2009 |
Adelie | Dream | 39.2 | 18.6 | 190 | 4250 | male | 2009 |
Adelie | Dream | 37.8 | 18.1 | 193 | 3750 | male | 2009 |
Adelie | Dream | 41.5 | 18.5 | 201 | 4000 | male | 2009 |
If we need to subset more than one information in a single variable, we can use the %in%
operator. For example, we want the penguins that have flipper_length_mm smaller than 180, that were collected at the years of 2007 and 2009:
|>
penguins filter(flipper_length_mm < 180,
%in% c(2007, 2009)) |> # the `c()` stands for "concatenate"
year ::kable() knitr
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|
Adelie | Biscoe | 37.8 | 18.3 | 174 | 3400 | female | 2007 |
Adelie | Biscoe | 37.9 | 18.6 | 172 | 3150 | female | 2007 |
Adelie | Dream | 39.5 | 16.7 | 178 | 3250 | female | 2007 |
Adelie | Dream | 37.2 | 18.1 | 178 | 3900 | male | 2007 |
Adelie | Dream | 37.5 | 18.9 | 179 | 2975 | NA | 2007 |
Adelie | Torgersen | 40.2 | 17.0 | 176 | 3450 | female | 2009 |
Chinstrap | Dream | 46.1 | 18.2 | 178 | 3250 | female | 2007 |
So, 7 out of 344 penguins were collected at 2007 or 2009 and had flippers lengths smaller than 180 mm.
Finishing
There is a lot more to do, and a lot of specific details at the filter()
function, but the main idea of this post is just to do a glimpse of what it does.
In this post, we explored:
{dplyr}
packagearrange()
function;filter()
function.
I hope you have enjoyed.
See you soon.
Cheers,