Filtering and Selecting Data

In the following coding examples, you will see three important functions, a pipe, select(), and filter(). First, a pipe is written as a %>% or |>. Pipes are a powerful tool that help us clearly expressing a sequence of functions. Pipes tell R that you want to use a designated set for a chain of functions that build off of each other. A way to think about is in math terms.

\[ F(G(H(X))), X=Data\]

In this series of functions, the first step is you solve \(H(X)\), then using that solution you solve \(G()\), then after you solve \(F()\). The final output is the solution of those chain of events with the information of \(X\). This is how pipes work! Pipes use a set of information, typically a data frame, and then apply it to a series of functions that build off each other.

In the following example, I use the select() and filter() to simplify the data. The select() allows me to select certain variables from the data and eliminate the rest. filter() does something similar but instead of variables, it allows me to simplify the data based on values based on a set of programmed logical operators. In my first example, I select for the circuit, year, constructor, surname of the driver, and the number of points awarded during that race. I use the unique() to remove duplicates, and then I filter for the race information for the constructor McLaren during the 2023 season.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(RandomData)


race_stats |>
  select(circuit, year, constructor, surname, points) |>
  # remove duplicates
  unique() |>
  filter(constructor == "McLaren" & year == 2023)

# A tibble: 40 × 5
   circuit                         year constructor surname points
   <chr>                          <dbl> <chr>       <chr>    <dbl>
 1 Bahrain International Circuit   2023 McLaren     Norris       0
 2 Jeddah Corniche Circuit         2023 McLaren     Norris       0
 3 Jeddah Corniche Circuit         2023 McLaren     Piastri      0
 4 Albert Park Grand Prix Circuit  2023 McLaren     Norris       8
 5 Albert Park Grand Prix Circuit  2023 McLaren     Piastri      4
 6 Baku City Circuit               2023 McLaren     Norris       2
 7 Baku City Circuit               2023 McLaren     Piastri      0
 8 Miami International Autodrome   2023 McLaren     Norris       0
 9 Miami International Autodrome   2023 McLaren     Piastri      0
10 Circuit de Monaco               2023 McLaren     Norris       2
# ℹ 30 more rows

What if we wanted two different constructors? Then we want to use the logical operator for | and also differentiate it from the next instruction for filter by using a , or (). In the example bellow, I filter for the race information for the constructor Mercades and Red Bull during the 2021 season.

race_stats |>
  select(circuit, year, constructor, surname, points) |>
  # remove duplicates
    unique() |>
    filter(constructor == "Mercades" |  constructor == "Red Bull", year == 2021) |>
    print()

# A tibble: 40 × 5
   circuit                             year constructor surname    points
   <chr>                              <dbl> <chr>       <chr>       <dbl>
 1 Losail International Circuit        2021 Red Bull    Pérez          12
 2 Losail International Circuit        2021 Red Bull    Verstappen     19
 3 Bahrain International Circuit       2021 Red Bull    Pérez          10
 4 Bahrain International Circuit       2021 Red Bull    Verstappen     18
 5 Autodromo Enzo e Dino Ferrari       2021 Red Bull    Pérez           0
 6 Autodromo Enzo e Dino Ferrari       2021 Red Bull    Verstappen     25
 7 Autódromo Internacional do Algarve  2021 Red Bull    Pérez          12
 8 Autódromo Internacional do Algarve  2021 Red Bull    Verstappen     18
 9 Circuit de Barcelona-Catalunya      2021 Red Bull    Pérez          10
10 Circuit de Barcelona-Catalunya      2021 Red Bull    Verstappen     19
# ℹ 30 more rows

However, if we wanted to use the modified data, we could not because it is not saved. If we want to use this data we need to save it as a new object!

season_2021 <- race_stats |>
  select(circuit, year, constructor, surname, points) |>
  # remove duplicates
    unique() |>
    filter(constructor == "Mercades" |  constructor == "Red Bull", year == 2021) |>
    print()

# A tibble: 40 × 5
   circuit                             year constructor surname    points
   <chr>                              <dbl> <chr>       <chr>       <dbl>
 1 Losail International Circuit        2021 Red Bull    Pérez          12
 2 Losail International Circuit        2021 Red Bull    Verstappen     19
 3 Bahrain International Circuit       2021 Red Bull    Pérez          10
 4 Bahrain International Circuit       2021 Red Bull    Verstappen     18
 5 Autodromo Enzo e Dino Ferrari       2021 Red Bull    Pérez           0
 6 Autodromo Enzo e Dino Ferrari       2021 Red Bull    Verstappen     25
 7 Autódromo Internacional do Algarve  2021 Red Bull    Pérez          12
 8 Autódromo Internacional do Algarve  2021 Red Bull    Verstappen     18
 9 Circuit de Barcelona-Catalunya      2021 Red Bull    Pérez          10
10 Circuit de Barcelona-Catalunya      2021 Red Bull    Verstappen     19
# ℹ 30 more rows

Now, we can use this information!