Subsetting

Previous code

# install.packages("tidyverse")
# install.packages("here")

library(tidyverse)
library(here)

athletes <- readRDS(file = here::here("raw_data", "athletes.rds"))

Subsetting means extracting smaller sets of data from a bigger data set. For example, we can extract specific rows from a data frame, or specific values from a vector. Let’s take a look at how that is done in R:

Data frames

In R we use square brackets [,] to extract specific rows and columns.

Rows

In front of the , we write the rows we want to extract:

# Extract the first and the fourth row
athletes[c(1, 4), ]

  NOC     ID              Name Sex Age Height Weight        Team       Games
1 AFG 132181       Najam Yahya   M  NA     NA     NA Afghanistan 1956 Summer
4 AFG    502 Ahmad Shah Abouwi   M  NA     NA     NA Afghanistan 1956 Summer
  Year Season      City  Sport               Event Medal      Region
...

Columns

Behind it the columns:

# Extract the second and the fourth column:
athletes[, c(2, 4)]

          ID Sex
1     132181   M
2      87371   M
3      44977   M
...

# Extract the columns by name:
athletes[, c("Year", "Sport")]

      Year                     Sport
1     1956                    Hockey
2     1948                    Hockey
3     1980                 Wrestling
...

# Or only the column Year (and turn it into a vector right away):
athletes$Year

    [1] 1956 1948 1980 1956 1964 1960 1936 1956 1972 1956 1960 1948 1980 1948
   [15] 1960 1936 1960 1968 1948 1972 1956 1980 1956 2016 1968 1948 1980 1936
   [29] 1988 1948 1956 1988 1956 1972 1960 1980 1972 2004 1980 1960 1972 1980
   [43] 1956 1964 1948 2008 1996 1980 1968 1960 1972 1972 1948 1936 2004 1936
...

Tip

Always use column names instead of position if possible. This way, your code will still work if the column position changes.

Rows & Columns

And of course we can combine both calls:

athletes[c(1, 4), c(2, 4)]

      ID Sex
1 132181   M
4    502   M

athletes[c(1, 4), c("Year", "Sport")]

  Year  Sport
1 1956 Hockey
4 1956 Hockey

We can also use Boolean values (every row/column must get a value here, so we extract the first 100 rows by repeating TRUE 100 times, and than add FALSE for the remaining rows):

str(athletes[c(rep(TRUE, 100), rep(FALSE, 271016)), ])

'data.frame':   100 obs. of  16 variables:
 $ NOC   : chr  "AFG" "AFG" "AFG" "AFG" ...
 $ ID    : int  132181 87371 44977 502 109153 29626 1076 121376 80210 87374 ...
 $ Name  : chr  "Najam Yahya" "Ahmad Jahan Nuristani" "Mohammad Halilula" "Ahmad Shah Abouwi" ...
 $ Sex   : chr  "M" "M" "M" "M" ...
 $ Age   : int  NA NA 28 NA 24 28 28 NA NA NA ...
 $ Height: int  NA NA 163 NA NA 168 NA NA NA NA ...
 $ Weight: num  NA NA 57 NA 74 73 NA NA 57 NA ...
 $ Team  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ Games : chr  "1956 Summer" "1948 Summer" "1980 Summer" "1956 Summer" ...
 $ Year  : int  1956 1948 1980 1956 1964 1960 1936 1956 1972 1956 ...
 $ Season: chr  "Summer" "Summer" "Summer" "Summer" ...
 $ City  : chr  "Melbourne" "London" "Moskva" "Melbourne" ...
 $ Sport : chr  "Hockey" "Hockey" "Wrestling" "Hockey" ...
 $ Event : chr  "Hockey Men's Hockey" "Hockey Men's Hockey" "Wrestling Men's Bantamweight, Freestyle" "Hockey Men's Hockey" ...
 $ Medal : chr  NA NA NA NA ...
 $ Region: chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...

Tip

Instead of writing 271016 we should actually use the current row number, in case that changes as well:

str(athletes[c(rep(TRUE, 100), rep(FALSE, nrow(athletes) - 100)), ])

'data.frame':   100 obs. of  16 variables:
 $ NOC   : chr  "AFG" "AFG" "AFG" "AFG" ...
 $ ID    : int  132181 87371 44977 502 109153 29626 1076 121376 80210 87374 ...
 $ Name  : chr  "Najam Yahya" "Ahmad Jahan Nuristani" "Mohammad Halilula" "Ahmad Shah Abouwi" ...
 $ Sex   : chr  "M" "M" "M" "M" ...
 $ Age   : int  NA NA 28 NA 24 28 28 NA NA NA ...
 $ Height: int  NA NA 163 NA NA 168 NA NA NA NA ...
 $ Weight: num  NA NA 57 NA 74 73 NA NA 57 NA ...
 $ Team  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ Games : chr  "1956 Summer" "1948 Summer" "1980 Summer" "1956 Summer" ...
 $ Year  : int  1956 1948 1980 1956 1964 1960 1936 1956 1972 1956 ...
 $ Season: chr  "Summer" "Summer" "Summer" "Summer" ...
 $ City  : chr  "Melbourne" "London" "Moskva" "Melbourne" ...
 $ Sport : chr  "Hockey" "Hockey" "Wrestling" "Hockey" ...
 $ Event : chr  "Hockey Men's Hockey" "Hockey Men's Hockey" "Wrestling Men's Bantamweight, Freestyle" "Hockey Men's Hockey" ...
 $ Medal : chr  NA NA NA NA ...
 $ Region: chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...

Conditional filtering

Now the stuff we looked at in logical operators comes in handy! We can filter rows which match some condition. For example, we might want to look at all athletes from Germany:

athletes[athletes$Team == "Germany", ]

       NOC     ID
107246 GER   7385
107247 GER 114424
107248 GER 112937
107249 GER 107870
107250 GER   9399
107252 GER   9398
107253 GER  47318
107254 GER  96348
107255 GER 127340
...

Unfold if you want to take a closer look at what’s happening here

Take a close look at the comparison before the ,:

athletes$Team == "Germany"

    [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
   [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
   [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
   [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
...

athletes$Team is a vector, so comparing its values to a specified value yields a logical vector with the respective TRUE and FALSE values. We can insert this logical vector in front of the , to extract all rows corresponding to that condition.

If we want to extract multiple nationalities at once, we need the %in% operator:

athletes[athletes$Team %in% c("Kenya", "Norway"), ]

       NOC     ID                                                          Name
156375 KEN  60617                                               Nixon Kiprotich
156376 KEN  85669                                             Benjamin Ngaruiya
156377 KEN  60620                                    Wilson Arap Chuma Kiprugut
156378 KEN  87843                                                Maisiba Obwoge
156379 KEN  20521                                             Charles Cheruiyot
156380 KEN  20524                               Rose Jelagat Cheruiyot (-Kirui)
156381 KEN  60610                                          Asbel Kipruto Kiprop
156382 KEN  90229                                             Janet Owino Awour
156383 KEN  89053                                             Ahmed Rajab Omari
...

By the way, if we want to save our extracted data frame, we can assign it a new name (otherwise it will only get printed into the console, but we can’t go on working with it):

athletes_team <- athletes[athletes$Team %in% c("Kenya", "Norway"), ]
head(athletes_team)

       NOC    ID                            Name Sex Age Height Weight  Team
156375 KEN 60617                 Nixon Kiprotich   M  29    185     68 Kenya
156376 KEN 85669               Benjamin Ngaruiya   M  24     NA     NA Kenya
156377 KEN 60620      Wilson Arap Chuma Kiprugut   M  NA    178     71 Kenya
156378 KEN 87843                  Maisiba Obwoge   M  28    175     99 Kenya
156379 KEN 20521               Charles Cheruiyot   M  19    165     54 Kenya
156380 KEN 20524 Rose Jelagat Cheruiyot (-Kirui)   F  24    154     48 Kenya
             Games Year Season        City     Sport
156375 1992 Summer 1992 Summer   Barcelona Athletics
156376 1992 Summer 1992 Summer   Barcelona    Boxing
...

We can also combine multiple logical vectors using & (“and”) and | (“or”). For example, we might want to look at all german athletes before the year 2000:

athletes_2 <- athletes[athletes$Team == "Germany" & athletes$Year < 2000, ]
head(athletes_2)

       NOC     ID                   Name Sex Age Height Weight    Team
107246 GER   7385     Dirk Peter Balster   M  26    195     90 Germany
107247 GER 114424 Kathleen Stark (-Kern)   F  16    166     51 Germany
107250 GER   9399     Petra Behle-Schaaf   F  23    177     67 Germany
107252 GER   9398 Jochen Friedrich Behle   M  37    183     73 Germany
107253 GER  47318          Martin Heinze   M  21    172     73 Germany
107254 GER  96348        Ramona Portwich   F  25    175     70 Germany
             Games Year Season        City                Sport
107246 1992 Summer 1992 Summer   Barcelona               Rowing
107247 1992 Summer 1992 Summer   Barcelona           Gymnastics
107250 1992 Winter 1992 Winter Albertville             Biathlon
107252 1998 Winter 1998 Winter      Nagano Cross Country Skiing
107253 1960 Summer 1960 Summer        Roma            Wrestling
107254 1992 Summer 1992 Summer   Barcelona             Canoeing
                                            Event Medal  Region
107246                 Rowing Men's Coxless Fours  <NA> Germany
107247             Gymnastics Women's Uneven Bars  <NA> Germany
107250     Biathlon Women's 7.5 kilometres Sprint  <NA> Germany
107252   Cross Country Skiing Men's 10 kilometres  <NA> Germany
107253    Wrestling Men's Welterweight, Freestyle  <NA> Germany
107254 Canoeing Women's Kayak Doubles, 500 metres  Gold Germany

Or at all judo athletes weighting over 100 or under 50 kg:

athletes_3 <- athletes[(athletes$Sport == "Judo") & (athletes$Weight > 100 | athletes$Weight < 50), ]
head(athletes_3)

      NOC    ID                Name  Sex Age Height Weight    Team       Games
NA   <NA>    NA                <NA> <NA>  NA     NA     NA    <NA>        <NA>
NA.1 <NA>    NA                <NA> <NA>  NA     NA     NA    <NA>        <NA>
NA.2 <NA>    NA                <NA> <NA>  NA     NA     NA    <NA>        <NA>
NA.3 <NA>    NA                <NA> <NA>  NA     NA     NA    <NA>        <NA>
471   ALG 13895 Mohamed Bouaichaoui    M  25    178    120 Algeria 2004 Summer
NA.4 <NA>    NA                <NA> <NA>  NA     NA     NA    <NA>        <NA>
     Year Season   City Sport                  Event Medal  Region
NA     NA   <NA>   <NA>  <NA>                   <NA>  <NA>    <NA>
NA.1   NA   <NA>   <NA>  <NA>                   <NA>  <NA>    <NA>
NA.2   NA   <NA>   <NA>  <NA>                   <NA>  <NA>    <NA>
NA.3   NA   <NA>   <NA>  <NA>                   <NA>  <NA>    <NA>
471  2004 Summer Athina  Judo Judo Men's Heavyweight  <NA> Algeria
NA.4   NA   <NA>   <NA>  <NA>                   <NA>  <NA>    <NA>

Hmm, that looks a bit weird. Some rows only contain NA values. That’s because there are missing values in the Weight column. We will look at that closer in the missings chapter and ignore that problem for now.

In the long run, always having to specify the name of the data frame for each column or row with condition can become a bit annoying and clutters the code. Also this code leaves all rows with missing values…
Instead, we can use the filter() function from the tidyverse:

Rows: Tidyverse

library(tidyverse)

athletes %>%
  filter(Sport == "Judo", (Weight > 100 | Weight < 50))

    NOC     ID                                 Name Sex Age Height Weight
1   ALG  13895                  Mohamed Bouaichaoui   M  25    178  120.0
2   ALG  82643                        Meriem Moussa   F  20    150   48.0
3   ALG  80035                      Boualem Miloudi   M  23    192  106.0
...

Note how we can just write our conditions without connecting them with & (filter() does that automatically for us). Also, we don’t have to put the column names into "", because filter() knows that this are column names of the athletes data frame, which makes coding a bit more pleasant. And finally, missing rows are automatically removed, which makes sense in many cases!

Columns: Tidyverse

For extracting columns, we need select():

athletes %>%
  select(Year, Sport)

      Year                     Sport
1     1956                    Hockey
2     1948                    Hockey
3     1980                 Wrestling
...

Vectors

Let’s take a quick look at how to extract elements from a vector, which shouldn’t be a problem after already dealing with data frames. It’s pretty straight forward: we just put the position of the element we want to extract behind the vector in square brackets (without a ,, as we only have a one dimensional object). Let’s quickly define a vector for illustration:

vec_sport <- athletes$Sport # remember: `$` returns a vector

And look at the second element:

vec_sport[2]

[1] "Hockey"

Of course we can also do that for multiple elements:

vec_sport[c(2, 3, 4)]

[1] "Hockey"    "Wrestling" "Hockey"

## Or, less to write:
vec_sport[2:4]

[1] "Hockey"    "Wrestling" "Hockey"

Another way would be to provide a logical vector, which defines for each position if we want to extract the element or not (like we already did for data frames):

vec_sport[c(rep(TRUE, 100), rep(FALSE, 65))]

    [1] "Hockey"                    "Hockey"                   
    [3] "Wrestling"                 "Hockey"                   
    [5] "Wrestling"                 "Wrestling"                
    [7] "Hockey"                    "Hockey"                   
...

Lists

When subsetting lists we have two options:

# Define an example list:
show_list <- list(
  "TV-Show" = c("Friends", "How I Met Your Mother"),
  "dat" = data.frame(
    "name" = c("Monica", "Ted"),
    "age" = c(24, 27)
  )
)

We can extract a list element. This is done by single square brackets:

str(show_list[2])

List of 1
 $ dat:'data.frame':    2 obs. of  2 variables:
  ..$ name: chr [1:2] "Monica" "Ted"
  ..$ age : num [1:2] 24 27

Note how the result is still a list? It’s like taking out a drawer from a closet, but keeping the content inside this drawer.

We can extract the element that is stored inside the list element. This is done by double square brackets:

str(show_list[[2]])

'data.frame':   2 obs. of  2 variables:
 $ name: chr  "Monica" "Ted"
 $ age : num  24 27

Here the result is the data frame that was saved inside the list. It’s like taking the content out of the drawer.