Missing values

Previous code

# install.packages("tidyverse")
# install.packages("here")

library(tidyverse)
library(here)

athletes <- readRDS(file = here::here("raw_data", "athletes.rds"))

What are `NA's`?

Remember the weird NA rows we have encountered when subsetting by condition? We were able to steer around that by using filter(), but let’s take a closer look at that now.

Missing values are denoted in R by NA (or NaN in some cases). They nullify a calculation or comparison pretty strongly - if one missing value is found somewhere along the line, the result will also be NA (if not specified otherwise):

c(4, NA) > 3

[1] TRUE   NA

That’s why we got some NA rows when trying to extract specific rows by weight: these rows had an NA in the weight column, and R returned rows with NA's as a result.

How to deal with them?

Find `NA's`

To check if values are NA, we can use is.na():

is.na(athletes$Weight)

    [1]  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE
   [13] FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
   [25] FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
   [37] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE
...

Some TRUEs, so there are missing values here. Let’s count them (Summing a logical vector counts the number of TRUE values.):

sum(is.na(athletes$Weight))

[1] 62785

We seem to have 62785 missings in this column.

Filtering rows with `NA's`

There are multiple different ways to deal with missings. For our comparison problem, we can add the new condition that all rows that get selected shouldn’t have an NA in the Weight column:

athletes[(athletes$Sport == "Judo") & (athletes$Weight > 100 | athletes$Weight < 50) & !is.na(athletes$Weight), ]

       NOC     ID                                 Name Sex Age Height Weight
471    ALG  13895                  Mohamed Bouaichaoui   M  25    178  120.0
673    ALG  82643                        Meriem Moussa   F  20    150   48.0
702    ALG  80035                      Boualem Miloudi   M  23    192  106.0
...

Unfold if you want to take a closer look at what’s happening here

Like always when filtering specific rows, we define a logical vector, which has a TRUE for all rows that have a missing on ID and a FALSE for all others (by using the ! operator, which inverts the boolean values - otherwise we would extract all rows with missing values in the Weight column):

!is.na(athletes$Weight)

    [1] FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE
   [13]  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
   [25]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
   [37]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE
...

We also assign the new name athletes_na to the resulting data frame, so we don’t overwrite the original one.

Removing `NA's`

Removing NA's from a data frame works pretty similar:

athletes_na <- athletes[!is.na(athletes$Weight), ]
head(athletes_na)

   NOC     ID                 Name Sex Age Height Weight        Team
3  AFG  44977    Mohammad Halilula   M  28    163     57 Afghanistan
5  AFG 109153   Shakar Khan Shakar   M  24     NA     74 Afghanistan
6  AFG  29626 Sultan Mohammad Dost   M  28    168     73 Afghanistan
9  AFG  80210             Alam Mir   M  NA     NA     57 Afghanistan
11 AFG 116125 Nizam-ud-din Subhani   M  34    168    111 Afghanistan
13 AFG 133692    Khojawahid Zahedi   M  20    178     74 Afghanistan
         Games Year Season   City     Sport
3  1980 Summer 1980 Summer Moskva Wrestling
5  1964 Summer 1964 Summer  Tokyo Wrestling
6  1960 Summer 1960 Summer   Roma Wrestling
9  1972 Summer 1972 Summer Munich Wrestling
11 1960 Summer 1960 Summer   Roma Wrestling
13 1980 Summer 1980 Summer Moskva Wrestling
                                       Event Medal      Region
3    Wrestling Men's Bantamweight, Freestyle  <NA> Afghanistan
5    Wrestling Men's Welterweight, Freestyle  <NA> Afghanistan
6    Wrestling Men's Welterweight, Freestyle  <NA> Afghanistan
9  Wrestling Men's Bantamweight, Greco-Roman  <NA> Afghanistan
11    Wrestling Men's Heavyweight, Freestyle  <NA> Afghanistan
13   Wrestling Men's Welterweight, Freestyle  <NA> Afghanistan

Or, using the tidyverse:

library(tidyverse)

athletes_na <- athletes %>%
  drop_na(Weight)
head(athletes_na)

  NOC     ID                 Name Sex Age Height Weight        Team       Games
1 AFG  44977    Mohammad Halilula   M  28    163     57 Afghanistan 1980 Summer
2 AFG 109153   Shakar Khan Shakar   M  24     NA     74 Afghanistan 1964 Summer
3 AFG  29626 Sultan Mohammad Dost   M  28    168     73 Afghanistan 1960 Summer
4 AFG  80210             Alam Mir   M  NA     NA     57 Afghanistan 1972 Summer
5 AFG 116125 Nizam-ud-din Subhani   M  34    168    111 Afghanistan 1960 Summer
6 AFG 133692    Khojawahid Zahedi   M  20    178     74 Afghanistan 1980 Summer
  Year Season   City     Sport                                     Event Medal
1 1980 Summer Moskva Wrestling   Wrestling Men's Bantamweight, Freestyle  <NA>
2 1964 Summer  Tokyo Wrestling   Wrestling Men's Welterweight, Freestyle  <NA>
3 1960 Summer   Roma Wrestling   Wrestling Men's Welterweight, Freestyle  <NA>
4 1972 Summer Munich Wrestling Wrestling Men's Bantamweight, Greco-Roman  <NA>
5 1960 Summer   Roma Wrestling    Wrestling Men's Heavyweight, Freestyle  <NA>
6 1980 Summer Moskva Wrestling   Wrestling Men's Welterweight, Freestyle  <NA>
       Region
1 Afghanistan
2 Afghanistan
3 Afghanistan
4 Afghanistan
5 Afghanistan
6 Afghanistan

Both code versions will remove all rows containing NA's in the weight column.

As already stated, it is not always necessary to remove NA's manually from the data set. In other cases it might be feasible to ignore them, and many functions can deal with missing values by themselves.

Footnotes

Image by Pierre Bamin on Unsplash.↩︎

What are NA's?