# install.packages("tidyverse")
# install.packages("here")
library(tidyverse)
library(here)
<- readRDS(file = here::here("raw_data", "athletes.rds")) athletes
Missing values
What are NA's
?
Remember the weird NA
rows we have encountered when subsetting by condition? We were able to steer around that by using filter()
, but let’s take a closer look at that now.
Missing values are denoted in R by NA
(or NaN
in some cases). They nullify a calculation or comparison pretty strongly - if one missing value is found somewhere along the line, the result will also be NA
(if not specified otherwise):
c(4, NA) > 3
[1] TRUE NA
That’s why we got some NA
rows when trying to extract specific rows by weight: these rows had an NA
in the weight column, and R returned rows with NA's
as a result.
How to deal with them?
Find NA's
To check if values are NA
, we can use is.na()
:
is.na(athletes$Weight)
[1] TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE TRUE
[13] FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
[25] FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
[37] FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE
...
Some TRUEs
, so there are missing values here. Let’s count them (Summing a logical vector counts the number of TRUE
values.):
sum(is.na(athletes$Weight))
[1] 62785
We seem to have 62785 missings in this column.
Filtering rows with NA's
There are multiple different ways to deal with missings. For our comparison problem, we can add the new condition that all rows that get selected shouldn’t have an NA
in the Weight
column:
$Sport == "Judo") & (athletes$Weight > 100 | athletes$Weight < 50) & !is.na(athletes$Weight), ] athletes[(athletes
NOC ID Name Sex Age Height Weight
471 ALG 13895 Mohamed Bouaichaoui M 25 178 120.0
673 ALG 82643 Meriem Moussa F 20 150 48.0
702 ALG 80035 Boualem Miloudi M 23 192 106.0
...
Like always when filtering specific rows, we define a logical vector, which has a TRUE
for all rows that have a missing on ID
and a FALSE
for all others (by using the !
operator, which inverts the boolean values - otherwise we would extract all rows with missing values in the Weight
column):
!is.na(athletes$Weight)
[1] FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE
[13] TRUE FALSE TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
[25] TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
[37] TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE
...
We also assign the new name athletes_na
to the resulting data frame, so we don’t overwrite the original one.
Removing NA's
Removing NA's
from a data frame works pretty similar:
<- athletes[!is.na(athletes$Weight), ]
athletes_na head(athletes_na)
NOC ID Name Sex Age Height Weight Team
3 AFG 44977 Mohammad Halilula M 28 163 57 Afghanistan
5 AFG 109153 Shakar Khan Shakar M 24 NA 74 Afghanistan
6 AFG 29626 Sultan Mohammad Dost M 28 168 73 Afghanistan
9 AFG 80210 Alam Mir M NA NA 57 Afghanistan
11 AFG 116125 Nizam-ud-din Subhani M 34 168 111 Afghanistan
13 AFG 133692 Khojawahid Zahedi M 20 178 74 Afghanistan
Games Year Season City Sport
3 1980 Summer 1980 Summer Moskva Wrestling
5 1964 Summer 1964 Summer Tokyo Wrestling
6 1960 Summer 1960 Summer Roma Wrestling
9 1972 Summer 1972 Summer Munich Wrestling
11 1960 Summer 1960 Summer Roma Wrestling
13 1980 Summer 1980 Summer Moskva Wrestling
Event Medal Region
3 Wrestling Men's Bantamweight, Freestyle <NA> Afghanistan
5 Wrestling Men's Welterweight, Freestyle <NA> Afghanistan
6 Wrestling Men's Welterweight, Freestyle <NA> Afghanistan
9 Wrestling Men's Bantamweight, Greco-Roman <NA> Afghanistan
11 Wrestling Men's Heavyweight, Freestyle <NA> Afghanistan
13 Wrestling Men's Welterweight, Freestyle <NA> Afghanistan
Or, using the tidyverse
:
library(tidyverse)
<- athletes %>%
athletes_na drop_na(Weight)
head(athletes_na)
NOC ID Name Sex Age Height Weight Team Games
1 AFG 44977 Mohammad Halilula M 28 163 57 Afghanistan 1980 Summer
2 AFG 109153 Shakar Khan Shakar M 24 NA 74 Afghanistan 1964 Summer
3 AFG 29626 Sultan Mohammad Dost M 28 168 73 Afghanistan 1960 Summer
4 AFG 80210 Alam Mir M NA NA 57 Afghanistan 1972 Summer
5 AFG 116125 Nizam-ud-din Subhani M 34 168 111 Afghanistan 1960 Summer
6 AFG 133692 Khojawahid Zahedi M 20 178 74 Afghanistan 1980 Summer
Year Season City Sport Event Medal
1 1980 Summer Moskva Wrestling Wrestling Men's Bantamweight, Freestyle <NA>
2 1964 Summer Tokyo Wrestling Wrestling Men's Welterweight, Freestyle <NA>
3 1960 Summer Roma Wrestling Wrestling Men's Welterweight, Freestyle <NA>
4 1972 Summer Munich Wrestling Wrestling Men's Bantamweight, Greco-Roman <NA>
5 1960 Summer Roma Wrestling Wrestling Men's Heavyweight, Freestyle <NA>
6 1980 Summer Moskva Wrestling Wrestling Men's Welterweight, Freestyle <NA>
Region
1 Afghanistan
2 Afghanistan
3 Afghanistan
4 Afghanistan
5 Afghanistan
6 Afghanistan
Both code versions will remove all rows containing NA's
in the weight column
.
As already stated, it is not always necessary to remove NA's
manually from the data set. In other cases it might be feasible to ignore them, and many functions can deal with missing values by themselves.
Footnotes
Image by Pierre Bamin on Unsplash.↩︎