Getting an overview

# install.packages("tidyverse")
# install.packages("here")

library(tidyverse)
library(here)

athletes <- readRDS(file = here::here("raw_data", "athletes.rds"))

Before starting to do something with your data, it is always a good idea to get an overview. Our goal is to answer questions in the line of:

  1. Which variables does our data have?
  2. How many rows/columns does our data frame have? If we have a list, how long is it, what is saved within?
  3. What types do our variables have (are they numeric, character …)? Do we have to transform them before we can work with them?
  4. Do we have any missing values?

To answer these questions, we have different tools at our disposal:

View()

View() will open the data set Excel-style in a new window:

View(athletes)

In this window we can sort and filter, which makes it a pretty useful tool.

str()

This one is actually my favorite, as for bigger data sets it is often more feasible to only look at the structure and not the whole data set. It looks a bit different to what we are used to though:

str(athletes)
'data.frame':   270767 obs. of  16 variables:
 $ NOC   : chr  "AFG" "AFG" "AFG" "AFG" ...
 $ ID    : int  132181 87371 44977 502 109153 29626 1076 121376 80210 87374 ...
 $ Name  : chr  "Najam Yahya" "Ahmad Jahan Nuristani" "Mohammad Halilula" "Ahmad Shah Abouwi" ...
 $ Sex   : chr  "M" "M" "M" "M" ...
 $ Age   : int  NA NA 28 NA 24 28 28 NA NA NA ...
 $ Height: int  NA NA 163 NA NA 168 NA NA NA NA ...
 $ Weight: num  NA NA 57 NA 74 73 NA NA 57 NA ...
 $ Team  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ Games : chr  "1956 Summer" "1948 Summer" "1980 Summer" "1956 Summer" ...
 $ Year  : int  1956 1948 1980 1956 1964 1960 1936 1956 1972 1956 ...
 $ Season: chr  "Summer" "Summer" "Summer" "Summer" ...
 $ City  : chr  "Melbourne" "London" "Moskva" "Melbourne" ...
 $ Sport : chr  "Hockey" "Hockey" "Wrestling" "Hockey" ...
 $ Event : chr  "Hockey Men's Hockey" "Hockey Men's Hockey" "Wrestling Men's Bantamweight, Freestyle" "Hockey Men's Hockey" ...
 $ Medal : chr  NA NA NA NA ...
 $ Region: chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...

Here, the column names are printed on the left side, followed by the type of the column and then the first few values of each column. We can also see at the top that this object is a data frame with 270767 rows and 16 columns.

summary()

Finally, to get a more thourough overview of our variables, we can use summary():

summary(athletes)
     NOC                  ID             Name               Sex           
 Length:270767      Min.   :     1   Length:270767      Length:270767     
 Class :character   1st Qu.: 34630   Class :character   Class :character  
 Mode  :character   Median : 68187   Mode  :character   Mode  :character  
                    Mean   : 68229                                        
                    3rd Qu.:102066                                        
                    Max.   :135571                                        
                                                                          
      Age            Height          Weight           Team          
 Min.   :10.00   Min.   :127.0   Min.   : 25.00   Length:270767     
 1st Qu.:21.00   1st Qu.:168.0   1st Qu.: 60.00   Class :character  
 Median :24.00   Median :175.0   Median : 70.00   Mode  :character  
 Mean   :25.56   Mean   :175.3   Mean   : 70.71                     
 3rd Qu.:28.00   3rd Qu.:183.0   3rd Qu.: 79.00                     
 Max.   :97.00   Max.   :226.0   Max.   :214.00                     
 NA's   :9462    NA's   :60083   NA's   :62785                      
    Games                Year         Season              City          
 Length:270767      Min.   :1896   Length:270767      Length:270767     
 Class :character   1st Qu.:1960   Class :character   Class :character  
 Mode  :character   Median :1988   Mode  :character   Mode  :character  
                    Mean   :1978                                        
                    3rd Qu.:2002                                        
                    Max.   :2016                                        
                                                                        
    Sport              Event              Medal              Region         
 Length:270767      Length:270767      Length:270767      Length:270767     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            

For numeric columns we get their minimum and maximum, median and mean, as well as the first and third quantile. In case of missing values (NAs) their number is printed at the bottom (e.g., look at the Age column). We will look at how to deal with missings soon, but first we have to talk about subsetting data.