Merging data

# install.packages("tidyverse")
# install.packages("here")

library(tidyverse)
library(here)

athletes <- readRDS(file = here::here("raw_data", "athletes.rds"))

Data set

In the end, we want to plot the number of gold medals the countries have won on a world map. To do that, we need a data set containing coordinates of the different countries. Luckily, ggplot2 (part of the tidyverse) provides a fitting data set. Let’s download it and load it into R:

world_coordinates <- readRDS(file = here::here("raw_data", "world_coordinates.rds"))

Before merging

Right now, we have multiple rows for each country in both data sets. This will not be merged easily, so we have to reduce our athletes data first. We need to calculate how many gold medals each country has won in total. Let’s do that quickly, using some tidyverse functions. It is not especially important you understand and know everything that happens here, but we need it for the next chapters, so here it goes:

medal_counts <- athletes %>%
  filter(Medal == "Gold") %>%
  group_by(Region) %>%
  count(Medal) 

medal_counts
# A tibble: 99 × 3
# Groups:   Region [99]
   Region     Medal     n
   <chr>      <chr> <int>
 1 Algeria    Gold      5
 2 Argentina  Gold     91
 3 Armenia    Gold      2
 4 Australia  Gold    368
 5 Austria    Gold    108
 6 Azerbaijan Gold      7
 7 Bahamas    Gold     14
 8 Bahrain    Gold      1
 9 Belarus    Gold     24
10 Belgium    Gold     98
# ℹ 89 more rows

What happens here? We extract all rows containing gold medals, group them by region, so our next operation is performed region wise, and not for the whole data set. Then we count how many gold medals each region got.

Merging

To merge two data frames that include information that belongs together, we need a common column, on which we can combine them. In our case, this is the column containing the country. They are both named region, but one with an upper case R. This doesn’t pose a problem, as we can define which columns should be taken from which data frame for merging. Let’s take a quick look before merging to check if there are any countries named differently in both data sets (this simply combines commands we have already looked at in the Basic operations chapter:

medal_counts$Region[!(medal_counts$Region %in% world_coordinates$region)]
[1] "Individual Olympic Athletes"

Looks like all of the countries in our medal_countries data frame can also be found in our world_coordinates frame. Only athletes without a country will be lost when merging, but that’s ok for now, as we are interested in the country specific gold medal counts. So let’s merge:

medal_countries <- merge(
  x = medal_counts,
  y = world_coordinates,
  by.x = "Region",
  by.y = "region",
  all.x = FALSE,
  all.y = TRUE
)

head(medal_countries)
       Region Medal  n     long      lat group order subregion
1 Afghanistan  <NA> NA 74.89131 37.23164     2    12      <NA>
2 Afghanistan  <NA> NA 74.84023 37.22505     2    13      <NA>
3 Afghanistan  <NA> NA 74.76738 37.24917     2    14      <NA>
4 Afghanistan  <NA> NA 74.73896 37.28564     2    15      <NA>
5 Afghanistan  <NA> NA 74.72666 37.29072     2    16      <NA>
6 Afghanistan  <NA> NA 74.66895 37.26670     2    17      <NA>

Note that we also used the all.x and all.y arguments. In this example, we want to take all rows from the second data set, but only those from the first data set, that have a match in the second data set. This is necessary, because we want to plot all countries later on, but only those we have coordinates for, because they won’t show up on the map otherwise.

We can also use the tidyverse for this operation. In order to do that, we first have to rename our region column, as the column names need to be the same over both data sets that are merged. left_join() means that we will merge onto the first data set (world_coordinates in the code below), like we have done using the all.x and all.y arguments in the merge() function.

medal_countries <- world_coordinates %>%
  rename("Region" = region) %>%
  left_join(medal_counts)
Joining with `by = join_by(Region)`
head(medal_countries)
       long      lat group order Region subregion Medal  n
1 -69.89912 12.45200     1     1  Aruba      <NA>  <NA> NA
2 -69.89571 12.42300     1     2  Aruba      <NA>  <NA> NA
3 -69.94219 12.43853     1     3  Aruba      <NA>  <NA> NA
4 -70.00415 12.50049     1     4  Aruba      <NA>  <NA> NA
5 -70.06612 12.54697     1     5  Aruba      <NA>  <NA> NA
6 -70.05088 12.59707     1     6  Aruba      <NA>  <NA> NA

Great! Now the information that belongs together is stored together.