Reshaping

# install.packages("tidyverse")
# install.packages("here")

library(tidyverse)
library(here)

In this chapter, we will look at a simpler data set that makes it a bit easier to explain reshaping between the wide and long data format, as our athletes data set is relatively complex.

Let’s define our own data set:

inhabitants_wide <- data.frame(
  country = c("China", "India", "USA"),
  inhabitants_2021 = c(1425893465, 1407563842, NA),
  inhabitants_2022 = c(1425857720, 1420939232, 338903174)
)

Wide Format

In the Wide Format, we have only one row per unit of analysis. In this example, each country has it’s own row:

head(inhabitants_wide)
  country inhabitants_2021 inhabitants_2022
1   China       1425893465       1425857720
2   India       1407563842       1420939232
3     USA               NA        338903174

However, the variable inhabitants is stretched over multiple rows. Depending on the use case it can make sense to reshape the data, so the inhabitants are all put into one column:

Long Format

This is what happens in the Long Data Format, where each unit of analysis is spread over multiple rows.

head(inhabitants_long)
# A tibble: 6 × 3
  country year             inhabitants
  <chr>   <chr>                  <dbl>
1 China   inhabitants_2022  1425857720
2 China   inhabitants_2021  1425893465
3 India   inhabitants_2022  1420939232
4 India   inhabitants_2021  1407563842
5 USA     inhabitants_2022   338903174
6 USA     inhabitants_2021          NA

Reshaping from Wide to Long Format

To get from Wide Format to Long Format we can use the pivot_longer() function from the tidyverse:

inhabitants_long <- inhabitants_wide %>%
  pivot_longer(
    ## Select the columns we want to reshape:
    cols = c("inhabitants_2022", "inhabitants_2021"),
    ## Define a new column where the column names will go to:
    names_to = "year",
    ## Define a new column where the values will go to:
    values_to = "inhabitants"
  )

head(inhabitants_long)
# A tibble: 6 × 3
  country year             inhabitants
  <chr>   <chr>                  <dbl>
1 China   inhabitants_2022  1425857720
2 China   inhabitants_2021  1425893465
3 India   inhabitants_2022  1420939232
4 India   inhabitants_2021  1407563842
5 USA     inhabitants_2022   338903174
6 USA     inhabitants_2021          NA

Reshaping from Long to Wide Format

In other cases, it might happen that multiple variables are put into the same column, together with an identifier column:

inhabitants_long_2
  country         variable      value
1   China             area    9597000
2   China inhabitants_2022 1425857720
3   India             area    3287000
4   India inhabitants_2022 1420939232
5     USA             area    9834000
6     USA inhabitants_2022  338903174

In that case it can make sense to spread the the distinct variables into two columns:

inhabitants_wide_2 <- inhabitants_long_2 %>%
  pivot_wider(
    id_cols = "country",
    names_from = "variable",
    values_from = "value"
  )

inhabitants_wide_2
# A tibble: 3 × 3
  country    area inhabitants_2022
  <chr>     <dbl>            <dbl>
1 China   9597000       1425857720
2 India   3287000       1420939232
3 USA     9834000        338903174