recoding – Didier Ruedin

Today I spent some time recoding categorical data, the kind of recategorizing countries into regions, recoding levels of education into larger groups, or “anonymizing” data by replacing names with a pseudonym. As always, we have many options to do this in R, but then I realized that hardly anyone seems to recommend merge() as a tool for recoding in R.

Let’s start with generating data:

set.seed(925217)
df = data.frame(age=floor(runif(50)*50), 
height=floor(runif(50)*70)+130,
preference=sample(c("A","B","C","D","E","F","G","H","I","Z","a"), 50, replace=TRUE))
df

The set.seed() command makes sure we all have the same example on the screen; it sets the pseudo-random number generator to a fixed sequence.

We then create a new dataframe called df, with three variables: age, height, and preference. For age, we generate 50 random numbers with runif(50), multiply them by 50, and then round down these numbers using floor(). This should give values between 0 and 49. For height, we do the same thing with 70 and add 130 to get something that remotely resembles values that could indicate height (in cm). For preferences, I sample 50 values from a string of characters, with replacement. Typing df, we can look at the dataframe (I only include the first few lines here):

> df
   age height preference
 1  45    147          Z
 2   0    175          E
 3   2    186          B
 4   2    162          Z
 5  30    186          G
 6   4    184          I
 7   0    135          F
 8  22    151          D
 9  32    160          a
10  26    178          H
11  43    182          C

OK, maybe I should use package Charlatan to create more realistic data…

Now, let’s assume we wanted to reclassify the preference variable, recategorizing. In these examples, I’m recategorizing the letters according to whether they are vowels (“v”) or consonants (“c”).

This being R, we have many ways to recode, including square brackets, or the mutate/recode commands in tidyverse.

df$new = NA
df$new[df$preference == "a"] = "v"
df$new[df$preference == "A"] = "v"
df$new[df$preference == "B"] = "c"

etc. I think this code can be transparent, but it’ll be a lot to type, and if the strings/words/categories are a bit longer, the code may be more difficult to read, especially if we imagine many different categories to recode, like classifying all countries of the world.

df$new = NA
df$new[df$preference %in% c("a", "A", "E", "I")] = "v"
df$new[df$preference %in% c("B", "C", "D", "F", "G","H", "Z")] = "c"

Using %in% we can get shorter code, which can be easier to read. With longer strings/words/categories and many categories to classify, it may be difficult to spot if any category was missed.

In the tidyverse, we might do:

df = df |> mutate(newvar = recode(preference, "a"="v", "A"="v", "B"="c", "C"="c", "D"="c", "E"="v", "F"="c", "G"="c", "H"="c", "I"="v", "Z"="c"))

We can spread the recoding over many lines to make the code more readable, especially if we image longer strings/words/categories.

df = df |> mutate(newvar = recode(preference, "a"="v", 
"A"="v", 
"B"="c",

etc.

Let’s see what we can do with merge(). We’ll need a dataframe with the original values and the new categories.

pr = sort(unique(df$preference))
cv = c("v","v","c","c","c","v","c","c","c","v","c")
re = data.frame(preference=pr, vowel=cv)
re

I divide the tasks into steps to have code that is more transparent to me. First, I use unique() to get all the values of the original variables. I add a sort() around it to make sure that the order of categories is always the same. It may not be necessary, but since we’re assigning the new categories manually, I prefer this safeguard (e.g. if the underlying data are updated with additional cases, or re-ordered). Second, I manually code by replacing the output from pr — “a” “A” “B” “C” “D” “E” “F” “G” “H” “I” “Z” — with the corresponding categories “v”, “v”, “c” etc. It’s the same work we do with the more conventional approaches, but I get an easy way to verify the recoding before applying it to the whole data.

The third step is a new dataframe, where preference is the same variable name as in the dataframe df. This is important, because we use it for merging. In this example, I chose “vowel” as the name of the recoded variable.

Now comes the step where merge() can be a real benefit: checking the recoding before applying it to the entire dataset. I simply type the name of the new dataframe (re in this case) and can check the original value and the assignment.

   preference  vowel
1           a      v
2           A      v
3           B      c
4           C      c
5           D      c
6           E      v
7           F      c
8           G      c
9           H      c
10          I      v
11          Z      c

This comparison works well even when the categories/words/strings are longer and the conventional code can become more difficult to check. We also immediately see if we have recategorized all values (you’ll get an error when creating that dataframe otherwise). I find it easier to check the recoding than comparing columns in the full dataset (i.e. a systematic rather than a cursory check), or using cross-tables that become unwieldy with many categories.

Then we simply merge

df = merge(df, re)
df

  preference age height vowel
1          a  42    133     v
2          a  41    170     v
3          a   0    198     v

etc.

This approach also works to “anonymize” a name, like a coder name, by using a unique ID (it’s a pseudonym):

pr = sort(unique(df$preference))
an = order(pr)
ra = data.frame(preference=pr, anon=an)
ra

  preference anon
1          a    1
2          A    2
3          B    3
4          D    4

In this case, there really isn’t much to check, and we can simply merge:

df = merge(df, ra)
df

  preference age height vowel anon
1          a  42    133     v    1
2          a  41    170     v    1
3          a   0    198     v    1
4          A  11    179     v    2

There we go, merge() can be used to recode/recategorize in R in a flexible way. In the end, use whatever is transparent to you!

Tag: recoding

What is the researcher degree of freedom?

Replace a missing value category with NA in R/RStudio

Using merge() to recode in R