How to add text labels to a scatter plot in R?

Adding text labels to a scatter plot in R is easy. The basic function is text(), and here’s a reproducible example how you can use it to create these plots:

Adding text to a scatter plot in R

For the example, I’m creating random data. Since the data are random, your plots will look different. In this fictitious example, I look at the relationship between a policy indicator and performance. It is conventional to put the outcome variable on the Y axis and the predictor on the X axis, but in this example there’s no relationship to reality anyway… The reason I chose min and max values for the random variables here is that I jotted down this code as an explanation for a replication. In this example, we have 25 observations, for 25 units I call “cantons”. The third line here creates a string of characters “A” to “Y”, these are the labels!

policy = runif(25, min=0.4, max=0.7)
perfor = runif(25, min=500, max=570)
canton = sapply(65:89, function(x) rawToChar(as.raw(x)))

For the scatter plot on the left, we use plot(). Then we add the trend line with abline() and lm(). To add the labels, we have text(), the first argument gives the X value of each point, the second argument the Y value (so R knows where to place the text) and the third argument is the corresponding label. The argument pos=1 is there to tell R to draw the label underneath the point; with pos=2 (etc.) we can change that position.

plot(policy ~ perfor, bty="n", ylab="Policy Indicator", xlab="Performance", main="Policy and Performance")
abline(lm(policy ~ perfor), col="red")
text(perfor, policy, canton, pos=1)

The scatter plot on the right is similar, but here we actually plot the labels instead of the dots. There are two differences in the code: First, we add type="n" to create the scatter plot without actually drawing any circles (an empty plot if you will). Second, when we add the text in the third line of the code, we do not have pos=1, because we want to place the labels exactly where the points are.

plot(policy ~ perfor, bty="n", type="n", ylab="Policy Indicator", xlab="Performance", main="Policy and Performance")
abline(lm(policy ~ perfor), col="red")
text(perfor, policy, canton)

Calculating VIF by hand

A widespread measure of multicollinearity is the VIF (short for variance inflation factor). Multicollinearity describes the situation when the predictor variables in a multiple regression model are highly correlated, which is usually not desirable (assuming you haven’t gone Bayesian yet).

In R, the VIF can easily be calculated with a function in library car. It’s actually not difficult to do it by hand — which incidentally helps understand what we measure with the VIF, or why there is no different VIF for logistic regression models, or why the VIF is better than looking at bivariate correlations between predictors.

We start with some random data to run the multiple regression model. Here we create one outcome (y) and three predictor variables (x, z, a), full of random numbers. That’ll do for a demonstration.

x = runif(50) 
y = runif(50)
z = runif(50)
a = runif(50)

Here’s a simple OLS model:


m = lm(y ~ x + z + a)

If you have library car installed, you can easily calculate the VIF:

library(car)
vif(m)

To do it by hand, though, we run a linear regression model (OLS) for each of the predictors. Here’s the code for predictor x. One of the predictors becomes the outcome variable (here x), and the other predictors remain predictors. The variable used as the outcome previously (y) does not appear here.

mx = lm(x ~ z + a)

The VIF is simply: 1/(1-R²) of this model. In R, we can run the following:

1/(1-summary(mx)$r.squared)

Updated PLZ – Cantons Tools for R

Thanks to Eva Van Belle pointing out issues with Appenzell postcodes, I’m happy to announce an update to the postcode to cantons conversion script for R. It’s essentially a database with Swiss postcodes (PLZ) and what canton they are in. For 16 postcodes only a probabilistic assignment is possible, and this is handled by siding with the (typically much) larger municipality.

Convert Swiss postcodes to cantons: https://gist.github.com/druedin/6690720

Two simple helper functions to go with: https://gist.github.com/druedin/8758265

Custom Tables of Descriptive Statistics in R

Here’s how we can quite easily and flexibly create tables of descriptive statistics in R. Of course, we can simply use summary(variable_name), but this is not what you’d include in a manuscript — so not what you want when compiling a document in knitr/Rmarkdown.

First, we identify the variables we want to summarize. Often our database includes many more variables:

vars <- c("variable_1", "variable_2", "variable_3")

Note that these are the variable names in quotes. Second, we use lapply() to calculate whatever summary statistic we want. This is where flexibility kicks in: have you ever tried to include an interpolated median in such a table, just as easy as the mean in R. Here’s an example with the mean, minimum, maximum, and median:

v_mean <- lapply(dataset[vars], mean, na.rm=TRUE)
v_min <- lapply(dataset[vars], min, na.rm=TRUE)
v_max <- lapply(dataset[vars], max, na.rm=TRUE)
v_med <- lapply(dataset[vars], median, na.rm=TRUE)

Too many digits? We can use round() to get rid of them. There’s actually an argument ‘digits’ in the kable() command we’ll use in a minute that in principle allows rounding at the very end, but unfortunately it often fails on me. Rounding:

v_mean <- round(as.numeric(v_mean), 2)

Now we only need to bring the different summary statistics together:

v_tab <- cbind(mean=v_mean, min=v_min, max=v_max, median=v_med)

And add useful variable labels:

rownames(v_tab) <- c("Variable 1", "A description of variable 2", "Variable 3")

and we use kable() to generate a decent table:

kable(v_tab)

If this looks complicated, bear in mind that with no additional work you can change the order of the variables and include any summary statistics. That’s table A1 in the appendix sorted.

Set Encoding When Importing CSV Data in R Under Windows

This is another simple thing I keep on looking up: how to set the encoding when importing CSV data in R under Windows. I need this when my data file is in UTF-8 (pretty standard these days), but I’m using R under Windows; or when I have a Windows-encoded file when using R elsewhere. The default encoding in Windows is not UTF-8, and R uses the default encoding — well, by default. Typically this is not an issue unless my data file contains accented characters in strings, which can lead to garbled text when the wrong encoding is set/assumed.

The solution is quite simple: add encoding="" to the read.csv() command, like this:

x <- read.csv("datafile.csv", encoding="Windows-1252")

or like this:

x <- read.csv("datafile.csv", encoding="UTF-8")