How to add text labels to a scatter plot in R?

Adding text labels to a scatter plot in R is easy. The basic function is text(), and here’s a reproducible example how you can use it to create these plots:

Adding text to a scatter plot in R

For the example, I’m creating random data. Since the data are random, your plots will look different. In this fictitious example, I look at the relationship between a policy indicator and performance. It is conventional to put the outcome variable on the Y axis and the predictor on the X axis, but in this example there’s no relationship to reality anyway… The reason I chose min and max values for the random variables here is that I jotted down this code as an explanation for a replication. In this example, we have 25 observations, for 25 units I call “cantons”. The third line here creates a string of characters “A” to “Y”, these are the labels!

policy = runif(25, min=0.4, max=0.7)
perfor = runif(25, min=500, max=570)
canton = sapply(65:89, function(x) rawToChar(as.raw(x)))

For the scatter plot on the left, we use plot(). Then we add the trend line with abline() and lm(). To add the labels, we have text(), the first argument gives the X value of each point, the second argument the Y value (so R knows where to place the text) and the third argument is the corresponding label. The argument pos=1 is there to tell R to draw the label underneath the point; with pos=2 (etc.) we can change that position.

plot(policy ~ perfor, bty="n", ylab="Policy Indicator", xlab="Performance", main="Policy and Performance")
abline(lm(policy ~ perfor), col="red")
text(perfor, policy, canton, pos=1)

The scatter plot on the right is similar, but here we actually plot the labels instead of the dots. There are two differences in the code: First, we add type="n" to create the scatter plot without actually drawing any circles (an empty plot if you will). Second, when we add the text in the third line of the code, we do not have pos=1, because we want to place the labels exactly where the points are.

plot(policy ~ perfor, bty="n", type="n", ylab="Policy Indicator", xlab="Performance", main="Policy and Performance")
abline(lm(policy ~ perfor), col="red")
text(perfor, policy, canton)

Calculating VIF by hand

A widespread measure of multicollinearity is the VIF (short for variance inflation factor). Multicollinearity describes the situation when the predictor variables in a multiple regression model are highly correlated, which is usually not desirable (assuming you haven’t gone Bayesian yet).

In R, the VIF can easily be calculated with a function in library car. It’s actually not difficult to do it by hand — which incidentally helps understand what we measure with the VIF, or why there is no different VIF for logistic regression models, or why the VIF is better than looking at bivariate correlations between predictors.

We start with some random data to run the multiple regression model. Here we create one outcome (y) and three predictor variables (x, z, a), full of random numbers. That’ll do for a demonstration.

x = runif(50) 
y = runif(50)
z = runif(50)
a = runif(50)

Here’s a simple OLS model:


m = lm(y ~ x + z + a)

If you have library car installed, you can easily calculate the VIF:

library(car)
vif(m)

To do it by hand, though, we run a linear regression model (OLS) for each of the predictors. Here’s the code for predictor x. One of the predictors becomes the outcome variable (here x), and the other predictors remain predictors. The variable used as the outcome previously (y) does not appear here.

mx = lm(x ~ z + a)

The VIF is simply: 1/(1-R²) of this model. In R, we can run the following:

1/(1-summary(mx)$r.squared)

Getting started with Bayesian in R

Stan logo

There really is no excuse any more: getting started with Bayesian regression analysis in R is really simple.

Step 1: install rstanarm from CRAN

Step 2: replace lm() with stan_glm() in your code

Sure, you’ll probably want to learn about priors, and invest a little in understanding diagnostics such as those provided by ShinyStan. But rstanarm is really designed to work well out of the box (i.e. with your existing code).

What I really appreciate is that it has useful warnings and error messages, and extensive documentation. Sometimes the documentation shows that quantitative analysis has something to do with mathematics, but even those who skip the Greek letters and formulae will get enough guidance. You’ll get nudges to use your own priors rather than rely on the default priors, but in my experience for most simple applications the default priors work reasonably well. You’ll also get suggestions right on your screen what you can do when there are say divergent transitions.

Once you can handle rstanarm, you’ll find it easy to upgrade to brms, where you can still use your trusted syntax for regression models in base R.

Hiring: Postdoctoral Researcher

We have an open position for a

Post-Doctoral Researcher (30 months, 80% FTE)

The start date will be 1 September 2021 or as agreed. The successful applicant is expected to contribute to a research project on the long-term impact of refugee shocks on the labour market, health, reproductive behaviour, well-being, and attitudinal outcomes of the resident population (quasi-experimental setup).

Requirements: You have completed a doctorate in one of the social sciences (preferably economics; sociology, or political sciences). Excellent knowledge of quantitative methods is required (preferably Stata or R). The project uses register data, as well as data from the Labour Force Survey, the Swiss Health Survey, post-election surveys, and results from selected referendums and popular initiatives. You are open to collaborate in an inter-disciplinary team. Experience in the analysis of register data, matching datasets, experimental methods and a keen interest in immigration, health, or labour market outcomes are an asset. Excellent written and oral command of English is required; knowledge of French or German is an asset.

You will be attached to the Swiss Forum for Migration and Population Studies at the University of Neuchâtel (http://www.unine.ch/sfm/) and will join a team in economics, sociology, and demography. An affiliation to the national centre of excellence NCCR on the move (https://nccr-onthemove.ch/) is possible and will open up exchange with other postdocs and researchers across the country.

Benefits: The salary is in accordance with the university guidelines (https://www.unine.ch/srh/post-doctorant-e-s-fns). There is a budget for conference participation, and we will support you develop your own research agenda.

Employer: The position is based at the University of Neuchâtel. The University of Neuchâtel is an equal opportunities employer. Qualified women and candidates with a migration history are encouraged to apply.

Submitting application: Applications (letter of intent, CV, names of two referees, a relevant research paper as a writing sample) should be submitted as a single PDF to didier.ruedin@unine.ch (also for queries). The position is open until filled; for full consideration, apply by 15 June 2021.

Updated PLZ – Cantons Tools for R

Thanks to Eva Van Belle pointing out issues with Appenzell postcodes, I’m happy to announce an update to the postcode to cantons conversion script for R. It’s essentially a database with Swiss postcodes (PLZ) and what canton they are in. For 16 postcodes only a probabilistic assignment is possible, and this is handled by siding with the (typically much) larger municipality.

Convert Swiss postcodes to cantons: https://gist.github.com/druedin/6690720

Two simple helper functions to go with: https://gist.github.com/druedin/8758265