How to add text labels to a scatter plot in R?

Adding text labels to a scatter plot in R is easy. The basic function is text(), and here’s a reproducible example how you can use it to create these plots:

Adding text to a scatter plot in R

For the example, I’m creating random data. Since the data are random, your plots will look different. In this fictitious example, I look at the relationship between a policy indicator and performance. It is conventional to put the outcome variable on the Y axis and the predictor on the X axis, but in this example there’s no relationship to reality anyway… The reason I chose min and max values for the random variables here is that I jotted down this code as an explanation for a replication. In this example, we have 25 observations, for 25 units I call “cantons”. The third line here creates a string of characters “A” to “Y”, these are the labels!

policy = runif(25, min=0.4, max=0.7)
perfor = runif(25, min=500, max=570)
canton = sapply(65:89, function(x) rawToChar(as.raw(x)))

For the scatter plot on the left, we use plot(). Then we add the trend line with abline() and lm(). To add the labels, we have text(), the first argument gives the X value of each point, the second argument the Y value (so R knows where to place the text) and the third argument is the corresponding label. The argument pos=1 is there to tell R to draw the label underneath the point; with pos=2 (etc.) we can change that position.

plot(policy ~ perfor, bty="n", ylab="Policy Indicator", xlab="Performance", main="Policy and Performance")
abline(lm(policy ~ perfor), col="red")
text(perfor, policy, canton, pos=1)

The scatter plot on the right is similar, but here we actually plot the labels instead of the dots. There are two differences in the code: First, we add type="n" to create the scatter plot without actually drawing any circles (an empty plot if you will). Second, when we add the text in the third line of the code, we do not have pos=1, because we want to place the labels exactly where the points are.

plot(policy ~ perfor, bty="n", type="n", ylab="Policy Indicator", xlab="Performance", main="Policy and Performance")
abline(lm(policy ~ perfor), col="red")
text(perfor, policy, canton)

Replication as learning

As part of the course on applied statistics I’m teaching, my students have to try to replicate a published paper (or, typically, part of the analysis). It’s an excellent opportunity to apply what they have learned in the course, and probably the best way to teach researcher degrees of freedom and how much we should trust the results of a single study. It’s also an excellent reminder to do better than much of the published research in providing proper details of the analysis we undertake. Common problems include not describing the selection of cases (where not everyone remains in the sample), opaque recoding of variables, and variables that are not described. An interesting case is the difference between what the authors wanted to do (e.g. restrict the sample to voters) and what they apparently did (e.g. forge to do so). One day, I hope this exercise will become obsolete: the day my students can readily download replication code…

Image: CC-by-nd Tina Sherwood Imaging

Comment on Reproducibility

There’s a ‘technical’ comment on a recent paper that has stirred quite a debate: the reproducibility of psychological science is purportedly quite low. This paper argues that when the results of the original study are corrected for error, power, and bias, there is not much left to conclude that there is a reproducibility crisis. As always in Science, short and to the point. And there’s a response to the comment, too.

Gilbert, Daniel T., Gary King, Stephen Pettigrew, and Timothy D. Wilson. 2016. ‘Comment on “Estimating the Reproducibility of Psychological Science”’. Science 351 (6277): 1037–1037. doi:10.1126/science.aad7243.

MIPEX as a Measure of Citizenship Models: Small Update

I have just added an additional document to the replication material for MIPEX as a Measure of Citizenship Models. The paper in the SSQ uses MIPEX data up to 2010, but the MIPEX releases 2012+ use a slightly different question order because a few questions were added and removed. (It’s this updated version we’ve used for the time series of MIPEX/immigration policy in Switzerland 1848 to 2015.) With this, replicating my MIPEX-based measure of citizenship models was no longer straightforward with the more recent MIPEX releases. There’s one important point to consider, though: with the additional questions in the latest MIPEX data, it probably makes sense to include one or two additional (relevant) questions rather than slavishly following the items used in the SSQ paper.

Ruedin, Didier. 2015. “Increasing validity by recombining existing indices: MIPEX as a measure of citizenship models.” Social Science Quarterly 96(2): 629-638. doi:10.1111/ssqu.12162

Ruedin, Didier, Camilla Alberti, and Gianni D’Amato. 2015. “Immigration and integration policy in Switzerland, 1848 to 2014”, Swiss Political Science Review 21(1): 5-22. doi:10.1111/spsr.12144

Why Knitr Beats Sweave

No doubt Sweave is one of the pieces that makes R great. Sweave combines the benefits of R with those of LaTeX to enable reproducible research. Knitr is a more recent contribution by Yihui Xie, packing in the goodness of Sweave alongside cacheSweave, pgfSweave, RweaveHTML, HighlightWeaveLatex etc. It requires separate installation, interestingly also when using Rstudio.

As much as I like Sweave, I argue that often knitr is the better choice, despite there being no equivalent to R CMD Sweave --pdf. First of all, knitr uses Rmarkdown, a set of intuitive human-readable code to do the formatting. While LaTeX is by no means as complicated as its reputation seems to suggest, Rmarkdown is actually easy. By human-readable I mean that anyone who has never even heard of Rmarkdown can understand what is happening to some extent.

Sweave is great for producing PDF, but that’s one of the biggest drawbacks of LaTeX in the social sciences: while the PDF may look good, they are not the format we need when collaborating with Word-only colleagues, and with rare exceptions when submitting a manuscript to journals. Knitr works very well with Pandoc, so creating a Word document or an ODF is just as easy as creating a PDF. The other day I had to submit a supplementary file as a *.doc file, even though it’ll end up as a PDF on Dataverse or so. With knitr this didn’t take long.

What’s the catch then? Rmarkdown comes with a restricted set of commands, and there is no way to create custom commands. This isn’t a problem, though. For instance, if you create a PDF with knitr, you can include standard LaTeX code, like \newpage. More importantly, with a restricted set of commands, I find myself tinkering much less than what I do in LaTeX. In other words, with Rmarkdown and knitr, I do more of that purported benefit of LaTeX, namely concentrating on the contents rather than worrying about what it’ll end up looking. A more radical step would probably be writing in plain text and then finish it off in Word (or LibreOffice), because we seem to end up there anyway — at least at the submission stage.

There are two aspects where the restrictions of Rmarkdown are noticeable: references (roughly on par with Endnote, not with BibTeX), and complex tables. When it comes to complex tables, we should probably be thinking about graphs anyway. In this context, however, being human-readable highlights another advantage of knitr: if the document fails to compile, it’s much easier to debug (and here Sweave beats odfWeave by miles).

What neither approach resolves, however, is collaborating with the Word-only crowd who need the “track changes” feature.