Guess the Correlation!

Here’s a fantastic way to kill a few minutes better understand correlation coefficients: http://guessthecorrelation.com/

You’re shown a simple scatter plot and enter the correlation coefficient you guess to be associated with it. If you’re close enough, you get coins, if you’re too far off, you lose a heart. There’s even a two-player mode. Basic gaming stuff, but you also build an intuition of what those correlation coefficients we’re throwing around all the time actually mean.

There’s more, though. The game also serves (another) serious purpose: Omar Wagih is collecting the data to analyse how we mortals perceive correlations in scatter plots.

Visualize correlations in R

There are rare cases when a graphic is not better than a figure to help us understand our quantitative results. A simple yet common table we’re staring at ever so often are tables of correlation coefficients: how strongly do different variables correlate with one another. We’re scanning the tables for numbers close to +1 and close to -1, but there’s a better way: visualize!

The R package corrplot offers a ready-made solution:

library(corrplot)
dat=matrix(c(0.11128257, -0.38968561, 0.11765272, -0.07089879, -0.19715366, -0.48083950, 0.54760745, -0.49410370, -0.42443391), nrow=3)
corrplot(dat)

Here we call the corrplot package, create some data so that we can plot something, normally this would be a selection of variables. Then we simply call corrplot() and we’re done.

There are many ways to tweak the plots, but in all versions we get a quicker and better overview of the variables that correlate than staring at a large table.

Here are some variants of the above:

par(mfrow=c(2,2))
corrplot(dat, method = "shade")
corrplot(dat, diag=FALSE)
corrplot(dat, method = "square")
corrplot(dat, method = "number")

Correlations Graphics in R

Correlations are some of the basics in quantitative analysis, and they are well suited for graphical examination. Using plots we can see whether it is justified to assume a linear relationship between the variables, for example. Scatter plots are our friends here, and with two variables it is as simple as calling plot() in R:

plot(var1, var2)

If we have more than two variables, it can be useful to plot a scatter plot matrix: multiple scatter plots in one go. The pairs() command is built in, but in my view not the most useful one out there. Here we use cbind() to combine a few variables, and specify that we don’t want to see the same scatter plots (rotated) in the upper panel.

pairs(cbind(var1, var2, var3, var4) , upper.panel=NULL)

A more flexible method is provided in library(car) with the scatterplotMatrix(). If this is not flexible enough, we can always split the plot and draw whatever we need, but that’s not for today.

library(car)scatterplotMatrix(cbind(var1, var2, var3, var4))

If we have many more variables, it’s necessary to draw multiple plots to be able to see what is going on. However, sometimes after having checked that the associations are more or less linear, we’re simply interested in the strength and direction of the correlations for many combinations of variables. I guess the classic approach is staring at a large table of correlation coefficients, but as is often the case, graphics can make your life easier, in this case library(corrplot):

library(corrplot)
corrplot(object_with_many_variables, method="circle", type="lower", diag=FALSE)

This is certainly more pleasant than staring at a table…

For all these commands, R offers plenty of ways to tweak the output.

Here’s how to get larger correlations coefficients

Here’s a link to a blog post by Andrew Gelman that deserves to be read more widely. It reports on a paper in a field remote to what I’m doing, but the issues is about correlation coefficients — a staple in much of what we do. Apparently the authors of the paper must have thought that a correlation coefficient of 0.02 is not enough to get published, and resorted to binning the data. Binning data in itself is not a bad thing, it can be quite useful for graphs, for instance. However, they then calculated and report the correlation of the binned data. Not so miraculously, the correlation coefficient increases; they average out unexplained variance.

Maybe I really should change the framing of my statistics course to focus on how to lie and cheat with statistics: I guess the students will learn just as much about good statistics this way.