Correlations Graphics in R

Correlations are some of the basics in quantitative analysis, and they are well suited for graphical examination. Using plots we can see whether it is justified to assume a linear relationship between the variables, for example. Scatter plots are our friends here, and with two variables it is as simple as calling plot() in R:

plot(var1, var2)

If we have more than two variables, it can be useful to plot a scatter plot matrix: multiple scatter plots in one go. The pairs() command is built in, but in my view not the most useful one out there. Here we use cbind() to combine a few variables, and specify that we don’t want to see the same scatter plots (rotated) in the upper panel.

pairs(cbind(var1, var2, var3, var4) , upper.panel=NULL)

A more flexible method is provided in library(car) with the scatterplotMatrix(). If this is not flexible enough, we can always split the plot and draw whatever we need, but that’s not for today.

library(car)scatterplotMatrix(cbind(var1, var2, var3, var4))

If we have many more variables, it’s necessary to draw multiple plots to be able to see what is going on. However, sometimes after having checked that the associations are more or less linear, we’re simply interested in the strength and direction of the correlations for many combinations of variables. I guess the classic approach is staring at a large table of correlation coefficients, but as is often the case, graphics can make your life easier, in this case library(corrplot):

library(corrplot)
corrplot(object_with_many_variables, method="circle", type="lower", diag=FALSE)

This is certainly more pleasant than staring at a table…

For all these commands, R offers plenty of ways to tweak the output.

Here’s how to get larger correlations coefficients

Here’s a link to a blog post by Andrew Gelman that deserves to be read more widely. It reports on a paper in a field remote to what I’m doing, but the issues is about correlation coefficients — a staple in much of what we do. Apparently the authors of the paper must have thought that a correlation coefficient of 0.02 is not enough to get published, and resorted to binning the data. Binning data in itself is not a bad thing, it can be quite useful for graphs, for instance. However, they then calculated and report the correlation of the binned data. Not so miraculously, the correlation coefficient increases; they average out unexplained variance.

Maybe I really should change the framing of my statistics course to focus on how to lie and cheat with statistics: I guess the students will learn just as much about good statistics this way.