Correlations Graphics in R

Correlations are some of the basics in quantitative analysis, and they are well suited for graphical examination. Using plots we can see whether it is justified to assume a linear relationship between the variables, for example. Scatter plots are our friends here, and with two variables it is as simple as calling plot() in R:

plot(var1, var2)

If we have more than two variables, it can be useful to plot a scatter plot matrix: multiple scatter plots in one go. The pairs() command is built in, but in my view not the most useful one out there. Here we use cbind() to combine a few variables, and specify that we don’t want to see the same scatter plots (rotated) in the upper panel.

pairs(cbind(var1, var2, var3, var4) , upper.panel=NULL)

A more flexible method is provided in library(car) with the scatterplotMatrix(). If this is not flexible enough, we can always split the plot and draw whatever we need, but that’s not for today.

library(car)scatterplotMatrix(cbind(var1, var2, var3, var4))

If we have many more variables, it’s necessary to draw multiple plots to be able to see what is going on. However, sometimes after having checked that the associations are more or less linear, we’re simply interested in the strength and direction of the correlations for many combinations of variables. I guess the classic approach is staring at a large table of correlation coefficients, but as is often the case, graphics can make your life easier, in this case library(corrplot):

library(corrplot)
corrplot(object_with_many_variables, method="circle", type="lower", diag=FALSE)

This is certainly more pleasant than staring at a table…

For all these commands, R offers plenty of ways to tweak the output.

Wasn’t integrity one of Edward Tufte’s principles?

Edward Tufte has a couple of principles for good graphs, among which integrity is probably the most important one. Here’s a list from an older article in the Washington Post that does not deserve to be forgotten:

https://www.washingtonpost.com/news/wonk/wp/2013/05/24/these-31-charts-will-destroy-your-faith-in-humanity/

While the article is about news spin, the examples are useful to think about integrity. Yes, indeed “More 77-year-olds are dying than ever before”, but what about the bigger picture?

The Magic of (Kernel Density) Plots

Today I was looking at some data I gathered on how different groups react to a certain stimulus. A classic case for the aggregate function in R:

aggregate(reaction_var, by=list(group=group_var), median, na.rm=TRUE)

I looked at the mean, median, and interpolated medians, but it was hard to make out whether there were real differences between the groups. That’s the moment I do what I tell my students to do all the time: graph, plot, … (and wonder why this time I thought I wouldn’t have to plot everything anyway)

Here’s the magic of the kernel densities that helped me see what’s going on.

plot(density(reaction_var, na.rm=TRUE, bw=8), main="", lty=2, ylim=c(0, 0.032), xlim=c(0,100), bty="n")
lines(density(reaction_var[group_var==1], na.rm=TRUE, bw=8), col="blue")
lines(density(reaction_var[group_var==0], na.rm=TRUE, bw=8), col="red")
abline(v=50, lty=3)

Here I only look at one particular stimulus, and first plot the kernel density for everyone (no square brackets). I chose a dashed line so that the aggregate is less dominant in the plot (lty=2), after all I’m interested in the group differences (if there are any). I also set the ylim in a second round, because the kernel densities for the red group would otherwise be cut off. I also set the xlim, because the range of my variable is only 0 to 100. Because of the bandwidth, kernel density plots never quite end at logical end points. I also set the bandwidth of the kernel density to 8 so that it is exactly the same across the groups. The last argument (bty) gets rid of the box R puts around the plot by default.

I then add the kernel densities for the two groups of interest (square brackets to identity the group of question) with a particular colour. Finally I added the median value for reference.

kernel.magic

Well, what is going on? All three lines (combined, and each group separately) have roughly the same median value. The mean is lower for the blue group, but the interpolated median values are almost exactly the same as the median values. Difference or no real difference? I know that the old “textbook” rule that the difference between the median and mean indicates skew often fails, so definitely a case for plotting. And we can see that the central tendency is not telling us much in this case, it’s mostly about the tail.

von Hippel, P. “Mean, Median, and Skew: Correcting a Textbook Rule.” Journal of Statistics Education 13, no. 2 (2005).

Getting Rid of Page Numbers in split.screen

This is something that nags me from time to time, particularly because it’s a documented feature I tend to forget. When using the split.screen() function of R to combine plots (rather than par(mfrow)) in Sweave or Knitr, the screen numbers are output in the main text, alongside the figure. Usually we want the figure only. All we need is adding results='hide'. I find this slightly counter-intuitive, because I do want the result (in form of a figure). That said, I don’t have a suggestion what alternative would feel more intuitive. The same issue also happens when using library(rgdal) for a local shapefile, and other similar situations.

Plotting Gradients in R

fadeThere might be an easier way to do this, but here’s one way to plot gradients in R. It draws on colorRampPalette() and a for() loop, and isn’t very fast on underpowered machines — but it works. Using colorRampPalette() we can create the necessary gradients. Here’s the code I cobbled together:

fade <- function(M=1, S=1, Y=1, H=0.05, K=50, dk = "black") {
# M = midpoint; S = spread; Y = position; H = height; K = steps in gradient; dk = dark colour
colfunc <- colorRampPalette(c(dk, "white")) # creates a function to produce the gradients
D <- S/K # delta; how wide does one rectangle have to be?
collist <- colfunc(K) # create K colours
for(i in 0:(K-1)) { # draw rectangles; K-1 because I start with 0 (this makes it easier in the line just below)
rect(M+(D*i), Y-H, M+D+(D*i), Y+H, col=collist[i+1], border=NA) # drawing a narrow rectangle, no borders drawn; to right
rect(M-(D*i), Y-H, M-D-(D*i), Y+H, col=collist[i+1], border=NA) # to left
}
}

Before applying the above code, we need an empty plot:

plot(1, ylim=c(0.5,8), xlim=c(1,8), type="n", axes=F, xlab="", ylab="")

Important are the ylim and xlim arguments, and the type="n" to plot nothing. I usually prefer drawing my own axes — axis(1), axis(2) as this allows easy customization…

Why not highlight the midpoint? We can this as follows:

text(M, Y, "|", col="white", font=4) # font = 4 for bold face

In many cases, the package denstrip may offer an easier solution, albeit one where we have less control.

P.S. The code produces just one of the lines in the plot included at the top.