Correlations Graphics in R

Correlations are some of the basics in quantitative analysis, and they are well suited for graphical examination. Using plots we can see whether it is justified to assume a linear relationship between the variables, for example. Scatter plots are our friends here, and with two variables it is as simple as calling plot() in R:

plot(var1, var2)

If we have more than two variables, it can be useful to plot a scatter plot matrix: multiple scatter plots in one go. The pairs() command is built in, but in my view not the most useful one out there. Here we use cbind() to combine a few variables, and specify that we don’t want to see the same scatter plots (rotated) in the upper panel.

pairs(cbind(var1, var2, var3, var4) , upper.panel=NULL)

A more flexible method is provided in library(car) with the scatterplotMatrix(). If this is not flexible enough, we can always split the plot and draw whatever we need, but that’s not for today.

library(car)scatterplotMatrix(cbind(var1, var2, var3, var4))

If we have many more variables, it’s necessary to draw multiple plots to be able to see what is going on. However, sometimes after having checked that the associations are more or less linear, we’re simply interested in the strength and direction of the correlations for many combinations of variables. I guess the classic approach is staring at a large table of correlation coefficients, but as is often the case, graphics can make your life easier, in this case library(corrplot):

library(corrplot)
corrplot(object_with_many_variables, method="circle", type="lower", diag=FALSE)

This is certainly more pleasant than staring at a table…

For all these commands, R offers plenty of ways to tweak the output.

The Magic of (Kernel Density) Plots

Today I was looking at some data I gathered on how different groups react to a certain stimulus. A classic case for the aggregate function in R:

aggregate(reaction_var, by=list(group=group_var), median, na.rm=TRUE)

I looked at the mean, median, and interpolated medians, but it was hard to make out whether there were real differences between the groups. That’s the moment I do what I tell my students to do all the time: graph, plot, … (and wonder why this time I thought I wouldn’t have to plot everything anyway)

Here’s the magic of the kernel densities that helped me see what’s going on.

plot(density(reaction_var, na.rm=TRUE, bw=8), main="", lty=2, ylim=c(0, 0.032), xlim=c(0,100), bty="n")
lines(density(reaction_var[group_var==1], na.rm=TRUE, bw=8), col="blue")
lines(density(reaction_var[group_var==0], na.rm=TRUE, bw=8), col="red")
abline(v=50, lty=3)

Here I only look at one particular stimulus, and first plot the kernel density for everyone (no square brackets). I chose a dashed line so that the aggregate is less dominant in the plot (lty=2), after all I’m interested in the group differences (if there are any). I also set the ylim in a second round, because the kernel densities for the red group would otherwise be cut off. I also set the xlim, because the range of my variable is only 0 to 100. Because of the bandwidth, kernel density plots never quite end at logical end points. I also set the bandwidth of the kernel density to 8 so that it is exactly the same across the groups. The last argument (bty) gets rid of the box R puts around the plot by default.

I then add the kernel densities for the two groups of interest (square brackets to identity the group of question) with a particular colour. Finally I added the median value for reference.

kernel.magic

Well, what is going on? All three lines (combined, and each group separately) have roughly the same median value. The mean is lower for the blue group, but the interpolated median values are almost exactly the same as the median values. Difference or no real difference? I know that the old “textbook” rule that the difference between the median and mean indicates skew often fails, so definitely a case for plotting. And we can see that the central tendency is not telling us much in this case, it’s mostly about the tail.

von Hippel, P. “Mean, Median, and Skew: Correcting a Textbook Rule.” Journal of Statistics Education 13, no. 2 (2005).

Predicted Probabilities in R

I got recently asked how to calculate predicted probabilities in R. Of course we could do this by hand, but often it’s preferable to do this in R. The command we need is predict(); here’s how to use it.

First we need to run a regression model. In this example, I predict whether a person voted in the previous election (binary dependent variable) with variables on education, income, and age. I use logistic regression:

m <- glm(voted ~ edu + income + age, family="binomial", data=voting)

Now comes the not so obvious part: we need to specify the cases we are interested in. We’re using with() and data.frame() to do so. In this case, I’m interested in the predicted probabilities of two people, both with average (mean) education and income, but one aged 20 and the other aged 60. In this step, we need to cover all the independent variables.

newdata <- with(voting, data.frame(age = c(20, 60), edu=mean(edu, na.rm=TRUE), income=mean(income, na.rm=TRUE)))

Finally we can get the predictions:

predict(m, newdata, type="response")

That’s our model m and newdata we’ve just specified. type="response" calculates the predicted probabilities. We get

        1         2 
0.3551121 0.6362611 

So 36% for the person aged 20, and 64% for the person aged 60.

Often, however, a picture will be more useful. The logic is the same. We use the same model, and ask R to predict for every age from 18 to 90 (I guess you don’t want to do this my hand).

newdata2 <- with(voting, data.frame(age = 18:90, edu=mean(edu, na.rm=TRUE), income=mean(income, na.rm=TRUE)))

We once again use predict(), but this time also ask for standard errors.

preds <- predict(m, newdata2, type="response", se.fit=TRUE)

For the plot, I want the predicted probabilities +/- 1.96 standard errors (that’s the 95% confidence interval; use qnorm(0.975) if 1.96 is not precise enough). I extract and calculate the values for each line separately to better understand the code.

predf <- preds$fit # predicted
lower <- preds$fit - (1.96*preds$se.fit) # lower bounds
upper <- preds$fit + (1.96*preds$se.fit) # upper bounds

Finally the plot: It’s a simple line plot of the predicted probabilities plotted against the age (18 to 90). I choose not to show the borders of the plot, and then use lines() twice to add the lower and upper bounds.

plot(18:90, predf, type="l", ylab="Predicted Probability to Vote", xlab="Age", bty="n")
lines(18:90, lower, lty=2)
lines(18:90, upper, lty=2)

predweb

Barplots across variables in R

barplot_rainbowHere’s a good example of how useful sapply can be. I have some data from Qualtrics, and each response is coded in its own variable. Let’s say there is a question on what kind of organization respondents work in, with 10 response categories. Qualtrics produces 10 variables, each with 1 if the box was ticked, and empty otherwise (structure shown just below). With the default CSV import, these blank cells are turned into NA. Here’s a simple way to produce a barplot in this case (in R, of course).

qualtrics_format

types = sapply(1:10, function(i) sum(get(paste("Q1_",i,sep="")), na.rm=TRUE))
barplot(types)

Let’s take this step by step. To count frequencies, we simply use sum(), with the argument na.rm=TRUE because the variables only contain 1 and NA. get() is used to find the variable specified by a string; the string is created with paste(). In this case, the variable names are Q1_1, Q1_2, Q1_3, … Q1_9, Q1_10. By using paste(), we combine the “Q1_” part with the counter variable i, with no separation (sep="").

The whole thing is then wrapped up in sapply(), with the counter variable i defined to take values from 1 to 10; the function(i) part is there so that the counter variable is applied to the sum. So sapply() takes each value of the counter variable, and applies it to the function we specified, which calculates the sum for one variable Q1_i at a time.

Now I can simply do a boxplot, and add the names.arg argument to specify the labels.

(Here I specified the colours: barplot(types, col=rainbow(10)) to have a catchy image at the top of this post, albeit one where colours have no meaning: so-called chart-junk).

Plotting Gradients in R

fadeThere might be an easier way to do this, but here’s one way to plot gradients in R. It draws on colorRampPalette() and a for() loop, and isn’t very fast on underpowered machines — but it works. Using colorRampPalette() we can create the necessary gradients. Here’s the code I cobbled together:

fade <- function(M=1, S=1, Y=1, H=0.05, K=50, dk = "black") {
# M = midpoint; S = spread; Y = position; H = height; K = steps in gradient; dk = dark colour
colfunc <- colorRampPalette(c(dk, "white")) # creates a function to produce the gradients
D <- S/K # delta; how wide does one rectangle have to be?
collist <- colfunc(K) # create K colours
for(i in 0:(K-1)) { # draw rectangles; K-1 because I start with 0 (this makes it easier in the line just below)
rect(M+(D*i), Y-H, M+D+(D*i), Y+H, col=collist[i+1], border=NA) # drawing a narrow rectangle, no borders drawn; to right
rect(M-(D*i), Y-H, M-D-(D*i), Y+H, col=collist[i+1], border=NA) # to left
}
}

Before applying the above code, we need an empty plot:

plot(1, ylim=c(0.5,8), xlim=c(1,8), type="n", axes=F, xlab="", ylab="")

Important are the ylim and xlim arguments, and the type="n" to plot nothing. I usually prefer drawing my own axes — axis(1), axis(2) as this allows easy customization…

Why not highlight the midpoint? We can this as follows:

text(M, Y, "|", col="white", font=4) # font = 4 for bold face

In many cases, the package denstrip may offer an easier solution, albeit one where we have less control.

P.S. The code produces just one of the lines in the plot included at the top.