The Magic of (Kernel Density) Plots

Today I was looking at some data I gathered on how different groups react to a certain stimulus. A classic case for the aggregate function in R:

aggregate(reaction_var, by=list(group=group_var), median, na.rm=TRUE)

I looked at the mean, median, and interpolated medians, but it was hard to make out whether there were real differences between the groups. That’s the moment I do what I tell my students to do all the time: graph, plot, … (and wonder why this time I thought I wouldn’t have to plot everything anyway)

Here’s the magic of the kernel densities that helped me see what’s going on.

plot(density(reaction_var, na.rm=TRUE, bw=8), main="", lty=2, ylim=c(0, 0.032), xlim=c(0,100), bty="n")
lines(density(reaction_var[group_var==1], na.rm=TRUE, bw=8), col="blue")
lines(density(reaction_var[group_var==0], na.rm=TRUE, bw=8), col="red")
abline(v=50, lty=3)

Here I only look at one particular stimulus, and first plot the kernel density for everyone (no square brackets). I chose a dashed line so that the aggregate is less dominant in the plot (lty=2), after all I’m interested in the group differences (if there are any). I also set the ylim in a second round, because the kernel densities for the red group would otherwise be cut off. I also set the xlim, because the range of my variable is only 0 to 100. Because of the bandwidth, kernel density plots never quite end at logical end points. I also set the bandwidth of the kernel density to 8 so that it is exactly the same across the groups. The last argument (bty) gets rid of the box R puts around the plot by default.

I then add the kernel densities for the two groups of interest (square brackets to identity the group of question) with a particular colour. Finally I added the median value for reference.

kernel.magic

Well, what is going on? All three lines (combined, and each group separately) have roughly the same median value. The mean is lower for the blue group, but the interpolated median values are almost exactly the same as the median values. Difference or no real difference? I know that the old “textbook” rule that the difference between the median and mean indicates skew often fails, so definitely a case for plotting. And we can see that the central tendency is not telling us much in this case, it’s mostly about the tail.

von Hippel, P. “Mean, Median, and Skew: Correcting a Textbook Rule.” Journal of Statistics Education 13, no. 2 (2005).

Getting Rid of Page Numbers in split.screen

This is something that nags me from time to time, particularly because it’s a documented feature I tend to forget. When using the split.screen() function of R to combine plots (rather than par(mfrow)) in Sweave or Knitr, the screen numbers are output in the main text, alongside the figure. Usually we want the figure only. All we need is adding results='hide'. I find this slightly counter-intuitive, because I do want the result (in form of a figure). That said, I don’t have a suggestion what alternative would feel more intuitive. The same issue also happens when using library(rgdal) for a local shapefile, and other similar situations.

Plotting Gradients in R

fadeThere might be an easier way to do this, but here’s one way to plot gradients in R. It draws on colorRampPalette() and a for() loop, and isn’t very fast on underpowered machines — but it works. Using colorRampPalette() we can create the necessary gradients. Here’s the code I cobbled together:

fade <- function(M=1, S=1, Y=1, H=0.05, K=50, dk = "black") {
# M = midpoint; S = spread; Y = position; H = height; K = steps in gradient; dk = dark colour
colfunc <- colorRampPalette(c(dk, "white")) # creates a function to produce the gradients
D <- S/K # delta; how wide does one rectangle have to be?
collist <- colfunc(K) # create K colours
for(i in 0:(K-1)) { # draw rectangles; K-1 because I start with 0 (this makes it easier in the line just below)
rect(M+(D*i), Y-H, M+D+(D*i), Y+H, col=collist[i+1], border=NA) # drawing a narrow rectangle, no borders drawn; to right
rect(M-(D*i), Y-H, M-D-(D*i), Y+H, col=collist[i+1], border=NA) # to left
}
}

Before applying the above code, we need an empty plot:

plot(1, ylim=c(0.5,8), xlim=c(1,8), type="n", axes=F, xlab="", ylab="")

Important are the ylim and xlim arguments, and the type="n" to plot nothing. I usually prefer drawing my own axes — axis(1), axis(2) as this allows easy customization…

Why not highlight the midpoint? We can this as follows:

text(M, Y, "|", col="white", font=4) # font = 4 for bold face

In many cases, the package denstrip may offer an easier solution, albeit one where we have less control.

P.S. The code produces just one of the lines in the plot included at the top.

Using Shading in Coefficient Plots

Having recently discovered the R package denstrip, I was struck by intuitive it is to use shading to express uncertainty. If you need convincing, check out the paper by Jennifer Lundquist and Ken Lin linked below.

coefplot.fadeI wondered whether this would also work well for generic coefficient plots to examine regression results. The proof is in the pudding, so here’s some quickly thrown together code (also on Gist).

library(denstrip)
fade <- function(x, labels=names(coef(x)), expo=FALSE, xlab="", ylab="", bty="n", ...) {
# argument: a regression, additional arguments passed on to plot() and text()
coe <- summary(x)$coefficients[,1] # extract coefficients
cse <- summary(x)$coefficients[,2] # standard errors
len <- length(coe) # how many coefficients (without intercept)
if(expo == TRUE) { # exponential form
coe <- exp(coe)
cse <- exp(cse)
}
ran <- c(min(coe)- 3*max(cse), max(coe)+3*max(cse)) # range of values
plot(0, ylim=c(1.5,len+0.5), xlim=ran, xlab=xlab, ylab=ylab, bty=bty, type="n", ..., axes=FALSE) # empty plot
# title, xlab, ylab, etc. can be used here.
for(i in 2:len) {
dens <- rnorm(mean=coe[i], sd=cse[i], n=1000)
denstrip(dens, at=i) # passing arguments conflicts with plot()
text(ran[1],i, labels[i], adj = c(0,0), ...) # adj to left-align
# cex, col can be used here
}
axis(1)
}

Well, it does work (see figure above). If we add an abline(v=0, lty=3), we can easily highlight the zero line.

coefplot.armHowever, intuitive as it is, I’m not really convinced. Here’s the coefficient plot the arm package produces. While it doesn’t use shading, lines of different thickness are used to indicate the standard errors — nicely de-emphasizing the end-point. There’s more emphasis on the point estimate (i.e. the coefficient itself: the square), while my code hides this. I did try adding a white line at the point estimate (third figure, below), but this doesn’t resolve my biggest worry: the amount of ink that is used to convey the message… I’m just not convinced that the shading adds that much to the relatively simple coefficient plots in the package arm.

coefplot.mod

(There’s an argument width that could be added to line calling denstrip, like width=0.2, but it hasn’t convinced me enough.)

Jackson, Christopher H. 2008. “Displaying Uncertainty with Shading.” The American Statistician 62 (4): 340–47. doi:10.1198/000313008X370843.

Lundquist, Jennifer H., and Ken-Hou Lin. 2015. “Is Love (Color) Blind? The Economy of Race among Gay and Straight Daters.” Social Forces, March, sov008. doi:10.1093/sf/sov008.

Tables to Figures, well…

There are people more eloquent out there trying to convince researchers to use figures rather than tables in scientific publications. The only (real) reservation I could find so far is that figures only may be difficult for meta-analyses. Turns out there is one more…

coefplot1

I have recently received the following comment on a submitted paper:

“the graphical representation of the analysis does not offer enough (statistical) insights such as to evaluate the quality of the analysis done, nor to assess the validity of the conclusions drawn from it.”

To be fair to the reviewer, the other feedback I got was very constructive. I just wanted to use the opportunity to highlight that there is much more to do in terms of spreading the word about coefficient plots (above/to the right the kind of figure I used in the paper). The odd thing is that I even included tables in the appendix; in this day of online supplementary material there is no reason not to. Unfortunately, it seems that the reviewer overlooked them…