Today I was looking at some data I gathered on how different groups react to a certain stimulus. A classic case for the aggregate function in R:
aggregate(reaction_var, by=list(group=group_var), median, na.rm=TRUE)
I looked at the mean, median, and interpolated medians, but it was hard to make out whether there were real differences between the groups. That’s the moment I do what I tell my students to do all the time: graph, plot, … (and wonder why this time I thought I wouldn’t have to plot everything anyway)
Here’s the magic of the kernel densities that helped me see what’s going on.
plot(density(reaction_var, na.rm=TRUE, bw=8), main="", lty=2, ylim=c(0, 0.032), xlim=c(0,100), bty="n")
lines(density(reaction_var[group_var==1], na.rm=TRUE, bw=8), col="blue")
lines(density(reaction_var[group_var==0], na.rm=TRUE, bw=8), col="red")
Here I only look at one particular stimulus, and first plot the kernel density for everyone (no square brackets). I chose a dashed line so that the aggregate is less dominant in the plot (lty=2), after all I’m interested in the group differences (if there are any). I also set the ylim in a second round, because the kernel densities for the red group would otherwise be cut off. I also set the xlim, because the range of my variable is only 0 to 100. Because of the bandwidth, kernel density plots never quite end at logical end points. I also set the bandwidth of the kernel density to 8 so that it is exactly the same across the groups. The last argument (bty) gets rid of the box R puts around the plot by default.
I then add the kernel densities for the two groups of interest (square brackets to identity the group of question) with a particular colour. Finally I added the median value for reference.
Well, what is going on? All three lines (combined, and each group separately) have roughly the same median value. The mean is lower for the blue group, but the interpolated median values are almost exactly the same as the median values. Difference or no real difference? I know that the old “textbook” rule that the difference between the median and mean indicates skew often fails, so definitely a case for plotting. And we can see that the central tendency is not telling us much in this case, it’s mostly about the tail.
von Hippel, P. “Mean, Median, and Skew: Correcting a Textbook Rule.” Journal of Statistics Education 13, no. 2 (2005).