Plot and boxplot in R without borders

R can be pretty counter-intuitive at times, usually for historical reasons. Here’s one I’ve forgotten several times: drawing a plot and a boxplot without the border/box around it. The default (top row) is the plot with a border/box around it.

For the plot() function, we need the argument bty=”n”, that’s the box type “n” (for none, I guess). For the boxplot, we need frame=FALSE, which behind the scenes sets the same graphical parameter. Oddly enough, frame=FALSE is not mentioned in the help on the boxplot() function.

The precise code for these plots: a bunch of values in myvalues, some random values in xrandom to spread out the points, and then this:

par(mfrow=c(2,2)) # 2 x 2 plots
plot(myvalues ~ xrandom, xlim=c(0.5, 1.5), xlab="", ylab="")
boxplot(myvalues, xlim=c(0.5, 1.5), xlab="", ylab="")
plot(myvalues ~ xrandom, xlim=c(0.5, 1.5), xlab="", ylab="", bty="n")
boxplot(myvalues, xlim=c(0.5, 1.5), xlab="", ylab="", frame=FALSE)

Plotting a normal distribution in R

Apparently there are some unnecessarily complicated tutorials out there how to draw a normal distribution (or other probability distributions) in R. No, there is no need for a loop; in fact, a single line of code is enough:

curve(dnorm(x, 0, 1), from=-4, to=4)

That’s a normal probability distribution with mean 0 and a standard deviation 1, plotted from -4 to +4.

So it’s also easy to draw normal distributions with different means or different standard deviations. Here’s one without a box around:

curve(dnorm(x, 2, 1), from=1, to=7, bty="n", xlab="")

How about a beta distribution? It’s not more difficult (assuming you know your shape parameters):

curve(dbeta(x, 10, 2), from=0, to=1, bty="n", xlab="")

So from now on, if you need to visualize your priors to get a feel whether they constitute reasonable distributions, remember it’s just one line in R.

How to add text labels to a scatter plot in R?

Adding text labels to a scatter plot in R is easy. The basic function is text(), and here’s a reproducible example how you can use it to create these plots:

Adding text to a scatter plot in R

For the example, I’m creating random data. Since the data are random, your plots will look different. In this fictitious example, I look at the relationship between a policy indicator and performance. It is conventional to put the outcome variable on the Y axis and the predictor on the X axis, but in this example there’s no relationship to reality anyway… The reason I chose min and max values for the random variables here is that I jotted down this code as an explanation for a replication. In this example, we have 25 observations, for 25 units I call “cantons”. The third line here creates a string of characters “A” to “Y”, these are the labels!

policy = runif(25, min=0.4, max=0.7)
perfor = runif(25, min=500, max=570)
canton = sapply(65:89, function(x) rawToChar(as.raw(x)))

For the scatter plot on the left, we use plot(). Then we add the trend line with abline() and lm(). To add the labels, we have text(), the first argument gives the X value of each point, the second argument the Y value (so R knows where to place the text) and the third argument is the corresponding label. The argument pos=1 is there to tell R to draw the label underneath the point; with pos=2 (etc.) we can change that position.

plot(policy ~ perfor, bty="n", ylab="Policy Indicator", xlab="Performance", main="Policy and Performance")
abline(lm(policy ~ perfor), col="red")
text(perfor, policy, canton, pos=1)

The scatter plot on the right is similar, but here we actually plot the labels instead of the dots. There are two differences in the code: First, we add type="n" to create the scatter plot without actually drawing any circles (an empty plot if you will). Second, when we add the text in the third line of the code, we do not have pos=1, because we want to place the labels exactly where the points are.

plot(policy ~ perfor, bty="n", type="n", ylab="Policy Indicator", xlab="Performance", main="Policy and Performance")
abline(lm(policy ~ perfor), col="red")
text(perfor, policy, canton)

Correlations Graphics in R

Correlations are some of the basics in quantitative analysis, and they are well suited for graphical examination. Using plots we can see whether it is justified to assume a linear relationship between the variables, for example. Scatter plots are our friends here, and with two variables it is as simple as calling plot() in R:

plot(var1, var2)

If we have more than two variables, it can be useful to plot a scatter plot matrix: multiple scatter plots in one go. The pairs() command is built in, but in my view not the most useful one out there. Here we use cbind() to combine a few variables, and specify that we don’t want to see the same scatter plots (rotated) in the upper panel.

pairs(cbind(var1, var2, var3, var4) , upper.panel=NULL)

A more flexible method is provided in library(car) with the scatterplotMatrix(). If this is not flexible enough, we can always split the plot and draw whatever we need, but that’s not for today.

library(car)scatterplotMatrix(cbind(var1, var2, var3, var4))

If we have many more variables, it’s necessary to draw multiple plots to be able to see what is going on. However, sometimes after having checked that the associations are more or less linear, we’re simply interested in the strength and direction of the correlations for many combinations of variables. I guess the classic approach is staring at a large table of correlation coefficients, but as is often the case, graphics can make your life easier, in this case library(corrplot):

library(corrplot)
corrplot(object_with_many_variables, method="circle", type="lower", diag=FALSE)

This is certainly more pleasant than staring at a table…

For all these commands, R offers plenty of ways to tweak the output.

The Magic of (Kernel Density) Plots

Today I was looking at some data I gathered on how different groups react to a certain stimulus. A classic case for the aggregate function in R:

aggregate(reaction_var, by=list(group=group_var), median, na.rm=TRUE)

I looked at the mean, median, and interpolated medians, but it was hard to make out whether there were real differences between the groups. That’s the moment I do what I tell my students to do all the time: graph, plot, … (and wonder why this time I thought I wouldn’t have to plot everything anyway)

Here’s the magic of the kernel densities that helped me see what’s going on.

plot(density(reaction_var, na.rm=TRUE, bw=8), main="", lty=2, ylim=c(0, 0.032), xlim=c(0,100), bty="n")
lines(density(reaction_var[group_var==1], na.rm=TRUE, bw=8), col="blue")
lines(density(reaction_var[group_var==0], na.rm=TRUE, bw=8), col="red")
abline(v=50, lty=3)

Here I only look at one particular stimulus, and first plot the kernel density for everyone (no square brackets). I chose a dashed line so that the aggregate is less dominant in the plot (lty=2), after all I’m interested in the group differences (if there are any). I also set the ylim in a second round, because the kernel densities for the red group would otherwise be cut off. I also set the xlim, because the range of my variable is only 0 to 100. Because of the bandwidth, kernel density plots never quite end at logical end points. I also set the bandwidth of the kernel density to 8 so that it is exactly the same across the groups. The last argument (bty) gets rid of the box R puts around the plot by default.

I then add the kernel densities for the two groups of interest (square brackets to identity the group of question) with a particular colour. Finally I added the median value for reference.

kernel.magic

Well, what is going on? All three lines (combined, and each group separately) have roughly the same median value. The mean is lower for the blue group, but the interpolated median values are almost exactly the same as the median values. Difference or no real difference? I know that the old “textbook” rule that the difference between the median and mean indicates skew often fails, so definitely a case for plotting. And we can see that the central tendency is not telling us much in this case, it’s mostly about the tail.

von Hippel, P. “Mean, Median, and Skew: Correcting a Textbook Rule.” Journal of Statistics Education 13, no. 2 (2005).