Barplots across variables in R

barplot_rainbowHere’s a good example of how useful sapply can be. I have some data from Qualtrics, and each response is coded in its own variable. Let’s say there is a question on what kind of organization respondents work in, with 10 response categories. Qualtrics produces 10 variables, each with 1 if the box was ticked, and empty otherwise (structure shown just below). With the default CSV import, these blank cells are turned into NA. Here’s a simple way to produce a barplot in this case (in R, of course).

qualtrics_format

types = sapply(1:10, function(i) sum(get(paste("Q1_",i,sep="")), na.rm=TRUE))
barplot(types)

Let’s take this step by step. To count frequencies, we simply use sum(), with the argument na.rm=TRUE because the variables only contain 1 and NA. get() is used to find the variable specified by a string; the string is created with paste(). In this case, the variable names are Q1_1, Q1_2, Q1_3, … Q1_9, Q1_10. By using paste(), we combine the “Q1_” part with the counter variable i, with no separation (sep="").

The whole thing is then wrapped up in sapply(), with the counter variable i defined to take values from 1 to 10; the function(i) part is there so that the counter variable is applied to the sum. So sapply() takes each value of the counter variable, and applies it to the function we specified, which calculates the sum for one variable Q1_i at a time.

Now I can simply do a boxplot, and add the names.arg argument to specify the labels.

(Here I specified the colours: barplot(types, col=rainbow(10)) to have a catchy image at the top of this post, albeit one where colours have no meaning: so-called chart-junk).

Plotting Gradients in R

fadeThere might be an easier way to do this, but here’s one way to plot gradients in R. It draws on colorRampPalette() and a for() loop, and isn’t very fast on underpowered machines — but it works. Using colorRampPalette() we can create the necessary gradients. Here’s the code I cobbled together:

fade <- function(M=1, S=1, Y=1, H=0.05, K=50, dk = "black") {
# M = midpoint; S = spread; Y = position; H = height; K = steps in gradient; dk = dark colour
colfunc <- colorRampPalette(c(dk, "white")) # creates a function to produce the gradients
D <- S/K # delta; how wide does one rectangle have to be?
collist <- colfunc(K) # create K colours
for(i in 0:(K-1)) { # draw rectangles; K-1 because I start with 0 (this makes it easier in the line just below)
rect(M+(D*i), Y-H, M+D+(D*i), Y+H, col=collist[i+1], border=NA) # drawing a narrow rectangle, no borders drawn; to right
rect(M-(D*i), Y-H, M-D-(D*i), Y+H, col=collist[i+1], border=NA) # to left
}
}

Before applying the above code, we need an empty plot:

plot(1, ylim=c(0.5,8), xlim=c(1,8), type="n", axes=F, xlab="", ylab="")

Important are the ylim and xlim arguments, and the type="n" to plot nothing. I usually prefer drawing my own axes — axis(1), axis(2) as this allows easy customization…

Why not highlight the midpoint? We can this as follows:

text(M, Y, "|", col="white", font=4) # font = 4 for bold face

In many cases, the package denstrip may offer an easier solution, albeit one where we have less control.

P.S. The code produces just one of the lines in the plot included at the top.

Plotting Connected Lines with Missing Values

When we plot data with missing values, R does not connect them. This is probably the correct behaviour, but what if we really want to gloss over missing data points?

plot(variable.name[country=="UK"], type="b") gives me something like the following. I used type="b", since type="l" will give an empty plot – generally not very useful.

miss_plain

What if we simply leave out the missing values? plot(na.omit(variable.name[country=="UK"]), type="b") kind of works, but we lose the correct spacing on the x-axis:

miss_omit

So what we can do is the following. In a first step we identify for which points we have data. Next we plot, but only these. In contrast to the above method, the spacing on the x-axis remains intact.

miss <- !is.na(variable.name[country=="UK"])
plot(which(miss), variable.name[country=="UK" & miss], col="red", type="b", lwd=2)

miss_miss

It is important to include an xlim argument if we add multiple lines on the same plot. Typically I draw the axes separately, as this gives me more control over them, especially the labels on the x-axis.

miss <- !is.na(variable.name[country=="UK")
plot(which(miss), variable.name[country=="UK" & miss], col="red", type="b", lwd=2, axes=FALSE, xlim=c(1,16))
axis(2)
axis(1,at=c(1,6,11,16), labels=c("1995","2000","2005","2010"))

miss_typical