How to run a regression on a subset in R

Sometimes we need to run a regression analysis on a subset or sub-sample. That’s quite simple to do in R. All we need is the subset command. Let’s look at a linear regression:

lm(y ~ x + z, data=myData)

Rather than run the regression on all of the data, let’s do it for only women, or only people with a certain characteristic:

lm(y ~ x + z, data=subset(myData, sex=="female"))

lm(y ~ x + z, data=subset(myData, age > 30))

The subset() command identifies the data set, and a condition how to identify the subset.

Same Explanatory Variables, Multiple Dependent Variables in R

I needed to run variations of the same regression model: the same explanatory variables with multiple dependent variables. In R, we can do this with a simple for() loop and assign().

First I specify the dependent variables:

dv <- c("dv1", "dv2", "dv3")

Then I create a for() loop to cycle through the different dependent variables:

for(i in 1:length(dv)){

Within this loop, I need to create an object to hold the models. I need a separate object for each model, so I create one with paste(). For the first dependent variable, this will be model1; for the second dependent variable model2, and so on.

model <- paste("model",i, sep="")

With this object to hold the model in place, I can run the model: the ith dependent variable is used. It is stored in an object called m.

m <- lm(as.formula(paste(dv[i],"~ ev1 + ev2")), data=mydata)

Now, I assign the model m to the model object created above: model1 for the first dependent variable, etc. That’s also the end of the for() loop.

assign(model,m)}

We can now look at the results:

summary(model1); summary(model2); summary(model3)

or, more practical to compare models:

library(memisc)
mtable(model1, model2, model3)

Using Shading in Coefficient Plots

Having recently discovered the R package denstrip, I was struck by intuitive it is to use shading to express uncertainty. If you need convincing, check out the paper by Jennifer Lundquist and Ken Lin linked below.

coefplot.fadeI wondered whether this would also work well for generic coefficient plots to examine regression results. The proof is in the pudding, so here’s some quickly thrown together code (also on Gist).

library(denstrip)
fade <- function(x, labels=names(coef(x)), expo=FALSE, xlab="", ylab="", bty="n", ...) {
# argument: a regression, additional arguments passed on to plot() and text()
coe <- summary(x)$coefficients[,1] # extract coefficients
cse <- summary(x)$coefficients[,2] # standard errors
len <- length(coe) # how many coefficients (without intercept)
if(expo == TRUE) { # exponential form
coe <- exp(coe)
cse <- exp(cse)
}
ran <- c(min(coe)- 3*max(cse), max(coe)+3*max(cse)) # range of values
plot(0, ylim=c(1.5,len+0.5), xlim=ran, xlab=xlab, ylab=ylab, bty=bty, type="n", ..., axes=FALSE) # empty plot
# title, xlab, ylab, etc. can be used here.
for(i in 2:len) {
dens <- rnorm(mean=coe[i], sd=cse[i], n=1000)
denstrip(dens, at=i) # passing arguments conflicts with plot()
text(ran[1],i, labels[i], adj = c(0,0), ...) # adj to left-align
# cex, col can be used here
}
axis(1)
}

Well, it does work (see figure above). If we add an abline(v=0, lty=3), we can easily highlight the zero line.

coefplot.armHowever, intuitive as it is, I’m not really convinced. Here’s the coefficient plot the arm package produces. While it doesn’t use shading, lines of different thickness are used to indicate the standard errors — nicely de-emphasizing the end-point. There’s more emphasis on the point estimate (i.e. the coefficient itself: the square), while my code hides this. I did try adding a white line at the point estimate (third figure, below), but this doesn’t resolve my biggest worry: the amount of ink that is used to convey the message… I’m just not convinced that the shading adds that much to the relatively simple coefficient plots in the package arm.

coefplot.mod

(There’s an argument width that could be added to line calling denstrip, like width=0.2, but it hasn’t convinced me enough.)

Jackson, Christopher H. 2008. “Displaying Uncertainty with Shading.” The American Statistician 62 (4): 340–47. doi:10.1198/000313008X370843.

Lundquist, Jennifer H., and Ken-Hou Lin. 2015. “Is Love (Color) Blind? The Economy of Race among Gay and Straight Daters.” Social Forces, March, sov008. doi:10.1093/sf/sov008.

How to Create Coefficient Plots in R the Easy Way

Presenting regression analyses as figures (rather than tables) has many advantages, despite what some reviewers may think

coefplottables2graphs has useful examples including R code, but there’s a simpler way. There’s an R package for (almost) everything, and (of course) you’ll find one to produce coefficient plots. Actually there are several ones.

The one I end up using most is the coefplot function in the package arm. It handles most common models out of the box. For those it doesn’t, you can simply supply the coefficients. Here’s the code for the coefficient plot shown. The first two lines are just to get the data in case you’re interested in full replication.

The default in arm is to use a vertical layout, so coefplot(m1) works wonderfully. Often I prefer the horizontal layout, which is easily done with vertical=FALSE; I also add custom margins so that the variable names are fully visible.

library(car) # for the example data
data(Duncan) # example data
m1 = lm(prestige ~ income + education + type, data=Duncan)
library(arm) # for coefplot()
coefplot(m1, vertical=FALSE, mar=c(5.5,2.5,2,2))

If I want to plot the coefficients of a model not supported, like Cox Proportional Hazard survival models, all it takes is to supply the coefficients. The coefplot function takes many arguments as we would expect it. Here’s an example. I supply the coefficients and SD as required (using subsets from the results), specify the variable names, and set the limits of the y-axis. In this example, I specify the columns with the coefficients [,1] and the SD [,2]. Specifying variable names is something I often do, because we don’t want to communicate using our (internal) variable names; even I find myself struggling to understand what some variables stood for when I get back to old results after a few months. Usually there is no need to specify the limits of the axes (ylim), but I included this here to show that the function takes many standard arguments, and because if we compare different models this can be useful. Speaking of comparing models, running coefplot a second time with add=TRUE does just this.

variableNames = c("Intercept", "Income", "Education", "Type Prof", "Type W.Coll")
coefplot(summary(m1)$coefficients[,1], summary(m1)$coefficients[,2], vertical=FALSE, varnames=variableNames, ylim=c(-5, 25), main="")

There’s also a dedicated package for coefficient plots called coefplot, but somehow it often failed me. There’s also Ben Bolker‘s coefplot2; it requires installation with the following code (it’s not on CRAN):

install.packages("coefplot2", repos="http://www.math.mcmaster.ca/bolker/R", type="source")

You might also be interested in the visreg package. Although it doesn’t do coefficient plots, it visualizes regression analyses so that you can see the data alongside the results.

Finally, for those happy to code in R, have a look at the figures (and code) by Carlisle Rainey. That’s nice for polishing the results for publication, but seems a bit complicated for a first look at the results. Indeed, even if we go for the table in the end, doing coefficient plots are a very useful tool for us researchers to understand the analyses we run, actually see what these figures amount to!

P.S. There are also legitimate reasons to keep the table, of course, but that’s something for the supplementary file.

That Feeling

I got Andrew Gelman post a question on that I-just-fit-a-model feeling.

How do you describe that feeling when you hit enter and get the first results on your screen? That’s after going through the theory, collecting data, specifying the model, perhaps debugging the code, etc.

To me, it’s one of the exciting moments in doing research. It’s go that excitement I get when — for no apparent reason — betting on the colour of the eggs in a carton (here they sell both brown and white eggs…) and lift the lid to find out. At the same time, there is a slightly less frivolous side to it, as I’m wagering (part) of my view how the world functions.

After reading Andrew’s reply, yes, I also often get that anti-climate mixed in, seeing numbers on the screen and trying to make sense of them. After all, it usually isn’t as straightforward as a box of brown or white eggs (or both)…