Calculating VIF by hand

A widespread measure of multicollinearity is the VIF (short for variance inflation factor). Multicollinearity describes the situation when the predictor variables in a multiple regression model are highly correlated, which is usually not desirable (assuming you haven’t gone Bayesian yet).

In R, the VIF can easily be calculated with a function in library car. It’s actually not difficult to do it by hand — which incidentally helps understand what we measure with the VIF, or why there is no different VIF for logistic regression models, or why the VIF is better than looking at bivariate correlations between predictors.

We start with some random data to run the multiple regression model. Here we create one outcome (y) and three predictor variables (x, z, a), full of random numbers. That’ll do for a demonstration.

x = runif(50) 
y = runif(50)
z = runif(50)
a = runif(50)

Here’s a simple OLS model:


m = lm(y ~ x + z + a)

If you have library car installed, you can easily calculate the VIF:

library(car)
vif(m)

To do it by hand, though, we run a linear regression model (OLS) for each of the predictors. Here’s the code for predictor x. One of the predictors becomes the outcome variable (here x), and the other predictors remain predictors. The variable used as the outcome previously (y) does not appear here.

mx = lm(x ~ z + a)

The VIF is simply: 1/(1-R²) of this model. In R, we can run the following:

1/(1-summary(mx)$r.squared)

How to run a regression on a subset in R

Sometimes we need to run a regression analysis on a subset or sub-sample. That’s quite simple to do in R. All we need is the subset command. Let’s look at a linear regression:

lm(y ~ x + z, data=myData)

Rather than run the regression on all of the data, let’s do it for only women, or only people with a certain characteristic:

lm(y ~ x + z, data=subset(myData, sex=="female"))

lm(y ~ x + z, data=subset(myData, age > 30))

The subset() command identifies the data set, and a condition how to identify the subset.

How to Create Coefficient Plots in R the Easy Way

Presenting regression analyses as figures (rather than tables) has many advantages, despite what some reviewers may think

coefplottables2graphs has useful examples including R code, but there’s a simpler way. There’s an R package for (almost) everything, and (of course) you’ll find one to produce coefficient plots. Actually there are several ones.

The one I end up using most is the coefplot function in the package arm. It handles most common models out of the box. For those it doesn’t, you can simply supply the coefficients. Here’s the code for the coefficient plot shown. The first two lines are just to get the data in case you’re interested in full replication.

The default in arm is to use a vertical layout, so coefplot(m1) works wonderfully. Often I prefer the horizontal layout, which is easily done with vertical=FALSE; I also add custom margins so that the variable names are fully visible.

library(car) # for the example data
data(Duncan) # example data
m1 = lm(prestige ~ income + education + type, data=Duncan)
library(arm) # for coefplot()
coefplot(m1, vertical=FALSE, mar=c(5.5,2.5,2,2))

If I want to plot the coefficients of a model not supported, like Cox Proportional Hazard survival models, all it takes is to supply the coefficients. The coefplot function takes many arguments as we would expect it. Here’s an example. I supply the coefficients and SD as required (using subsets from the results), specify the variable names, and set the limits of the y-axis. In this example, I specify the columns with the coefficients [,1] and the SD [,2]. Specifying variable names is something I often do, because we don’t want to communicate using our (internal) variable names; even I find myself struggling to understand what some variables stood for when I get back to old results after a few months. Usually there is no need to specify the limits of the axes (ylim), but I included this here to show that the function takes many standard arguments, and because if we compare different models this can be useful. Speaking of comparing models, running coefplot a second time with add=TRUE does just this.

variableNames = c("Intercept", "Income", "Education", "Type Prof", "Type W.Coll")
coefplot(summary(m1)$coefficients[,1], summary(m1)$coefficients[,2], vertical=FALSE, varnames=variableNames, ylim=c(-5, 25), main="")

There’s also a dedicated package for coefficient plots called coefplot, but somehow it often failed me. There’s also Ben Bolker‘s coefplot2; it requires installation with the following code (it’s not on CRAN):

install.packages("coefplot2", repos="http://www.math.mcmaster.ca/bolker/R", type="source")

You might also be interested in the visreg package. Although it doesn’t do coefficient plots, it visualizes regression analyses so that you can see the data alongside the results.

Finally, for those happy to code in R, have a look at the figures (and code) by Carlisle Rainey. That’s nice for polishing the results for publication, but seems a bit complicated for a first look at the results. Indeed, even if we go for the table in the end, doing coefficient plots are a very useful tool for us researchers to understand the analyses we run, actually see what these figures amount to!

P.S. There are also legitimate reasons to keep the table, of course, but that’s something for the supplementary file.