I got recently asked how to calculate predicted probabilities in R. Of course we could do this by hand, but often it’s preferable to do this in R. The command we need is
predict(); here’s how to use it.
First we need to run a regression model. In this example, I predict whether a person voted in the previous election (binary dependent variable) with variables on education, income, and age. I use logistic regression:
m <- glm(voted ~ edu + income + age, family="binomial", data=voting)
Now comes the not so obvious part: we need to specify the cases we are interested in. We’re using
data.frame() to do so. In this case, I’m interested in the predicted probabilities of two people, both with average (mean) education and income, but one aged 20 and the other aged 60. In this step, we need to cover all the independent variables.
newdata <- with(voting, data.frame(age = c(20, 60), edu=mean(edu, na.rm=TRUE), income=mean(income, na.rm=TRUE)))
Finally we can get the predictions:
predict(m, newdata, type="response")
That’s our model
newdata we’ve just specified.
type="response" calculates the predicted probabilities. We get
So 36% for the person aged 20, and 64% for the person aged 60.
Often, however, a picture will be more useful. The logic is the same. We use the same model, and ask R to predict for every age from 18 to 90 (I guess you don’t want to do this by hand).
newdata2 <- with(voting, data.frame(age = 18:90, edu=mean(edu, na.rm=TRUE), income=mean(income, na.rm=TRUE)))
We once again use
predict(), but this time also ask for standard errors.
preds <- predict(m, newdata2, type="response", se.fit=TRUE)
For the plot, I want the predicted probabilities +/- 1.96 standard errors (that’s the 95% confidence interval; use
qnorm(0.975) if 1.96 is not precise enough). I extract and calculate the values for each line separately to better understand the code.
predf <- preds$fit # predicted
lower <- preds$fit - (1.96*preds$se.fit) # lower bounds
upper <- preds$fit + (1.96*preds$se.fit) # upper bounds
Finally the plot: It’s a simple line plot of the predicted probabilities plotted against the age (18 to 90). I choose not to show the borders of the plot, and then use
lines() twice to add the lower and upper bounds.
plot(18:90, predf, type="l", ylab="Predicted Probability to Vote", xlab="Age", bty="n")
lines(18:90, lower, lty=2)
lines(18:90, upper, lty=2)