I got recently asked how to calculate predicted probabilities in R. Of course we could do this by hand, but often it’s preferable to do this in R. The command we need is `predict()`

; here’s how to use it.

First we need to run a regression model. In this example, I predict whether a person voted in the previous election (binary dependent variable) with variables on education, income, and age. I use logistic regression:

`m <- glm(voted ~ edu + income + age, family="binomial", data=voting)`

Now comes the not so obvious part: we need to specify the cases we are interested in. We’re using `with()`

and `data.frame()`

to do so. In this case, I’m interested in the predicted probabilities of two people, both with average (mean) education and income, but one aged 20 and the other aged 60. In this step, we need to cover all the independent variables.

`newdata <- with(voting, data.frame(age = c(20, 60), edu=mean(edu, na.rm=TRUE), income=mean(income, na.rm=TRUE)))`

Finally we can get the predictions:

`predict(m, newdata, type="response")`

That’s our model `m`

and `newdata`

we’ve just specified. `type="response"`

calculates the predicted probabilities. We get

```
``` 1 2
0.3551121 0.6362611

So 36% for the person aged 20, and 64% for the person aged 60.

Often, however, a picture will be more useful. The logic is the same. We use the same model, and ask R to predict for every age from 18 to 90 (I guess you don’t want to do this by hand).

`newdata2 <- with(voting, data.frame(age = 18:90, edu=mean(edu, na.rm=TRUE), income=mean(income, na.rm=TRUE)))`

We once again use `predict()`

, but this time also ask for standard errors.

`preds <- predict(m, newdata2, type="response", se.fit=TRUE)`

For the plot, I want the predicted probabilities +/- 1.96 standard errors (that’s the 95% confidence interval; use `qnorm(0.975)`

if 1.96 is not precise enough). I extract and calculate the values for each line separately to better understand the code.

`predf <- preds$fit # predicted`

lower <- preds$fit - (1.96*preds$se.fit) # lower bounds

upper <- preds$fit + (1.96*preds$se.fit) # upper bounds

Finally the plot: It’s a simple line plot of the predicted probabilities plotted against the age (18 to 90). I choose not to show the borders of the plot, and then use `lines()`

twice to add the lower and upper bounds.

`plot(18:90, predf, type="l", ylab="Predicted Probability to Vote", xlab="Age", bty="n")`

lines(18:90, lower, lty=2)

lines(18:90, upper, lty=2)