I got recently asked how to calculate predicted probabilities in R. Of course we could do this by hand, but often it’s preferable to do this in R. The command we need is `predict()`

; here’s how to use it.

First we need to run a regression model. In this example, I predict whether a person voted in the previous election (binary dependent variable) with variables on education, income, and age. I use logistic regression:

`m <- glm(voted ~ edu + income + age, family="binomial", data=voting)`

Now comes the not so obvious part: we need to specify the cases we are interested in. We’re using `with()`

and `data.frame()`

to do so. In this case, I’m interested in the predicted probabilities of two people, both with average (mean) education and income, but one aged 20 and the other aged 60. In this step, we need to cover all the independent variables.

`newdata <- with(voting, data.frame(age = c(20, 60), edu=mean(edu, na.rm=TRUE), income=mean(income, na.rm=TRUE)))`

Finally we can get the predictions:

`predict(m, newdata, type="response")`

That’s our model `m`

and `newdata`

we’ve just specified. `type="response"`

calculates the predicted probabilities. We get

```
``` 1 2
0.3551121 0.6362611

So 36% for the person aged 20, and 64% for the person aged 60.

Often, however, a picture will be more useful. The logic is the same. We use the same model, and ask R to predict for every age from 18 to 90 (I guess you don’t want to do this my hand).

`newdata2 <- with(voting, data.frame(age = 18:90, edu=mean(edu, na.rm=TRUE), income=mean(income, na.rm=TRUE)))`

We once again use `predict()`

, but this time also ask for standard errors.

`preds <- predict(m, newdata2, type="response", se.fit=TRUE)`

For the plot, I want the predicted probabilities +/- 1.96 standard errors (that’s the 95% confidence interval; use `qnorm(0.975)`

if 1.96 is not precise enough). I extract and calculate the values for each line separately to better understand the code.

`predf <- preds$fit # predicted`

lower <- preds$fit - (1.96*preds$se.fit) # lower bounds

upper <- preds$fit + (1.96*preds$se.fit) # upper bounds

Finally the plot: It’s a simple line plot of the predicted probabilities plotted against the age (18 to 90). I choose not to show the borders of the plot, and then use `lines()`

twice to add the lower and upper bounds.

`plot(18:90, predf, type="l", ylab="Predicted Probability to Vote", xlab="Age", bty="n")`

lines(18:90, lower, lty=2)

lines(18:90, upper, lty=2)

Thank you!!!

I’m glad this helped.

Hi, this is extremely useful I have a question. In your examples, you constrained continuous variables like income and education at their means, while running the predicted probabilities. How do we handle factor variables? I tried to constrain them at their means but I wasn’t able to.

So for example, if your education variable was a factor variable (1=less than HS, 2 = HS, 3 = some college and 4 = college and higher). edu=mean(edu, na.rm=TRUE) wouldn’t work. There would be an error message that [educ was fitted with type “factor” but type “numeric” was supplied] or something like that.

Thanks for checking in. You simply specify the category you want to use, like edu=”less than” or 1, depending how you have coded this. You can also set continuous variables to something other than the mean, it’s really up to you to choose something meaningful for the comparison.

Hi Didier, I have a follow-up question. Let’s say I want to get the average difference in predicted probability of voting for a treatment vs. control group and I have continuous (income) and categorical covariates (education and gender). Would I set each categorical variable to a mean value (by making it numeric)? I want to get one average estimate for predicted probability holding everything else constant and not separate probability estimates for each level of treatment, gender and education? Thank you!

Sincerely, Confused grad student trying to use R

Hi Ariel,

You don’t have to use the mean value for continuous variables at all. You can compare the outcome at income 15k and 35k. I actually think, such an approach would be preferable if the 15k and 35k are meaningful values (I’m making this up as an example, but say if these were the mean salary for a nurse and a teacher, we can relate to the predicted probabilities we get).

In most situations, the “mean” of a categorical variable is utterly meaningless. If you have a binary variable like gender (I presume you only measure “male” and “female”), you’d normally set it to one of them, say you’d look at women only, and then vary the income to see how this affects the outcome. If you mention education as a categorical variable, I guess you measure it in an ordinal way (say “primary”, “secondary”, “tertiary”). In that case, you’d pick a level that is meaningful (in substantive terms).

Now, whether you should just pick one level (e.g. “women”, “primary”) or multiple ones, depends on what you want to show. If your model is strictly linear, it doesn’t matter, because the “effect” is the same no matter where you start. If the model relaxes linearity in one or another way, you’d no longer expect the same “effect” everywhere, and it can make sense to explore them at different values. Where I’d always choose different levels is when the categories I picked are truly meaningful (say “male teacher”, “female truck driver”). So it depends a bit on what you want to achieve.