I got recently asked how to calculate predicted probabilities in R. Of course we could do this by hand, but often it’s preferable to do this in R. The command we need is `predict()`

; here’s how to use it.

First we need to run a regression model. In this example, I predict whether a person voted in the previous election (binary dependent variable) with variables on education, income, and age. I use logistic regression:

`m <- glm(voted ~ edu + income + age, family="binomial", data=voting)`

Now comes the not so obvious part: we need to specify the cases we are interested in. We’re using `with()`

and `data.frame()`

to do so. In this case, I’m interested in the predicted probabilities of two people, both with average (mean) education and income, but one aged 20 and the other aged 60. In this step, we need to cover all the independent variables.

`newdata <- with(voting, data.frame(age = c(20, 60), edu=mean(edu, na.rm=TRUE), income=mean(income, na.rm=TRUE)))`

Finally we can get the predictions:

`predict(m, newdata, type="response")`

That’s our model `m`

and `newdata`

we’ve just specified. `type="response"`

calculates the predicted probabilities. We get

```
``` 1 2
0.3551121 0.6362611

So 36% for the person aged 20, and 64% for the person aged 60.

Often, however, a picture will be more useful. The logic is the same. We use the same model, and ask R to predict for every age from 18 to 90 (I guess you don’t want to do this by hand).

`newdata2 <- with(voting, data.frame(age = 18:90, edu=mean(edu, na.rm=TRUE), income=mean(income, na.rm=TRUE)))`

We once again use `predict()`

, but this time also ask for standard errors.

`preds <- predict(m, newdata2, type="response", se.fit=TRUE)`

For the plot, I want the predicted probabilities +/- 1.96 standard errors (that’s the 95% confidence interval; use `qnorm(0.975)`

if 1.96 is not precise enough). I extract and calculate the values for each line separately to better understand the code.

`predf <- preds$fit # predicted`

lower <- preds$fit - (1.96*preds$se.fit) # lower bounds

upper <- preds$fit + (1.96*preds$se.fit) # upper bounds

Finally the plot: It’s a simple line plot of the predicted probabilities plotted against the age (18 to 90). I choose not to show the borders of the plot, and then use `lines()`

twice to add the lower and upper bounds.

`plot(18:90, predf, type="l", ylab="Predicted Probability to Vote", xlab="Age", bty="n")`

lines(18:90, lower, lty=2)

lines(18:90, upper, lty=2)

Hi Didier,

This post is very Nice.

But i have one question, what is the difference between what you do and the visreg function, is exactly the same or existe differences? How can I interpret the Visreg function with continous and factor variables?

Thanks in advance

It seems you can pretty much do the same with library(visreg). There’s the argument to transform log(odds) to probabilities in visreg, too (scale=”response”). I guess doing it manually, as I do it in the post gives you easier control over the cases you compare. Put differently, I can choose specific cases with substantive meaning. I’m not sure I understand your second question on continuous and factor variables. I’d generally recommend using predict() to calculate the probabilities for specific (substantially meaningful) cases.

Thank you!!

The last question, when you make the plot of visreg for one variable that probability is estimated for the mean of the other variables, or for the median? Visreg calculate that probability for the mean or for the median of the sample?

I don´t know why the AMEs made with margins function do not coincide with the jumps of the plots made with visreg function. I think it is because the AMEs are obtained with the sample mean for the rest of the variables and Visreg for the median.

Thank you very very much

You’ll have to check the documentation of the visreg() function to get an answer to that, I’m afraid. As you note, there’s more than one way to do this.

Dear Didier,

Thank you so much for the post. Here is my question

What if I want to plot predicted probability of interaction variables between categoric variable [gender] and 3 level variable [income].

vote[Yes=1,No]~ gender[male=1,female=0] + income [low=0,middle=1,high=2] + age + education [high=1,low=0] +

location [urban=1, rural=0]+ gender [ref=Female] * income[ref=low].The interaction term is significant.

Hope this is clear. I look forward to hear from you :-)

Henry

I’m not sure I understand your question, Henry. If I understand you right, your variables are categorical (2 categories for gender, 3 categories for income). You can easily use the predicted values for each of the (2×3) possibilities. If you want to use a graph with so few actual predictions is another question: you’d have two lines defined by 3 income levels each, which is not really much for a line.

Thank you so much!!!

I spent so long looking for how to find confidence intervals for predicted probabilities and it turned out to be really easy. Really appreciate this.