I got recently asked how to calculate predicted probabilities in R. Of course we could do this by hand, but often it’s preferable to do this in R. The command we need is `predict()`

; here’s how to use it.

First we need to run a regression model. In this example, I predict whether a person voted in the previous election (binary dependent variable) with variables on education, income, and age. I use logistic regression:

`m <- glm(voted ~ edu + income + age, family="binomial", data=voting)`

Now comes the not so obvious part: we need to specify the cases we are interested in. We’re using `with()`

and `data.frame()`

to do so. In this case, I’m interested in the predicted probabilities of two people, both with average (mean) education and income, but one aged 20 and the other aged 60. In this step, we need to cover all the independent variables.

`newdata <- with(voting, data.frame(age = c(20, 60), edu=mean(edu, na.rm=TRUE), income=mean(income, na.rm=TRUE)))`

Finally we can get the predictions:

`predict(m, newdata, type="response")`

That’s our model `m`

and `newdata`

we’ve just specified. `type="response"`

calculates the predicted probabilities. We get

```
``` 1 2
0.3551121 0.6362611

So 36% for the person aged 20, and 64% for the person aged 60.

Often, however, a picture will be more useful. The logic is the same. We use the same model, and ask R to predict for every age from 18 to 90 (I guess you don’t want to do this by hand).

`newdata2 <- with(voting, data.frame(age = 18:90, edu=mean(edu, na.rm=TRUE), income=mean(income, na.rm=TRUE)))`

We once again use `predict()`

, but this time also ask for standard errors.

`preds <- predict(m, newdata2, type="response", se.fit=TRUE)`

For the plot, I want the predicted probabilities +/- 1.96 standard errors (that’s the 95% confidence interval; use `qnorm(0.975)`

if 1.96 is not precise enough). I extract and calculate the values for each line separately to better understand the code.

`predf <- preds$fit # predicted`

lower <- preds$fit - (1.96*preds$se.fit) # lower bounds

upper <- preds$fit + (1.96*preds$se.fit) # upper bounds

Finally the plot: It’s a simple line plot of the predicted probabilities plotted against the age (18 to 90). I choose not to show the borders of the plot, and then use `lines()`

twice to add the lower and upper bounds.

`plot(18:90, predf, type="l", ylab="Predicted Probability to Vote", xlab="Age", bty="n")`

lines(18:90, lower, lty=2)

lines(18:90, upper, lty=2)

Thank you!!!

I’m glad this helped.

Hi, this is extremely useful I have a question. In your examples, you constrained continuous variables like income and education at their means, while running the predicted probabilities. How do we handle factor variables? I tried to constrain them at their means but I wasn’t able to.

So for example, if your education variable was a factor variable (1=less than HS, 2 = HS, 3 = some college and 4 = college and higher). edu=mean(edu, na.rm=TRUE) wouldn’t work. There would be an error message that [educ was fitted with type “factor” but type “numeric” was supplied] or something like that.

Thanks for checking in. You simply specify the category you want to use, like edu=”less than” or 1, depending how you have coded this. You can also set continuous variables to something other than the mean, it’s really up to you to choose something meaningful for the comparison.

Hi Didier, I have a follow-up question. Let’s say I want to get the average difference in predicted probability of voting for a treatment vs. control group and I have continuous (income) and categorical covariates (education and gender). Would I set each categorical variable to a mean value (by making it numeric)? I want to get one average estimate for predicted probability holding everything else constant and not separate probability estimates for each level of treatment, gender and education? Thank you!

Sincerely, Confused grad student trying to use R

Hi Ariel,

You don’t have to use the mean value for continuous variables at all. You can compare the outcome at income 15k and 35k. I actually think, such an approach would be preferable if the 15k and 35k are meaningful values (I’m making this up as an example, but say if these were the mean salary for a nurse and a teacher, we can relate to the predicted probabilities we get).

In most situations, the “mean” of a categorical variable is utterly meaningless. If you have a binary variable like gender (I presume you only measure “male” and “female”), you’d normally set it to one of them, say you’d look at women only, and then vary the income to see how this affects the outcome. If you mention education as a categorical variable, I guess you measure it in an ordinal way (say “primary”, “secondary”, “tertiary”). In that case, you’d pick a level that is meaningful (in substantive terms).

Now, whether you should just pick one level (e.g. “women”, “primary”) or multiple ones, depends on what you want to show. If your model is strictly linear, it doesn’t matter, because the “effect” is the same no matter where you start. If the model relaxes linearity in one or another way, you’d no longer expect the same “effect” everywhere, and it can make sense to explore them at different values. Where I’d always choose different levels is when the categories I picked are truly meaningful (say “male teacher”, “female truck driver”). So it depends a bit on what you want to achieve.

This is really helpful! I appreciate how it is details of the explanation.

Here is one more and complex question. Hope you answer it. I am very suffering from this problem. What if I want to see pictured predicted probability of interaction variables with categorical variable and number variable? For example, in the case above, let’s assume firstly, education is a categorical variable (“primary”, “secondary”, “tertiary”). And, I want to see predicted probability of “interaction of primary and income”. In other words, predicted probability of a categorical variable and a number variable as interaction term… Hope you understand what I want.

Look forward to your reply :)

Thank you in advance.

Best regards,

Leroy Choi

Thanks for checking in, Leroy Choi! Yes, the example you describe is clear enough, but I don’t see the problem. You’ll have to decide whether you want to predict the probabilities of income at the three different levels of education, or the level of say primary education at different levels of income. Apart from that (which is a substantive question only you can answer), you can use the approach outlined here. For instance,

newdata2 <- with(voting, data.frame(age = mean(age, na.rm=TRUE), edu="primary", income=1000:10000))

could be a solution for the second possibility (primary education, different levels of income), which you could plot.

Other than that, check out library(effects) and library(visreg) to visualize the interaction terms specifically.

Hi Didier,

This post is very helpful.

I am working with two categorical predictors.

Both variables are binary.

Can I use predicted probabilities with categorical predictors?

Thanks in advance

Yes, there is no reason why this shouldn’t work for categorical predictors.

Hello Didier!

I have used your method to measure and plot predicted probabilities for a glm with a single variable! Thank you for this! I however found an upper bound of CI to be above 1,

how is this possible?

We’re in the world of the model here; the estimate (coefficient) is bound to [0,1], not the CI.

Dear Didier,

Thanks for this blog. I have applied the code to my dataset and it works well. I have a question about an extension of the use of the “predict” function to other questions. In my case, I am looking at whether two different groups have a different probability of survival given the titers of a virus. Two questions: (1) I’d like to have the curves of both plots on the same figure. How can I do that?. (2) Is it possible to test whether the predicted curves are statistically significant? Or is using the CI for both curves sufficient to make that inference?

Thanks in advance,

Maya

Thanks for checking in. For (1), you can use the above code, but when drawing the line of the second group, you simply use “lines” instead of “plot” (i.e. the same way the code above draws the confidence intervals. You’ll probably want to use the “col” or “lty” arguments to differentiate the two curves. Regarding (2), I’m not aware of any test comparing two curves all at once. It’s common practice to look at non-overlapping confidence intervals, though that’s not exactly what you’ve asked for since you don’t model the difference between the curves directly. In most contexts where statistical significance is something values, though, I’d guess non-overlapping confidence intervals will be accepted.

Dear Maya,

Plotting the model visually is very important, because coefficients being significant do not mean that they are significant at every possible value. What is important here is to construct 95% confidence intervals around the estimated curve. Where the interval consists of zero, the related variable is not significant because we cannot reject the null hypothesis in such situation. For a much better and technical explanation, please read

Zelner, Bennet A., 2009, Using Simulation to Interpret Results from Logit, Probit, and Other Nonlinear Models, Strategic Management Journal, 30, 1335-1348.

I hope this helps.

Mel

Dear Didier,

Thank you very much for your blog which is really helpful. I have a question and would be grateful for your help.

I used the predict function to predict the probability of voting for Party A in a past election. I wanted to see how differences in educational attainment affected the probability of voting for Party A. As the database is for an election that has already taken place, I assumed that the predicted probabilities would be equal to the actual proportions of voters who voted for Party A.

For example, the predict function predicted that voters with a high educational attainment had a 24.3% probability of voting for Party A while voters with a low educational attainment had a probability of voting for Party A of 0.6%. I assumed this meant that 24.3% of high-educational attainment voters had actually voted for Party A while only 0.6% of low-educational attainment voters had.

However, when I looked at the actual data, at the ex-post proportions of voters, only 22.6% of high-educational attainment voters actually voted for Party A –instead of 24.3%- while 0% of voters with low educational attainment did –instead of 0.6%.

How come the predicted probabilities don’t match the actual ex-post proportions ? Does this mean that the predict function predicts probabilities a priori, based on a model rather than taking into account the ex-post data ? I’m just wondering if the discrepancy I’m getting is due to the error term of the predict function’s probability model (if there is one) or to something else I should be worrying about.

Thanks in advance !

Emmanuel

Dear Emmanuel,

Yes, the predict() function simply predicts on the basis of the model. That’s not surprising to see differences between the world of the model and real data. The order of magnitude you describe doesn’t sound alarming to me, but how well the model should fit the data is also a matter of the research question.

Ok, thank you so much Didier!