Predicted Probabilities in R

I got recently asked how to calculate predicted probabilities in R. Of course we could do this by hand, but often it’s preferable to do this in R. The command we need is predict(); here’s how to use it.

First we need to run a regression model. In this example, I predict whether a person voted in the previous election (binary dependent variable) with variables on education, income, and age. I use logistic regression:

m <- glm(voted ~ edu + income + age, family="binomial", data=voting)

Now comes the not so obvious part: we need to specify the cases we are interested in. We’re using with() and data.frame() to do so. In this case, I’m interested in the predicted probabilities of two people, both with average (mean) education and income, but one aged 20 and the other aged 60. In this step, we need to cover all the independent variables.

newdata <- with(voting, data.frame(age = c(20, 60), edu=mean(edu, na.rm=TRUE), income=mean(income, na.rm=TRUE)))

Finally we can get the predictions:

predict(m, newdata, type="response")

That’s our model m and newdata we’ve just specified. type="response" calculates the predicted probabilities. We get

        1         2 
0.3551121 0.6362611 

So 36% for the person aged 20, and 64% for the person aged 60.

Often, however, a picture will be more useful. The logic is the same. We use the same model, and ask R to predict for every age from 18 to 90 (I guess you don’t want to do this by hand).

newdata2 <- with(voting, data.frame(age = 18:90, edu=mean(edu, na.rm=TRUE), income=mean(income, na.rm=TRUE)))

We once again use predict(), but this time also ask for standard errors.

preds <- predict(m, newdata2, type="response", se.fit=TRUE)

For the plot, I want the predicted probabilities +/- 1.96 standard errors (that’s the 95% confidence interval; use qnorm(0.975) if 1.96 is not precise enough). I extract and calculate the values for each line separately to better understand the code.

predf <- preds$fit # predicted
lower <- preds$fit - (1.96*preds$se.fit) # lower bounds
upper <- preds$fit + (1.96*preds$se.fit) # upper bounds

Finally the plot: It’s a simple line plot of the predicted probabilities plotted against the age (18 to 90). I choose not to show the borders of the plot, and then use lines() twice to add the lower and upper bounds.

plot(18:90, predf, type="l", ylab="Predicted Probability to Vote", xlab="Age", bty="n")
lines(18:90, lower, lty=2)
lines(18:90, upper, lty=2)

predweb

27 Replies to “Predicted Probabilities in R”

  1. Hi Didier,
    This post is very Nice.
    But i have one question, what is the difference between what you do and the visreg function, is exactly the same or existe differences? How can I interpret the Visreg function with continous and factor variables?
    Thanks in advance

    1. It seems you can pretty much do the same with library(visreg). There’s the argument to transform log(odds) to probabilities in visreg, too (scale=”response”). I guess doing it manually, as I do it in the post gives you easier control over the cases you compare. Put differently, I can choose specific cases with substantive meaning. I’m not sure I understand your second question on continuous and factor variables. I’d generally recommend using predict() to calculate the probabilities for specific (substantially meaningful) cases.

      1. Thank you!!
        The last question, when you make the plot of visreg for one variable that probability is estimated for the mean of the other variables, or for the median? Visreg calculate that probability for the mean or for the median of the sample?

        I don´t know why the AMEs made with margins function do not coincide with the jumps of the plots made with visreg function. I think it is because the AMEs are obtained with the sample mean for the rest of the variables and Visreg for the median.

        Thank you very very much

      2. You’ll have to check the documentation of the visreg() function to get an answer to that, I’m afraid. As you note, there’s more than one way to do this.

  2. Dear Didier,
    Thank you so much for the post. Here is my question
    What if I want to plot predicted probability of interaction variables between categoric variable [gender] and 3 level variable [income].
    vote[Yes=1,No]~ gender[male=1,female=0] + income [low=0,middle=1,high=2] + age + education [high=1,low=0] +
    location [urban=1, rural=0]+ gender [ref=Female] * income[ref=low].The interaction term is significant.
    Hope this is clear. I look forward to hear from you :-)
    Henry

    1. I’m not sure I understand your question, Henry. If I understand you right, your variables are categorical (2 categories for gender, 3 categories for income). You can easily use the predicted values for each of the (2×3) possibilities. If you want to use a graph with so few actual predictions is another question: you’d have two lines defined by 3 income levels each, which is not really much for a line.

  3. Thank you so much!!!
    I spent so long looking for how to find confidence intervals for predicted probabilities and it turned out to be really easy. Really appreciate this.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.