How to run a regression on a subset in R

Sometimes we need to run a regression analysis on a subset or sub-sample. That’s quite simple to do in R. All we need is the subset command. Let’s look at a linear regression:

lm(y ~ x + z, data=myData)

Rather than run the regression on all of the data, let’s do it for only women, or only people with a certain characteristic:

lm(y ~ x + z, data=subset(myData, sex=="female"))

lm(y ~ x + z, data=subset(myData, age > 30))

The subset() command identifies the data set, and a condition how to identify the subset.

13 Replies to “How to run a regression on a subset in R”

  1. Hi,

    Is it possible to specify both “sex== female” and “age>30” at the same time? Is there a limit to how many specifications you can add?

    For example, if I wanted to include data from only one year (from the year column) and only females (from the gender column) and weight >50 (from the weight column) and regress weight on females born in the specified year.

    If this is possible, how would one write it?

  2. Thanks for your article.
    I do have a large data file with 67 countries. How can I run multiple regressions on different countries? Or how can I run multiple regressions on different periods? There are some long ways to follow in [R] and get the results I am looking for, but I would like to learn the shortcut. Thanks

  3. Thanks, Didier! That was really helpful. However, I am still struggling to combine the “subset” argument with the “weights” argument (the variable for weights is not the one I’m using to subset). When I try to do that, I receive an error message telling me vectors have different lengths. I’d appreciate it if you could help me with that.

    1. Thanks for checking in. Without further details, I cannot replicate your issue. I have just double checked:

      m1 = lm(y ~ x, data=d, weights=weight)
      m2 = lm(y ~ x, data=subset(d, z==1), weights=weight)

      and both regression models work as expected.

      1. It worked now. I was probably insisting on some small, embarassing mistake that was preventing R from running it correctly. Thank you once again.

  4. Hi Didier!

    I am trying to run a logistic regression (family=binomial) using a subset of the X variable.

    mod1_sp<-glm(INF~SP, family=binomial, data=subset(PA, SP=="achatinus"))

    INF(infected)=1,0 dependent variable
    SP(species)=categorical, independent variable

    I want to work just with one specie, but when im subsetting the variable in the model its giving me this error:

    Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
    contrasts can be applied only to factors with 2 or more levels

    Any idea of what I am doing wrong? thank you!!!

    1. Is there variance on your variables in the subset? I guess the easiest is to create the subset separately PA_sub = subset(PA, SP==”achatinus”), and then explore the variables like table(PA_sub$INF) etc.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: