# How to run a regression on a subset in R

Sometimes we need to run a regression analysis on a subset or sub-sample. That’s quite simple to do in R. All we need is the `subset` command. Let’s look at a linear regression:

`lm(y ~ x + z, data=myData)`

Rather than run the regression on all of the data, let’s do it for only women, or only people with a certain characteristic:

`lm(y ~ x + z, data=subset(myData, sex=="female"))`

`lm(y ~ x + z, data=subset(myData, age > 30))`

The `subset()` command identifies the data set, and a condition how to identify the subset.

## 13 Replies to “How to run a regression on a subset in R”

1. Ilias says:

thanks, that helped

2. Didier Ruedin says:

Thanks for checking in. I’m glad you found this useful!

3. Strollin says:

Hi,

Is it possible to specify both “sex== female” and “age>30” at the same time? Is there a limit to how many specifications you can add?

For example, if I wanted to include data from only one year (from the year column) and only females (from the gender column) and weight >50 (from the weight column) and regress weight on females born in the specified year.

If this is possible, how would one write it?

1. Didier Ruedin says:

Yes, you can combine as many criteria as you want. What you need is the & operator for AND, and perhaps the | operator for OR. See https://www.statmethods.net/management/operators.html for a quick overview. So you could run:

lm(y ~ x + z, data=subset(myData, sex==”female” & age>30))

1. Strollin says:

4. Hisham Alhawal says:

I do have a large data file with 67 countries. How can I run multiple regressions on different countries? Or how can I run multiple regressions on different periods? There are some long ways to follow in [R] and get the results I am looking for, but I would like to learn the shortcut. Thanks

5. Leonardo says:

Thanks, Didier! That was really helpful. However, I am still struggling to combine the “subset” argument with the “weights” argument (the variable for weights is not the one I’m using to subset). When I try to do that, I receive an error message telling me vectors have different lengths. I’d appreciate it if you could help me with that.

1. Didier Ruedin says:

Thanks for checking in. Without further details, I cannot replicate your issue. I have just double checked:

m1 = lm(y ~ x, data=d, weights=weight)
m2 = lm(y ~ x, data=subset(d, z==1), weights=weight)

and both regression models work as expected.

1. leoscampos says:

It worked now. I was probably insisting on some small, embarassing mistake that was preventing R from running it correctly. Thank you once again.

2. Didier Ruedin says:

I’m glad to hear this! If it makes you feel better, I’ve once spent a day hunting what turned out a misplaced comma in my R code!

6. Mar Moretta says:

Hi Didier!

I am trying to run a logistic regression (family=binomial) using a subset of the X variable.

mod1_sp<-glm(INF~SP, family=binomial, data=subset(PA, SP=="achatinus"))

INF(infected)=1,0 dependent variable
SP(species)=categorical, independent variable

I want to work just with one specie, but when im subsetting the variable in the model its giving me this error:

Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels

Any idea of what I am doing wrong? thank you!!!

1. Didier Ruedin says:

Is there variance on your variables in the subset? I guess the easiest is to create the subset separately PA_sub = subset(PA, SP==”achatinus”), and then explore the variables like table(PA_sub\$INF) etc.

This site uses Akismet to reduce spam. Learn how your comment data is processed.