Sometimes we need to run a regression analysis on a subset or sub-sample. That’s quite simple to do in R. All we need is the subset
command. Let’s look at a linear regression:
lm(y ~ x + z, data=myData)
Rather than run the regression on all of the data, let’s do it for only women, or only people with a certain characteristic:
lm(y ~ x + z, data=subset(myData, sex=="female"))
lm(y ~ x + z, data=subset(myData, age > 30))
The subset()
command identifies the data set, and a condition how to identify the subset.
thanks, that helped
Thanks for checking in. I’m glad you found this useful!
Hi,
Is it possible to specify both “sex== female” and “age>30” at the same time? Is there a limit to how many specifications you can add?
For example, if I wanted to include data from only one year (from the year column) and only females (from the gender column) and weight >50 (from the weight column) and regress weight on females born in the specified year.
If this is possible, how would one write it?
Yes, you can combine as many criteria as you want. What you need is the & operator for AND, and perhaps the | operator for OR. See https://www.statmethods.net/management/operators.html for a quick overview. So you could run:
lm(y ~ x + z, data=subset(myData, sex==”female” & age>30))
Thank you, that’s very helpful!
Thanks for your article.
I do have a large data file with 67 countries. How can I run multiple regressions on different countries? Or how can I run multiple regressions on different periods? There are some long ways to follow in [R] and get the results I am looking for, but I would like to learn the shortcut. Thanks
Thanks for checking in. If I understand you right, you want to run (e.g.) 67 regression models from your dataset? You’ll need a loop and the assign function as described here: https://druedin.com/2015/11/28/same-explanatory-variables-multiple-dependent-variables-in-r/
Thanks, Didier! That was really helpful. However, I am still struggling to combine the “subset” argument with the “weights” argument (the variable for weights is not the one I’m using to subset). When I try to do that, I receive an error message telling me vectors have different lengths. I’d appreciate it if you could help me with that.
Thanks for checking in. Without further details, I cannot replicate your issue. I have just double checked:
m1 = lm(y ~ x, data=d, weights=weight)
m2 = lm(y ~ x, data=subset(d, z==1), weights=weight)
and both regression models work as expected.
It worked now. I was probably insisting on some small, embarassing mistake that was preventing R from running it correctly. Thank you once again.
I’m glad to hear this! If it makes you feel better, I’ve once spent a day hunting what turned out a misplaced comma in my R code!
Hi Didier!
I am trying to run a logistic regression (family=binomial) using a subset of the X variable.
mod1_sp<-glm(INF~SP, family=binomial, data=subset(PA, SP=="achatinus"))
INF(infected)=1,0 dependent variable
SP(species)=categorical, independent variable
I want to work just with one specie, but when im subsetting the variable in the model its giving me this error:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
Any idea of what I am doing wrong? thank you!!!
Is there variance on your variables in the subset? I guess the easiest is to create the subset separately PA_sub = subset(PA, SP==”achatinus”), and then explore the variables like table(PA_sub$INF) etc.