Data Analysis for Social Science: A Friendly and Practical Introduction

Looking to get started with data science, but scared it’d be too complicated? There’s a new book by Elena Llaudet and Kosuke Imai that will get you covered. Data Analysis for Social Science: A Friendly and Practical Introduction is now available as an e-book, and it truly delivers what the title claims: friendly and practical. It’s also up-to-date, with a focus on experimental data and causal inference much more than on multiple regression analysis. I don’t think I’ve seen a more accessible introduction to R and Rstudio — cheat-sheets included!

How to specify RGB colours and transparency in R plots

R handles colours like “red” or “blue” out of the box, but what if we want more precise colours? Enter rgb():

Here’s some simple code to illustrate how to use rgb in R: Let’s just draw 10 circles in pure red (red=1=100%, green=0%, blue=0%).

plot(1:10, rep(1,10), col=rgb(1, 0, 0, 0.5), pch=16)

The additional number here (0.5) is the alpha value to indicate transparency. 0.5 is 50% transparent.

Let’s add blue points:
points((1:10)+0.05, rep(1,10), col=rgb(0, 0, 1, 0.5), pch=16)

For intermediary colours, we need to specify the R, G, and B values as fractions: the division by 255 is not typically included on colour palettes.

rgb(68/255, 119/255, 170/255, 0.5)

Calculating VIF by hand

A widespread measure of multicollinearity is the VIF (short for variance inflation factor). Multicollinearity describes the situation when the predictor variables in a multiple regression model are highly correlated, which is usually not desirable (assuming you haven’t gone Bayesian yet).

In R, the VIF can easily be calculated with a function in library car. It’s actually not difficult to do it by hand — which incidentally helps understand what we measure with the VIF, or why there is no different VIF for logistic regression models, or why the VIF is better than looking at bivariate correlations between predictors.

We start with some random data to run the multiple regression model. Here we create one outcome (y) and three predictor variables (x, z, a), full of random numbers. That’ll do for a demonstration.

x = runif(50) 
y = runif(50)
z = runif(50)
a = runif(50)

Here’s a simple OLS model:

m = lm(y ~ x + z + a)

If you have library car installed, you can easily calculate the VIF:


To do it by hand, though, we run a linear regression model (OLS) for each of the predictors. Here’s the code for predictor x. One of the predictors becomes the outcome variable (here x), and the other predictors remain predictors. The variable used as the outcome previously (y) does not appear here.

mx = lm(x ~ z + a)

The VIF is simply: 1/(1-R²) of this model. In R, we can run the following:


Why I (also) teach using R/Rstudio

My colleagues are sometimes surprised to learn that I teach statistics using SPSS and R/Rstudio in parallel. (Part of this is due to a misconception that R is hard to learn, ignoring that there are more difficult problems like proper model specifications and interpretation of results.) In my opinion, there are many benefits in doing so; here’s an unordered (and incomplete) list:
– introduction to a statistics package that remains available after they leave university and have access to the SPSS site licence (between jobs, moving to another university, out of academia)
– exposure to a different paradigm, making the shift to other software like Stata or SAS appear less threatening
– understanding that it doesn’t matter what package we use for basic statistics (we could even do it by hand)
– that line on the CV
– overcoming limitations in SPSS (ever tried to plot an interaction effect the way we want them?)
– ensuring that those who want to progress to more advanced (contemporary) methods actually can (being “future ready”)
– encourage a mindset that we are in control of the analyses, not the software package

At the same time, I acknowledge that many students have been exposed to SPSS before and feel more at ease when they can see the menu bar. (And the day the university gets rid of that site licence, PSPP will do nicely to work in parallel with R/Rstudio).

Is PSPP a replacement for SPSS?

PSPP is sometimes touted as a replacement for SPSS (including by its creators). Well, it isn’t (this is often the case with open source alternatives; the ambition and reality do not quite match). By stating plainly that PSPP is not a replacement for SPSS, I don’t mean to dismiss PSPP.

psppFirst off, PSPP is under active development, and getting hold of the latest version can be a bit difficult. For Windows, this site often has the most up-to-date version, for Linux/Debian you’ll need to be on a “unstable” release or compile your own (which I doubt many will want to do given that we’re looking at an SPSS replacement, not R or Octave).

Second, recent releases cover many basic functions needed for an introductory statistics course. The GUI frequently lags a bit the underlying capability, so some functions will only be available using SYNTAX. Oddly enough, the PSPP team copy the SPSS interface quite well, including things that could readily be improved (e.g. why do we have tabs for the “Data View” and the “Variable View”, but a separate window for the results or syntax? Why mix the two?).

So PSPP can readily do tables, ANOVA, linear and logistic regressions, and recoding variables. Unfortunately, and this is why PSPP is not even a replacement for basic SPSS users, there are bits and pieces missing even in the basic functions. On the positive side, PSPP has a cleaner interface than SPSS, on the negative side some features are just not there. Unless users follow a course designed specifically with PSPP in mind, they will frequently hit a wall. The same is the case for SYNTAX. Users will be able to run SPSS syntax with no problem, as long as PSPP has the commands implemented. Again, when using code from the many websites helping SPSS users, unfortunately PSPP users will frequently hit a wall.

What do I mean by bits and pieces missing? Let’s take a linear regression. It’s there, the familiar box with arrows to choose variables. Now I may want some multi-collinearity statistics, too. Ah, sorry, doesn’t exist yet. So I can build a model, but do not even have one of the most basic means to check whether it is any good. For this reason I am not surprised nobody has written an that there are not many introductions into statistics using PSPP… it’s just not there yet.

One thing I missed a lot is that PSPP does not remember the last input. So if I run a regression and want to add another variable, I’ll have to start from scratch in PSPP, entering each variable. Graphing is lacking or very poor.

With the advancements in Rstudio, R Commander, etc., I sometimes wonder whether PSPP is just advancing too slowly. Having said all this, I wanted to add on a positive note. PSPP has got quite stable in recent releases; it’s got a price tag hard to beat and moral superiority with being truly open source. And finally, it is fast, much faster than SPSS!