Data Analysis for Social Science: A Friendly and Practical Introduction

Looking to get started with data science, but scared it’d be too complicated? There’s a new book by Elena Llaudet and Kosuke Imai that will get you covered. Data Analysis for Social Science: A Friendly and Practical Introduction is now available as an e-book, and it truly delivers what the title claims: friendly and practical. It’s also up-to-date, with a focus on experimental data and causal inference much more than on multiple regression analysis. I don’t think I’ve seen a more accessible introduction to R and Rstudio — cheat-sheets included!

How to specify RGB colours and transparency in R plots

R handles colours like “red” or “blue” out of the box, but what if we want more precise colours? Enter rgb():

Here’s some simple code to illustrate how to use rgb in R: Let’s just draw 10 circles in pure red (red=1=100%, green=0%, blue=0%).

plot(1:10, rep(1,10), col=rgb(1, 0, 0, 0.5), pch=16)

The additional number here (0.5) is the alpha value to indicate transparency. 0.5 is 50% transparent.

Let’s add blue points:
points((1:10)+0.05, rep(1,10), col=rgb(0, 0, 1, 0.5), pch=16)

For intermediary colours, we need to specify the R, G, and B values as fractions: the division by 255 is not typically included on colour palettes.

rgb(68/255, 119/255, 170/255, 0.5)

Combining two strings in R when they are in a vector

Joining two strings in R can readily be done with the paste() command. However, if our strings are part of a vector, paste() no longer works as we might expect.

To illustrate this, assume a list of characters or symbols, and a set of values you want to convert into these characters. The characters “A” to “F” may be more transparent for the example, but it was the bars that led me to this problem: _ ▁ ▂ ▃ ▅ ▇. I use the Unicode values of these bars.

characters = c("A", "B", "C", "D", "E", "F")
characters = c("_", "\u2581", "\u2582", "\u2583", "\u2585", "\u2587")

Let’s start with a couple of random values:

values = runif(5)

In my example, I got:

[1] 0.9568333 0.4533342 0.6775706 0.5726334 0.1029247

Then we want to select one of the characters or symbols based on these values, picking the longer bar for longer values — effectively creating our own histogram (sparkline).

selected = characters[round(values * 5)+1]

This works well, except that we have a series of characters rather than a single string with all the characters included:

"▇" "▂" "▃" "▃" "▁"

My next step was paste() or actually the shorthand paste0() which does not use separation, but it’s not really creating a single string:

paste0(selected)

[1] "▇" "▂" "▃" "▃" "▁"

enter library(stringr) and the str_c() command:

str_c(selected, sep="", collapse="")

[1] "▇▂▃▃▁"

There we go, a single string…

Post-Doc position/ Quantitative Social Scientist/ Migration in Potsdam

Jasper Tjaden is looking for a post doc to work with him in Potsdam starting October 2022. The contract is initially two years with the possibility of extension.

Interested candidates should send their CV and a short (one paragraph) motivation (in the body of the email).

Candidates should have:

  • Interest in migration/ integration studies (!)
  • Advanced R/ Stata skills
  • Good understanding of econometrics
  • Good understanding of causal inference
  • Experience collecting data/ survey methodology
  • Interest in/ experience with digital data (Facebook/ Google etc.)
  • Team player
  • Solid publication record

Prof. Dr. Jasper Tjaden

Faculty of Economic and Social Sciences
Professor of Applied Social Research & Public Policy

https://jaspertjaden.com/

Plotting a normal distribution in R

Apparently there are some unnecessarily complicated tutorials out there how to draw a normal distribution (or other probability distributions) in R. No, there is no need for a loop; in fact, a single line of code is enough:

curve(dnorm(x, 0, 1), from=-4, to=4)

That’s a normal probability distribution with mean 0 and a standard deviation 1, plotted from -4 to +4.

So it’s also easy to draw normal distributions with different means or different standard deviations. Here’s one without a box around:

curve(dnorm(x, 2, 1), from=1, to=7, bty="n", xlab="")

How about a beta distribution? It’s not more difficult (assuming you know your shape parameters):

curve(dbeta(x, 10, 2), from=0, to=1, bty="n", xlab="")

So from now on, if you need to visualize your priors to get a feel whether they constitute reasonable distributions, remember it’s just one line in R.