No, nationality is not a mechanism

This post might serve as a reminder to myself and others doing research on immigrants and their descendent that nationality is not a mechanism. Put differently, if you discover that people with nationality A differ from people with nationality B in a given characteristic, you have not explained anything at all.

It feels rather obvious when put this way, but it’s usually harder when it comes to multiple regression models. So often we throw in a control variable like “foreign national” or “foreign born” without thinking why we do so, what alternative explanation we think we are capturing. Obviously, a person’s passport or place of birth is used as a shorthand or proxy of something else, but what exactly?

Let’s consider the commonly used variables of migration background or migration origin. Short of calling a particular section of society different in essence (which we probably don’t want to), there are a range of concepts we might be trying to capture, like the experience of (racial) discrimination, having a different skin colour, having a different religion, holding different values, having poor language skills, being of the working class, having additional cultural perspectives and experiences, transnational ties, or a combination of these.

Knowing what we’re after is essential for understanding. Sometimes it is necessary to use proxies like immigrant origin, but we need to specify the mechanism we’re trying to capture. Depending on the mechanism, who should be counted as of immigrant origin, for example, can be quite different, especially when it comes to children of immigrants, individuals of “mixed” background, and naturalized individuals. Having poor language skills, for example, is something most likely to affect (first generation) immigrants; but likely experience of racial discrimination is probably not disappearing just because it was my grandparents rather than me who came to this country.

Barplots across variables in R

barplot_rainbowHere’s a good example of how useful sapply can be. I have some data from Qualtrics, and each response is coded in its own variable. Let’s say there is a question on what kind of organization respondents work in, with 10 response categories. Qualtrics produces 10 variables, each with 1 if the box was ticked, and empty otherwise (structure shown just below). With the default CSV import, these blank cells are turned into NA. Here’s a simple way to produce a barplot in this case (in R, of course).

qualtrics_format

types = sapply(1:10, function(i) sum(get(paste("Q1_",i,sep="")), na.rm=TRUE))
barplot(types)

Let’s take this step by step. To count frequencies, we simply use sum(), with the argument na.rm=TRUE because the variables only contain 1 and NA. get() is used to find the variable specified by a string; the string is created with paste(). In this case, the variable names are Q1_1, Q1_2, Q1_3, … Q1_9, Q1_10. By using paste(), we combine the “Q1_” part with the counter variable i, with no separation (sep="").

The whole thing is then wrapped up in sapply(), with the counter variable i defined to take values from 1 to 10; the function(i) part is there so that the counter variable is applied to the sum. So sapply() takes each value of the counter variable, and applies it to the function we specified, which calculates the sum for one variable Q1_i at a time.

Now I can simply do a boxplot, and add the names.arg argument to specify the labels.

(Here I specified the colours: barplot(types, col=rainbow(10)) to have a catchy image at the top of this post, albeit one where colours have no meaning: so-called chart-junk).

R: empty cells in weighted cross-tabs across multiple variables

I’m not even sure how to succinctly describe the problem, but here’s what worked for me. Well, I have two sets of variables and want to run a cross-tabulation. I also want to weigh the frequencies and then calculate the sum of them, and there are some empty (blank) cells to add to the mix. Three small problems in one; R to the rescue.

The two series of variables are as follows: attributes, each with 6 categories to indicate frequencies, alas grouped. So for attribute 1, variable Q5_1 indicates 0 occurrences, 1 to 5 occurrences, 6 to 10 etc. There are also sectors to identify subgroups, using a series of dummy variables to identify the sector (Q6_1, Q6_2, … Q6_10). So basically I want to run table(Q5_{1:6}, Q6_{1:10}), turning the categorical variables into approximative frequency counts.

First, I attach() my data; the get(paste(...)) code seems to like this by a mile.

Second, I create an empty matrix that I will subsequently fill with the (approximated) frequencies: tbl <- matrix(data=NA, nrow=6,ncol=10).

Third, I cycle through each pair of variables: 1:6 sectors (sectvar), and 1:10 attributes (attrvar).
for(attrvar in 1:6) {
for(sectvar in 1:13) {

Here I create a simple cross-tabulation for the current pair of variables. get(paste(...)) does all the work.
raw <- table(get(paste("Q5_",sectvar,sep="")), get(paste("Q6_",attrvar,sep="")))

Since I want to weigh the counts so as to approximate the actual frequencies from the categorical counts, I run into problems if there are empty cells in the previous step. table() simply leaves them out in the result. That’s usually fine, but problematic because of the weights. So I have to add the zeros back in. Here’s one way to do this: create an empty vector with as many zeros as the variable (attrvar) has: 6. (The package agrmt has a helper function for similar cases.)
raw2 <- c(0,0,0,0,0,0)

Next I replace all the zeros with the actual values from the variables raw if they exist. If they do not exist, we keep the zero.
for(i in 1:6){raw2[as.numeric(dimnames(raw)[[1]])[i]] <- raw[i]}

Now we have a complete frequency vector and I can apply my weights.
wei <- raw2 * c(0, 2.5, 5.5, 10.5, 15.5, 0)

and then sum up to approximate the actual count:
tbl[sectvar, attrvar] <- sum(wei)
}
}

print(tbl) now gives me the cross-tabulation with approximate counts.