Use Interpolated Median Values to Measure Brain Waste

For a while now, I have been coordinating an IMISCOE research group on brain waste with Marco Pecoraro. Brain waste — not my choice of term — is the underutilization of education and skills in their country of destination, a specific form of educational mismatch also referred to as over-education, over-qualification, over-schooling. The stereotypical case is an immigrant scientist working as a taxi driver.

One way to enumerate brain waste is to look at the average educational or skills level in a specific occupation or occupational group, and then check whether an individual has higher or lower levels of education or skills. That’s quite neat, until it comes to choosing the average. Typically we measure skills and education using ordered scales, and depending on the researcher the mean, median, or mode is used (or sometimes a mix of them). None of them is really appropriate, but with interpolated median values, there is a more appropriate measure out there.

Interpolated median values are generally the most adequate measure of central tendency when there is a limited number of response categories, such as Likert scales or the level of education. To calculate interpolated median values, each response category is understood as a range with width w, and within the median response, linear interpolation is used. In principle, we could estimate any quantile, but we’re interested in the median (q=0.5).

In addition, when comparing groups rather than individuals (which is what we typically do), superimposed kernel distributions would be quite helpful: once for the majority population, and once for the immigrant group studied. The interpolated median could readily be added to give a good sense of how much of a difference there is between the groups in substantive terms.

Now, if you were thinking that the measure of central tendency does not matter, here’s a bunch of distributions (as histograms because of the small number of observations in these examples, say of levels of education), along with their mean (blue line), median (red line), and interpolated median (dashed black line). We can see that in some configurations the choice of central tendency makes no difference at all, in others there is a small difference, and in others still the differences are substantive. It’s these substantive differences we should be worried about.

While I’m at it, here are some other challenges to enumerating brain waste. Typically we do not (attempt to) adjust for quality differences of education, but take diplomas at their face value. Differences in quality may occur across countries, but also within countries across universities etc. Typically we do not distinguish between over-skilled and over-educated, even though conceptually the two are different. Here the lack of adequate questions in the data is a major limitation. Finally, we often should also consider the counterfactual: Would this over-qualified immigrant have been able to realize their potential in the country of origin (or elsewhere)? While being over-qualified is generally a problem, for the individual in question it may still be the ‘optimal’ outcome.

An Overview of Recent Correspondence Tests

In a recent IZA working paper, Stijn Baert offers a long list of correspondence tests: field experiments where equivalent CV are sent to employer to capture discrimination in hiring. What’s quite exciting about this list is that it covers all kinds of characteristics, from nationality to gender, from religion to sexual orientation. What’s also great is the promise to keep this list up-to-date on his website. At the same time, the register does not describe the inclusion criteria in great detail. I was surprised not to find some of the studies Eva Zschirnt and I included in our meta-analysis on the list, despite our making all the material available on Dataverse. Was this an oversight — the title of the working paper includes an “almost” –, or was this due to inclusion criteria? What I found really disappointing was the misguided focus on p-values to identify the ‘treatment effect’. All in all a useful list for those interested in hiring discrimination more generally.

Member Commitment in Political Parties

This is something someone could examine empirically. I think we can use the argument developed by Dan Olson in “Why Do Small Religious Groups Have More Committed Members” (Review of Religious Research. 49(4): 353-378.) and apply it to political parties (or any organization). The argument that (religious) groups located in areas where their members are a smaller proportion of the population have more committed members. If we translate this to political parties, this means member commitment in relatively small parties should be stronger. Dan Olson argues that these groups have higher rates of leaving and joining, processes that select for more committed members. Congregations with higher membership turnover rates have current members that are more committed (they attend service, they give money). If we translate this to political parties, member commitment should be higher in parties with higher membership turnover.

Calculating Standard Deviations on Specific Columns/Variables in R

When calculating the mean across a set of variables (or columns) in R, we have colMeans() at our disposal. What do we do if we want to want to calculate say the standard deviation? There are a couple of packages offering such a function, but there is no need, because we have apply().

Let’s start with creating some data, a matrix with 3 columns full of random numbers.

M <- matrix(rnorm(30), ncol=3)

This gives us something like this:
[,1] [,2] [,3]
[1,] -0.3533716 -1.12408752 0.09979301
[2,] 0.6099991 -0.48712761 0.22566861
[3,] -0.9374809 -1.10497004 -0.26493616
[4,] -0.5243967 -0.66074559 0.16858864
[5,] 0.2094733 -0.45156576 -0.27735151
[6,] 0.6800691 1.82395926 -0.18114150
[7,] 0.1862829 0.43073422 0.14464538
[8,] -1.0130029 -1.52320349 -1.74322076
[9,] 1.1886103 0.09653443 -1.95614608
[10,] -0.9953963 -1.15683775 1.61106346

Now comes apply(), where 1 indicates that we want to apply the function specified (here: sd(), but we can use any function we want) across columns; we can use 2 to apply it across rows).

apply(M, 1, sd)

This gives us the standard deviations for each row:

[1] 0.6187682 0.5566979 0.4446021 0.4447124 0.3426177 1.0058659 0.1545623
[8] 0.3745954 1.5966433 1.5535429

We can quickly check whether these numbers are correct:

sd(c(-0.3533716, -1.12408752, 0.09979301))

[1] 0.6187682

Of course we can choose the variables or columns we want, such as this apply(M[,2:3], 1, sd) or by using cbind().