From time to time I get asked when the data from the SOM Project on the politicization of immigration will be available. It’s already there!
Let there be no doubt, there are many reasons to advocate and use open-source software. What I care about in this post has nothing to do with open-source and closed-source software, but is about the value of using different statistical software in the same project. It may sound counter-intuitive, but research projects can benefit if different software packages are used to do the statistical analysis.
Why would this be the case? By doing the analysis in different software packages, we are forced to replicate the analysis, forced to think independently in terms of implementation and forced to understand. Obviously this could also be done by two teams working independently using the same software, but with different software packages there is no harm in sharing code. Having done the analysis twice, the influence of typos etc. that do not lead to errors or obviously wrong results will be spotted — leading to robust research results. At the same time, there will be fruitful discussions about methods, like whether the default values in a software package are the most useful ones for the particular problem at hand.
Here’s just one more reason to make sure research is replicable. We’re working in a team and have divided tasks for greater efficiency. One person was blessed with the job of running cross tabs, and put everything into an Excel sheet. The project got delayed, and we now want to use the latest wave of the data. The person responsible for the original cross tabs is no longer here; the details provided turn out to be insufficient (variable description rather than variable name). Now we’re replicating that person’s research in order to update it. How efficient is that?
(probably should have checked this first way round…)
With the R command
sapply() we can easily apply a function many times. Here I simply want to highlight that
sapply() can be used within
sapply(): it can be nested.
First, a simple application: I have several countries in a dataset, and want to generate a table for each of them.
sapply(c("AT", "DE", "CH"), function(x) round(prop.table(table(object[country == x]))*100, 1))
Step by step: the
table() function counts how many cases there are in each category of the variable
object. The subscript
[country == x] means that R replaces the x with one of the items provided in each round: “AT”, then “DE”, and finally “CH”. The
prop.table() function turns the counts into proportions. Here I also multiply these by 100 to get percentages, and round them off to just one digit. The
function(x) part tells R that we define our own function, and that
x is the variable to use. The first argument is the countries I want to use. I end up with a table, with the countries across and with the categories of by variable
With just three countries, using
sapply() can be rather trivial, but how about running the code on all countries in the dataset? We can use
sapply() rather than
for() loops has two important advantages. First, it is often faster. Second, we usually end up with (much) more compact code, reducing the risk of mistakes when copying and pasting code.
Let’s assume our dataset includes variation over time as well as across countries. We can simply nest two
sapply() commands. Here we have code to calculate the median salience by country and year.
cy <- c("AT", "DE", "CH")
yr <- 2000:2010
country.salience <- sapply(cy, function(x) sapply(yr, function(y) median(salience[country == x & year == y], na.rm=TRUE)))
rownames(country.salience) <- yr
colnames(country.salience) <- cy
At the top I define the countries of interest, and the years I want to examine. At the code, we take the median of the variable
salience for country x and year y. The first
sapply() runs this on the countries chosen, the second
sapply() runs this on the years chosen. The last two lines simply name the rows and columns to give an readily accessible table.
What if we now want the interpolated median instead of the median? We simply replace that part of the code. In contrast to copy & paste code, we make the change once, not for all the country/year combinations.
I regularly get asked whether I’d be willing to share my data on the political representation of ethnic groups in national legislatures. The answer always is yes, as I believe in sharing data. Apart from the appendices in my 2009 article and 2013 monograph, the data are available on my Dataverse. There you can find a spreadsheet with the numbers behind the representation scores. As this is an ongoing project of mine, I do from time to time update my database with better estimates, and also expand coverage. This also means that I’d appreciate possible corrections.