No, we shouldn’t all be using Stata/R/SPSS/…

Let there be no doubt, there are many reasons to advocate and use open-source software. What I care about in this post has nothing to do with open-source and closed-source software, but is about the value of using different statistical software in the same project. It may sound counter-intuitive, but research projects can benefit if different software packages are used to do the statistical analysis.

Why would this be the case? By doing the analysis in different software packages, we are forced to replicate the analysis, forced to think independently in terms of implementation and forced to understand. Obviously this could also be done by two teams working independently using the same software, but with different software packages there is no harm in sharing code. Having done the analysis twice, the influence of typos etc. that do not lead to errors or obviously wrong results will be spotted — leading to robust research results. At the same time, there will be fruitful discussions about methods, like whether the default values in a software package are the most useful ones for the particular problem at hand.

One more reason for replicable research

Here’s just one more reason to make sure research is replicable. We’re working in a team and have divided tasks for greater efficiency. One person was blessed with the job of running cross tabs, and put everything into an Excel sheet. The project got delayed, and we now want to use the latest wave of the data. The person responsible for the original cross tabs is no longer here; the details provided turn out to be insufficient (variable description rather than variable name). Now we’re replicating that person’s research in order to update it. How efficient is that?

(probably should have checked this first way round…)

Double sapply()

With the R command sapply() we can easily apply a function many times. Here I simply want to highlight that sapply() can be used within sapply(): it can be nested.

First, a simple application: I have several countries in a dataset, and want to generate a table for each of them.

sapply(c("AT", "DE", "CH"), function(x) round(prop.table(table(object[country == x]))*100, 1))

Step by step: the table() function counts how many cases there are in each category of the variable object. The subscript [country == x] means that R replaces the x with one of the items provided in each round: “AT”, then “DE”, and finally “CH”. The prop.table() function turns the counts into proportions. Here I also multiply these by 100 to get percentages, and round them off to just one digit. The function(x) part tells R that we define our own function, and that x is the variable to use. The first argument is the countries I want to use. I end up with a table, with the countries across and with the categories of by variable object down.

With just three countries, using sapply() can be rather trivial, but how about running the code on all countries in the dataset? We can use unique(country). Using sapply() rather than for() loops has two important advantages. First, it is often faster. Second, we usually end up with (much) more compact code, reducing the risk of mistakes when copying and pasting code.

Let’s assume our dataset includes variation over time as well as across countries. We can simply nest two sapply() commands. Here we have code to calculate the median salience by country and year.

cy <- c("AT", "DE", "CH")
yr <- 2000:2010
country.salience <- sapply(cy, function(x) sapply(yr, function(y) median(salience[country == x & year == y], na.rm=TRUE)))
rownames(country.salience) <- yr
colnames(country.salience) <- cy

At the top I define the countries of interest, and the years I want to examine. At the code, we take the median of the variable salience for country x and year y. The first sapply() runs this on the countries chosen, the second sapply() runs this on the years chosen. The last two lines simply name the rows and columns to give an readily accessible table.

What if we now want the interpolated median instead of the median? We simply replace that part of the code. In contrast to copy & paste code, we make the change once, not for all the country/year combinations.

Data on Ethnic Group Representation

I regularly get asked whether I’d be willing to share my data on the political representation of ethnic groups in national legislatures. The answer always is yes, as I believe in sharing data. Apart from the appendices in my 2009 article and 2013 monograph, the data are available on my Dataverse. There you can find a spreadsheet with the numbers behind the representation scores. As this is an ongoing project of mine, I do from time to time update my database with better estimates, and also expand coverage. This also means that I’d appreciate possible corrections.

Why I Ditched Sweave for odfWeave

Well, I didn’t really – not completely. Sweave is an incredible tool for research. It facilitates replicable research to the benefit of researchers, too. Here is the main reason that cut it for me: getting back to a paper after a few weeks or months – think working on multiple papers, think hearing back from reviewers –, and not remembering all the details. For example, a figure or number looks odd, and by using Sweave I can immediately see where it comes from. Another reason is that I don’t want to avoid the following situation from the other end: I recently asked an author about a detail in a recently published paper. The response was “I can’t remember, I did the analysis some 2 year ago.” Using a single Sweave file, I also avoid confusion on the level of filenames (compare here or here).

So what is the problem with Sweave? There are two. First, on some papers I collaborate with others who don’t use R and Latex. Second, most journals in political sciences and sociology don’t accept Latex files. Enters odfWeave, doing almost everything Sweave does using LibreOffice rather than Latex. Creating Word documents for commenting and submission is easy. It also plays nicely with Zotero – which I find a bit easier to work in than Bibtex. (One annoyance: odfWeave hates relative paths.)

I said that I did not really ditch Sweave. For first drafts, I still like the accessibility, non-distraction, and compatibility of a plain text file. Usually I use heavily commented R-code, but Sweave is never far, especially as I can keep all analysis and plots in a single file.