I have recently explored open-source approaches to computer-assisted qualitative data analysis (CAQDA). As is common with open-source software, there are several options available, but as is often also the case, not many of them can keep up with the commercial packages, or are abandoned.

Here I wanted to highlight just three options.

RQDA is built on top of R, which is perhaps not the most obvious choice — but can have advantages. The documentation is steadily improving, making it more apparent how RQDA has the main features we’ve come to expect from CAQDA software. I find it a bit fiddly with the many windows that tend to be opened, especially when working on a small screen.

Colloquium is Java-based, which makes it run almost everywhere. It offers a rather basic feature set, and tags can only be assigned to lines (which also implies that lines are the unit of analysis). Where it shines, though, is how it enables working in two languages in parallel.

CATMA is web-based, but runs without flash — so it should run pretty anywhere. It offers basic manual and automatic coding, but there’s one feature we really should care about: CATMA does TEI. This means that CATMA offers a standardized XML export that should be usable in the future, and facilitate sharing the documents as well as the accompanying coding. That’s quite exciting.

What I find difficult to judge at the moment, is whether TEI will be adopted by CAQDA software. Atlas.ti does some XML, but as far as I know it’s not TEI. And, would TEI be more useful to future researchers than a SQLite database like RQDA produces them?

Quantitative Social Science: An Introduction

Kosuke Imai has recently published a great introduction: Quantitative Social Science: An Introduction. Finally a stats data analysis book that has arrived in the present! Yes, we can get away with very little mathematics and still do quantitative analysis. Yes, examples from published work are much more interesting than constructed toy examples. Yes, R can be accessible. Yes, we can talk about causality, measurement, and prediction (even Bayes) before getting to hypothesis testing. Yes, we can work with text and spatial data.

Wordscores and JFreq – an update

An old post of mine on using JFreq and Wordscores in R still gets frequent hits. For some documents, the current version of JFreq doesn’t work as well as the old one (which you can find here [I’m just hosting this, all credit to Will Lowe]). For even longer documents, we have a Python script by Thiago Marzagão archived here (I have never tried this). And then there is quanteda, the new R package that also does Wordscores.

Having said this, a recent working paper by Bastiaan Bruinsma, Kostas Gemenis heavily criticize Wordscores. While their work does not discredit Wordscores as such (merely the quick and easy approach Wordscores advertises — which depending on your view is the essence of Wordscores), I prefer to read it as a call to validating Wordscores before they are applied. After all, in some situations they seems to ‘work’ pretty well, as Laura Morales and I show in our recent paper in Party Politics.

Alienating open-source contributors?

Some time ago, I came across a blog post highlighting how open-source contributors can be alienated by maintainers. Tim Jurka describes his unpleasant experience of sending an updated version of an R package to CRAN. He highlights the short and impersonal messages from CRAN maintainers, an apparent contradiction, and generally felt alienated by the process. Interestingly, there are four lessons to be learnt offered:

– don’t alienate volunteers — everyone in the R community is a volunteer, and it doesn’t benefit the community when you’re unnecessarily rude.
– understand volunteers have other commitments — while the core R team is doing an excellent job building a statistical computing platform, not everyone can make the same commitment to an open-source project.
– open-source has limited resources — every contribution helps.
– be patient — not everyone can operate on the same level, and new members will need to be brought up to speed on best practices.

I guess everyone would sign up to this, but oddly enough my experience with the team running CRAN has always been of the nature Tim Jurka cites as a positive example: brief, but courteous. What is definitely missing from said blog post, though, is an appreciation that the team running R and CRAN are also volunteers!

Calculating Standard Deviations on Specific Columns/Variables in R

When calculating the mean across a set of variables (or columns) in R, we have colMeans() at our disposal. What do we do if we want to want to calculate say the standard deviation? There are a couple of packages offering such a function, but there is no need, because we have apply().

Let’s start with creating some data, a matrix with 3 columns full of random numbers.

M <- matrix(rnorm(30), ncol=3)

This gives us something like this:
[,1] [,2] [,3]
[1,] -0.3533716 -1.12408752 0.09979301
[2,] 0.6099991 -0.48712761 0.22566861
[3,] -0.9374809 -1.10497004 -0.26493616
[4,] -0.5243967 -0.66074559 0.16858864
[5,] 0.2094733 -0.45156576 -0.27735151
[6,] 0.6800691 1.82395926 -0.18114150
[7,] 0.1862829 0.43073422 0.14464538
[8,] -1.0130029 -1.52320349 -1.74322076
[9,] 1.1886103 0.09653443 -1.95614608
[10,] -0.9953963 -1.15683775 1.61106346

Now comes apply(), where 1 indicates that we want to apply the function specified (here: sd(), but we can use any function we want) across columns; we can use 2 to apply it across rows).

apply(M, 1, sd)

This gives us the standard deviations for each row:

[1] 0.6187682 0.5566979 0.4446021 0.4447124 0.3426177 1.0058659 0.1545623
[8] 0.3745954 1.5966433 1.5535429

We can quickly check whether these numbers are correct:

sd(c(-0.3533716, -1.12408752, 0.09979301))

[1] 0.6187682

Of course we can choose the variables or columns we want, such as this apply(M[,2:3], 1, sd) or by using cbind().