Wordscores and JFreq – an update

An old post of mine on using JFreq and Wordscores in R still gets frequent hits. For some documents, the current version of JFreq doesn’t work as well as the old one (which you can find here [I’m just hosting this, all credit to Will Lowe]). For even longer documents, we have a Python script by Thiago Marzagão archived here (I have never tried this). And then there is quanteda, the new R package that also does Wordscores.

Having said this, a recent working paper by Bastiaan Bruinsma, Kostas Gemenis heavily criticize Wordscores. While their work does not discredit Wordscores as such (merely the quick and easy approach Wordscores advertises — which depending on your view is the essence of Wordscores), I prefer to read it as a call to validating Wordscores before they are applied. After all, in some situations they seems to ‘work’ pretty well, as Laura Morales and I show in our recent paper in Party Politics.

R package not available on R-Forge?

Apparently it is still the case that some R packages on R-Forge are unavailable for download. For instance, the package Austin to calculate Wordscores and Wordfish has no download candidate for Windows. Does that mean Windows users cannot run Wordscores using R? Far from it, they just have to compile the package themselves. This might sound a bit scary, but is actually quite easy if you follow the instructions I have posted over a year ago.

Should We Use Stop Words?

When using automatic content analysis like Wordscores or Wordfish, stop words may be used. This is a contentious issue, with recommendations ranging from definitely use stop words to those who argue that stop words are a bad thing. What to do?

To me this sounded more like an empirical question than something beliefs could settle. Using professionally translated texts (i.e. party manifestos available in two languages), I examined how stop words affect predicted scores (i.e. party positions). Lowe & Benoit (2013) highlight that words considered as a priori uninformative can help predict party positions altogether. This can be used as an argument against using stop words. In my analysis, I applied just a few stop words, consisting almost entirely of grammatical terms like articles and conjunctions (function words). It turns out that removing these words can almost entirely remove the impact of language on predicted scores. Put differently, removing words that really carry no meaning can improve the predictions.

So should we use stop words? Yes, but we don’t need many stop words, and using stop words that clearly carry no substantive information seems to be a good idea.

Lowe, Will, and Kenneth Benoit. 2013. “Validating estimates of latent traits from textual data using human judgment as a benchmark.” Political Analysis 21 (3): 298–313. doi:10.1093/pan/mpt002.

Ruedin, D. 2013. “The Role of Language in the Automatic Coding of Political Texts.” Swiss Political Science Review 19 (4): 539–45. doi:doi:10.1111/spsr.12050.

The role of source language in Wordscores etc.

My paper on the role of source language in the automatic coding of political texts (Wordscores, dictionary coding) is now available online. I make use of Swiss party manifestos to examine the impact of source language on party positions derived from the manifestos: does it matter if a French or German manifesto is used? The conclusion is that both stemming and particularly stop words are important to obtain comparable results for Wordscores, while the keyword-based dictionary approach is not affected by language differences. Replication material is available on my Dataverse.