Wordscores in R

A while ago I wrote a step by step guide for using Wordscores in R. I did this as part of the FP7 project SOM, but ended up doing the analyses myself. I have recently posted some code on running Wordscores in R:

# #################################
# WORDSCORES AND WORDFISH ANALYSIS
# #################################

# setup
library(austin)

############################
# GETTING DOCUMENTS IN
############################
a <- wfm("SHORT.1995-2011.csv")
a[0,] # check the party order (header only)

############################
# A. WORDFISH
############################
wordfish(a, dir=c(23, 20), control=list(tol=1e-06, sigma=3, startparams=NULL), verbose=FALSE)
# identification strategy:
# GPS 2003 and SVP 2003
# these are the extremes in the expert survey (moving average or alternative count)
# also they are nicely the Benoit & Laver texts, for which we have some confidence

############################
# B. WORDSCORES
############################

# SET REFERENCES
ref <- c(10,11,15,20,23) # reference texts
vir <- 1:24 # SPS 2011 (short) is empty, thus not included
vir <- vir[-ref] # everything minus the reference texts

r <- getdocs (a, ref)
ws <- classic.wordscores(r, scores=c(5.971929825,1.252631579,4.665789474,9.206140351,0.935087719))
summary(ws)

# PREDICT
v <- getdocs (a, vir)
predict(ws,newdata=v)

I use Will Lowe’s JFreq to get the word frequencies.

Wordscores/JFreq with Long Manifestos

Today I run into an unexpected error when using Wordscores in R. I used JFreq 0.5.4 to calculate the word frequencies from 35 parties with rather long party manifestos. This resulted in a 3.4M CSV file with 42462 columns. R would throw up an error regarding read.table when I called Austin‘s (0.2) wfm function to import the word frequencies: “Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names”. Well, the file seems too wide to open.

The solution I found was to use the old JFreq 0.2.5, which produces the output the other way around (rows/columns switched). Even if it is a bit slower than the newer JFreq, having a rather long (as opposed to wide) CSV with the word frequencies does not seem to pose problems.