Today I run into an unexpected error when using Wordscores in R. I used JFreq 0.5.4 to calculate the word frequencies from 35 parties with rather long party manifestos. This resulted in a 3.4M CSV file with 42462 columns. R would throw up an error regarding read.table when I called Austin‘s (0.2) wfm
function to import the word frequencies: “Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names”. Well, the file seems too wide to open.
The solution I found was to use the old JFreq 0.2.5, which produces the output the other way around (rows/columns switched). Even if it is a bit slower than the newer JFreq, having a rather long (as opposed to wide) CSV with the word frequencies does not seem to pose problems.
It seems the old version of JFreq is no longer easily found on the web. Get it here: https://drive.switch.ch/index.php/s/8RG7kVWtj1k0PuM (all credit for the software to Will Lowe: ). Licence: GPL.
Thanks for posting the link to the old JFreq program. I did run into the problem that my file was “too wide” and couldn’t be transposed using R (which has probably less to do with R than with an internal memory problem).
I tried JFreq with some large text files and it choked (“Java heap space” error).
I ended up building a term-frequency generator in Python:
http://thiagomarzagao.wordpress.com/2013/06/10/word-counting-in-python/
It reads the file in chunks, so the file can be arbitrarily large (I’ve used it with 3GB files).
Thanks for sharing! That’s certainly useful to know. Have you used the tm package in R; it might also work with very large text files…?
No, I’m doing everything in Python. Have you tried tm? What’s your impression of it?
For anyone looking for Thiago Marzagão’s python script, here’s an archive copy: https://web.archive.org/web/20140411070348/http://thiagomarzagao.com:80/2013/06/10/word-counting-in-python/
I haven’t really used tm myself.
Given that I still get hits on this post, https://cran.r-project.org/package=quanteda might be of interest to some readers…
Hi Didier,
I am trying the older version of JFreq to avoid read.table error. But as it has switched my rows and columns. Now, how can I access of reference txt files and virgin txt files.
As getdocs command is returning a word and not doc. Can you please help me out knowing how the code will change for older version of JFreq.
For Example : ref <- getdocs(a, c(4,3)) //How can I change this piece of line to include docs and not words.
You’ll need to change the code further up, when you create the wfm object: a <- wfm("frequencies.csv", word.margin=1) # not word.margin=2
You might also be interested in https://cran.r-project.org/package=quanteda
Thank you so much Didier!
Hi Didier,
Is there any way to extract the scorable words out of total words. The code is :
predict(ws, newdata=vir)
output : 5551 of 101159 words (5.49%) are scorable
How can I extract these 5551 words.
If your Wordscores model is called “ws”, I guess what you’re looking for is in ws$pi [use str(ws) to explore the structure of the model, or any R object for that]. The total words are in ws$data.
Yes, It worked. Thank you again Didier.