readODS and column specifications

The R package readODS allows you to import ODS spreadsheets to R. It’s slow, but it works. In an attempt to speed up things, I thought providing column types would help. I didn’t find an improvement, but I noticed that the documentation wasn’t really clear (“refer to readr::type_convert to specify cols specification”). It seems like you do need to refer to type_convert to understand how to specify column types, and then feed them to readODS like this:

col_types=cols(VAR1 = "f", VAR2 ="i")

etc.

So an entire call would be:

data = read_ods("spreadsheet.ods", col_names=TRUE, col_types=cols(VAR1 = "f", VAR2 = "i", VAR3= "-"))

Note: I had to explicitly use library(readr) before calling read_ods(), otherwise the cols() function was not available.

Unfortunately, the short approach using col_types=as.col_spec(“fi-“) does not seem to work.

Counting Articles in Nexis Uni

We wanted to count the number of articles on a number of keywords that were published in specific newspapers. This is a measure of salience. So we have a list of keywords (e.g. Boris Johnson, Jeremy Corbyn, Jo Swinson, Caroline Lucas), a date range (e.g. 1 January 2017 to 31 December 2017), and Nexis Uni to get the articles. In this case, we were not interested in the contents of the articles, so I downloaded the meta data (headline, publication, an empty “summary” column, and the date of the publication). Nexis Uni gives me a spreadsheet per keyword. With some 30 keywords, I did not want a manual approach.

What we wanted in the end was a spreadsheet, with all dates in the date range and the number of articles for each keyword as a column. Here’s how I did this in R:

sheets = list.files(pattern="*.XLSX")

This gives me a list of all the XLSX files in the folder. I use library(readxl) because it imports the dates properly, unlike some of the other options to open XLSX files in R.

library(readxl)

for (i in 1:length(sheets)) {
assign(paste("N", i, sep=""), read_excel(sheets[i]))
}

Here I chose a loop to read in the data from the Nexis spreadsheets, each into a separate container. I guess I could have used a list or something, but for this project speed was no concern.

Next, we need a list of dates in the date range. The seq() already works, but to make this match the date format from the Excel documents, I need to wrap everything in the as.POSIXlt.

dat = as.POSIXlt(seq(from=as.Date("2017-01-01"), to=as.Date("2017-12-31"), by=1))

Now comes the actual work. First I create a vector with the dates of the articles in the Nexis spreadsheet for each keyword. Some dates in the date range have no articles, others have one, others still have more than one. I then use sapply() to match the dates, and colSums() to count the number of articles for a given day.

for (i in 1:length(sheets)) {
assign(paste("D", i, sep=""), eval(parse(text=paste("N", i, "$Date", sep=""))))
assign(paste("S", i, sep=""), eval(parse(text=paste("sapply(dat, function(x) D", i, "== x)", sep=""))))
assign(paste("H", i, sep=""), eval(parse(text=paste("colSums(S", i, ")", sep=""))))
}

At this stage I regret not using a list or data frame, but this code combines the different variables:

hits = data.frame(dat)
for (i in 1:length(sheets)) {
hits = cbind(hits, eval(parse(text=paste("H", i, sep=""))))
}

And then I add the names of the sheets and I have a data frame I can export with write.csv() or whatever.

colnames(hits) = c("date", sheets)