We wanted to count the number of articles on a number of keywords that were published in specific newspapers. This is a measure of salience. So we have a list of keywords (e.g. Boris Johnson, Jeremy Corbyn, Jo Swinson, Caroline Lucas), a date range (e.g. 1 January 2017 to 31 December 2017), and Nexis Uni to get the articles. In this case, we were not interested in the contents of the articles, so I downloaded the meta data (headline, publication, an empty “summary” column, and the date of the publication). Nexis Uni gives me a spreadsheet per keyword. With some 30 keywords, I did not want a manual approach.
What we wanted in the end was a spreadsheet, with all dates in the date range and the number of articles for each keyword as a column. Here’s how I did this in R:
sheets = list.files(pattern="*.XLSX")
This gives me a list of all the XLSX files in the folder. I use library(readxl) because it imports the dates properly, unlike some of the other options to open XLSX files in R.
library(readxl)
for (i in 1:length(sheets)) {
assign(paste("N", i, sep=""), read_excel(sheets[i]))
}
Here I chose a loop to read in the data from the Nexis spreadsheets, each into a separate container. I guess I could have used a list or something, but for this project speed was no concern.
Next, we need a list of dates in the date range. The seq() already works, but to make this match the date format from the Excel documents, I need to wrap everything in the as.POSIXlt.
dat = as.POSIXlt(seq(from=as.Date("2017-01-01"), to=as.Date("2017-12-31"), by=1))
Now comes the actual work. First I create a vector with the dates of the articles in the Nexis spreadsheet for each keyword. Some dates in the date range have no articles, others have one, others still have more than one. I then use sapply() to match the dates, and colSums() to count the number of articles for a given day.
for (i in 1:length(sheets)) {
assign(paste("D", i, sep=""), eval(parse(text=paste("N", i, "$Date", sep=""))))
assign(paste("S", i, sep=""), eval(parse(text=paste("sapply(dat, function(x) D", i, "== x)", sep=""))))
assign(paste("H", i, sep=""), eval(parse(text=paste("colSums(S", i, ")", sep=""))))
}
At this stage I regret not using a list or data frame, but this code combines the different variables:
hits = data.frame(dat)
for (i in 1:length(sheets)) {
hits = cbind(hits, eval(parse(text=paste("H", i, sep=""))))
}
And then I add the names of the sheets and I have a data frame I can export with write.csv() or whatever.
colnames(hits) = c("date", sheets)