Counting Articles in Nexis Uni

We wanted to count the number of articles on a number of keywords that were published in specific newspapers. This is a measure of salience. So we have a list of keywords (e.g. Boris Johnson, Jeremy Corbyn, Jo Swinson, Caroline Lucas), a date range (e.g. 1 January 2017 to 31 December 2017), and Nexis Uni to get the articles. In this case, we were not interested in the contents of the articles, so I downloaded the meta data (headline, publication, an empty “summary” column, and the date of the publication). Nexis Uni gives me a spreadsheet per keyword. With some 30 keywords, I did not want a manual approach.

What we wanted in the end was a spreadsheet, with all dates in the date range and the number of articles for each keyword as a column. Here’s how I did this in R:

sheets = list.files(pattern="*.XLSX")

This gives me a list of all the XLSX files in the folder. I use library(readxl) because it imports the dates properly, unlike some of the other options to open XLSX files in R.

library(readxl)

for (i in 1:length(sheets)) {
assign(paste("N", i, sep=""), read_excel(sheets[i]))
}

Here I chose a loop to read in the data from the Nexis spreadsheets, each into a separate container. I guess I could have used a list or something, but for this project speed was no concern.

Next, we need a list of dates in the date range. The seq() already works, but to make this match the date format from the Excel documents, I need to wrap everything in the as.POSIXlt.

dat = as.POSIXlt(seq(from=as.Date("2017-01-01"), to=as.Date("2017-12-31"), by=1))

Now comes the actual work. First I create a vector with the dates of the articles in the Nexis spreadsheet for each keyword. Some dates in the date range have no articles, others have one, others still have more than one. I then use sapply() to match the dates, and colSums() to count the number of articles for a given day.

for (i in 1:length(sheets)) {
assign(paste("D", i, sep=""), eval(parse(text=paste("N", i, "$Date", sep=""))))
assign(paste("S", i, sep=""), eval(parse(text=paste("sapply(dat, function(x) D", i, "== x)", sep=""))))
assign(paste("H", i, sep=""), eval(parse(text=paste("colSums(S", i, ")", sep=""))))
}

At this stage I regret not using a list or data frame, but this code combines the different variables:

hits = data.frame(dat)
for (i in 1:length(sheets)) {
hits = cbind(hits, eval(parse(text=paste("H", i, sep=""))))
}

And then I add the names of the sheets and I have a data frame I can export with write.csv() or whatever.

colnames(hits) = c("date", sheets)

Quickly Count Cases in Excel Using =subtotal()

Here’s a quick way to count the number of cases of subgroups in Excel. Let’s say you have an Excel sheet with information on survey respondents, and you wanted to count how many women and how many men there are. The first thing is to add a filter on the data. You can simply select the header and press Filter. The second thing is to add a little code to count. I usually put this on the first row, right next to the header. In that cell, type =SUBTOTAL(3,A2:A5). Subtotal is to count things (it can also to other things like averages). The 3 at the beginning is to count non-empty cells. The range (here A2:A5) indicates what to count; it should include the entire data, so if there are 10,000 cases in the data, the range would probably be A2:A10001. Press enter, and the cell shows the number of cases.

subtotal

Now, obviously we didn’t need a fancy formula for just this. We could have, for example, simply have selected the first column and checked in the status bar to read Count: 5, which gives us 4 cases once we take into account the header. But we wanted to know how many women there are. Now we can use the filter, click on that small arrow next to Gender and choose only F. Once you click OK, Excel filters accordingly. The cell with our count is updated. We could have clicked on the column to check the count, but guess what, we can copy the value the formula gives: No more typos and forgetting to subtract 1 because of the header, and fewer clicks, too. We can use multiple filters, like how many women are there called Berta?

I find this useful for a quick look; for more complicated tables, a decent statistical program will be easier to use.