Getting Qualtrics data into R

Data collected in Qualtrics come in a funny way when exported to CSV: the first two lines are headers. Simply using read.csv() will mess things up, because typically we only have one line as header. We can skip empty lines at the beginning, but there is no immediately obvious way to skip only the second line.

Of course there is an R package for that, but when I tried, the qualtRics package was very slow:

devtools::install_github("ropensci/qualtRics")
library(qualtRics)
raw_data <- readSurvey("qualtrics_survey.csv")

library(qualtRics)
raw_data <- readSurvey("qualtrics_survey_legacy.csv", legacyFormat=T) # if two rows at the top

As an alternative, you could import just the header of your survey, and then join it to an import where you skip the header lines. Actually, here’s a better way of doing just this:

everything = readLines("qualtrics_survey_legacy.csv")
wanted = everything[-2]
mydata = read.csv(textConnection(wanted), header = TRUE, stringsAsFactors = FALSE)

If you get an error “EOF within quoted string”, don’t ignore it: It indicates problems with double quoting, so add quote = "" to your import code.

If you are willing to violate the principle of not touching the raw data file, you could open the survey in a spreadsheet like Excel or LibreOffice Calc and delete the unwanted rows.

Given all these options, I found the most reliable way (as in: contrary to the above, it hasn’t failed me so far) to get Qualtrics data into R yet another one:
1. export as SPSS (rather than CSV)
2. use library(haven)
3. read_spss()

Counting Articles in Nexis Uni

We wanted to count the number of articles on a number of keywords that were published in specific newspapers. This is a measure of salience. So we have a list of keywords (e.g. Boris Johnson, Jeremy Corbyn, Jo Swinson, Caroline Lucas), a date range (e.g. 1 January 2017 to 31 December 2017), and Nexis Uni to get the articles. In this case, we were not interested in the contents of the articles, so I downloaded the meta data (headline, publication, an empty “summary” column, and the date of the publication). Nexis Uni gives me a spreadsheet per keyword. With some 30 keywords, I did not want a manual approach.

What we wanted in the end was a spreadsheet, with all dates in the date range and the number of articles for each keyword as a column. Here’s how I did this in R:

sheets = list.files(pattern="*.XLSX")

This gives me a list of all the XLSX files in the folder. I use library(readxl) because it imports the dates properly, unlike some of the other options to open XLSX files in R.

library(readxl)

for (i in 1:length(sheets)) {
assign(paste("N", i, sep=""), read_excel(sheets[i]))
}

Here I chose a loop to read in the data from the Nexis spreadsheets, each into a separate container. I guess I could have used a list or something, but for this project speed was no concern.

Next, we need a list of dates in the date range. The seq() already works, but to make this match the date format from the Excel documents, I need to wrap everything in the as.POSIXlt.

dat = as.POSIXlt(seq(from=as.Date("2017-01-01"), to=as.Date("2017-12-31"), by=1))

Now comes the actual work. First I create a vector with the dates of the articles in the Nexis spreadsheet for each keyword. Some dates in the date range have no articles, others have one, others still have more than one. I then use sapply() to match the dates, and colSums() to count the number of articles for a given day.

for (i in 1:length(sheets)) {
assign(paste("D", i, sep=""), eval(parse(text=paste("N", i, "$Date", sep=""))))
assign(paste("S", i, sep=""), eval(parse(text=paste("sapply(dat, function(x) D", i, "== x)", sep=""))))
assign(paste("H", i, sep=""), eval(parse(text=paste("colSums(S", i, ")", sep=""))))
}

At this stage I regret not using a list or data frame, but this code combines the different variables:

hits = data.frame(dat)
for (i in 1:length(sheets)) {
hits = cbind(hits, eval(parse(text=paste("H", i, sep=""))))
}

And then I add the names of the sheets and I have a data frame I can export with write.csv() or whatever.

colnames(hits) = c("date", sheets)

Anonymizing Microdata

In the age of datalinkage, protecting microdata is as relevant as ever. Fortunately, there are R packages available to help:

That’s another excuse for not sharing data busted.

R-Forge: File ‘/pkg/DESCRIPTION’ is out of date

I came across a puzzling “File ‘/pkg/DESCRIPTION’ is out of date” error when updating an R-package of mine on R-Forge today. I was pretty sure the version number and date were correct. I couldn’t figure out any other issue, so I made a copy of my documents, checked out the code again, and copied the modified files back. All worked. I’m still a bit puzzled as I only use one machine for this, but who cares if the code is working…