Getting Qualtrics data into R

Data collected in Qualtrics come in a funny way when exported to CSV: the first two lines are headers. Simply using read.csv() will mess things up, because typically we only have one line as header. We can skip empty lines at the beginning, but there is no immediately obvious way to skip only the second line.

Of course there is an R package for that, but when I tried, the qualtRics package was very slow:

raw_data <- readSurvey("qualtrics_survey.csv")

raw_data <- readSurvey("qualtrics_survey_legacy.csv", legacyFormat=T) # if two rows at the top

As an alternative, you could import just the header of your survey, and then join it to an import where you skip the header lines. Actually, here’s a better way of doing just this:

everything = readLines("qualtrics_survey_legacy.csv")
wanted = everything[-2]
mydata = read.csv(textConnection(wanted), header = TRUE, stringsAsFactors = FALSE)

If you get an error “EOF within quoted string”, don’t ignore it: It indicates problems with double quoting, so add quote = "" to your import code.

If you are willing to violate the principle of not touching the raw data file, you could open the survey in a spreadsheet like Excel or LibreOffice Calc and delete the unwanted rows.

Given all these options, I found the most reliable way (as in: contrary to the above, it hasn’t failed me so far) to get Qualtrics data into R yet another one:
1. export as SPSS (rather than CSV)
2. use library(haven)
3. read_spss()

Counting Articles in Nexis Uni

We wanted to count the number of articles on a number of keywords that were published in specific newspapers. This is a measure of salience. So we have a list of keywords (e.g. Boris Johnson, Jeremy Corbyn, Jo Swinson, Caroline Lucas), a date range (e.g. 1 January 2017 to 31 December 2017), and Nexis Uni to get the articles. In this case, we were not interested in the contents of the articles, so I downloaded the meta data (headline, publication, an empty “summary” column, and the date of the publication). Nexis Uni gives me a spreadsheet per keyword. With some 30 keywords, I did not want a manual approach.

What we wanted in the end was a spreadsheet, with all dates in the date range and the number of articles for each keyword as a column. Here’s how I did this in R:

sheets = list.files(pattern="*.XLSX")

This gives me a list of all the XLSX files in the folder. I use library(readxl) because it imports the dates properly, unlike some of the other options to open XLSX files in R.


for (i in 1:length(sheets)) {
assign(paste("N", i, sep=""), read_excel(sheets[i]))

Here I chose a loop to read in the data from the Nexis spreadsheets, each into a separate container. I guess I could have used a list or something, but for this project speed was no concern.

Next, we need a list of dates in the date range. The seq() already works, but to make this match the date format from the Excel documents, I need to wrap everything in the as.POSIXlt.

dat = as.POSIXlt(seq(from=as.Date("2017-01-01"), to=as.Date("2017-12-31"), by=1))

Now comes the actual work. First I create a vector with the dates of the articles in the Nexis spreadsheet for each keyword. Some dates in the date range have no articles, others have one, others still have more than one. I then use sapply() to match the dates, and colSums() to count the number of articles for a given day.

for (i in 1:length(sheets)) {
assign(paste("D", i, sep=""), eval(parse(text=paste("N", i, "$Date", sep=""))))
assign(paste("S", i, sep=""), eval(parse(text=paste("sapply(dat, function(x) D", i, "== x)", sep=""))))
assign(paste("H", i, sep=""), eval(parse(text=paste("colSums(S", i, ")", sep=""))))

At this stage I regret not using a list or data frame, but this code combines the different variables:

hits = data.frame(dat)
for (i in 1:length(sheets)) {
hits = cbind(hits, eval(parse(text=paste("H", i, sep=""))))

And then I add the names of the sheets and I have a data frame I can export with write.csv() or whatever.

colnames(hits) = c("date", sheets)

Mac *.txt.rtfd to *.txt

In a recent project, an assistant used TextEdit to supposedly save documents as pure (UTF-8) text files. We managed to fix the workflow, but I was left with a bunch of Zip files full of *.rtf from TextEdit. On a Windows or GNU/Linux machine, these files show up as what they are: folders that contain a rich text document (and potentially other stuff). I needed text documents.

After a bit of searching and tweaking, I got the following shell script to convert all the rich text documents in these folders/containers into text documents:

find . -name '*.rtf' -exec unoconv -f txt {} \;

There was a problem, though. The files all had a name containing important meta data. So I had the folder with the name of the file, and inside this folder the file but it was called TXT.txt (converted from TXT.rtf). I’m sure there’s a quick way in a shell script (if you know one, please share it in the comments), but I got stuck with the shell.

Enter LiveCode. Here’s a script that does just that. I guess I could have called the above shell script, but I already had this.

on mouseup
-- INPUT: select a folder with the *.txt.rtfd folders
answer folder "Input: Choose folder:"
put it into infoldername
set the defaultFolder to infoldername
put the folders into listoffolders
-- filter . and .. can cause problems
filter listoffolders without "."
filter listoffolders without ".."
-- OUTPUT: select a destination folder
answer folder "Output: Choose folder:"
put it into outfoldername
repeat with i = 1 to the number of lines of listoffolders
put line i of listoffolders into currentfolder
revCopyFile infoldername & slash & currentfolder &
slash & "TXT.txt", outfoldername & slash & textname
end repeat
end mouseup

Full LiveCode stack here on OSF (it’s nothing more than a button and a text field with a basic log).