Visualize correlations in R

There are rare cases when a graphic is not better than a figure to help us understand our quantitative results. A simple yet common table we’re staring at ever so often are tables of correlation coefficients: how strongly do different variables correlate with one another. We’re scanning the tables for numbers close to +1 and close to -1, but there’s a better way: visualize!

The R package corrplot offers a ready-made solution:

library(corrplot)
dat=matrix(c(0.11128257, -0.38968561, 0.11765272, -0.07089879, -0.19715366, -0.48083950, 0.54760745, -0.49410370, -0.42443391), nrow=3)
corrplot(dat)

Here we call the corrplot package, create some data so that we can plot something, normally this would be a selection of variables. Then we simply call corrplot() and we’re done.

There are many ways to tweak the plots, but in all versions we get a quicker and better overview of the variables that correlate than staring at a large table.

Here are some variants of the above:

par(mfrow=c(2,2))
corrplot(dat, method = "shade")
corrplot(dat, diag=FALSE)
corrplot(dat, method = "square")
corrplot(dat, method = "number")

rtweet with Premium to search the archive

The R package rtweet does a great job to connect R to Twitter. Unless you’re looking at the past 7 days, Twitter offers two additional API (with different syntax).

If you access Twitter archives with rtweet and have a Premium subscrption on Twitter, the current version of rtweet sends requests in batches of n=100, but Premium (currently) allows batches up n=500. This means, you use 5 requests where 1 would suffice. Kevin Taylor has provided a fix for this, which he also mentioned in the issues of rtweet. Using the fix is easy (much easier than the description in issues thread suggests):

library(devtools)
install_github("kevintaylor/rtweet")

This will replace any installed version fo rtweet. You probably want this version if you’re on Twitter Premium; for the free Sandbox, n=100 is correct. Perhaps this is why rtweet has not implemented the fix yet?

Image credit: CC-by-nc by diarnst

Contour Plot Breaks Off?

Today I experimented with the good old contour plots in R. I plotted my points rather large, because there is quite some uncertainty around their precise placement. In this particular case, I start with an empty plot and a custom range, and add the points separately. Note the cex=8 to draw extra large points.

plot(c(80, 740), c(180, 740) , type='n', xlab="", ylab="", bty="n", main="")
points(jitter(x), jitter(y), cex=8, pch=19, col="#AA449950")

Then I added contours, and they were cut off, breaking off where I expected them to go around the dots. Why are there incomplete lines at the top and bottom?

It turns out — a.k.a. read the manual — that kde2d sets the default limits to the range (I guess this is quite reasonable in other cases): lims = c(range(x), range(y)). Now my big dots obviously cover more than the strict range of values, so all I needed to do was set my own lims in kde2d.

Here’s the entire code for the plot:
plot(c(80, 740), c(180, 740) , type='n', xlab="", ylab="", bty="n", main="")
points(jitter(x), jitter(y), cex=8, pch=19, col="#AA449950")
library(MASS)
# z = kde2d(x, y, n=50) # this one didn't work out
z = kde2d(x, y, n=50, lims=c(80, 740, 180, 740))
contour(z, drawlabels=FALSE, nlevels=6, col="#AA4499", add=TRUE)

Getting Qualtrics data into R

Data collected in Qualtrics come in a funny way when exported to CSV: the first two lines are headers. Simply using read.csv() will mess things up, because typically we only have one line as header. We can skip empty lines at the beginning, but there is no immediately obvious way to skip only the second line.

Of course there is an R package for that, but when I tried, the qualtRics package was very slow:

devtools::install_github("ropensci/qualtRics")
library(qualtRics)
raw_data <- readSurvey("qualtrics_survey.csv")

library(qualtRics)
raw_data <- readSurvey("qualtrics_survey_legacy.csv", legacyFormat=T) # if two rows at the top

As an alternative, you could import just the header of your survey, and then join it to an import where you skip the header lines. Actually, here’s a better way of doing just this:

everything = readLines("qualtrics_survey_legacy.csv")
wanted = everything[-2]
mydata = read.csv(textConnection(wanted), header = TRUE, stringsAsFactors = FALSE)

If you get an error “EOF within quoted string”, don’t ignore it: It indicates problems with double quoting, so add quote = "" to your import code.

If you are willing to violate the principle of not touching the raw data file, you could open the survey in a spreadsheet like Excel or LibreOffice Calc and delete the unwanted rows.

Given all these options, I found the most reliable way (as in: contrary to the above, it hasn’t failed me so far) to get Qualtrics data into R yet another one:
1. export as SPSS (rather than CSV)
2. use library(haven)
3. read_spss()