readODS and column specifications

The R package readODS allows you to import ODS spreadsheets to R. It’s slow, but it works. In an attempt to speed up things, I thought providing column types would help. I didn’t find an improvement, but I noticed that the documentation wasn’t really clear (“refer to readr::type_convert to specify cols specification”). It seems like you do need to refer to type_convert to understand how to specify column types, and then feed them to readODS like this:

col_types=cols(VAR1 = "f", VAR2 ="i")

etc.

So an entire call would be:

data = read_ods("spreadsheet.ods", col_names=TRUE, col_types=cols(VAR1 = "f", VAR2 = "i", VAR3= "-"))

Note: I had to explicitly use library(readr) before calling read_ods(), otherwise the cols() function was not available.

Unfortunately, the short approach using col_types=as.col_spec(“fi-“) does not seem to work.

Error: `quantile.haven_labelled()` not implemented.

Today I got an Error: `quantile.haven_labelled()` not implemented on a database imported into R via library(haven) when trying to see the results of a simple linear regression model. What I needed was the zap_labels() function, which strips the value labels (and user-defined NA). Then I run the model on the new dataset, and all was good.

dataset2 = zap_labels(dataset)

Visualize correlations in R

There are rare cases when a graphic is not better than a figure to help us understand our quantitative results. A simple yet common table we’re staring at ever so often are tables of correlation coefficients: how strongly do different variables correlate with one another. We’re scanning the tables for numbers close to +1 and close to -1, but there’s a better way: visualize!

The R package corrplot offers a ready-made solution:

library(corrplot)
dat=matrix(c(0.11128257, -0.38968561, 0.11765272, -0.07089879, -0.19715366, -0.48083950, 0.54760745, -0.49410370, -0.42443391), nrow=3)
corrplot(dat)

Here we call the corrplot package, create some data so that we can plot something, normally this would be a selection of variables. Then we simply call corrplot() and we’re done.

There are many ways to tweak the plots, but in all versions we get a quicker and better overview of the variables that correlate than staring at a large table.

Here are some variants of the above:

par(mfrow=c(2,2))
corrplot(dat, method = "shade")
corrplot(dat, diag=FALSE)
corrplot(dat, method = "square")
corrplot(dat, method = "number")

rtweet with Premium to search the archive

The R package rtweet does a great job to connect R to Twitter. Unless you’re looking at the past 7 days, Twitter offers two additional API (with different syntax).

If you access Twitter archives with rtweet and have a Premium subscrption on Twitter, the current version of rtweet sends requests in batches of n=100, but Premium (currently) allows batches up n=500. This means, you use 5 requests where 1 would suffice. Kevin Taylor has provided a fix for this, which he also mentioned in the issues of rtweet. Using the fix is easy (much easier than the description in issues thread suggests):

library(devtools)
install_github("kevintaylor/rtweet")

This will replace any installed version fo rtweet. You probably want this version if you’re on Twitter Premium; for the free Sandbox, n=100 is correct. Perhaps this is why rtweet has not implemented the fix yet?

Image credit: CC-by-nc by diarnst

Contour Plot Breaks Off?

Today I experimented with the good old contour plots in R. I plotted my points rather large, because there is quite some uncertainty around their precise placement. In this particular case, I start with an empty plot and a custom range, and add the points separately. Note the cex=8 to draw extra large points.

plot(c(80, 740), c(180, 740) , type='n', xlab="", ylab="", bty="n", main="")
points(jitter(x), jitter(y), cex=8, pch=19, col="#AA449950")

Then I added contours, and they were cut off, breaking off where I expected them to go around the dots. Why are there incomplete lines at the top and bottom?

It turns out — a.k.a. read the manual — that kde2d sets the default limits to the range (I guess this is quite reasonable in other cases): lims = c(range(x), range(y)). Now my big dots obviously cover more than the strict range of values, so all I needed to do was set my own lims in kde2d.

Here’s the entire code for the plot:
plot(c(80, 740), c(180, 740) , type='n', xlab="", ylab="", bty="n", main="")
points(jitter(x), jitter(y), cex=8, pch=19, col="#AA449950")
library(MASS)
# z = kde2d(x, y, n=50) # this one didn't work out
z = kde2d(x, y, n=50, lims=c(80, 740, 180, 740))
contour(z, drawlabels=FALSE, nlevels=6, col="#AA4499", add=TRUE)