Custom Tables of Descriptive Statistics in R

Here’s how we can quite easily and flexibly create tables of descriptive statistics in R. Of course, we can simply use summary(variable_name), but this is not what you’d include in a manuscript — so not what you want when compiling a document in knitr/Rmarkdown.

First, we identify the variables we want to summarize. Often our database includes many more variables:

vars <- c("variable_1", "variable_2", "variable_3")

Note that these are the variable names in quotes. Second, we use lapply() to calculate whatever summary statistic we want. This is where flexibility kicks in: have you ever tried to include an interpolated median in such a table, just as easy as the mean in R. Here’s an example with the mean, minimum, maximum, and median:

v_mean <- lapply(dataset[vars], mean, na.rm=TRUE)
v_min <- lapply(dataset[vars], min, na.rm=TRUE)
v_max <- lapply(dataset[vars], max, na.rm=TRUE)
v_med <- lapply(dataset[vars], median, na.rm=TRUE)

Too many digits? We can use round() to get rid of them. There’s actually an argument ‘digits’ in the kable() command we’ll use in a minute that in principle allows rounding at the very end, but unfortunately it often fails on me. Rounding:

v_mean <- round(as.numeric(v_mean), 2)

Now we only need to bring the different summary statistics together:

v_tab <- cbind(mean=v_mean, min=v_min, max=v_max, median=v_med)

And add useful variable labels:

rownames(v_tab) <- c("Variable 1", "A description of variable 2", "Variable 3")

and we use kable() to generate a decent table:

kable(v_tab)

If this looks complicated, bear in mind that with no additional work you can change the order of the variables and include any summary statistics. That’s table A1 in the appendix sorted.

CATMA does TEI

I have recently explored open-source approaches to computer-assisted qualitative data analysis (CAQDA). As is common with open-source software, there are several options available, but as is often also the case, not many of them can keep up with the commercial packages, or are abandoned.

Here I wanted to highlight just three options.

RQDA is built on top of R, which is perhaps not the most obvious choice — but can have advantages. The documentation is steadily improving, making it more apparent how RQDA has the main features we’ve come to expect from CAQDA software. I find it a bit fiddly with the many windows that tend to be opened, especially when working on a small screen.

Colloquium is Java-based, which makes it run almost everywhere. It offers a rather basic feature set, and tags can only be assigned to lines (which also implies that lines are the unit of analysis). Where it shines, though, is how it enables working in two languages in parallel.

CATMA is web-based, but runs without flash — so it should run pretty anywhere. It offers basic manual and automatic coding, but there’s one feature we really should care about: CATMA does TEI. This means that CATMA offers a standardized XML export that should be usable in the future, and facilitate sharing the documents as well as the accompanying coding. That’s quite exciting.

What I find difficult to judge at the moment, is whether TEI will be adopted by CAQDA software. Atlas.ti does some XML, but as far as I know it’s not TEI. And, would TEI be more useful to future researchers than a SQLite database like RQDA produces them?

Quantitative Social Science: An Introduction

Kosuke Imai has recently published a great introduction: Quantitative Social Science: An Introduction. Finally a stats data analysis book that has arrived in the present! Yes, we can get away with very little mathematics and still do quantitative analysis. Yes, examples from published work are much more interesting than constructed toy examples. Yes, R can be accessible. Yes, we can talk about causality, measurement, and prediction (even Bayes) before getting to hypothesis testing. Yes, we can work with text and spatial data.

Wordscores and JFreq – an update

An old post of mine on using JFreq and Wordscores in R still gets frequent hits. For some documents, the current version of JFreq doesn’t work as well as the old one (which you can find here [I’m just hosting this, all credit to Will Lowe]). For even longer documents, we have a Python script by Thiago Marzagão archived here (I have never tried this). And then there is quanteda, the new R package that also does Wordscores.

Having said this, a recent working paper by Bastiaan Bruinsma, Kostas Gemenis heavily criticize Wordscores. While their work does not discredit Wordscores as such (merely the quick and easy approach Wordscores advertises — which depending on your view is the essence of Wordscores), I prefer to read it as a call to validating Wordscores before they are applied. After all, in some situations they seems to ‘work’ pretty well, as Laura Morales and I show in our recent paper in Party Politics.

Alienating open-source contributors?

Some time ago, I came across a blog post highlighting how open-source contributors can be alienated by maintainers. Tim Jurka describes his unpleasant experience of sending an updated version of an R package to CRAN. He highlights the short and impersonal messages from CRAN maintainers, an apparent contradiction, and generally felt alienated by the process. Interestingly, there are four lessons to be learnt offered:

– don’t alienate volunteers — everyone in the R community is a volunteer, and it doesn’t benefit the community when you’re unnecessarily rude.
– understand volunteers have other commitments — while the core R team is doing an excellent job building a statistical computing platform, not everyone can make the same commitment to an open-source project.
– open-source has limited resources — every contribution helps.
– be patient — not everyone can operate on the same level, and new members will need to be brought up to speed on best practices.

I guess everyone would sign up to this, but oddly enough my experience with the team running CRAN has always been of the nature Tim Jurka cites as a positive example: brief, but courteous. What is definitely missing from said blog post, though, is an appreciation that the team running R and CRAN are also volunteers!