Still Looking for the Perfect Workflow

I have written about Sweave and odfWeave, but the quest for the perfect workflow continues. For various reasons I prefer to do my statistical analyses in R, and manage my references in Zotero. In my area of research, the most common file format for papers is without doubt Microsoft Word; it’s the principal format for exchanging editable documents, and it’s the principal format most journals seem to want (actually insist on). Moreover, I seek a workflow that keeps my research reproducible.

Here are a few options I regularly use: (1) To me Sweave is an elegant and fast solution. It gives full control over the necessary parts of the output. The downsides are the lack of direct integration with Zotero, the difficulty to collaborate with colleagues who do not use LaTeX, and the need to convert papers for journals — at least once they are through peer review. Yes, I could use Lyx to use Zotero more easily than exporting the required references into Bibtext (at least Zotero allows dragging references into the .bib file), but this would end the elegance of working in the text file.

(2) In principle odfWeave fits the requirements rather well. It works very well with Zotero, and conversion to Word is usually quite simple (or to PDF). On the downside, odfWeave is really quite slow, and often chokes when (invisible) formatting gets in the way. I find that in-text code (\Sexpr{}) is particularly prone to breaking. Frequent compiling would be a workaround to catch problems early, but unfortunately the lack of speed makes this more challenging than it first seems. Moreover, I have had documents compile, and then choking the next time I open them, despite not having touched the code in the document. Usually removing the formatting of the section in question helps, but there can be tough cases with spaces that aren’t quite what they seem to be, for example. I find hunting for reasons why a document does not compile really quite frustrating; at least Sweave is able to give more precise indications where the problem lies, if there is one.

(3) Something I do frequently when working with others is using R scripts and Word/LibreOffice side by side. While this disconnects the code and the output to some degree, keeping all the code in a single file maintains some degree of reproducibility. Comments in the code are important: why did I do this that way? Often I use a Sweave document to keep all the analysis in one place. A benefit over just commented code is that the compiled PDF has all figures and output, making it easier to pinpoint the code of interest.

(4) Another possibility I sometimes use, usually in combination with R scripts, is using comments in Word to attach the underlying code to the figures and numbers in the document. This works relatively well as long I am diligent in keeping the code up-to-date. A final version of the document can be created simply by choosing “remove all comments”. This works fine when working with others, but with many other comments can lead to crowded documents when track changes are used.

One downside of using Word or LibreOffice is that it’s just too easy to quickly tweak the column width of a table, or manually add an extra reference line to a graph (etc.). This becomes a problem when tables and graphs are updated, and all these small steps have to be done once again. The fact that the link to code is indirect also means that is generally difficult to be certain all the numbers have been updated, for example if a miscoded case is fixed in the data.

So for me, at the moment, there is no single workflow, but different ones, depending on whom I am working with and what the output is expected to be.

6 thoughts on “Still Looking for the Perfect Workflow

  1. Another update: I find myself using R scripts in parallel to Pandoc or Word files quite a lot these days. Pandoc is very easy to write, and with no effort we get the lovely PDF we’ve learned to like. At the same time, with little effort we can create slides for presentations (PDF, but also dynamic HTML). Further still, a Word file (.docx) is only a click away, and unlike all the tools that convert LaTeX to Word, Pandoc’s Word files work like a charm. That’s quite useful when sharing documents with others, or indeed when submitting to journals. Having the code in a separate file rather than the source (as we’d do in the case of Sweave or odfWeave) is a potential source of errors/typos, but it gives me more flexibility when writing. To ensure some degree of replicability, it is important to comment the code a great deal (we should be doing this anyway), including notes which code I’ve used as exploration, and which numbers and figures are actually in the paper. Because a reference to figure 1 is ambiguous over the life course of a paper, substantive labels are important, like “### FIGURE USED IN PAPER: CORRELATION BETWEEN AGE AND ATTITUDES”.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s