Factiva is one of the major newspaper archives out there. Often in research projects, we’d want to have a single text file for each article. Factiva lets you download articles individually, but with some 5 or 6 clicks to save an article, this quickly becomes burdensome if you’re looking at more than a handful of articles.
We simply selected the articles of interest, saving them as HTML (PDF and Word seemed less flexible). After trying about a dozen of software solutions to split HTML files, it turns out that probably the easiest way to split the files from Factiva is via MS Word and a VBA macro. I have modified a script from here (apparently from here) to do the job in MS Word. (I have modified the code slightly so it uses the document name in the output, and produces UTF-8 text files).
So, for each of the HTML files containing articles, the following procedure will split them. Open the file in MS Word. Use advanced find to search for bold fonts, and replace with “///^&” (without quotes; this will add three slashes in front of each title, the “^&” bit simply adds the bit searched for). Hit Alt + F11 to open the Macro document, add the code mentioned, and hit F5 to split the files. You’ll be asked to confirm before sitting back to see your computer do all the work.
It’s still dull, but at least it is now feasible to work with more than a handful of articles. Yes, if I knew python…