Divide Text Files by Keyword

pilesHow can we use the computer to divide a pile of text files into two piles? We have a pile of hundreds of newspaper articles on a specific topic. Now we want to do a content analysis on some of these, namely those that are about protest.

Using grep we can do this quite easily, assuming we have the right keywords to identify protest (or specific kinds of protest). Grep is part of UNIX and Linux systems, and presumably also on Mac. On Windows there is the FINDSTR program, but you can install grep for Windows, too: http://gnuwin32.sourceforge.net.

First, we can test our string of keywords, which is quite useful if you already have a few documents known to be about what you want (here: protest). egrep -l "protest" *.txt lists all text files in the current directory that contain the word protest. Obviously the string of keywords can be made more complex, specifying logical OR, AND, and NOT.

Now for the pile sorting, here’s the code to copy each hit (i.e. each document that contains the word protest to a subdirectory called out (which we create beforehand). egrep -l "protest" *.txt | xargs -I {} cp {} out/{}. The xargs code is necessary to make the code work in all cases, also when there are no hits. Otherwise, you could also escape the grep code… What this code does is, list all files which contain the word protest, and then (pipe) copy these to the subdirectory.

On a related note, grep can also do basic content analysis using keywords, namely by counting: grep -c "string" *.txt. Did you know your computer already had software for content analysis pre-installed…? (Just save the results for analysis: grep -c "string" *.txt > out.txt.

image credit

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: