6. one of these things is not like the others ============================================= If you're the sort of person who took a few detours into the history of religion in college, you might be familiar with some of the ways people used to do textual comparison. When pen, paper, and typesetting were what scholars had to work with, they did some fairly sophisticated things in order to expose the relationships between multiple pieces of text. -> <- Here's a book I got in college: _Gospel Parallels: A Comparison of the Synoptic Gospels_, Burton H. Throckmorton, Jr., Ed. It breaks up three books from the New Testament by the stories and themes that they contain, and shows the overlapping sections of each book that contain parallel texts. You can work your way through and see what parts only show up in one book, or in two but not the other, or in all three. Pages are arranged like so:
                 ยง JESUS DOES SOME STUFF
     ________________________________________________
    |  MAT            |    MAR             |  LUK    |
    |-----------------+--------------------+---------|
    | Stuff           |                    |         |
    |                 | Stuff              |         |
    |                 | Stuff              | Stuff   |
    |                 | Stuff              |         |
    |                 | Stuff              |         |
    |                 |                    |         |
The way I understand it, a book like this one only scratches the surface of the field. Tools like this support a lot of theory about which books copied each other and how, and what other sources they might have copied that we've since lost. This is some _incredibly_ dry material, even if you kind of dig thinking about the questions it addresses. It takes a special temperament to actually sit poring over fragmentary texts in ancient languages and do these painstaking comparisons. Even if you're a writer or editor and work with a lot of revisions of a text, there's a good chance you rarely do this kind of comparison on your own work, because that shit is _tedious_. diff ---- It turns out that academics aren't the only people who need tools for comparing different versions of a text. Working programmers, in fact, need to do this _constantly_. Programmers are also happiest when putting off the _actual_ task at hand to solve some incidental problem that cropped up along the way, so by now there are a lot of ways to say "here's how this file is different from this file", or "here's how this file is different from itself a year ago". Let's look at a couple of shell scripts from an earlier chapter: $ cat ../script/okpoems #!/bin/bash # find all the marker files and get the name of # the directory containing each find ~/p1k3/archives -name 'meta-ok-poem' | xargs -n1 dirname exit 0 $ cat ../script/findprop #!/bin/bash if [ ! $1 ] then echo "usage: findprop " exit fi # find all the marker files and get the name of # the directory containing each find ~/p1k3/archives -name $1 | xargs -n1 dirname exit 0 It's pretty obvious these are similar files, but do we know what _exactly_ changed between them at a glance? It wouldn't be hard to figure out, once. If you wanted to be really certain about it, you could print them out, set them side by side, and go over them with a highlighter. Now imagine doing that for a bunch of files, some of them hundreds or thousands of lines long. I've actually done that before, colored markers and all, but I didn't feel smart while I was doing it. This is a job for software. $ diff ../script/okpoems ../script/findprop 2a3,8 > if [ ! $1 ] > then > echo "usage: findprop " > exit > fi > 5c11 < find ~/p1k3/archives -name 'meta-ok-poem' | xargs -n1 dirname --- > find ~/p1k3/archives -name $1 | xargs -n1 dirname That's not the most human-friendly output, but it's a little simpler than it seems at first glance. It's basically just a way of describing the changes needed to turn `okpoems` into `findprop`. The string `2a3,8` can be read as "at line 2, add lines 3 through 8". Lines with a `>` in front of them are added. `5c11` can be read as "line 5 in the original file becomes line 11 in the new file", and the `<` line is replaced with the `>` line. If you wanted, you could take a copy of the original file and apply these instructions by hand in your text editor, and you'd wind up with the new file. A lot of people (me included) prefer what's known as a "unified" diff, because it's easier to read and offers context for the changed lines. We can ask for one of these with `diff -u`: $ diff -u ../script/okpoems ../script/findprop --- ../script/okpoems 2014-04-19 00:08:03.321230818 -0600 +++ ../script/findprop 2014-04-21 21:51:29.360846449 -0600 @@ -1,7 +1,13 @@ #!/bin/bash +if [ ! $1 ] +then + echo "usage: findprop " + exit +fi + # find all the marker files and get the name of # the directory containing each -find ~/p1k3/archives -name 'meta-ok-poem' | xargs -n1 dirname +find ~/p1k3/archives -name $1 | xargs -n1 dirname exit 0 That's a little longer, and has some metadata we might not always care about, but if you look for lines starting with `+` and `-`, it's easy to read as "added these, took away these". This diff tells us at a glance that we added some lines to complain if we didn't get a command line argument, and replaced `'meta-ok-poem'` in the `find` command with that argument. Since it shows us some context, we have a pretty good idea where those lines are in the file and what they're for. What if we don't care exactly _how_ the files differ, but only whether they do? $ diff -q ../script/okpoems ../script/findprop Files ../script/okpoems and ../script/findprop differ I use `diff` a lot in the course of my day job, because I spend a lot of time needing to know just how two programs differ. Just as importantly, I often need to know how (or whether!) the _output_ of programs differs. As a concrete example, I want to make sure that `findprop meta-ok-poem` is really a suitable replacement for `okpoems`. Since I expect their output to be identical, I can do this: $ ../script/okpoems > okpoem_output $ ../script/findprop meta-ok-poem > findprop_output $ diff -s okpoem_output findprop_output Files okpoem_output and findprop_output are identical The `-s` just means that `diff` should explicitly tell us if files are the **s**ame. Otherwise, it'd output nothing at all, because there aren't any differences. As with many other tools, `diff` doesn't very much care whether it's looking at shell scripts or a list of filenames or what-have-you. If you read the man page, you'll find some features geared towards people writing C-like programming languages, but its real specialty is just text files with lines made out of characters, which works well for lots of code, but certainly could be applied to English prose. Since I have a couple of versions ready to hand, let's apply this to a text with some well-known variations and a bit of a literary legacy. Here's the first day of the Genesis creation narrative in a couple of English translations: $ cat genesis_nkj In the beginning God created the heavens and the earth. The earth was without form, and void; and darkness was on the face of the deep. And the Spirit of God was hovering over the face of the waters. Then God said, "Let there be light"; and there was light. And God saw the light, that it was good; and God divided the light from the darkness. God called the light Day, and the darkness He called Night. So the evening and the morning were the first day. $ cat genesis_nrsv In the beginning when God created the heavens and the earth, the earth was a formless void and darkness covered the face of the deep, while a wind from God swept over the face of the waters. Then God said, "Let there be light"; and there was light. And God saw that the light was good; and God separated the light from the darkness. God called the light Day, and the darkness he called Night. And there was evening and there was morning, the first day. What happens if we diff them? $ diff -u genesis_nkj genesis_nrsv --- genesis_nkj 2014-05-11 16:28:29.692508461 -0600 +++ genesis_nrsv 2014-05-11 16:28:29.744508459 -0600 @@ -1,6 +1,6 @@ -In the beginning God created the heavens and the earth. The earth was without -form, and void; and darkness was on the face of the deep. And the Spirit of -God was hovering over the face of the waters. Then God said, "Let there be -light"; and there was light. And God saw the light, that it was good; and God -divided the light from the darkness. God called the light Day, and the darkness -He called Night. So the evening and the morning were the first day. +In the beginning when God created the heavens and the earth, the earth was a +formless void and darkness covered the face of the deep, while a wind from +God swept over the face of the waters. Then God said, "Let there be light"; +and there was light. And God saw that the light was good; and God separated +the light from the darkness. God called the light Day, and the darkness he +called Night. And there was evening and there was morning, the first day. Kind of useless, right? If a given line differs by so much as a character, it's not the same line. This highlights the limitations of `diff` for comparing things that - aren't logically grouped by line - aren't easily thought of as versions of the same text with some lines changed We could edit the files into a more logically defined structure, like one-line-per-verse, and try again: $ diff -u genesis_nkj_by_verse genesis_nrsv_by_verse --- genesis_nkj_by_verse 2014-05-11 16:51:14.312457198 -0600 +++ genesis_nrsv_by_verse 2014-05-11 16:53:02.484453134 -0600 @@ -1,5 +1,5 @@ -In the beginning God created the heavens and the earth. -The earth was without form, and void; and darkness was on the face of the deep. And the Spirit of God was hovering over the face of the waters. +In the beginning when God created the heavens and the earth, +the earth was a formless void and darkness covered the face of the deep, while a wind from God swept over the face of the waters. Then God said, "Let there be light"; and there was light. -And God saw the light, that it was good; and God divided the light from the darkness. -God called the light Day, and the darkness He called Night. So the evening and the morning were the first day. +And God saw that the light was good; and God separated the light from the darkness. +God called the light Day, and the darkness he called Night. And there was evening and there was morning, the first day. It might be a little more descriptive, but editing all that text just for a quick comparison felt suspiciously like work, and anyway the output still doesn't seem very useful. wdiff ----- For cases like this, I'm fond of a tool called `wdiff`: $ wdiff genesis_nkj genesis_nrsv In the beginning {+when+} God created the heavens and the [-earth. The-] {+earth, the+} earth was [-without form, and void;-] {+a formless void+} and darkness [-was on-] {+covered+} the face of the [-deep. And the Spirit of-] {+deep, while a wind from+} God [-was hovering-] {+swept+} over the face of the waters. Then God said, "Let there be light"; and there was light. And God saw [-the light,-] that [-it-] {+the light+} was good; and God [-divided-] {+separated+} the light from the darkness. God called the light Day, and the darkness [-He-] {+he+} called Night. [-So the-] {+And there was+} evening and [-the morning were-] {+there was morning,+} the first day. Deleted words are surrounded by `[- -]` and inserted ones by `{+ +}`. You can even ask it to spit out HTML tags for insertion and deletion... $ wdiff -w '' -x '' -y '' -z '' genesis_nkj genesis_nrsv ...and come up with something your browser will render like this:

In the beginning when God created the heavens and the earth. The earth, the earth was without form, and void; a formless void and darkness was on covered the face of the deep. And the Spirit of deep, while a wind from God was hovering swept over the face of the waters. Then God said, "Let there be light"; and there was light. And God saw the light, that it the light was good; and God divided separated the light from the darkness. God called the light Day, and the darkness He he called Night. So the And there was evening and the morning were there was morning, the first day.

Burton H. Throckmorton, Jr. this ain't. Still, it has its uses.