If you're the sort of person who took a few detours into the history of religion in college, you might be familiar with some of the ways people used to do textual comparison. When pen, paper, and typesetting were what scholars had to work with, they did some fairly sophisticated things in order to expose the relationships between multiple pieces of text.
-> <-
Here's a book I got in college: Gospel Parallels: A Comparison of the Synoptic Gospels, Burton H. Throckmorton, Jr., Ed. It breaks up three books from the New Testament by the stories and themes that they contain, and shows the overlapping sections of each book that contain parallel texts. You can work your way through and see what parts only show up in one book, or in two but not the other, or in all three. Pages are arranged like so:
§ JESUS DOES SOME STUFF ________________________________________________ | MAT | MAR | LUK | |-----------------+--------------------+---------| | Stuff | | | | | Stuff | | | | Stuff | Stuff | | | Stuff | | | | Stuff | | | | | |
The way I understand it, a book like this one only scratches the surface of the field. Tools like this support a lot of theory about which books copied each other and how, and what other sources they might have copied that we've since lost.
This is some incredibly dry material, even if you kind of dig thinking about the questions it addresses. It takes a special temperament to actually sit poring over fragmentary texts in ancient languages and do these painstaking comparisons. Even if you're a writer or editor and work with a lot of revisions of a text, there's a good chance you rarely do this kind of comparison on your own work, because that shit is tedious.
It turns out that academics aren't the only people who need tools for comparing different versions of a text. Working programmers, in fact, need to do this constantly. Programmers are also happiest when putting off the actual task at hand to solve some incidental problem that cropped up along the way, so by now there are a lot of ways to say "here's how this file is different from this file", or "here's how this file is different from itself a year ago".
Let's look at a couple of shell scripts from an earlier chapter:
$ cat ../script/okpoems
#!/bin/bash
# find all the marker files and get the name of
# the directory containing each
find ~/p1k3/archives -name 'meta-ok-poem' | xargs -n1 dirname
exit 0
$ cat ../script/findprop
#!/bin/bash
if [ ! $1 ]
then
echo "usage: findprop <property>"
exit
fi
# find all the marker files and get the name of
# the directory containing each
find ~/p1k3/archives -name $1 | xargs -n1 dirname
exit 0
It's pretty obvious these are similar files, but do we know what exactly changed between them at a glance? It wouldn't be hard to figure out, once. If you wanted to be really certain about it, you could print them out, set them side by side, and go over them with a highlighter.
Now imagine doing that for a bunch of files, some of them hundreds or thousands of lines long. I've actually done that before, colored markers and all, but I didn't feel smart while I was doing it. This is a job for software.
$ diff ../script/okpoems ../script/findprop
2a3,8
> if [ ! $1 ]
> then
> echo "usage: findprop <property>"
> exit
> fi
>
5c11
< find ~/p1k3/archives -name 'meta-ok-poem' | xargs -n1 dirname
---
> find ~/p1k3/archives -name $1 | xargs -n1 dirname
That's not the most human-friendly output, but it's a little simpler than it
seems at first glance. It's basically just a way of describing the changes
needed to turn okpoems
into findprop
. The string 2a3,8
can be read as
"at line 2, add lines 3 through 8". Lines with a >
in front of them are
added. 5c11
can be read as "line 5 in the original file becomes line 11 in
the new file", and the <
line is replaced with the >
line. If you wanted,
you could take a copy of the original file and apply these instructions by hand
in your text editor, and you'd wind up with the new file.
A lot of people (me included) prefer what's known as a "unified" diff, because
it's easier to read and offers context for the changed lines. We can ask for
one of these with diff -u
:
$ diff -u ../script/okpoems ../script/findprop
--- ../script/okpoems 2014-04-19 00:08:03.321230818 -0600
+++ ../script/findprop 2014-04-21 21:51:29.360846449 -0600
@@ -1,7 +1,13 @@
#!/bin/bash
+if [ ! $1 ]
+then
+ echo "usage: findprop <property>"
+ exit
+fi
+
# find all the marker files and get the name of
# the directory containing each
-find ~/p1k3/archives -name 'meta-ok-poem' | xargs -n1 dirname
+find ~/p1k3/archives -name $1 | xargs -n1 dirname
exit 0
That's a little longer, and has some metadata we might not always care about,
but if you look for lines starting with +
and -
, it's easy to read as
"added these, took away these". This diff tells us at a glance that we added
some lines to complain if we didn't get a command line argument, and replaced
'meta-ok-poem'
in the find
command with that argument. Since it shows us
some context, we have a pretty good idea where those lines are in the file
and what they're for.
What if we don't care exactly how the files differ, but only whether they do?
$ diff -q ../script/okpoems ../script/findprop
Files ../script/okpoems and ../script/findprop differ
I use diff
a lot in the course of my day job, because I spend a lot of time
needing to know just how two programs differ. Just as importantly, I often
need to know how (or whether!) the output of programs differs. As a concrete
example, I want to make sure that findprop meta-ok-poem
is really a suitable
replacement for okpoems
. Since I expect their output to be identical, I can
do this:
$ ../script/okpoems > okpoem_output
$ ../script/findprop meta-ok-poem > findprop_output
$ diff -s okpoem_output findprop_output
Files okpoem_output and findprop_output are identical
The -s
just means that diff
should explicitly tell us if files are the
same. Otherwise, it'd output nothing at all, because there aren't any
differences.
As with many other tools, diff
doesn't very much care whether it's looking at
shell scripts or a list of filenames or what-have-you. If you read the man
page, you'll find some features geared towards people writing C-like
programming languages, but its real specialty is just text files with lines
made out of characters, which works well for lots of code, but certainly could
be applied to English prose.
Since I have a couple of versions ready to hand, let's apply this to a text with some well-known variations and a bit of a literary legacy. Here's the first day of the Genesis creation narrative in a couple of English translations:
$ cat genesis_nkj
In the beginning God created the heavens and the earth. The earth was without
form, and void; and darkness was on the face of the deep. And the Spirit of
God was hovering over the face of the waters. Then God said, "Let there be
light"; and there was light. And God saw the light, that it was good; and God
divided the light from the darkness. God called the light Day, and the darkness
He called Night. So the evening and the morning were the first day.
$ cat genesis_nrsv
In the beginning when God created the heavens and the earth, the earth was a
formless void and darkness covered the face of the deep, while a wind from
God swept over the face of the waters. Then God said, "Let there be light";
and there was light. And God saw that the light was good; and God separated
the light from the darkness. God called the light Day, and the darkness he
called Night. And there was evening and there was morning, the first day.
What happens if we diff them?
$ diff -u genesis_nkj genesis_nrsv
--- genesis_nkj 2014-05-11 16:28:29.692508461 -0600
+++ genesis_nrsv 2014-05-11 16:28:29.744508459 -0600
@@ -1,6 +1,6 @@
-In the beginning God created the heavens and the earth. The earth was without
-form, and void; and darkness was on the face of the deep. And the Spirit of
-God was hovering over the face of the waters. Then God said, "Let there be
-light"; and there was light. And God saw the light, that it was good; and God
-divided the light from the darkness. God called the light Day, and the darkness
-He called Night. So the evening and the morning were the first day.
+In the beginning when God created the heavens and the earth, the earth was a
+formless void and darkness covered the face of the deep, while a wind from
+God swept over the face of the waters. Then God said, "Let there be light";
+and there was light. And God saw that the light was good; and God separated
+the light from the darkness. God called the light Day, and the darkness he
+called Night. And there was evening and there was morning, the first day.
Kind of useless, right? If a given line differs by so much as a character,
it's not the same line. This highlights the limitations of diff
for comparing
things that
We could edit the files into a more logically defined structure, like one-line-per-verse, and try again:
$ diff -u genesis_nkj_by_verse genesis_nrsv_by_verse
--- genesis_nkj_by_verse 2014-05-11 16:51:14.312457198 -0600
+++ genesis_nrsv_by_verse 2014-05-11 16:53:02.484453134 -0600
@@ -1,5 +1,5 @@
-In the beginning God created the heavens and the earth.
-The earth was without form, and void; and darkness was on the face of the deep. And the Spirit of God was hovering over the face of the waters.
+In the beginning when God created the heavens and the earth,
+the earth was a formless void and darkness covered the face of the deep, while a wind from God swept over the face of the waters.
Then God said, "Let there be light"; and there was light.
-And God saw the light, that it was good; and God divided the light from the darkness.
-God called the light Day, and the darkness He called Night. So the evening and the morning were the first day.
+And God saw that the light was good; and God separated the light from the darkness.
+God called the light Day, and the darkness he called Night. And there was evening and there was morning, the first day.
It might be a little more descriptive, but editing all that text just for a quick comparison felt suspiciously like work, and anyway the output still doesn't seem very useful.
For cases like this, I'm fond of a tool called wdiff
:
$ wdiff genesis_nkj genesis_nrsv
In the beginning {+when+} God created the heavens and the [-earth. The-] {+earth, the+} earth was [-without
form, and void;-] {+a
formless void+} and darkness [-was on-] {+covered+} the face of the [-deep. And the Spirit of-] {+deep, while a wind from+}
God [-was hovering-] {+swept+} over the face of the waters. Then God said, "Let there be light";
and there was light. And God saw [-the light,-] that [-it-] {+the light+} was good; and God
[-divided-] {+separated+}
the light from the darkness. God called the light Day, and the darkness
[-He-] {+he+}
called Night. [-So the-] {+And there was+} evening and [-the morning were-] {+there was morning,+} the first day.
Deleted words are surrounded by [- -]
and inserted ones by {+ +}
. You can
even ask it to spit out HTML tags for insertion and deletion...
$ wdiff -w '<del>' -x '</del>' -y '<ins>' -z '</ins>' genesis_nkj genesis_nrsv
...and come up with something your browser will render like this:
In the beginning when God created the heavens and the
earth. Theearth, the earth waswithout form, and void;a formless void and darknesswas oncovered the face of thedeep. And the Spirit ofdeep, while a wind from Godwas hoveringswept over the face of the waters. Then God said, "Let there be light"; and there was light. And God sawthe light,thatitthe light was good; and Goddividedseparated the light from the darkness. God called the light Day, and the darknessHehe called Night.So theAnd there was evening andthe morning werethere was morning, the first day.
Burton H. Throckmorton, Jr. this ain't. Still, it has its uses.