A book about the command line for humans.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 

14 KiB

  1. one of these things is not like the others =============================================

If you're the sort of person who took a few detours into the history of religion in college, you might be familiar with some of the ways people used to do textual comparison. When pen, paper, and typesetting were what scholars had to work with, they did some fairly sophisticated things in order to expose the relationships between multiple pieces of text.

-> <-

Here's a book I got in college: Gospel Parallels: A Comparison of the Synoptic Gospels, Burton H. Throckmorton, Jr., Ed. It breaks up three books from the New Testament by the stories and themes that they contain, and shows the overlapping sections of each book that contain parallel texts. You can work your way through and see what parts only show up in one book, or in two but not the other, or in all three. Pages are arranged like so:

                 § JESUS DOES SOME STUFF
     ________________________________________________
    |  MAT            |    MAR             |  LUK    |
    |-----------------+--------------------+---------|
    | Stuff           |                    |         |
    |                 | Stuff              |         |
    |                 | Stuff              | Stuff   |
    |                 | Stuff              |         |
    |                 | Stuff              |         |
    |                 |                    |         |

The way I understand it, a book like this one only scratches the surface of the field. Tools like this support a lot of theory about which books copied each other and how, and what other sources they might have copied that we've since lost.

This is some incredibly dry material, even if you kind of dig thinking about the questions it addresses. It takes a special temperament to actually sit poring over fragmentary texts in ancient languages and do these painstaking comparisons. Even if you're a writer or editor and work with a lot of revisions of a text, there's a good chance you rarely do this kind of comparison on your own work, because that shit is tedious.

diff

It turns out that academics aren't the only people who need tools for comparing different versions of a text. Working programmers, in fact, need to do this constantly. Programmers are also happiest when putting off the actual task at hand to solve some incidental problem that cropped up along the way, so by now there are a lot of ways to say "here's how this file is different from this file", or "here's how this file is different from itself a year ago".

Let's look at a couple of shell scripts from an earlier chapter:

$ cat ../script/okpoems
#!/bin/bash

# find all the marker files and get the name of
# the directory containing each
find ~/p1k3/archives -name 'meta-ok-poem' | xargs -n1 dirname

exit 0
$ cat ../script/findprop
#!/bin/bash

if [ ! $1 ]
then
  echo "usage: findprop <property>"
  exit
fi

# find all the marker files and get the name of
# the directory containing each
find ~/p1k3/archives -name $1 | xargs -n1 dirname

exit 0

It's pretty obvious these are similar files, but do we know what exactly changed between them at a glance? It wouldn't be hard to figure out, once. If you wanted to be really certain about it, you could print them out, set them side by side, and go over them with a highlighter.

Now imagine doing that for a bunch of files, some of them hundreds or thousands of lines long. I've actually done that before, colored markers and all, but I didn't feel smart while I was doing it. This is a job for software.

$ diff ../script/okpoems ../script/findprop
2a3,8
> if [ ! $1 ]
> then
>   echo "usage: findprop <property>"
>   exit
> fi
> 
5c11
< find ~/p1k3/archives -name 'meta-ok-poem' | xargs -n1 dirname
---
> find ~/p1k3/archives -name $1 | xargs -n1 dirname

That's not the most human-friendly output, but it's a little simpler than it seems at first glance. It's basically just a way of describing the changes needed to turn okpoems into findprop. The string 2a3,8 can be read as "at line 2, add lines 3 through 8". Lines with a > in front of them are added. 5c11 can be read as "line 5 in the original file becomes line 11 in the new file", and the < line is replaced with the > line. If you wanted, you could take a copy of the original file and apply these instructions by hand in your text editor, and you'd wind up with the new file.

A lot of people (me included) prefer what's known as a "unified" diff, because it's easier to read and offers context for the changed lines. We can ask for one of these with diff -u:

$ diff -u ../script/okpoems ../script/findprop
--- ../script/okpoems	2014-04-19 00:08:03.321230818 -0600
+++ ../script/findprop	2014-04-21 21:51:29.360846449 -0600
@@ -1,7 +1,13 @@
 #!/bin/bash
 
+if [ ! $1 ]
+then
+  echo "usage: findprop <property>"
+  exit
+fi
+
 # find all the marker files and get the name of
 # the directory containing each
-find ~/p1k3/archives -name 'meta-ok-poem' | xargs -n1 dirname
+find ~/p1k3/archives -name $1 | xargs -n1 dirname
 
 exit 0

That's a little longer, and has some metadata we might not always care about, but if you look for lines starting with + and -, it's easy to read as "added these, took away these". This diff tells us at a glance that we added some lines to complain if we didn't get a command line argument, and replaced 'meta-ok-poem' in the find command with that argument. Since it shows us some context, we have a pretty good idea where those lines are in the file and what they're for.

What if we don't care exactly how the files differ, but only whether they do?

$ diff -q ../script/okpoems ../script/findprop
Files ../script/okpoems and ../script/findprop differ

I use diff a lot in the course of my day job, because I spend a lot of time needing to know just how two programs differ. Just as importantly, I often need to know how (or whether!) the output of programs differs. As a concrete example, I want to make sure that findprop meta-ok-poem is really a suitable replacement for okpoems. Since I expect their output to be identical, I can do this:

$ ../script/okpoems > okpoem_output
$ ../script/findprop meta-ok-poem > findprop_output
$ diff -s okpoem_output findprop_output
Files okpoem_output and findprop_output are identical

The -s just means that diff should explicitly tell us if files are the same. Otherwise, it'd output nothing at all, because there aren't any differences.

As with many other tools, diff doesn't very much care whether it's looking at shell scripts or a list of filenames or what-have-you. If you read the man page, you'll find some features geared towards people writing C-like programming languages, but its real specialty is just text files with lines made out of characters, which works well for lots of code, but certainly could be applied to English prose.

Since I have a couple of versions ready to hand, let's apply this to a text with some well-known variations and a bit of a literary legacy. Here's the first day of the Genesis creation narrative in a couple of English translations:

$ cat genesis_nkj
In the beginning God created the heavens and the earth.  The earth was without
form, and void; and darkness was on the face of the deep.  And the Spirit of
God was hovering over the face of the waters.  Then God said, "Let there be
light"; and there was light.  And God saw the light, that it was good; and God
divided the light from the darkness.  God called the light Day, and the darkness
He called Night.  So the evening and the morning were the first day.
$ cat genesis_nrsv
In the beginning when God created the heavens and the earth, the earth was a
formless void and darkness covered the face of the deep, while a wind from
God swept over the face of the waters.  Then God said, "Let there be light";
and there was light.  And God saw that the light was good; and God separated
the light from the darkness.  God called the light Day, and the darkness he
called Night.  And there was evening and there was morning, the first day.

What happens if we diff them?

$ diff -u genesis_nkj genesis_nrsv
--- genesis_nkj	2014-05-11 16:28:29.692508461 -0600
+++ genesis_nrsv	2014-05-11 16:28:29.744508459 -0600
@@ -1,6 +1,6 @@
-In the beginning God created the heavens and the earth.  The earth was without
-form, and void; and darkness was on the face of the deep.  And the Spirit of
-God was hovering over the face of the waters.  Then God said, "Let there be
-light"; and there was light.  And God saw the light, that it was good; and God
-divided the light from the darkness.  God called the light Day, and the darkness
-He called Night.  So the evening and the morning were the first day.
+In the beginning when God created the heavens and the earth, the earth was a
+formless void and darkness covered the face of the deep, while a wind from
+God swept over the face of the waters.  Then God said, "Let there be light";
+and there was light.  And God saw that the light was good; and God separated
+the light from the darkness.  God called the light Day, and the darkness he
+called Night.  And there was evening and there was morning, the first day.

Kind of useless, right? If a given line differs by so much as a character, it's not the same line. This highlights the limitations of diff for comparing things that

  • aren't logically grouped by line
  • aren't easily thought of as versions of the same text with some lines changed

We could edit the files into a more logically defined structure, like one-line-per-verse, and try again:

$ diff -u genesis_nkj_by_verse genesis_nrsv_by_verse
--- genesis_nkj_by_verse	2014-05-11 16:51:14.312457198 -0600
+++ genesis_nrsv_by_verse	2014-05-11 16:53:02.484453134 -0600
@@ -1,5 +1,5 @@
-In the beginning God created the heavens and the earth.
-The earth was without form, and void; and darkness was on the face of the deep.  And the Spirit of God was hovering over the face of the waters.
+In the beginning when God created the heavens and the earth,
+the earth was a formless void and darkness covered the face of the deep, while a wind from God swept over the face of the waters.
 Then God said, "Let there be light"; and there was light.
-And God saw the light, that it was good; and God divided the light from the darkness.
-God called the light Day, and the darkness He called Night.  So the evening and the morning were the first day.
+And God saw that the light was good; and God separated the light from the darkness.
+God called the light Day, and the darkness he called Night.  And there was evening and there was morning, the first day.

It might be a little more descriptive, but editing all that text just for a quick comparison felt suspiciously like work, and anyway the output still doesn't seem very useful.

wdiff

For cases like this, I'm fond of a tool called wdiff:

$ wdiff genesis_nkj genesis_nrsv
In the beginning {+when+} God created the heavens and the [-earth.  The-] {+earth, the+} earth was [-without
form, and void;-] {+a
formless void+} and darkness [-was on-] {+covered+} the face of the [-deep.  And the Spirit of-] {+deep, while a wind from+}
God [-was hovering-] {+swept+} over the face of the waters.  Then God said, "Let there be light";
and there was light.  And God saw [-the light,-] that [-it-] {+the light+} was good; and God
[-divided-] {+separated+}
the light from the darkness.  God called the light Day, and the darkness
[-He-] {+he+}
called Night.  [-So the-]  {+And there was+} evening and [-the morning were-] {+there was morning,+} the first day.

Deleted words are surrounded by [- -] and inserted ones by {+ +}. You can even ask it to spit out HTML tags for insertion and deletion...

$ wdiff -w '<del>' -x '</del>' -y '<ins>' -z '</ins>' genesis_nkj genesis_nrsv

...and come up with something your browser will render like this:

In the beginning when God created the heavens and the earth. The earth, the earth was without form, and void; a formless void and darkness was on covered the face of the deep. And the Spirit of deep, while a wind from God was hovering swept over the face of the waters. Then God said, "Let there be light"; and there was light. And God saw the light, that it the light was good; and God divided separated the light from the darkness. God called the light Day, and the darkness He he called Night. So the And there was evening and the morning were there was morning, the first day.

Burton H. Throckmorton, Jr. this ain't. Still, it has its uses.