3. programmerthink
==================
In the [preceding chapter](#a-literary-problem), I worked through accumulating
a big piece of text from some other, smaller texts. I started with a bunch of
files and wound up with one big file called `potential_poems_full`.
Let's talk for a minute about how programmers approach problems like this one.
What I've just done is sort of an old-school humanities take on things:
Metaphorically speaking, I took a book off the shelf and hauled it down to the
copy machine to xerox a bunch of pages, and now I'm going to start in on them
with a highlighter and some Post-Its or something. A process like this will
often trigger a cascade of questions in the programmer-mind:
- What if, halfway through the project, I realize my selection criteria were all
wrong and have to backtrack?
- What if I discover corrections that also need to be made in the source documents?
- What if I want to access metadata, like the original location of a file?
- What if I want to quickly re-order the poems according to some new criteria?
- Why am I storing the same text in two different places?
A unifying theme of these questions is that they could all be answered by
involving a little more abstraction.
-> ★ <-
Some kinds of abstraction are so common in the physical world that we can
forget they're part of a sophisticated technology. For example, a good deal of
bicycle maintenance can be accomplished with a cheap multi-tool containing a
few different sizes of hex wrench and a couple of screwdrivers.
A hex wrench or screwdriver doesn't really know anything about bicycles. All
it _really_ knows about is fitting into a space and allowing torque to be
applied. Standardized fasteners and adjustment mechanisms on a bicycle ensure
that the work can be done anywhere, by anyone with a certain set of tools.
Standard tools mean that if you can work on a particular bike, you can work on
_most_ bikes, and even on things that aren't bikes at all, but were designed by
people with the same abstractions in mind.
The relationship between a wrench, a bolt, and the purpose of a bolt is a lot
like something we call _indirection_ in software. Programs like `grep` or
`cat` don't really know anything about poetry. All they _really_ know about is
finding lines of text in input, or sticking inputs together. Files, lines, and
text are like standardized fasteners that allow a user who can work on one kind
of data (be it poetry, a list of authors, the source code of a program) to use
the same tools for other problems and other data.
-> ★ <-
When I first started writing stuff on the web, I edited a page --- a single HTML
file --- by hand. When the entries on my nascent blog got old, I manually
cut-and-pasted them to archive files with names like `old_main97.html`, which
held all of the stuff I'd written in 1997.
I'm not holding this up as an example of youthful folly. In fact, it worked
fine, and just having a single, static file that you can open in any text
editor has turned out to be a _lot_ more future-proof than the sophisticated
blogging software people were starting to write at the time.
And yet. Something about this habit nagged at my developing programmer mind
after a few years. It was just a little bit too manual and repetitive, a
little bit silly to have to write things like a table of contents by hand, or
move entries around by copy-and-pasting them to different files. Since I knew
the date for each entry, and wanted to make them navigable on that basis, why
not define a directory structure for the years and months, and then write a
file to hold each day? That way, all I'd have to do is concatenate the files
in one directory to display any given month:
$ cat ~/p1k3/archives/2014/1/* | head -10
Sunday, January 12
the one casey is waiting for
after a while
the thing about drinking
is that it just feeds
what you drink to kill
and kills
I ultimately wound up writing a few thousand lines of Perl to do the actual
work, but the essential idea of the thing is still little more than invoking
`cat` on some stuff.
I didn't know the word for it at the time, but what I was reaching for was a
kind of indirection. By putting blog posts in a specific directory layout, I
was creating a simple model of the temporal structure that I considered their
most important property. Now, if I want to write commands that ask questions
about my blog posts or re-combine them in certain ways, I can address my
concerns to this model. Maybe, for example, I want a rough idea how many words
I've written in blog posts so far in 2014:
$ find ~/p1k3/archives/2014/ -type f | xargs cat | wc -w
6677
`xargs` is not the most intuitive command, but it's useful and common enough to
explain here. At the end of last chapter, when I said:
$ cat `grep -ril '' ~/p1k3/archives` > ~/possible_poems_full
I could also have written this as:
$ grep -ril '' ~/p1k3/archives | xargs cat > ~/possible_poems_full
What this does is take its input, which starts like:
/home/brennen/p1k3/archives/2002/10/16
/home/brennen/p1k3/archives/2002/10/27
/home/brennen/p1k3/archives/2002/10/10
...and run `cat` on all the things in it:
cat /home/brennen/p1k3/archives/2002/10/16 /home/brennen/p1k3/archives/2002/10/27 /home/brennen/p1k3/archives/2002/10/10 ...
It can be a better idea to use `xargs`, because while backticks are
incredibly useful, they have some limitations. If you're dealing with a very
large list of files, for example, you might exceed the maximum allowed length
for arguments to a command on your system. `xargs` is smart enough to know
that limit and run `cat` more than once if needed.
`xargs` is actually sort of a pain to think about, and will make you jump
through some irritating hoops if you have spaces or other weirdness in your
filenames, but I wind up using it quite a bit.
Maybe I want to see a table of contents:
$ find ~/p1k3/archives/2014/ -type d | xargs ls -v | head -10
/home/brennen/p1k3/archives/2014/:
1
2
3
4
/home/brennen/p1k3/archives/2014/1:
5
12
14
Or find the subtitles I used in 2013:
$ find ~/p1k3/archives/2012/ -type f | xargs perl -ne 'print "$1\n" if m{(.*?)
}'
pursuit
fragment
this poem again
i'll do better next time
timebinding animals
more observations on gear nerdery & utility fetishism
thrift
A miracle, in fact, means work
technical notes for late october, or it gets dork out earlier these days
radio
light enough to travel
12:06am
"figures like Heinlein and Gingrich"
The crucial thing about this is that the filesystem _itself_ is just like `cat`
and `grep`: It doesn't know anything about blogs (or poetry), and it's
basically indifferent to the actual _structure_ of a file like
`~/p1k3/archives/2014/1/12`. What the filesystem knows is that there are files
with certain names in certain places. It need not know anything about the
_meaning_ of those names in order to be useful; in fact, it's best if it stays
agnostic about the question, for this enables us to assign our own meaning to a
structure and manipulate that structure with standard tools.
-> ★ <-
Back to the problem at hand: I have this collection of files, and I know how
to extract the ones that contain poems. My goal is to see all the poems and
collect the subset of them that I still find worthwhile. Just knowing how to
grep and then edit a big file solves my problem, in a basic sort of way. And
yet: Something about this nags at my mind. I find that, just as I can already
use standard tools and the filesystem to ask questions about all of my blog
posts in a given year or month, I would like to be able to ask questions about
the set of interesting poems.
If I want the freedom to execute many different sorts of commands against this
set of poems, it begins to seem that I need a model.
When programmers talk about models, they often mean something that people in
the sciences would recognize: We find ways to represent the arrangement of
facts so that we can think about them. A structured representation of things
often means that we can _change_ those things, or at least derive new
understanding of them.
-> ★ <-
At this point in the narrative, I could pretend that my next step is
immediately obvious, but in fact it's not. I spend a couple of days thinking
off and on about how to proceed, scribbling notes during bus rides and while
drinking beers at the pizza joint down the street. I assess and discard ideas
which fall into a handful of broad approaches:
- Store blog entries in a relational database system which would allow me to
associate them with data like "this entry is in a collection called 'ok
poems'".
- Selectively build up a file containing the list of files with ok poems, and use
it to do other tasks.
- Define a format for metadata that lives within entry files.
- Turn each interesting file into a directory of its own which contains a file
with the original text and another file with metadata.
I discard the relational database idea immediately: I like working with files,
and I don't feel like abandoning a model that's served me well for my entire
adult life.
Building up an index file to point at the other files I'm working with has a
certain appeal. I'm already most of the way there with the `grep` output in
`potential_poems`. It would be easy to write shell commands to add, remove,
sort, and search entries. Still, it doesn't feel like a very satisfying
solution unto itself. I'd like to know that an entry is part of the collection
just by looking at the entry, without having to cross-reference it to a list
somewhere else.
What about putting some meaningful text in the file itself? I thought about
a bunch of different ways to do this, some of them really complicated, and
eventually arrived at this:
The `` bits are how you define a comment in HTML, which means that
neither my blog code nor web browsers nor my text editor have to know anything
about the format, but I can easily find files with certain values. Check it:
$ find ~/p1k3/archives -type f | xargs perl -ne 'print "$ARGV[0]: $1 -> $2\n" if m{};'
/home/brennen/p1k3/archives/2014/2/9: collection -> ok-poems
That's an ugly one-liner, and I haven't explained half of what it does, but the
comment format actually seems pretty workable for this. It's a little tacky to
look at, but it's simple and searchable.
Before we settle, though, let's turn to the notion of making each entry into a
directory that can contain some structured metadata in a separate file.
Imagine something like:
$ ls ~/p1k3/archives/2013/2/9
index Meta
Here I use the name "index" for the main part of the entry because it's a
convention of web sites for the top-level page in a directory to be called
something like `index.html`. As it happens, my blog software already supports
this kind of file layout for entries which contain multiple parts, image files,
and so forth.
$ head ~/p1k3/archives/2013/2/9/index
saturday, february 9
midwinter midafternoon; depressed as hell
sitting in a huge cabin in the rich-people mountains
writing a sprawl, pages, of melancholic midlife bullshit
outside the snow gives way to broken clouds and the
clear unyielding light of the high country sun fills
$ cat ~/p1k3/archives/2013/2/9/Meta
collection: ok-poems
It would then be easy to `find` files called `Meta` and grep them for
`collection: ok-poems`.
What if I put metadata right in the filename itself, and dispense with the grep
altogether?
$ ls ~/p1k3/archives/2013/2/9
index meta-ok-poem
$ find ~/p1k3/archives -name 'meta-ok-poem'
/home/brennen/archives/2013/2/9/meta-ok-poem
There's a lot to like about this. For one thing, it's immediately visible in a
directory listing. For another, it doesn't require searching through thousands
of lines of text to extract a specific string. If a directory has a
`meta-ok-poem` in it, I can be pretty sure that it will contain an interesting
`index`.
What are the downsides? Well, it requires transforming lots of text files into
directories-containing-files. I might automate that process, but it's still a
little tedious and it makes the layout of the entry archive more complicated
overall. There's a cost to doing things this way. It lets me extend my
existing model of a blog entry to include arbitrary metadata, but it also adds
steps to writing or finding blog entries.
Abstractions usually cost you something. Is this one worth the hassle?
Sometimes the best way to answer that question is to start writing code that
handles a given abstraction.