brennen
/
userland-book

3. programmerthink==================
In the [preceding chapter](#a-literary-problem), I worked through accumulatinga big piece of text from some other, smaller texts.  I started with a bunch offiles and wound up with one big file called `potential_poems_full`.
Let's talk for a minute about how programmers approach problems like this one.What I've just done is sort of an old-school humanities take on things:Metaphorically speaking, I took a book off the shelf and hauled it down to thecopy machine to xerox a bunch of pages, and now I'm going to start in on themwith a highlighter and some Post-Its or something.  A process like this willoften trigger a cascade of questions in the programmer-mind:
- What if, halfway through the project, I realize my selection criteria were all  wrong and have to backtrack?- What if I discover corrections that also need to be made in the source documents?- What if I want to access metadata, like the original location of a file?- What if I want to quickly re-order the poems according to some new criteria?- Why am I storing the same text in two different places?
A unifying theme of these questions is that they could all be answered byinvolving a little more abstraction.
-> ★ <-
Some kinds of abstraction are so common in the physical world that we canforget they're part of a sophisticated technology.  For example, a good deal ofbicycle maintenance can be accomplished with a cheap multi-tool containing afew different sizes of hex wrench and a couple of screwdrivers.
A hex wrench or screwdriver doesn't really know anything about bicycles.  Allit _really_ knows about is fitting into a space and allowing torque to beapplied.  Standardized fasteners and adjustment mechanisms on a bicycle ensurethat the work can be done anywhere, by anyone with a certain set of tools.Standard tools mean that if you can work on a particular bike, you can work on_most_ bikes, and even on things that aren't bikes at all, but were designed bypeople with the same abstractions in mind.
The relationship between a wrench, a bolt, and the purpose of a bolt is a lotlike something we call _indirection_ in software.  Programs like `grep` or`cat` don't really know anything about poetry.  All they _really_ know about isfinding lines of text in input, or sticking inputs together.  Files, lines, andtext are like standardized fasteners that allow a user who can work on one kindof data (be it poetry, a list of authors, the source code of a program) to usethe same tools for other problems and other data.
-> ★ <-
When I first started writing stuff on the web, I edited a page --- a single HTMLfile --- by hand.  When the entries on my nascent blog got old, I manuallycut-and-pasted them to archive files with names like `old_main97.html`, whichheld all of the stuff I'd written in 1997.
I'm not holding this up as an example of youthful folly.  In fact, it workedfine, and just having a single, static file that you can open in any texteditor has turned out to be a _lot_ more future-proof than the sophisticatedblogging software people were starting to write at the time.
And yet.  Something about this habit nagged at my developing programmer mindafter a few years.  It was just a little bit too manual and repetitive, alittle bit silly to have to write things like a table of contents by hand, ormove entries around by copy-and-pasting them to different files.  Since I knewthe date for each entry, and wanted to make them navigable on that basis, whynot define a directory structure for the years and months, and then write afile to hold each day?  That way, all I'd have to do is concatenate the filesin one directory to display any given month:
    $ cat ~/p1k3/archives/2014/1/* | head -10    <h1>Sunday, January 12</h1>        <h2>the one casey is waiting for</h2>        <freeverse>    after a while    the thing about drinking    is that it just feeds    what you drink to kill    and kills
I ultimately wound up writing a few thousand lines of Perl to do the actualwork, but the essential idea of the thing is still little more than invoking`cat` on some stuff.
I didn't know the word for it at the time, but what I was reaching for was akind of indirection.  By putting blog posts in a specific directory layout, Iwas creating a simple model of the temporal structure that I considered theirmost important property.  Now, if I want to write commands that ask questionsabout my blog posts or re-combine them in certain ways, I can address myconcerns to this model.  Maybe, for example, I want a rough idea how many wordsI've written in blog posts so far in 2014:
    $ find ~/p1k3/archives/2014/ -type f | xargs cat | wc -w    6677
`xargs` is not the most intuitive command, but it's useful and common enough toexplain here.  At the end of last chapter, when I said:
    $ cat `grep -ril '<freeverse>' ~/p1k3/archives` > ~/possible_poems_full
I could also have written this as:
    $ grep -ril '<freeverse>' ~/p1k3/archives | xargs cat > ~/possible_poems_full
What this does is take its input, which starts like:
    /home/brennen/p1k3/archives/2002/10/16    /home/brennen/p1k3/archives/2002/10/27    /home/brennen/p1k3/archives/2002/10/10
...and run `cat` on all the things in it:
    cat /home/brennen/p1k3/archives/2002/10/16 /home/brennen/p1k3/archives/2002/10/27 /home/brennen/p1k3/archives/2002/10/10 ...
It can be a better idea to use `xargs`, because while backticks areincredibly useful, they have some limitations.  If you're dealing with a verylarge list of files, for example, you might exceed the maximum allowed lengthfor arguments to a command on your system.  `xargs` is smart enough to knowthat limit and run `cat` more than once if needed.
`xargs` is actually sort of a pain to think about, and will make you jumpthrough some irritating hoops if you have spaces or other weirdness in yourfilenames, but I wind up using it quite a bit.
Maybe I want to see a table of contents:
<!-- exec -->
    $ find ~/p1k3/archives/2014/ -type d | xargs ls -v | head -10    /home/brennen/p1k3/archives/2014/:    1    2    3    4        /home/brennen/p1k3/archives/2014/1:    5    12    14
<!-- end -->
Or find the subtitles I used in 2013:
<!-- exec -->
    $ find ~/p1k3/archives/2012/ -type f | xargs perl -ne 'print "$1\n" if m{<h2>(.*?)</h2>}'    pursuit    fragment    this poem again    i'll do better next time    timebinding animals    more observations on gear nerdery &amp; utility fetishism    thrift    A miracle, in fact, means work    <em>technical notes for late october</em>, or <em>it gets dork out earlier these days</em>    radio    light enough to travel    12:06am    "figures like Heinlein and Gingrich"
<!-- end -->
The crucial thing about this is that the filesystem _itself_ is just like `cat`and `grep`:  It doesn't know anything about blogs (or poetry), and it'sbasically indifferent to the actual _structure_ of a file like`~/p1k3/archives/2014/1/12`.  What the filesystem knows is that there are fileswith certain names in certain places.  It need not know anything about the_meaning_ of those names in order to be useful; in fact, it's best if it staysagnostic about the question, for this enables us to assign our own meaning to astructure and manipulate that structure with standard tools.
-> ★ <-
Back to the problem at hand:  I have this collection of files, and I know howto extract the ones that contain poems.  My goal is to see all the poems andcollect the subset of them that I still find worthwhile.  Just knowing how togrep and then edit a big file solves my problem, in a basic sort of way.  Andyet: Something about this nags at my mind.  I find that, just as I can alreadyuse standard tools and the filesystem to ask questions about all of my blogposts in a given year or month, I would like to be able to ask questions aboutthe set of interesting poems.
If I want the freedom to execute many different sorts of commands against thisset of poems, it begins to seem that I need a model.
When programmers talk about models, they often mean something that people inthe sciences would recognize:  We find ways to represent the arrangement offacts so that we can think about them.  A structured representation of thingsoften means that we can _change_ those things, or at least derive newunderstanding of them.
-> ★ <-
At this point in the narrative, I could pretend that my next step isimmediately obvious, but in fact it's not.  I spend a couple of days thinkingoff and on about how to proceed, scribbling notes during bus rides and whiledrinking beers at the pizza joint down the street.  I assess and discard ideaswhich fall into a handful of broad approaches:
- Store blog entries in a relational database system which would allow me to  associate them with data like "this entry is in a collection called 'ok  poems'".- Selectively build up a file containing the list of files with ok poems, and use  it to do other tasks.- Define a format for metadata that lives within entry files.- Turn each interesting file into a directory of its own which contains a file  with the original text and another file with metadata.
I discard the relational database idea immediately:  I like working with files,and I don't feel like abandoning a model that's served me well for my entireadult life.
Building up an index file to point at the other files I'm working with has acertain appeal.  I'm already most of the way there with the `grep` output in`potential_poems`. It would be easy to write shell commands to add, remove,sort, and search entries.  Still, it doesn't feel like a very satisfyingsolution unto itself.  I'd like to know that an entry is part of the collectionjust by looking at the entry, without having to cross-reference it to a listsomewhere else.
What about putting some meaningful text in the file itself?  I thought abouta bunch of different ways to do this, some of them really complicated, andeventually arrived at this:
    <!-- collection: ok-poems -->
The `<!-- -->` bits are how you define a comment in HTML, which means thatneither my blog code nor web browsers nor my text editor have to know anythingabout the format, but I can easily find files with certain values.  Check it:
    $ find ~/p1k3/archives -type f | xargs perl -ne 'print "$ARGV[0]: $1 -> $2\n" if m{<!-- ([a-z]+): (.*?) -->};'    /home/brennen/p1k3/archives/2014/2/9: collection -> ok-poems
That's an ugly one-liner, and I haven't explained half of what it does, but thecomment format actually seems pretty workable for this.  It's a little tacky tolook at, but it's simple and searchable.
Before we settle, though, let's turn to the notion of making each entry into adirectory that can contain some structured metadata in a separate file.Imagine something like:
    $ ls ~/p1k3/archives/2013/2/9    index  Meta
Here I use the name "index" for the main part of the entry because it's aconvention of web sites for the top-level page in a directory to be calledsomething like `index.html`.  As it happens, my blog software already supportsthis kind of file layout for entries which contain multiple parts, image files,and so forth.
    $ head ~/p1k3/archives/2013/2/9/index    <h1>saturday, february 9</h1>        <freeverse>    midwinter midafternoon; depressed as hell    sitting in a huge cabin in the rich-people mountains    writing a sprawl, pages, of melancholic midlife bullshit        outside the snow gives way to broken clouds and the    clear unyielding light of the high country sun fills
    $ cat ~/p1k3/archives/2013/2/9/Meta    collection: ok-poems
It would then be easy to `find` files called `Meta` and grep them for`collection: ok-poems`.
What if I put metadata right in the filename itself, and dispense with the grepaltogether?
    $ ls ~/p1k3/archives/2013/2/9    index  meta-ok-poem
    $ find ~/p1k3/archives -name 'meta-ok-poem'    /home/brennen/archives/2013/2/9/meta-ok-poem
There's a lot to like about this.  For one thing, it's immediately visible in adirectory listing.  For another, it doesn't require searching through thousandsof lines of text to extract a specific string.  If a directory has a`meta-ok-poem` in it, I can be pretty sure that it will contain an interesting`index`.
What are the downsides?  Well, it requires transforming lots of text files intodirectories-containing-files.  I might automate that process, but it's still alittle tedious and it makes the layout of the entry archive more complicatedoverall.  There's a cost to doing things this way.  It lets me extend myexisting model of a blog entry to include arbitrary metadata, but it also addssteps to writing or finding blog entries.
Abstractions usually cost you something.  Is this one worth the hassle?Sometimes the best way to answer that question is to start writing code thathandles a given abstraction.