A book about the command line for humans.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

293 lines
13 KiB

10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
  1. 3. programmerthink
  2. ==================
  3. In the [preceding chapter](#a-literary-problem), I worked through accumulating
  4. a big piece of text from some other, smaller texts. I started with a bunch of
  5. files and wound up with one big file called `potential_poems_full`.
  6. Let's talk for a minute about how programmers approach problems like this one.
  7. What I've just done is sort of an old-school humanities take on things:
  8. Metaphorically speaking, I took a book off the shelf and hauled it down to the
  9. copy machine to xerox a bunch of pages, and now I'm going to start in on them
  10. with a highlighter and some Post-Its or something. A process like this will
  11. often trigger a cascade of questions in the programmer-mind:
  12. - What if, halfway through the project, I realize my selection criteria were all
  13. wrong and have to backtrack?
  14. - What if I discover corrections that also need to be made in the source documents?
  15. - What if I want to access metadata, like the original location of a file?
  16. - What if I want to quickly re-order the poems according to some new criteria?
  17. - Why am I storing the same text in two different places?
  18. A unifying theme of these questions is that they could all be answered by
  19. involving a little more abstraction.
  20. -> ★ <-
  21. Some kinds of abstraction are so common in the physical world that we can
  22. forget they're part of a sophisticated technology. For example, a good deal of
  23. bicycle maintenance can be accomplished with a cheap multi-tool containing a
  24. few different sizes of hex wrench and a couple of screwdrivers.
  25. A hex wrench or screwdriver doesn't really know anything about bicycles. All
  26. it _really_ knows about is fitting into a space and allowing torque to be
  27. applied. Standardized fasteners and adjustment mechanisms on a bicycle ensure
  28. that the work can be done anywhere, by anyone with a certain set of tools.
  29. Standard tools mean that if you can work on a particular bike, you can work on
  30. _most_ bikes, and even on things that aren't bikes at all, but were designed by
  31. people with the same abstractions in mind.
  32. The relationship between a wrench, a bolt, and the purpose of a bolt is a lot
  33. like something we call _indirection_ in software. Programs like `grep` or
  34. `cat` don't really know anything about poetry. All they _really_ know about is
  35. finding lines of text in input, or sticking inputs together. Files, lines, and
  36. text are like standardized fasteners that allow a user who can work on one kind
  37. of data (be it poetry, a list of authors, the source code of a program) to use
  38. the same tools for other problems and other data.
  39. -> ★ <-
  40. When I first started writing stuff on the web, I edited a page --- a single HTML
  41. file --- by hand. When the entries on my nascent blog got old, I manually
  42. cut-and-pasted them to archive files with names like `old_main97.html`, which
  43. held all of the stuff I'd written in 1997.
  44. I'm not holding this up as an example of youthful folly. In fact, it worked
  45. fine, and just having a single, static file that you can open in any text
  46. editor has turned out to be a _lot_ more future-proof than the sophisticated
  47. blogging software people were starting to write at the time.
  48. And yet. Something about this habit nagged at my developing programmer mind
  49. after a few years. It was just a little bit too manual and repetitive, a
  50. little bit silly to have to write things like a table of contents by hand, or
  51. move entries around by copy-and-pasting them to different files. Since I knew
  52. the date for each entry, and wanted to make them navigable on that basis, why
  53. not define a directory structure for the years and months, and then write a
  54. file to hold each day? That way, all I'd have to do is concatenate the files
  55. in one directory to display any given month:
  56. $ cat ~/p1k3/archives/2014/1/* | head -10
  57. <h1>Sunday, January 12</h1>
  58. <h2>the one casey is waiting for</h2>
  59. <freeverse>
  60. after a while
  61. the thing about drinking
  62. is that it just feeds
  63. what you drink to kill
  64. and kills
  65. I ultimately wound up writing a few thousand lines of Perl to do the actual
  66. work, but the essential idea of the thing is still little more than invoking
  67. `cat` on some stuff.
  68. I didn't know the word for it at the time, but what I was reaching for was a
  69. kind of indirection. By putting blog posts in a specific directory layout, I
  70. was creating a simple model of the temporal structure that I considered their
  71. most important property. Now, if I want to write commands that ask questions
  72. about my blog posts or re-combine them in certain ways, I can address my
  73. concerns to this model. Maybe, for example, I want a rough idea how many words
  74. I've written in blog posts so far in 2014:
  75. $ find ~/p1k3/archives/2014/ -type f | xargs cat | wc -w
  76. 6677
  77. `xargs` is not the most intuitive command, but it's useful and common enough to
  78. explain here. At the end of last chapter, when I said:
  79. $ cat `grep -ril '<freeverse>' ~/p1k3/archives` > ~/possible_poems_full
  80. I could also have written this as:
  81. $ grep -ril '<freeverse>' ~/p1k3/archives | xargs cat > ~/possible_poems_full
  82. What this does is take its input, which starts like:
  83. /home/brennen/p1k3/archives/2002/10/16
  84. /home/brennen/p1k3/archives/2002/10/27
  85. /home/brennen/p1k3/archives/2002/10/10
  86. ...and run `cat` on all the things in it:
  87. cat /home/brennen/p1k3/archives/2002/10/16 /home/brennen/p1k3/archives/2002/10/27 /home/brennen/p1k3/archives/2002/10/10 ...
  88. It can be a better idea to use `xargs`, because while backticks are
  89. incredibly useful, they have some limitations. If you're dealing with a very
  90. large list of files, for example, you might exceed the maximum allowed length
  91. for arguments to a command on your system. `xargs` is smart enough to know
  92. that limit and run `cat` more than once if needed.
  93. `xargs` is actually sort of a pain to think about, and will make you jump
  94. through some irritating hoops if you have spaces or other weirdness in your
  95. filenames, but I wind up using it quite a bit.
  96. Maybe I want to see a table of contents:
  97. <!-- exec -->
  98. $ find ~/p1k3/archives/2014/ -type d | xargs ls -v | head -10
  99. /home/brennen/p1k3/archives/2014/:
  100. 1
  101. 2
  102. 3
  103. 4
  104. /home/brennen/p1k3/archives/2014/1:
  105. 5
  106. 12
  107. 14
  108. <!-- end -->
  109. Or find the subtitles I used in 2013:
  110. <!-- exec -->
  111. $ find ~/p1k3/archives/2012/ -type f | xargs perl -ne 'print "$1\n" if m{<h2>(.*?)</h2>}'
  112. pursuit
  113. fragment
  114. this poem again
  115. i'll do better next time
  116. timebinding animals
  117. more observations on gear nerdery &amp; utility fetishism
  118. thrift
  119. A miracle, in fact, means work
  120. <em>technical notes for late october</em>, or <em>it gets dork out earlier these days</em>
  121. radio
  122. light enough to travel
  123. 12:06am
  124. "figures like Heinlein and Gingrich"
  125. <!-- end -->
  126. The crucial thing about this is that the filesystem _itself_ is just like `cat`
  127. and `grep`: It doesn't know anything about blogs (or poetry), and it's
  128. basically indifferent to the actual _structure_ of a file like
  129. `~/p1k3/archives/2014/1/12`. What the filesystem knows is that there are files
  130. with certain names in certain places. It need not know anything about the
  131. _meaning_ of those names in order to be useful; in fact, it's best if it stays
  132. agnostic about the question, for this enables us to assign our own meaning to a
  133. structure and manipulate that structure with standard tools.
  134. -> ★ <-
  135. Back to the problem at hand: I have this collection of files, and I know how
  136. to extract the ones that contain poems. My goal is to see all the poems and
  137. collect the subset of them that I still find worthwhile. Just knowing how to
  138. grep and then edit a big file solves my problem, in a basic sort of way. And
  139. yet: Something about this nags at my mind. I find that, just as I can already
  140. use standard tools and the filesystem to ask questions about all of my blog
  141. posts in a given year or month, I would like to be able to ask questions about
  142. the set of interesting poems.
  143. If I want the freedom to execute many different sorts of commands against this
  144. set of poems, it begins to seem that I need a model.
  145. When programmers talk about models, they often mean something that people in
  146. the sciences would recognize: We find ways to represent the arrangement of
  147. facts so that we can think about them. A structured representation of things
  148. often means that we can _change_ those things, or at least derive new
  149. understanding of them.
  150. -> ★ <-
  151. At this point in the narrative, I could pretend that my next step is
  152. immediately obvious, but in fact it's not. I spend a couple of days thinking
  153. off and on about how to proceed, scribbling notes during bus rides and while
  154. drinking beers at the pizza joint down the street. I assess and discard ideas
  155. which fall into a handful of broad approaches:
  156. - Store blog entries in a relational database system which would allow me to
  157. associate them with data like "this entry is in a collection called 'ok
  158. poems'".
  159. - Selectively build up a file containing the list of files with ok poems, and use
  160. it to do other tasks.
  161. - Define a format for metadata that lives within entry files.
  162. - Turn each interesting file into a directory of its own which contains a file
  163. with the original text and another file with metadata.
  164. I discard the relational database idea immediately: I like working with files,
  165. and I don't feel like abandoning a model that's served me well for my entire
  166. adult life.
  167. Building up an index file to point at the other files I'm working with has a
  168. certain appeal. I'm already most of the way there with the `grep` output in
  169. `potential_poems`. It would be easy to write shell commands to add, remove,
  170. sort, and search entries. Still, it doesn't feel like a very satisfying
  171. solution unto itself. I'd like to know that an entry is part of the collection
  172. just by looking at the entry, without having to cross-reference it to a list
  173. somewhere else.
  174. What about putting some meaningful text in the file itself? I thought about
  175. a bunch of different ways to do this, some of them really complicated, and
  176. eventually arrived at this:
  177. <!-- collection: ok-poems -->
  178. The `<!-- -->` bits are how you define a comment in HTML, which means that
  179. neither my blog code nor web browsers nor my text editor have to know anything
  180. about the format, but I can easily find files with certain values. Check it:
  181. $ find ~/p1k3/archives -type f | xargs perl -ne 'print "$ARGV[0]: $1 -> $2\n" if m{<!-- ([a-z]+): (.*?) -->};'
  182. /home/brennen/p1k3/archives/2014/2/9: collection -> ok-poems
  183. That's an ugly one-liner, and I haven't explained half of what it does, but the
  184. comment format actually seems pretty workable for this. It's a little tacky to
  185. look at, but it's simple and searchable.
  186. Before we settle, though, let's turn to the notion of making each entry into a
  187. directory that can contain some structured metadata in a separate file.
  188. Imagine something like:
  189. $ ls ~/p1k3/archives/2013/2/9
  190. index Meta
  191. Here I use the name "index" for the main part of the entry because it's a
  192. convention of web sites for the top-level page in a directory to be called
  193. something like `index.html`. As it happens, my blog software already supports
  194. this kind of file layout for entries which contain multiple parts, image files,
  195. and so forth.
  196. $ head ~/p1k3/archives/2013/2/9/index
  197. <h1>saturday, february 9</h1>
  198. <freeverse>
  199. midwinter midafternoon; depressed as hell
  200. sitting in a huge cabin in the rich-people mountains
  201. writing a sprawl, pages, of melancholic midlife bullshit
  202. outside the snow gives way to broken clouds and the
  203. clear unyielding light of the high country sun fills
  204. $ cat ~/p1k3/archives/2013/2/9/Meta
  205. collection: ok-poems
  206. It would then be easy to `find` files called `Meta` and grep them for
  207. `collection: ok-poems`.
  208. What if I put metadata right in the filename itself, and dispense with the grep
  209. altogether?
  210. $ ls ~/p1k3/archives/2013/2/9
  211. index meta-ok-poem
  212. $ find ~/p1k3/archives -name 'meta-ok-poem'
  213. /home/brennen/archives/2013/2/9/meta-ok-poem
  214. There's a lot to like about this. For one thing, it's immediately visible in a
  215. directory listing. For another, it doesn't require searching through thousands
  216. of lines of text to extract a specific string. If a directory has a
  217. `meta-ok-poem` in it, I can be pretty sure that it will contain an interesting
  218. `index`.
  219. What are the downsides? Well, it requires transforming lots of text files into
  220. directories-containing-files. I might automate that process, but it's still a
  221. little tedious and it makes the layout of the entry archive more complicated
  222. overall. There's a cost to doing things this way. It lets me extend my
  223. existing model of a blog entry to include arbitrary metadata, but it also adds
  224. steps to writing or finding blog entries.
  225. Abstractions usually cost you something. Is this one worth the hassle?
  226. Sometimes the best way to answer that question is to start writing code that
  227. handles a given abstraction.