A book about the command line for humans.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

989 lines
30 KiB

10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
  1. 1. the command line as literary environment
  2. ===========================================
  3. There're a lot of ways to structure an introduction to the command line. I'm
  4. going to start with writing as a point of departure because, aside from web
  5. development, it's what I use a computer for most. I want to shine a light on
  6. the humane potential of ideas that are usually understood as nerd trivia.
  7. Computers have utterly transformed the practice of writing within the space of
  8. my lifetime, but it seems to me that writers as a class miss out on many of the
  9. software tools and patterns taken as a given in more "technical" fields.
  10. Writing, particularly writing of any real scope or complexity, is very much a
  11. technical task. It makes demands, both physical and psychological, of its
  12. practitioners. As with woodworkers, graphic artists, and farmers, writers
  13. exhibit strong preferences in their tools, materials, and environment, and they
  14. do so because they're engaged in a physically and cognitively challenging task.
  15. My thesis is that the modern Linux command line is a pretty good environment
  16. for working with English prose and prosody, and that maybe this will illuminate
  17. the ways it could be useful in your own work with a computer, whatever that
  18. work happens to be.
  19. terms and definitions
  20. ---------------------
  21. What software are we actually talking about when we say "the command line"?
  22. For the purposes of this discussion, we're talking about an environment built
  23. on a very old paradigm called Unix.
  24. -> <img src="images/jp_unix.jpg" height=320 width=470> <-
  25. ...except what classical Unix really looks like is this:
  26. -> <img src="images/blinking.gif" width=470> <-
  27. The Unix-like environment we're going to use isn't very classical, really.
  28. It's an operating system kernel called Linux, combined with a bunch of things
  29. written by other people (people in the GNU and Debian projects, and many
  30. others). Purists will tell you that this isn't properly Unix at all. In
  31. strict historical terms they're right, or at least a certain kind of right, but
  32. for the purposes of my cultural agenda I'm going to ignore them right now.
  33. -> <img src="images/debian.png"> <-
  34. This is what's called a shell. There are many different shells, but they
  35. pretty much all operate on the same idea: You navigate a filesystem and run
  36. programs by typing commands. Commands can be combined in various ways to make
  37. programs of their own, and in fact the way you use the computer is often just
  38. to write little programs that invoke other programs, turtles-all-the-way-down
  39. style.
  40. The standard shell these days is something called Bash, so we'll use Bash.
  41. It's what you'll most often see in the wild. Like most shells, Bash is ugly
  42. and stupid in more ways than it is possible to easily summarize. It's also an
  43. incredibly powerful and expressive piece of software.
  44. twisty little passages
  45. ----------------------
  46. Have you ever played a text-based adventure game or MUD, of the kind that
  47. describes a setting and takes commands for movement and so on? Readers of a
  48. certain age and temperament might recognize the opening of Crowther & Woods'
  49. _Adventure_, the great-granddaddy of text adventure games:
  50. YOU ARE STANDING AT THE END OF A ROAD BEFORE A SMALL BRICK BUILDING.
  51. AROUND YOU IS A FOREST. A SMALL STREAM FLOWS OUT OF THE BUILDING ANd
  52. DOWN A GULLY.
  53. > GO EAST
  54. YOU ARE INSIDE A BUILDING, A WELL HOUSE FOR A LARGE SPRING.
  55. THERE ARE SOME KEYS ON THE GROUND HERE.
  56. THERE IS A SHINY BRASS LAMP NEARBY.
  57. THERE IS FOOD HERE.
  58. THERE IS A BOTTLE OF WATER HERE.
  59. You can think of the shell as a kind of environment you inhabit, in much the
  60. way your character inhabits an adventure game. The difference is that instead
  61. of navigating around virtual rooms and hallways with commands like `LOOK` and
  62. `EAST`, you navigate between directories by typing commands like `ls` and `cd
  63. notes`:
  64. $ ls
  65. code Downloads notes p1k3 photos scraps userland-book
  66. $ cd notes
  67. $ ls
  68. notes.txt sparkfun TODO.txt
  69. `ls` lists files. Some files are directories, which means they can contain
  70. other files, and you can step inside of them by typing `cd` (for **c**hange
  71. **d**irectory).
  72. In the Macintosh and Windows world, directories have been called
  73. "folders" for a long time now. This isn't the _worst_ metaphor for what's
  74. going on, and it's so pervasive by now that it's not worth fighting about.
  75. It's also not exactly a _great_ metaphor, since computer filesystems aren't
  76. built very much like the filing cabinets of yore. A directory acts a lot like
  77. a container of some sort, but it's an infinitely expandable one which may
  78. contain nested sub-spaces much larger than itself. Directories are frequently
  79. like the TARDIS: Bigger on the inside.
  80. cat
  81. ---
  82. When you're in the shell, you have many tools at your disposal - programs that
  83. can be used on many different files, or chained together with other programs.
  84. They tend to have weird, cryptic names, but a lot of them do very simple
  85. things. Tasks that might be a menu item in a big program like Word, like
  86. counting the number of words in a document or finding a particular phrase, are
  87. often programs unto themselves. We'll start with something even more basic
  88. than that.
  89. Suppose you have some files, and you're curious what's in them. For example,
  90. suppose you've got a list of authors you're planning to reference, and you just
  91. want to check its contents real quick-like. This is where our friend `cat`
  92. comes in:
  93. <!-- exec -->
  94. $ cat authors_sff
  95. Ursula K. Le Guin
  96. Jo Walton
  97. Pat Cadigan
  98. John Ronald Reuel Tolkien
  99. Vanessa Veselka
  100. James Tiptree, Jr.
  101. John Brunner
  102. <!-- end -->
  103. "Why," you might be asking, "is the command to dump out the contents of a file
  104. to a screen called `cat`? What do felines have to do with anything?"
  105. It turns out that `cat` is actually short for "catenate", which is a long
  106. word basically meaning "stick things together". In programming, we usually
  107. refer to sticking two bits of text together as "string concatenation", probably
  108. because programmers like to feel like they're being very precise about very
  109. simple actions.
  110. Suppose you wanted to see the contents of a _set_ of author lists:
  111. <!-- exec -->
  112. $ cat authors_sff authors_contemporary_fic authors_nat_hist
  113. Ursula K. Le Guin
  114. Jo Walton
  115. Pat Cadigan
  116. John Ronald Reuel Tolkien
  117. Vanessa Veselka
  118. James Tiptree, Jr.
  119. John Brunner
  120. Eden Robinson
  121. Vanessa Veselka
  122. Miriam Toews
  123. Gwendolyn L. Waring
  124. <!-- end -->
  125. wildcards
  126. ---------
  127. We're working with three filenames: `authors_sff`, `authors_contemporary_fic`,
  128. and `authors_nat_hist`. That's an awful lot of typing every time we want to do
  129. something to all three files. Fortunately, our shell offers a shorthand for
  130. "all the files that start with `authors_`":
  131. <!-- exec -->
  132. $ cat authors_*
  133. Eden Robinson
  134. Vanessa Veselka
  135. Miriam Toews
  136. Gwendolyn L. Waring
  137. Ursula K. Le Guin
  138. Jo Walton
  139. Pat Cadigan
  140. John Ronald Reuel Tolkien
  141. Vanessa Veselka
  142. James Tiptree, Jr.
  143. John Brunner
  144. <!-- end -->
  145. In Bash-land, `*` basically means "anything", and is known in the vernacular,
  146. somewhat poetically, as a "wildcard". You should always be careful with
  147. wildcards, especially if you're doing anything destructive. They can and will
  148. surprise the unwary. Still, once you're used to the idea, they will save you a
  149. lot of RSI.
  150. sort
  151. ----
  152. There's a problem here. Our author list is out of order, and thus confusing to
  153. reference. Fortunately, since one of the most basic things you can do to a
  154. list is to sort it, someone else has already solved this problem for us.
  155. Here's a command that will give us some organization:
  156. <!-- exec -->
  157. $ sort authors_*
  158. Eden Robinson
  159. Gwendolyn L. Waring
  160. James Tiptree, Jr.
  161. John Brunner
  162. John Ronald Reuel Tolkien
  163. Jo Walton
  164. Miriam Toews
  165. Pat Cadigan
  166. Ursula K. Le Guin
  167. Vanessa Veselka
  168. Vanessa Veselka
  169. <!-- end -->
  170. Does it bother you that they aren't sorted by last name? Me too. As a partial
  171. solution, we can ask `sort` to use the second "field" in each line as its sort
  172. **k**ey (by default, sort treats whitespace as a division between fields):
  173. <!-- exec -->
  174. $ sort -k2 authors_*
  175. John Brunner
  176. Pat Cadigan
  177. Ursula K. Le Guin
  178. Gwendolyn L. Waring
  179. Eden Robinson
  180. John Ronald Reuel Tolkien
  181. James Tiptree, Jr.
  182. Miriam Toews
  183. Vanessa Veselka
  184. Vanessa Veselka
  185. Jo Walton
  186. <!-- end -->
  187. That's closer, right? It sorted on "Cadigan" and "Veselka" instead of "Pat"
  188. and "Vanessa". (Of course, it's still far from perfect, because the
  189. second field in each line isn't necessarily the person's last name.)
  190. options
  191. -------
  192. Above, when we wanted to ask `sort` to behave differently, we gave it what is
  193. known as an option. Most programs with command-line interfaces will allow
  194. their behavior to be changed by adding various options. Options usually
  195. (but not always!) look like `-o` or `--option`.
  196. For example, if we wanted to see just the unique lines, irrespective of case,
  197. for a file called colors:
  198. <!-- exec -->
  199. $ cat colors
  200. RED
  201. blue
  202. red
  203. BLUE
  204. Green
  205. green
  206. GREEN
  207. <!-- end -->
  208. We could write this:
  209. <!-- exec -->
  210. $ sort -uf colors
  211. blue
  212. Green
  213. RED
  214. <!-- end -->
  215. Here `-u` stands for **u**nique and `-f` stands for **f**old case, which means
  216. to treat upper- and lower-case letters as the same for comparison purposes. You'll
  217. often see a group of short options following the `-` like this.
  218. uniq
  219. ----
  220. Did you notice how Vanessa Veselka shows up twice in our list of authors?
  221. That's useful if we want to remember that she's in more than one category, but
  222. it's redundant if we're just worried about membership in the overall set of
  223. authors. We can make sure our list doesn't contain repeating lines by using
  224. `sort`, just like with that list of colors:
  225. <!-- exec -->
  226. $ sort -u -k2 authors_*
  227. John Brunner
  228. Pat Cadigan
  229. Ursula K. Le Guin
  230. Gwendolyn L. Waring
  231. Eden Robinson
  232. John Ronald Reuel Tolkien
  233. James Tiptree, Jr.
  234. Miriam Toews
  235. Vanessa Veselka
  236. Jo Walton
  237. <!-- end -->
  238. But there's another approach to this --- `sort` is good at only displaying a line
  239. once, but suppose we wanted to see a count of how many different lists an
  240. author shows up on? `sort` doesn't do that, but a command called `uniq` does,
  241. if you give it the option `-c` for **c**ount.
  242. `uniq` moves through the lines in its input, and if it sees a line more than
  243. once in sequence, it will only print that line once. If you have a bunch of
  244. files and you just want to see the unique lines across all of those files, you
  245. probably need to run them through `sort` first. How do you do that?
  246. <!-- exec -->
  247. $ sort authors_* | uniq -c
  248. 1 Eden Robinson
  249. 1 Gwendolyn L. Waring
  250. 1 James Tiptree, Jr.
  251. 1 John Brunner
  252. 1 John Ronald Reuel Tolkien
  253. 1 Jo Walton
  254. 1 Miriam Toews
  255. 1 Pat Cadigan
  256. 1 Ursula K. Le Guin
  257. 2 Vanessa Veselka
  258. <!-- end -->
  259. standard IO
  260. -----------
  261. The `|` is called a "pipe". In the command above, it tells your shell that
  262. instead of printing the output of `sort authors_*` right to your terminal, it
  263. should send it to `uniq -c`.
  264. -> <img src="images/pipe.gif"> <-
  265. Pipes are some of the most important magic in the shell. When the people who
  266. built Unix in the first place give interviews about the stuff they remember
  267. from the early days, a lot of them reminisce about the invention of pipes and
  268. all of the new stuff it immediately made possible.
  269. Pipes help you control a thing called "standard IO". In the world of the
  270. command line, programs take **i**nput and produce **o**utput. A pipe is a way
  271. to hook the output from one program to the input of another.
  272. Unlike a lot of the weirdly named things you'll encounter in software, the
  273. metaphor here is obvious and makes pretty good sense. It even kind of looks
  274. like a physical pipe.
  275. What if, instead of sending the output of one program to the input of another,
  276. you'd like to store it in a file for later use?
  277. Check it out:
  278. <!-- exec -->
  279. $ sort authors_* | uniq > ./all_authors
  280. <!-- end -->
  281. <!-- exec -->
  282. $ cat all_authors
  283. Eden Robinson
  284. Gwendolyn L. Waring
  285. James Tiptree, Jr.
  286. John Brunner
  287. John Ronald Reuel Tolkien
  288. Jo Walton
  289. Miriam Toews
  290. Pat Cadigan
  291. Ursula K. Le Guin
  292. Vanessa Veselka
  293. <!-- end -->
  294. I like to think of the `>` as looking like a little funnel. It can be
  295. dangerous --- you should always make sure that you're not going to clobber
  296. an existing file you actually want to keep.
  297. If you want to tack more stuff on to the end of an existing file, you can use
  298. `>>` instead. To test that, let's use `echo`, which prints out whatever string
  299. you give it on a line by itself:
  300. <!-- exec -->
  301. $ echo 'hello' > hello_world
  302. <!-- end -->
  303. <!-- exec -->
  304. $ echo 'world' >> hello_world
  305. <!-- end -->
  306. <!-- exec -->
  307. $ cat hello_world
  308. hello
  309. world
  310. <!-- end -->
  311. You can also take a file and pull it directly back into the input of a given
  312. program, which is a bit like a funnel going the other direction:
  313. <!-- exec -->
  314. $ nl < all_authors
  315. 1 Eden Robinson
  316. 2 Gwendolyn L. Waring
  317. 3 James Tiptree, Jr.
  318. 4 John Brunner
  319. 5 John Ronald Reuel Tolkien
  320. 6 Jo Walton
  321. 7 Miriam Toews
  322. 8 Pat Cadigan
  323. 9 Ursula K. Le Guin
  324. 10 Vanessa Veselka
  325. <!-- end -->
  326. `nl` is just a way to **n**umber **l**ines. This command accomplishes pretty much
  327. the same thing as `cat all_authors | nl`, or `nl all_authors`. You won't see
  328. it used as often as `|` and `>`, since most utilities can read files on their
  329. own, but it can save you typing `cat` quite as often.
  330. We'll use these features liberally from here on out.
  331. `--help` and man pages
  332. ----------------------
  333. You can change the behavior of most tools by giving them different options.
  334. This is all well and good if you already know what options are available,
  335. but what if you don't?
  336. Often, you can ask the tool itself:
  337. $ sort --help
  338. Usage: sort [OPTION]... [FILE]...
  339. or: sort [OPTION]... --files0-from=F
  340. Write sorted concatenation of all FILE(s) to standard output.
  341. Mandatory arguments to long options are mandatory for short options too.
  342. Ordering options:
  343. -b, --ignore-leading-blanks ignore leading blanks
  344. -d, --dictionary-order consider only blanks and alphanumeric characters
  345. -f, --ignore-case fold lower case to upper case characters
  346. -g, --general-numeric-sort compare according to general numerical value
  347. -i, --ignore-nonprinting consider only printable characters
  348. -M, --month-sort compare (unknown) < 'JAN' < ... < 'DEC'
  349. -h, --human-numeric-sort compare human readable numbers (e.g., 2K 1G)
  350. -n, --numeric-sort compare according to string numerical value
  351. -R, --random-sort sort by random hash of keys
  352. --random-source=FILE get random bytes from FILE
  353. -r, --reverse reverse the result of comparisons
  354. ...and so on. (It goes on for a while in this vein.)
  355. If that doesn't work, or doesn't provide enough info, the next thing to try is
  356. called a man page. ("man" is short for "manual". It's sort of an unfortunate
  357. abbreviation.)
  358. $ man sort
  359. SORT(1) User Commands SORT(1)
  360. NAME
  361. sort - sort lines of text files
  362. SYNOPSIS
  363. sort [OPTION]... [FILE]...
  364. sort [OPTION]... --files0-from=F
  365. DESCRIPTION
  366. Write sorted concatenation of all FILE(s) to standard output.
  367. ...and so on. Manual pages vary in quality, and it can take a while to get
  368. used to reading them, but they're very often the best place to look for help.
  369. If you're not sure what _program_ you want to use to solve a given problem, you
  370. might try searching all the man pages on the system for a keyword. `man`
  371. itself has an option to let you do this - `man -k keyword` - but most systems
  372. also have a shortcut called `apropos`, which I like to use because it's easy to
  373. remember if you imagine yourself saying "apropos of [some problem I have]..."
  374. <!-- exec -->
  375. $ apropos -s1 sort
  376. apt-sortpkgs (1) - Utility to sort package index files
  377. bunzip2 (1) - a block-sorting file compressor, v1.0.6
  378. bzip2 (1) - a block-sorting file compressor, v1.0.6
  379. comm (1) - compare two sorted files line by line
  380. sort (1) - sort lines of text files
  381. tsort (1) - perform topological sort
  382. <!-- end -->
  383. It's useful to know that the manual represented by `man` has numbered sections
  384. for different kinds of manual pages. Most of what the average user needs to
  385. know about lives in section 1, "User Commands", so you'll often see the names
  386. of different tools written like `sort(1)` or `cat(1)`. This can be a good way
  387. to make it clear in writing that you're talking about a specific piece of
  388. software rather than a verb or a small carnivorous mammal. (I specified `-s1`
  389. for section 1 above just to cut down on clutter, though in practice I usually
  390. don't bother.)
  391. Like other literary traditions, Unix is littered with this sort of convention.
  392. This one just happens to date from a time when the manual was still a physical
  393. book.
  394. wc
  395. --
  396. `wc` stands for **w**ord **c**ount. It does about what you'd expect - it
  397. counts the number of words in its input.
  398. $ wc index.md
  399. 736 4117 24944 index.md
  400. 736 is the number of lines, 4117 the number of words, and 24944 the number of
  401. characters in the file I'm writing right now. I use this constantly. Most
  402. obviously, it's a good way to get an idea of how much you've written. `wc` is
  403. the tool I used to track my progress the last time I tried National Novel
  404. Writing Month:
  405. $ find ~/p1k3/archives/2010/11 -regextype egrep -regex '.*([0-9]+|index)' -type f | xargs wc -w | tail -1
  406. 6585 total
  407. <!-- exec -->
  408. $ cowsay 'embarrassing.'
  409. _______________
  410. < embarrassing. >
  411. ---------------
  412. \ ^__^
  413. \ (oo)\_______
  414. (__)\ )\/\
  415. ||----w |
  416. || ||
  417. <!-- end -->
  418. Anyway. The less obvious thing about `wc` is that you can use it to count the
  419. output of other commands. Want to know _how many_ unique authors we have?
  420. <!-- exec -->
  421. $ sort authors_* | uniq | wc -l
  422. 10
  423. <!-- end -->
  424. This kind of thing is trivial, but it comes in handy more often than you might
  425. think.
  426. head, tail, and cut
  427. -------------------
  428. Remember our old pal `cat`, which just splats everything it's given back to
  429. standard output?
  430. Sometimes you've got a piece of output that's more than you actually want to
  431. deal with at once. Maybe you just want to glance at the first few lines in a
  432. file:
  433. <!-- exec -->
  434. $ head -3 colors
  435. RED
  436. blue
  437. red
  438. <!-- end -->
  439. ...or maybe you want to see the last thing in a list:
  440. <!-- exec -->
  441. $ sort colors | uniq -i | tail -1
  442. red
  443. <!-- end -->
  444. ...or maybe you're only interested in the first "field" in some list. You might
  445. use `cut` here, asking it to treat spaces as delimiters between fields and
  446. return only the first field for each line of its input:
  447. <!-- exec -->
  448. $ cut -d' ' -f1 ./authors_*
  449. Eden
  450. Vanessa
  451. Miriam
  452. Gwendolyn
  453. Ursula
  454. Jo
  455. Pat
  456. John
  457. Vanessa
  458. James
  459. John
  460. <!-- end -->
  461. Suppose we're curious what the few most commonly occurring first names on our
  462. author list are? Here's an approach, silly but effective, that combines a lot
  463. of what we've discussed so far and looks like plenty of one-liners I wind up
  464. writing in real life:
  465. <!-- exec -->
  466. $ cut -d' ' -f1 ./authors_* | sort | uniq -ci | sort -n | tail -3
  467. 1 Ursula
  468. 2 John
  469. 2 Vanessa
  470. <!-- end -->
  471. Let's walk through this one step by step:
  472. First, we have `cut` extract the first field of each line in our author lists.
  473. cut -d' ' -f1 ./authors_*
  474. Then we sort these results
  475. | sort
  476. and pass them to `uniq`, asking it for a case-insensitive count of each
  477. repeated line
  478. | uniq -ci
  479. then sort again, numerically,
  480. | sort -n
  481. and finally, we chop off everything but the last three lines:
  482. | tail -3
  483. If you wanted to make sure to count an individual author's first name
  484. only once, even if that author appears more than once in the files,
  485. you could instead do:
  486. <!-- exec -->
  487. $ sort -u ./authors_* | cut -d' ' -f1 | uniq -ci | sort -n | tail -3
  488. 1 Ursula
  489. 1 Vanessa
  490. 2 John
  491. <!-- end -->
  492. tab separated values
  493. --------------------
  494. Notice above how we had to tell `cut` that "fields" in `authors_*` are
  495. delimited by spaces? It turns out that if you don't use `-d`, `cut` defaults
  496. to using tab characters for a delimiter.
  497. Tab characters are sort of weird little animals. You can't usually _see_ them
  498. directly --- they're like a space character that takes up more than one space
  499. when displayed. By convention, one tab is usually rendered as 8 spaces, but
  500. it's up to the software that's displaying the character what it wants to do.
  501. (In fact, it's more complicated than that: Tabs are often rendered as marking
  502. _tab stops_, which is a concept I remember from 7th grade typing classes, but
  503. haven't actually thought about in my day-to-day life for nearly 20 years.)
  504. Here's a version of our `all_authors` that's been rearranged so that the first
  505. field is the author's last name, the second is their first name, the third is
  506. their middle name or initial (if we know it) and the fourth is any suffix.
  507. Fields are separated by a single tab character:
  508. <!-- exec -->
  509. $ cat all_authors.tsv
  510. Robinson Eden
  511. Waring Gwendolyn L.
  512. Tiptree James Jr.
  513. Brunner John
  514. Tolkien John Ronald Reuel
  515. Walton Jo
  516. Toews Miriam
  517. Cadigan Pat
  518. Le Guin Ursula K.
  519. Veselka Vanessa
  520. <!-- end -->
  521. That looks kind of garbled, right? In order to make it a little more obvious
  522. what's happening, let's use `cat -T`, which displays tab characters as `^I`:
  523. <!-- exec -->
  524. $ cat -T all_authors.tsv
  525. Robinson^IEden
  526. Waring^IGwendolyn^IL.
  527. Tiptree^IJames^I^IJr.
  528. Brunner^IJohn
  529. Tolkien^IJohn^IRonald Reuel
  530. Walton^IJo
  531. Toews^IMiriam
  532. Cadigan^IPat
  533. Le Guin^IUrsula^IK.
  534. Veselka^IVanessa
  535. <!-- end -->
  536. It looks odd when displayed because some names are at or nearly at 8 characters long.
  537. "Robinson", at 8 characters, overshoots the first tab stop, so "Eden" gets indented
  538. further than other first names, and so on.
  539. Fortunately, in order to make this more human-readable, we can pass it through
  540. `expand`, which turns tabs into a given number of spaces (8 by default):
  541. <!-- exec -->
  542. $ expand -t14 all_authors.tsv
  543. Robinson Eden
  544. Waring Gwendolyn L.
  545. Tiptree James Jr.
  546. Brunner John
  547. Tolkien John Ronald Reuel
  548. Walton Jo
  549. Toews Miriam
  550. Cadigan Pat
  551. Le Guin Ursula K.
  552. Veselka Vanessa
  553. <!-- end -->
  554. Now it's easy to sort by last name:
  555. <!-- exec -->
  556. $ sort -k1 all_authors.tsv | expand -t14
  557. Brunner John
  558. Cadigan Pat
  559. Le Guin Ursula K.
  560. Robinson Eden
  561. Tiptree James Jr.
  562. Toews Miriam
  563. Tolkien John Ronald Reuel
  564. Veselka Vanessa
  565. Walton Jo
  566. Waring Gwendolyn L.
  567. <!-- end -->
  568. Or just extract middle names and initials:
  569. <!-- exec -->
  570. $ cut -f3 all_authors.tsv
  571. L.
  572. Ronald Reuel
  573. K.
  574. <!-- end -->
  575. It probably won't surprise you to learn that there's a corresponding `paste`
  576. command, which takes two or more files and stitches them together with tab
  577. characters. Let's extract a couple of things from our author list and put them
  578. back together in a different order:
  579. <!-- exec -->
  580. $ cut -f1 all_authors.tsv > lastnames
  581. <!-- end -->
  582. <!-- exec -->
  583. $ cut -f2 all_authors.tsv > firstnames
  584. <!-- end -->
  585. <!-- exec -->
  586. $ paste firstnames lastnames | sort -k2 | expand -t12
  587. John Brunner
  588. Pat Cadigan
  589. Ursula Le Guin
  590. Eden Robinson
  591. James Tiptree
  592. Miriam Toews
  593. John Tolkien
  594. Vanessa Veselka
  595. Jo Walton
  596. Gwendolyn Waring
  597. <!-- end -->
  598. As these examples show, TSV is something very like a primitive spreadsheet: A
  599. way to represent information in columns and rows. In fact, it's a close cousin
  600. of CSV, which is often used as a lowest-common-denominator format for
  601. transferring spreadsheets, and which represents data something like this:
  602. last,first,middle,suffix
  603. Tolkien,John,Ronald Reuel,
  604. Tiptree,James,,Jr.
  605. The advantage of tabs is that they're supported by a bunch of the standard
  606. tools. A disadvantage is that they're kind of ugly and can be weird to deal
  607. with, but they're useful anyway, and character-delimited rows are often a
  608. good-enough way to hack your way through problems that call for basic
  609. structure.
  610. finding text: grep
  611. ------------------
  612. After all those contortions, what if you actually just want to see _which lists_
  613. an individual author appears on?
  614. <!-- exec -->
  615. $ grep 'Vanessa' ./authors_*
  616. ./authors_contemporary_fic:Vanessa Veselka
  617. ./authors_sff:Vanessa Veselka
  618. <!-- end -->
  619. `grep` takes a string to search for and, optionally, a list of files to search
  620. in. If you don't specify files, it'll look through standard input instead:
  621. <!-- exec -->
  622. $ cat ./authors_* | grep 'Vanessa'
  623. Vanessa Veselka
  624. Vanessa Veselka
  625. <!-- end -->
  626. Most of the time, piping the output of `cat` to `grep` is considered silly,
  627. because `grep` knows how to find things in files on its own. Many thousands of
  628. words have been written on this topic by leading lights of the nerd community.
  629. You've probably noticed that this result doesn't contain filenames (and thus
  630. isn't very useful to us). That's because all `grep` saw was the lines in the
  631. files, not the names of the files themselves.
  632. now you have n problems
  633. -----------------------
  634. To close out this introductory chapter, let's spend a little time on a topic
  635. that will likely vex, confound, and (occasionally) delight you for as long as
  636. you are acquainted with the command line.
  637. When I was talking about `grep` a moment ago, I fudged the details more than a
  638. little by saying that it expects a string to search for. What `grep`
  639. _actually_ expects is a _pattern_. Moreover, it expects a specific kind of
  640. pattern, what's known as a _regular expression_, a cumbersome phrase frequently
  641. shortened to regex.
  642. There's a lot of theory about what makes up a regular expression. Fortunately,
  643. very little of it matters to the short version that will let you get useful
  644. stuff done. The short version is that a regex is like using wildcards in the
  645. shell to match groups of files, but for text in general and with more magic.
  646. <!-- exec -->
  647. $ grep 'Jo.*' ./authors_*
  648. ./authors_sff:Jo Walton
  649. ./authors_sff:John Ronald Reuel Tolkien
  650. ./authors_sff:John Brunner
  651. <!-- end -->
  652. The pattern `Jo.*` says that we're looking for lines which contain a literal
  653. `Jo`, followed by any quantity (including none) of any character. In a regex,
  654. `.` means "anything" and `*` means "any amount of the preceding thing".
  655. `.` and `*` are magical. In the particular dialect of regexen understood
  656. by `grep`, other magical things include:
  657. <table>
  658. <tr><td><code>^</code> </td> <td>start of a line </td></tr>
  659. <tr><td><code>$</code> </td> <td>end of a line </td></tr>
  660. <tr><td><code>[abc]</code></td> <td>one of a, b, or c </td></tr>
  661. <tr><td><code>[a-z]</code></td> <td>a character in the range a through z</td></tr>
  662. <tr><td><code>[0-9]</code></td> <td>a character in the range 0 through 9</td></tr>
  663. <tr><td><code>+</code> </td> <td>one or more of the preceding thing </td></tr>
  664. <tr><td><code>?</code> </td> <td>0 or 1 of the preceding thing </td></tr>
  665. <tr><td><code>*</code> </td> <td>any number of the preceding thing </td></tr>
  666. <tr><td><code>(foo|bar)</code></td> <td>"foo" or "bar"</td></tr>
  667. <tr><td><code>(foo)?</code></td> <td>optional "foo"</td></tr>
  668. </table>
  669. It's actually a little more complicated than that: By default, if you want to
  670. use a lot of the magical characters, you have to prefix them with `\`. This is
  671. both ugly and confusing, so unless you're writing a very simple pattern, it's
  672. often easiest to call `grep -E`, for **E**xtended regular expressions, which
  673. means that lots of characters will have special meanings.
  674. Authors with 4-letter first names:
  675. <!-- exec -->
  676. $ grep -iE '^[a-z]{4} ' ./authors_*
  677. ./authors_contemporary_fic:Eden Robinson
  678. ./authors_sff:John Ronald Reuel Tolkien
  679. ./authors_sff:John Brunner
  680. <!-- end -->
  681. A count of authors named John:
  682. <!-- exec -->
  683. $ grep -c '^John ' ./all_authors
  684. 2
  685. <!-- end -->
  686. Lines in this file matching the words "magic" or "magical":
  687. $ grep -iE 'magic(al)?' ./index.md
  688. Pipes are some of the most important magic in the shell. When the people who
  689. shell to match groups of files, but with more magic.
  690. `.` and `*` are magical. In the particular dialect of regexen understood
  691. by `grep`, other magical things include:
  692. use a lot of the magical characters, you have to prefix them with `\`. This is
  693. Lines in this file matching the words "magic" or "magical":
  694. $ grep -iE 'magic(al)?' ./index.md
  695. Find some "-agic" words in a big list of words:
  696. <!-- exec -->
  697. $ grep -iE '(m|tr|pel)agic' /usr/share/dict/words
  698. magic
  699. magic's
  700. magical
  701. magically
  702. magician
  703. magician's
  704. magicians
  705. pelagic
  706. tragic
  707. tragically
  708. tragicomedies
  709. tragicomedy
  710. tragicomedy's
  711. <!-- end -->
  712. `grep` isn't the only - or even the most important - tool that makes use of
  713. regular expressions, but it's a good place to start because it's one of the
  714. fundamental building blocks for so many other operations. Filtering lists of
  715. things, matching patterns within collections, and writing concise descriptions
  716. of how text should be transformed are at the heart of a practical approach to
  717. Unix-like systems. Regexen turn out to be a seductively powerful way to do
  718. these things - so much so that they've crept their way into text editors,
  719. databases, and full-featured programming languages.
  720. There's a dark side to all of this, for the truth about regular expressions is
  721. that they are ugly, inconsistent, brittle, and _incredibly_ difficult to think
  722. clearly about. They take years to master and reward the wielder with great
  723. power, but they are also a trap: a temptation towards the path of cleverness
  724. masquerading as wisdom.
  725. -> ✑ <-
  726. I'll be returning to this theme, but for the time being let's move on. Now
  727. that we've established, however haphazardly, some of the basics, let's consider
  728. their application to a real-world task.