A book about the command line for humans.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

995 lines
31 KiB

8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
  1. 1. the command line as literary environment
  2. ===========================================
  3. There're a lot of ways to structure an introduction to the command line. I'm
  4. going to start with writing as a point of departure because, aside from web
  5. development, it's what I use a computer for most. I want to shine a light on
  6. the humane potential of ideas that are usually understood as nerd trivia.
  7. Computers have utterly transformed the practice of writing within the space of
  8. my lifetime, but it seems to me that writers as a class miss out on many of the
  9. software tools and patterns taken as a given in more "technical" fields.
  10. Writing, particularly writing of any real scope or complexity, is very much a
  11. technical task. It makes demands, both physical and psychological, of its
  12. practitioners. As with woodworkers, graphic artists, and farmers, writers
  13. exhibit strong preferences in their tools, materials, and environment, and they
  14. do so because they're engaged in a physically and cognitively challenging task.
  15. My thesis is that the modern Linux command line is a pretty good environment
  16. for working with English prose and prosody, and that maybe this will illuminate
  17. the ways it could be useful in your own work with a computer, whatever that
  18. work happens to be.
  19. terms and definitions
  20. ---------------------
  21. What software are we actually talking about when we say "the command line"?
  22. For the purposes of this discussion, we're talking about an environment built
  23. on a very old paradigm called Unix.
  24. -> <img src="images/jp_unix.jpg" height=320 width=470> <-
  25. ...except what classical Unix really looks like is this:
  26. -> <img src="images/blinking.gif" width=470> <-
  27. The Unix-like environment we're going to use isn't very classical, really.
  28. It's an operating system kernel called Linux, combined with a bunch of things
  29. written by other people (people in the GNU and Debian projects, and many
  30. others). Purists will tell you that this isn't properly Unix at all. In
  31. strict historical terms they're right, or at least a certain kind of right, but
  32. for the purposes of my cultural agenda I'm going to ignore them right now.
  33. -> <img src="images/debian.png"> <-
  34. This is what's called a shell. There are many different shells, but they
  35. pretty much all operate on the same idea: You navigate a filesystem and run
  36. programs by typing commands. Commands can be combined in various ways to make
  37. programs of their own, and in fact the way you use the computer is often just
  38. to write little programs that invoke other programs, turtles-all-the-way-down
  39. style.
  40. The standard shell these days is something called Bash, so we'll use Bash.
  41. It's what you'll most often see in the wild. Like most shells, Bash is ugly
  42. and stupid in more ways than it is possible to easily summarize. It's also an
  43. incredibly powerful and expressive piece of software.
  44. get you a shell
  45. ---------------
  46. {TODO: Make this section useful.}
  47. twisty little passages
  48. ----------------------
  49. Have you ever played a text-based adventure game or MUD, of the kind that
  50. describes a setting and takes commands for movement and so on? Readers of a
  51. certain age and temperament might recognize the opening of Crowther & Woods'
  52. _Adventure_, the great-granddaddy of text adventure games:
  56. > GO EAST
  62. In much the same way, you can think of the shell as a kind of environment you
  63. inhabit, the same way your character might inhabit an adventure game. Or as a
  64. sort of vehicle for getting around inside of computers. The difference is that
  65. instead of navigating around virtual rooms and hallways with commands like
  66. `LOOK` and `EAST`, you navigate between directories by typing commands like
  67. `ls` and `cd notes`:
  68. $ ls
  69. code Downloads notes p1k3 photos scraps userland-book
  70. $ cd notes
  71. $ ls
  72. notes.txt sparkfun TODO.txt
  73. `ls` lists files. Some files are directories, which means they can contain
  74. other files, and you can step inside of them by typing `cd` (for **c**hange
  75. **d**irectory).
  76. In the Macintosh and Windows world, directories have been called
  77. "folders" for a long time now. This isn't the _worst_ metaphor for what's
  78. going on, and it's so pervasive by now that it's not worth fighting about.
  79. It's also not exactly a _great_ metaphor, since computer filesystems aren't
  80. built very much like the filing cabinets of yore. A directory acts a lot like
  81. a container of some sort, but it's an infinitely expandable one which may
  82. contain nested sub-spaces much larger than itself. Directories are frequently
  83. like the TARDIS: Bigger on the inside.
  84. cat
  85. ---
  86. When you're in the shell, you have many tools at your disposal - programs that
  87. can be used on many different files, or chained together with other programs.
  88. They tend to have weird, cryptic names, but a lot of them do very simple
  89. things. Tasks that might be a menu item in a big program like Word, like
  90. counting the number of words in a document or finding a particular phrase, are
  91. often programs unto themselves. We'll start with something even more basic
  92. than that.
  93. Suppose you have some files, and you're curious what's in them. For example,
  94. suppose you've got a list of authors you're planning to reference, and you just
  95. want to check its contents real quick-like. This is where our friend `cat`
  96. comes in:
  97. <!-- exec -->
  98. $ cat authors_sff
  99. Ursula K. Le Guin
  100. Jo Walton
  101. Pat Cadigan
  102. John Ronald Reuel Tolkien
  103. Vanessa Veselka
  104. James Tiptree, Jr.
  105. John Brunner
  106. <!-- end -->
  107. "Why," you might be asking, "is the command to dump out the contents of a file
  108. to a screen called `cat`? What do felines have to do with anything?"
  109. It turns out that `cat` is actually short for "concatenate", which is a long
  110. word basically meaning "stick things together". In programming, we usually
  111. refer to sticking two bits of text together as "string concatenation", probably
  112. because programmers like to feel like they're being very precise about very
  113. simple actions.
  114. Suppose you wanted to see the contents of a _set_ of author lists:
  115. <!-- exec -->
  116. $ cat authors_sff authors_contemporary_fic authors_nat_hist
  117. Ursula K. Le Guin
  118. Jo Walton
  119. Pat Cadigan
  120. John Ronald Reuel Tolkien
  121. Vanessa Veselka
  122. James Tiptree, Jr.
  123. John Brunner
  124. Eden Robinson
  125. Vanessa Veselka
  126. Miriam Toews
  127. Gwendolyn L. Waring
  128. <!-- end -->
  129. wildcards
  130. ---------
  131. We're working with three filenames: `authors_sff`, `authors_contemporary_fic`,
  132. and `authors_nat_hist`. That's an awful lot of typing every time we want to do
  133. something to all three files. Fortunately, our shell offers a shorthand for
  134. "all the files that start with `authors_`":
  135. <!-- exec -->
  136. $ cat authors_*
  137. Eden Robinson
  138. Vanessa Veselka
  139. Miriam Toews
  140. Gwendolyn L. Waring
  141. Ursula K. Le Guin
  142. Jo Walton
  143. Pat Cadigan
  144. John Ronald Reuel Tolkien
  145. Vanessa Veselka
  146. James Tiptree, Jr.
  147. John Brunner
  148. <!-- end -->
  149. In Bash-land, `*` basically means "anything", and is known in the vernacular,
  150. somewhat poetically, as a "wildcard". You should always be careful with
  151. wildcards, especially if you're doing anything destructive. They can and will
  152. surprise the unwary. Still, once you're used to the idea, they will save you a
  153. lot of RSI.
  154. sort
  155. ----
  156. There's a problem here. Our author list is out of order, and thus confusing to
  157. reference. Fortunately, since one of the most basic things you can do to a
  158. list is to sort it, someone else has already solved this problem for us.
  159. Here's a command that will give us some organization:
  160. <!-- exec -->
  161. $ sort authors_*
  162. Eden Robinson
  163. Gwendolyn L. Waring
  164. James Tiptree, Jr.
  165. John Brunner
  166. John Ronald Reuel Tolkien
  167. Jo Walton
  168. Miriam Toews
  169. Pat Cadigan
  170. Ursula K. Le Guin
  171. Vanessa Veselka
  172. Vanessa Veselka
  173. <!-- end -->
  174. Does it bother you that they aren't sorted by last name? Me too. As a partial
  175. solution, we can ask `sort` to use the second "field" in each line as its sort
  176. **k**ey (by default, sort treats whitespace as a division between fields):
  177. <!-- exec -->
  178. $ sort -k2 authors_*
  179. John Brunner
  180. Pat Cadigan
  181. Ursula K. Le Guin
  182. Gwendolyn L. Waring
  183. Eden Robinson
  184. John Ronald Reuel Tolkien
  185. James Tiptree, Jr.
  186. Miriam Toews
  187. Vanessa Veselka
  188. Vanessa Veselka
  189. Jo Walton
  190. <!-- end -->
  191. That's closer, right? It sorted on "Cadigan" and "Veselka" instead of "Pat"
  192. and "Vanessa". (Of course, it's still far from perfect, because the
  193. second field in each line isn't necessarily the person's last name.)
  194. options
  195. -------
  196. Above, when we wanted to ask `sort` to behave differently, we gave it what is
  197. known as an option. Most programs with command-line interfaces will allow
  198. their behavior to be changed by adding various options. Options usually
  199. (but not always!) look like `-o` or `--option`.
  200. For example, if we wanted to see just the unique lines, irrespective of case,
  201. for a file called colors:
  202. <!-- exec -->
  203. $ cat colors
  204. RED
  205. blue
  206. red
  207. BLUE
  208. Green
  209. green
  210. GREEN
  211. <!-- end -->
  212. We could write this:
  213. <!-- exec -->
  214. $ sort -uf colors
  215. blue
  216. Green
  217. RED
  218. <!-- end -->
  219. Here `-u` stands for **u**nique and `-f` stands for **f**old case, which means
  220. to treat upper- and lower-case letters as the same for comparison purposes. You'll
  221. often see a group of short options following the `-` like this.
  222. uniq
  223. ----
  224. Did you notice how Vanessa Veselka shows up twice in our list of authors?
  225. That's useful if we want to remember that she's in more than one category, but
  226. it's redundant if we're just worried about membership in the overall set of
  227. authors. We can make sure our list doesn't contain repeating lines by using
  228. `sort`, just like with that list of colors:
  229. <!-- exec -->
  230. $ sort -u -k2 authors_*
  231. John Brunner
  232. Pat Cadigan
  233. Ursula K. Le Guin
  234. Gwendolyn L. Waring
  235. Eden Robinson
  236. John Ronald Reuel Tolkien
  237. James Tiptree, Jr.
  238. Miriam Toews
  239. Vanessa Veselka
  240. Jo Walton
  241. <!-- end -->
  242. But there's another approach to this --- `sort` is good at only displaying a line
  243. once, but suppose we wanted to see a count of how many different lists an
  244. author shows up on? `sort` doesn't do that, but a command called `uniq` does,
  245. if you give it the option `-c` for **c**ount.
  246. `uniq` moves through the lines in its input, and if it sees a line more than
  247. once in sequence, it will only print that line once. If you have a bunch of
  248. files and you just want to see the unique lines across all of those files, you
  249. probably need to run them through `sort` first. How do you do that?
  250. <!-- exec -->
  251. $ sort authors_* | uniq -c
  252. 1 Eden Robinson
  253. 1 Gwendolyn L. Waring
  254. 1 James Tiptree, Jr.
  255. 1 John Brunner
  256. 1 John Ronald Reuel Tolkien
  257. 1 Jo Walton
  258. 1 Miriam Toews
  259. 1 Pat Cadigan
  260. 1 Ursula K. Le Guin
  261. 2 Vanessa Veselka
  262. <!-- end -->
  263. standard IO
  264. -----------
  265. The `|` is called a "pipe". In the command above, it tells your shell that
  266. instead of printing the output of `sort authors_*` right to your terminal, it
  267. should send it to `uniq -c`.
  268. -> <img src="images/pipe.gif"> <-
  269. Pipes are some of the most important magic in the shell. When the people who
  270. built Unix in the first place give interviews about the stuff they remember
  271. from the early days, a lot of them reminisce about the invention of pipes and
  272. all of the new stuff it immediately made possible.
  273. Pipes help you control a thing called "standard IO". In the world of the
  274. command line, programs take **i**nput and produce **o**utput. A pipe is a way
  275. to hook the output from one program to the input of another.
  276. Unlike a lot of the weirdly named things you'll encounter in software, the
  277. metaphor here is obvious and makes pretty good sense. It even kind of looks
  278. like a physical pipe.
  279. What if, instead of sending the output of one program to the input of another,
  280. you'd like to store it in a file for later use?
  281. Check it out:
  282. <!-- exec -->
  283. $ sort authors_* | uniq > ./all_authors
  284. <!-- end -->
  285. <!-- exec -->
  286. $ cat all_authors
  287. Eden Robinson
  288. Gwendolyn L. Waring
  289. James Tiptree, Jr.
  290. John Brunner
  291. John Ronald Reuel Tolkien
  292. Jo Walton
  293. Miriam Toews
  294. Pat Cadigan
  295. Ursula K. Le Guin
  296. Vanessa Veselka
  297. <!-- end -->
  298. I like to think of the `>` as looking like a little funnel. It can be
  299. dangerous --- you should always make sure that you're not going to clobber
  300. an existing file you actually want to keep.
  301. If you want to tack more stuff on to the end of an existing file, you can use
  302. `>>` instead. To test that, let's use `echo`, which prints out whatever string
  303. you give it on a line by itself:
  304. <!-- exec -->
  305. $ echo 'hello' > hello_world
  306. <!-- end -->
  307. <!-- exec -->
  308. $ echo 'world' >> hello_world
  309. <!-- end -->
  310. <!-- exec -->
  311. $ cat hello_world
  312. hello
  313. world
  314. <!-- end -->
  315. You can also take a file and pull it directly back into the input of a given
  316. program, which is a bit like a funnel going the other direction:
  317. <!-- exec -->
  318. $ nl < all_authors
  319. 1 Eden Robinson
  320. 2 Gwendolyn L. Waring
  321. 3 James Tiptree, Jr.
  322. 4 John Brunner
  323. 5 John Ronald Reuel Tolkien
  324. 6 Jo Walton
  325. 7 Miriam Toews
  326. 8 Pat Cadigan
  327. 9 Ursula K. Le Guin
  328. 10 Vanessa Veselka
  329. <!-- end -->
  330. `nl` is just a way to **n**umber **l**ines. This command accomplishes pretty much
  331. the same thing as `cat all_authors | nl`, or `nl all_authors`. You won't see
  332. it used as often as `|` and `>`, since most utilities can read files on their
  333. own, but it can save you typing `cat` quite as often.
  334. We'll use these features liberally from here on out.
  335. `--help` and man pages
  336. ----------------------
  337. You can change the behavior of most tools by giving them different options.
  338. This is all well and good if you already know what options are available,
  339. but what if you don't?
  340. Often, you can ask the tool itself:
  341. $ sort --help
  342. Usage: sort [OPTION]... [FILE]...
  343. or: sort [OPTION]... --files0-from=F
  344. Write sorted concatenation of all FILE(s) to standard output.
  345. Mandatory arguments to long options are mandatory for short options too.
  346. Ordering options:
  347. -b, --ignore-leading-blanks ignore leading blanks
  348. -d, --dictionary-order consider only blanks and alphanumeric characters
  349. -f, --ignore-case fold lower case to upper case characters
  350. -g, --general-numeric-sort compare according to general numerical value
  351. -i, --ignore-nonprinting consider only printable characters
  352. -M, --month-sort compare (unknown) < 'JAN' < ... < 'DEC'
  353. -h, --human-numeric-sort compare human readable numbers (e.g., 2K 1G)
  354. -n, --numeric-sort compare according to string numerical value
  355. -R, --random-sort sort by random hash of keys
  356. --random-source=FILE get random bytes from FILE
  357. -r, --reverse reverse the result of comparisons
  358. ...and so on. (It goes on for a while in this vein.)
  359. If that doesn't work, or doesn't provide enough info, the next thing to try is
  360. called a man page. ("man" is short for "manual". It's sort of an unfortunate
  361. abbreviation.)
  362. $ man sort
  363. SORT(1) User Commands SORT(1)
  364. NAME
  365. sort - sort lines of text files
  367. sort [OPTION]... [FILE]...
  368. sort [OPTION]... --files0-from=F
  370. Write sorted concatenation of all FILE(s) to standard output.
  371. ...and so on. Manual pages vary in quality, and it can take a while to get
  372. used to reading them, but they're very often the best place to look for help.
  373. If you're not sure what _program_ you want to use to solve a given problem, you
  374. might try searching all the man pages on the system for a keyword. `man`
  375. itself has an option to let you do this - `man -k keyword` - but most systems
  376. also have a shortcut called `apropos`, which I like to use because it's easy to
  377. remember if you imagine yourself saying "apropos of [some problem I have]..."
  378. <!-- exec -->
  379. $ apropos -s1 sort
  380. apt-sortpkgs (1) - Utility to sort package index files
  381. bunzip2 (1) - a block-sorting file compressor, v1.0.6
  382. bzip2 (1) - a block-sorting file compressor, v1.0.6
  383. comm (1) - compare two sorted files line by line
  384. sort (1) - sort lines of text files
  385. tsort (1) - perform topological sort
  386. <!-- end -->
  387. It's useful to know that the manual represented by `man` has numbered sections
  388. for different kinds of manual pages. Most of what the average user needs to
  389. know about lives in section 1, "User Commands", so you'll often see the names
  390. of different tools written like `sort(1)` or `cat(1)`. This can be a good way
  391. to make it clear in writing that you're talking about a specific piece of
  392. software rather than a verb or a small carnivorous mammal. (I specified `-s1`
  393. for section 1 above just to cut down on clutter, though in practice I usually
  394. don't bother.)
  395. Like other literary traditions, Unix is littered with this sort of convention.
  396. This one just happens to date from a time when the manual was still a physical
  397. book.
  398. wc
  399. --
  400. `wc` stands for **w**ord **c**ount. It does about what you'd expect - it
  401. counts the number of words in its input.
  402. $ wc index.md
  403. 736 4117 24944 index.md
  404. 736 is the number of lines, 4117 the number of words, and 24944 the number of
  405. characters in the file I'm writing right now. I use this constantly. Most
  406. obviously, it's a good way to get an idea of how much you've written. `wc` is
  407. the tool I used to track my progress the last time I tried National Novel
  408. Writing Month:
  409. $ find ~/p1k3/archives/2010/11 -regextype egrep -regex '.*([0-9]+|index)' -type f | xargs wc -w | tail -1
  410. 6585 total
  411. <!-- exec -->
  412. $ cowsay 'embarrassing.'
  413. _______________
  414. < embarrassing. >
  415. ---------------
  416. \ ^__^
  417. \ (oo)\_______
  418. (__)\ )\/\
  419. ||----w |
  420. || ||
  421. <!-- end -->
  422. Anyway. The less obvious thing about `wc` is that you can use it to count the
  423. output of other commands. Want to know _how many_ unique authors we have?
  424. <!-- exec -->
  425. $ sort authors_* | uniq | wc -l
  426. 10
  427. <!-- end -->
  428. This kind of thing is trivial, but it comes in handy more often than you might
  429. think.
  430. head, tail, and cut
  431. -------------------
  432. Remember our old pal `cat`, which just splats everything it's given back to
  433. standard output?
  434. Sometimes you've got a piece of output that's more than you actually want to
  435. deal with at once. Maybe you just want to glance at the first few lines in a
  436. file:
  437. <!-- exec -->
  438. $ head -3 colors
  439. RED
  440. blue
  441. red
  442. <!-- end -->
  443. ...or maybe you want to see the last thing in a list:
  444. <!-- exec -->
  445. $ sort colors | uniq -i | tail -1
  446. red
  447. <!-- end -->
  448. ...or maybe you're only interested in the first "field" in some list. You might
  449. use `cut` here, asking it to treat spaces as delimiters between fields and
  450. return only the first field for each line of its input:
  451. <!-- exec -->
  452. $ cut -d' ' -f1 ./authors_*
  453. Eden
  454. Vanessa
  455. Miriam
  456. Gwendolyn
  457. Ursula
  458. Jo
  459. Pat
  460. John
  461. Vanessa
  462. James
  463. John
  464. <!-- end -->
  465. Suppose we're curious what the few most commonly occurring first names on our
  466. author list are? Here's an approach, silly but effective, that combines a lot
  467. of what we've discussed so far and looks like plenty of one-liners I wind up
  468. writing in real life:
  469. <!-- exec -->
  470. $ cut -d' ' -f1 ./authors_* | sort | uniq -ci | sort -n | tail -3
  471. 1 Ursula
  472. 2 John
  473. 2 Vanessa
  474. <!-- end -->
  475. Let's walk through this one step by step:
  476. First, we have `cut` extract the first field of each line in our author lists.
  477. cut -d' ' -f1 ./authors_*
  478. Then we sort these results
  479. | sort
  480. and pass them to `uniq`, asking it for a case-insensitive count of each
  481. repeated line
  482. | uniq -ci
  483. then sort again, numerically,
  484. | sort -n
  485. and finally, we chop off everything but the last three lines:
  486. | tail -3
  487. If you wanted to make sure to count an individual author's first name
  488. only once, even if that author appears more than once in the files,
  489. you could instead do:
  490. <!-- exec -->
  491. $ sort -u ./authors_* | cut -d' ' -f1 | uniq -ci | sort -n | tail -3
  492. 1 Ursula
  493. 1 Vanessa
  494. 2 John
  495. <!-- end -->
  496. tab separated values
  497. --------------------
  498. Notice above how we had to tell `cut` that "fields" in `authors_*` are
  499. delimited by spaces? It turns out that if you don't use `-d`, `cut` defaults
  500. to using tab characters for a delimiter.
  501. Tab characters are sort of weird little animals. You can't usually _see_ them
  502. directly --- they're like a space character that takes up more than one space
  503. when displayed. By convention, one tab is usually rendered as 8 spaces, but
  504. it's up to the software that's displaying the character what it wants to do.
  505. (In fact, it's more complicated than that: Tabs are often rendered as marking
  506. _tab stops_, which is a concept I remember from 7th grade typing classes, but
  507. haven't actually thought about in my day-to-day life for nearly 20 years.)
  508. Here's a version of our `all_authors` that's been rearranged so that the first
  509. field is the author's last name, the second is their first name, the third is
  510. their middle name or initial (if we know it) and the fourth is any suffix.
  511. Fields are separated by a single tab character:
  512. <!-- exec -->
  513. $ cat all_authors.tsv
  514. Robinson Eden
  515. Waring Gwendolyn L.
  516. Tiptree James Jr.
  517. Brunner John
  518. Tolkien John Ronald Reuel
  519. Walton Jo
  520. Toews Miriam
  521. Cadigan Pat
  522. Le Guin Ursula K.
  523. Veselka Vanessa
  524. <!-- end -->
  525. That looks kind of garbled, right? In order to make it a little more obvious
  526. what's happening, let's use `cat -T`, which displays tab characters as `^I`:
  527. <!-- exec -->
  528. $ cat -T all_authors.tsv
  529. Robinson^IEden
  530. Waring^IGwendolyn^IL.
  531. Tiptree^IJames^I^IJr.
  532. Brunner^IJohn
  533. Tolkien^IJohn^IRonald Reuel
  534. Walton^IJo
  535. Toews^IMiriam
  536. Cadigan^IPat
  537. Le Guin^IUrsula^IK.
  538. Veselka^IVanessa
  539. <!-- end -->
  540. It looks odd when displayed because some names are at or nearly at 8 characters long.
  541. "Robinson", at 8 characters, overshoots the first tab stop, so "Eden" gets indented
  542. further than other first names, and so on.
  543. Fortunately, in order to make this more human-readable, we can pass it through
  544. `expand`, which turns tabs into a given number of spaces (8 by default):
  545. <!-- exec -->
  546. $ expand -t14 all_authors.tsv
  547. Robinson Eden
  548. Waring Gwendolyn L.
  549. Tiptree James Jr.
  550. Brunner John
  551. Tolkien John Ronald Reuel
  552. Walton Jo
  553. Toews Miriam
  554. Cadigan Pat
  555. Le Guin Ursula K.
  556. Veselka Vanessa
  557. <!-- end -->
  558. Now it's easy to sort by last name:
  559. <!-- exec -->
  560. $ sort -k1 all_authors.tsv | expand -t14
  561. Brunner John
  562. Cadigan Pat
  563. Le Guin Ursula K.
  564. Robinson Eden
  565. Tiptree James Jr.
  566. Toews Miriam
  567. Tolkien John Ronald Reuel
  568. Veselka Vanessa
  569. Walton Jo
  570. Waring Gwendolyn L.
  571. <!-- end -->
  572. Or just extract middle names and initials:
  573. <!-- exec -->
  574. $ cut -f3 all_authors.tsv
  575. L.
  576. Ronald Reuel
  577. K.
  578. <!-- end -->
  579. It probably won't surprise you to learn that there's a corresponding `paste`
  580. command, which takes two or more files and stitches them together with tab
  581. characters. Let's extract a couple of things from our author list and put them
  582. back together in a different order:
  583. <!-- exec -->
  584. $ cut -f1 all_authors.tsv > lastnames
  585. <!-- end -->
  586. <!-- exec -->
  587. $ cut -f2 all_authors.tsv > firstnames
  588. <!-- end -->
  589. <!-- exec -->
  590. $ paste firstnames lastnames | sort -k2 | expand -t12
  591. John Brunner
  592. Pat Cadigan
  593. Ursula Le Guin
  594. Eden Robinson
  595. James Tiptree
  596. Miriam Toews
  597. John Tolkien
  598. Vanessa Veselka
  599. Jo Walton
  600. Gwendolyn Waring
  601. <!-- end -->
  602. As these examples show, TSV is something very like a primitive spreadsheet: A
  603. way to represent information in columns and rows. In fact, it's a close cousin
  604. of CSV, which is often used as a lowest-common-denominator format for
  605. transferring spreadsheets, and which represents data something like this:
  606. last,first,middle,suffix
  607. Tolkien,John,Ronald Reuel,
  608. Tiptree,James,,Jr.
  609. The advantage of tabs is that they're supported by a bunch of the standard
  610. tools. A disadvantage is that they're kind of ugly and can be weird to deal
  611. with, but they're useful anyway, and character-delimited rows are often a
  612. good-enough way to hack your way through problems that call for basic
  613. structure.
  614. finding text: grep
  615. ------------------
  616. After all those contortions, what if you actually just want to see _which lists_
  617. an individual author appears on?
  618. <!-- exec -->
  619. $ grep 'Vanessa' ./authors_*
  620. ./authors_contemporary_fic:Vanessa Veselka
  621. ./authors_sff:Vanessa Veselka
  622. <!-- end -->
  623. `grep` takes a string to search for and, optionally, a list of files to search
  624. in. If you don't specify files, it'll look through standard input instead:
  625. <!-- exec -->
  626. $ cat ./authors_* | grep 'Vanessa'
  627. Vanessa Veselka
  628. Vanessa Veselka
  629. <!-- end -->
  630. Most of the time, piping the output of `cat` to `grep` is considered silly,
  631. because `grep` knows how to find things in files on its own. Many thousands of
  632. words have been written on this topic by leading lights of the nerd community.
  633. You've probably noticed that this result doesn't contain filenames (and thus
  634. isn't very useful to us). That's because all `grep` saw was the lines in the
  635. files, not the names of the files themselves.
  636. now you have n problems: regex and rabbit holes
  637. -----------------------------------------------
  638. To close out this introductory chapter, let's spend a little time on a topic
  639. that will likely vex, confound, and (occasionally) delight you for as long as
  640. you are acquainted with the command line.
  641. When I was talking about `grep` a moment ago, I fudged the details more than a
  642. little by saying that it expects a string to search for. What `grep`
  643. _actually_ expects is a _pattern_. Moreover, it expects a specific kind of
  644. pattern, what's known as a _regular expression_, a cumbersome phrase frequently
  645. shortened to regex.
  646. There's a lot of theory about what makes up a regular expression. Fortunately,
  647. very little of it matters to the short version that will let you get useful
  648. stuff done. The short version is that a regex is like using wildcards in the
  649. shell to match groups of files, but for text in general and with more magic.
  650. <!-- exec -->
  651. $ grep 'Jo.*' ./authors_*
  652. ./authors_sff:Jo Walton
  653. ./authors_sff:John Ronald Reuel Tolkien
  654. ./authors_sff:John Brunner
  655. <!-- end -->
  656. The pattern `Jo.*` says that we're looking for lines which contain a literal
  657. `Jo`, followed by any quantity (including none) of any character. In a regex,
  658. `.` means "anything" and `*` means "any amount of the preceding thing".
  659. `.` and `*` are magical. In the particular dialect of regexen understood
  660. by `grep`, other magical things include:
  661. <table>
  662. <tr><td><code>^</code> </td> <td>start of a line </td></tr>
  663. <tr><td><code>$</code> </td> <td>end of a line </td></tr>
  664. <tr><td><code>[abc]</code></td> <td>one of a, b, or c </td></tr>
  665. <tr><td><code>[a-z]</code></td> <td>a character in the range a through z</td></tr>
  666. <tr><td><code>[0-9]</code></td> <td>a character in the range 0 through 9</td></tr>
  667. <tr><td><code>+</code> </td> <td>one or more of the preceding thing </td></tr>
  668. <tr><td><code>?</code> </td> <td>0 or 1 of the preceding thing </td></tr>
  669. <tr><td><code>*</code> </td> <td>any number of the preceding thing </td></tr>
  670. <tr><td><code>(foo|bar)</code></td> <td>"foo" or "bar"</td></tr>
  671. <tr><td><code>(foo)?</code></td> <td>optional "foo"</td></tr>
  672. </table>
  673. It's actually a little more complicated than that: By default, if you want to
  674. use a lot of the magical characters, you have to prefix them with `\`. This is
  675. both ugly and confusing, so unless you're writing a very simple pattern, it's
  676. often easiest to call `grep -E`, for **E**xtended regular expressions, which
  677. means that lots of characters will have special meanings.
  678. Authors with 4-letter first names:
  679. <!-- exec -->
  680. $ grep -iE '^[a-z]{4} ' ./authors_*
  681. ./authors_contemporary_fic:Eden Robinson
  682. ./authors_sff:John Ronald Reuel Tolkien
  683. ./authors_sff:John Brunner
  684. <!-- end -->
  685. A count of authors named John:
  686. <!-- exec -->
  687. $ grep -c '^John ' ./all_authors
  688. 2
  689. <!-- end -->
  690. Lines in this file matching the words "magic" or "magical":
  691. $ grep -iE 'magic(al)?' ./index.md
  692. Pipes are some of the most important magic in the shell. When the people who
  693. shell to match groups of files, but with more magic.
  694. `.` and `*` are magical. In the particular dialect of regexen understood
  695. by `grep`, other magical things include:
  696. use a lot of the magical characters, you have to prefix them with `\`. This is
  697. Lines in this file matching the words "magic" or "magical":
  698. $ grep -iE 'magic(al)?' ./index.md
  699. Find some "-agic" words in a big list of words:
  700. <!-- exec -->
  701. $ grep -iE '(m|tr|pel)agic' /usr/share/dict/words
  702. magic
  703. magic's
  704. magical
  705. magically
  706. magician
  707. magician's
  708. magicians
  709. pelagic
  710. tragic
  711. tragically
  712. tragicomedies
  713. tragicomedy
  714. tragicomedy's
  715. <!-- end -->
  716. `grep` isn't the only - or even the most important - tool that makes use of
  717. regular expressions, but it's a good place to start because it's one of the
  718. fundamental building blocks for so many other operations. Filtering lists of
  719. things, matching patterns within collections, and writing concise descriptions
  720. of how text should be transformed are at the heart of a practical approach to
  721. Unix-like systems. Regexen turn out to be a seductively powerful way to do
  722. these things - so much so that they've crept their way into text editors,
  723. databases, and full-featured programming languages.
  724. There's a dark side to all of this, for the truth about regular expressions is
  725. that they are ugly, inconsistent, brittle, and _incredibly_ difficult to think
  726. clearly about. They take years to master and reward the wielder with great
  727. power, but they are also a trap: a temptation towards the path of cleverness
  728. masquerading as wisdom.
  729. -> ✑ <-
  730. I'll be returning to this theme, but for the time being let's move on. Now
  731. that we've established, however haphazardly, some of the basics, let's consider
  732. their application to a real-world task.