Brennen Bearnes 9988cd2f43 | 10 years ago | |
---|---|---|
.. | ||
README.md | 10 years ago | |
all_authors | 10 years ago | |
all_authors.tsv | 10 years ago | |
authors_contemporary_fic | 10 years ago | |
authors_nat_hist | 10 years ago | |
authors_sff | 10 years ago | |
colors | 10 years ago | |
firstnames | 10 years ago | |
hello_world | 10 years ago | |
index.md | 10 years ago | |
lastnames | 10 years ago |
There're a lot of ways to structure an introduction to the command line. I'm going to start with writing as a point of departure because, aside from web development, it's what I use a computer for most. I want to shine a light on the humane potential of ideas that are usually understood as nerd trivia. Computers have utterly transformed the practice of writing within the space of my lifetime, but it seems to me that writers as a class miss out on many of the software tools and patterns taken as a given in more "technical" fields.
Writing, particularly writing of any real scope or complexity, is very much a technical task. It makes demands, both physical and psychological, of its practitioners. As with woodworkers, graphic artists, and farmers, writers exhibit strong preferences in their tools, materials, and environment, and they do so because they're engaged in a physically and cognitively challenging task.
My thesis is that the modern Linux command line is a pretty good environment for working with English prose and prosody, and that maybe this will illuminate the ways it could be useful in your own work with a computer, whatever that work happens to be.
What software are we actually talking about when we say "the command line"?
For the purposes of this discussion, we're talking about an environment built on a very old paradigm called Unix.
-> <-
...except what classical Unix really looks like is this:
-> <-
The Unix-like environment we're going to use isn't very classical, really. It's an operating system kernel called Linux, combined with a bunch of things written by other people (people in the GNU and Debian projects, and many others). Purists will tell you that this isn't properly Unix at all. In strict historical terms they're right, or at least a certain kind of right, but for the purposes of my cultural agenda I'm going to ignore them right now.
-> <-
This is what's called a shell. There are many different shells, but they pretty much all operate on the same idea: You navigate a filesystem and run programs by typing commands. Commands can be combined in various ways to make programs of their own, and in fact the way you use the computer is often just to write little programs that invoke other programs, turtles-all-the-way-down style.
The standard shell these days is something called Bash, so we'll use Bash. It's what you'll most often see in the wild. Like most shells, Bash is ugly and stupid in more ways than it is possible to easily summarize. It's also an incredibly powerful and expressive piece of software.
Have you ever played a text-based adventure game or MUD, of the kind that describes a setting and takes commands for movement and so on? Readers of a certain age and temperament might recognize the opening of Crowther & Woods' Adventure, the great-granddaddy of text adventure games:
YOU ARE STANDING AT THE END OF A ROAD BEFORE A SMALL BRICK BUILDING.
AROUND YOU IS A FOREST. A SMALL STREAM FLOWS OUT OF THE BUILDING ANd
DOWN A GULLY.
> GO EAST
YOU ARE INSIDE A BUILDING, A WELL HOUSE FOR A LARGE SPRING.
THERE ARE SOME KEYS ON THE GROUND HERE.
THERE IS A SHINY BRASS LAMP NEARBY.
THERE IS FOOD HERE.
THERE IS A BOTTLE OF WATER HERE.
You can think of the shell as a kind of environment you inhabit, in much the
way your character inhabits an adventure game. The difference is that instead
of navigating around virtual rooms and hallways with commands like LOOK
and
EAST
, you navigate between directories by typing commands like ls
and cd notes
:
$ ls
code Downloads notes p1k3 photos scraps userland-book
$ cd notes
$ ls
notes.txt sparkfun TODO.txt
ls
lists files. Some files are directories, which means they can contain
other files, and you can step inside of them by typing cd
(for change
directory).
In the Macintosh and Windows world, directories have been called "folders" for a long time now. This isn't the worst metaphor for what's going on, and it's so pervasive by now that it's not worth fighting about. It's also not exactly a great metaphor, since computer filesystems aren't built very much like the filing cabinets of yore. A directory acts a lot like a container of some sort, but it's an infinitely expandable one which may contain nested sub-spaces much larger than itself. Directories are frequently like the TARDIS: Bigger on the inside.
When you're in the shell, you have many tools at your disposal - programs that can be used on many different files, or chained together with other programs. They tend to have weird, cryptic names, but a lot of them do very simple things. Tasks that might be a menu item in a big program like Word, like counting the number of words in a document or finding a particular phrase, are often programs unto themselves. We'll start with something even more basic than that.
Suppose you have some files, and you're curious what's in them. For example,
suppose you've got a list of authors you're planning to reference, and you just
want to check its contents real quick-like. This is where our friend cat
comes in:
$ cat authors_sff
Ursula K. Le Guin
Jo Walton
Pat Cadigan
John Ronald Reuel Tolkien
Vanessa Veselka
James Tiptree, Jr.
John Brunner
"Why," you might be asking, "is the command to dump out the contents of a file
to a screen called cat
? What do felines have to do with anything?"
It turns out that cat
is actually short for "catenate", which is a long
word basically meaning "stick things together". In programming, we usually
refer to sticking two bits of text together as "string concatenation", probably
because programmers like to feel like they're being very precise about very
simple actions.
Suppose you wanted to see the contents of a set of author lists:
$ cat authors_sff authors_contemporary_fic authors_nat_hist
Ursula K. Le Guin
Jo Walton
Pat Cadigan
John Ronald Reuel Tolkien
Vanessa Veselka
James Tiptree, Jr.
John Brunner
Eden Robinson
Vanessa Veselka
Miriam Toews
Gwendolyn L. Waring
We're working with three filenames: authors_sff
, authors_contemporary_fic
,
and authors_nat_hist
. That's an awful lot of typing every time we want to do
something to all three files. Fortunately, our shell offers a shorthand for
"all the files that start with authors_
":
$ cat authors_*
Eden Robinson
Vanessa Veselka
Miriam Toews
Gwendolyn L. Waring
Ursula K. Le Guin
Jo Walton
Pat Cadigan
John Ronald Reuel Tolkien
Vanessa Veselka
James Tiptree, Jr.
John Brunner
In Bash-land, *
basically means "anything", and is known in the vernacular,
somewhat poetically, as a "wildcard". You should always be careful with
wildcards, especially if you're doing anything destructive. They can and will
surprise the unwary. Still, once you're used to the idea, they will save you a
lot of RSI.
There's a problem here. Our author list is out of order, and thus confusing to reference. Fortunately, since one of the most basic things you can do to a list is to sort it, someone else has already solved this problem for us. Here's a command that will give us some organization:
$ sort authors_*
Eden Robinson
Gwendolyn L. Waring
James Tiptree, Jr.
John Brunner
John Ronald Reuel Tolkien
Jo Walton
Miriam Toews
Pat Cadigan
Ursula K. Le Guin
Vanessa Veselka
Vanessa Veselka
Does it bother you that they aren't sorted by last name? Me too. As a partial
solution, we can ask sort
to use the second "field" in each line as its sort
key (by default, sort treats whitespace as a division between fields):
$ sort -k2 authors_*
John Brunner
Pat Cadigan
Ursula K. Le Guin
Gwendolyn L. Waring
Eden Robinson
John Ronald Reuel Tolkien
James Tiptree, Jr.
Miriam Toews
Vanessa Veselka
Vanessa Veselka
Jo Walton
That's closer, right? It sorted on "Cadigan" and "Veselka" instead of "Pat" and "Vanessa". (Of course, it's still far from perfect, because the second field in each line isn't necessarily the person's last name.)
Above, when we wanted to ask sort
to behave differently, we gave it what is
known as an option. Most programs with command-line interfaces will allow
their behavior to be changed by adding various options. Options usually
(but not always!) look like -o
or --option
.
For example, if we wanted to see just the unique lines, irrespective of case, for a file called colors:
$ cat colors
RED
blue
red
BLUE
Green
green
GREEN
We could write this:
$ sort -uf colors
blue
Green
RED
Here -u
stands for unique and -f
stands for fold case, which means
to treat upper- and lower-case letters as the same for comparison purposes. You'll
often see a group of short options following the -
like this.
Did you notice how Vanessa Veselka shows up twice in our list of authors?
That's useful if we want to remember that she's in more than one category, but
it's redundant if we're just worried about membership in the overall set of
authors. We can make sure our list doesn't contain repeating lines by using
sort
, just like with that list of colors:
$ sort -u -k2 authors_*
John Brunner
Pat Cadigan
Ursula K. Le Guin
Gwendolyn L. Waring
Eden Robinson
John Ronald Reuel Tolkien
James Tiptree, Jr.
Miriam Toews
Vanessa Veselka
Jo Walton
But there's another approach to this --- sort
is good at only displaying a line
once, but suppose we wanted to see a count of how many different lists an
author shows up on? sort
doesn't do that, but a command called uniq
does,
if you give it the option -c
for count.
uniq
moves through the lines in its input, and if it sees a line more than
once in sequence, it will only print that line once. If you have a bunch of
files and you just want to see the unique lines across all of those files, you
probably need to run them through sort
first. How do you do that?
$ sort authors_* | uniq -c
1 Eden Robinson
1 Gwendolyn L. Waring
1 James Tiptree, Jr.
1 John Brunner
1 John Ronald Reuel Tolkien
1 Jo Walton
1 Miriam Toews
1 Pat Cadigan
1 Ursula K. Le Guin
2 Vanessa Veselka
The |
is called a "pipe". In the command above, it tells your shell that
instead of printing the output of sort authors_*
right to your terminal, it
should send it to uniq -c
.
-> <-
Pipes are some of the most important magic in the shell. When the people who built Unix in the first place give interviews about the stuff they remember from the early days, a lot of them reminisce about the invention of pipes and all of the new stuff it immediately made possible.
Pipes help you control a thing called "standard IO". In the world of the command line, programs take input and produce output. A pipe is a way to hook the output from one program to the input of another.
Unlike a lot of the weirdly named things you'll encounter in software, the metaphor here is obvious and makes pretty good sense. It even kind of looks like a physical pipe.
What if, instead of sending the output of one program to the input of another, you'd like to store it in a file for later use?
Check it out:
$ sort authors_* | uniq > ./all_authors
$ cat all_authors
Eden Robinson
Gwendolyn L. Waring
James Tiptree, Jr.
John Brunner
John Ronald Reuel Tolkien
Jo Walton
Miriam Toews
Pat Cadigan
Ursula K. Le Guin
Vanessa Veselka
I like to think of the >
as looking like a little funnel. It can be
dangerous --- you should always make sure that you're not going to clobber
an existing file you actually want to keep.
If you want to tack more stuff on to the end of an existing file, you can use
>>
instead. To test that, let's use echo
, which prints out whatever string
you give it on a line by itself:
$ echo 'hello' > hello_world
$ echo 'world' >> hello_world
$ cat hello_world
hello
world
You can also take a file and pull it directly back into the input of a given program, which is a bit like a funnel going the other direction:
$ nl < all_authors
1 Eden Robinson
2 Gwendolyn L. Waring
3 James Tiptree, Jr.
4 John Brunner
5 John Ronald Reuel Tolkien
6 Jo Walton
7 Miriam Toews
8 Pat Cadigan
9 Ursula K. Le Guin
10 Vanessa Veselka
nl
is just a way to number lines. This command accomplishes pretty much
the same thing as cat all_authors | nl
, or nl all_authors
. You won't see
it used as often as |
and >
, since most utilities can read files on their
own, but it can save you typing cat
quite as often.
We'll use these features liberally from here on out.
--help
and man pagesYou can change the behavior of most tools by giving them different options. This is all well and good if you already know what options are available, but what if you don't?
Often, you can ask the tool itself:
$ sort --help
Usage: sort [OPTION]... [FILE]...
or: sort [OPTION]... --files0-from=F
Write sorted concatenation of all FILE(s) to standard output.
Mandatory arguments to long options are mandatory for short options too.
Ordering options:
-b, --ignore-leading-blanks ignore leading blanks
-d, --dictionary-order consider only blanks and alphanumeric characters
-f, --ignore-case fold lower case to upper case characters
-g, --general-numeric-sort compare according to general numerical value
-i, --ignore-nonprinting consider only printable characters
-M, --month-sort compare (unknown) < 'JAN' < ... < 'DEC'
-h, --human-numeric-sort compare human readable numbers (e.g., 2K 1G)
-n, --numeric-sort compare according to string numerical value
-R, --random-sort sort by random hash of keys
--random-source=FILE get random bytes from FILE
-r, --reverse reverse the result of comparisons
...and so on. (It goes on for a while in this vein.)
If that doesn't work, or doesn't provide enough info, the next thing to try is called a man page. ("man" is short for "manual". It's sort of an unfortunate abbreviation.)
$ man sort
SORT(1) User Commands SORT(1)
NAME
sort - sort lines of text files
SYNOPSIS
sort [OPTION]... [FILE]...
sort [OPTION]... --files0-from=F
DESCRIPTION
Write sorted concatenation of all FILE(s) to standard output.
...and so on. Manual pages vary in quality, and it can take a while to get used to reading them, but they're very often the best place to look for help.
If you're not sure what program you want to use to solve a given problem, you
might try searching all the man pages on the system for a keyword. man
itself has an option to let you do this - man -k keyword
- but most systems
also have a shortcut called apropos
, which I like to use because it's easy to
remember if you imagine yourself saying "apropos of [some problem I have]..."
$ apropos -s1 sort
apt-sortpkgs (1) - Utility to sort package index files
bunzip2 (1) - a block-sorting file compressor, v1.0.6
bzip2 (1) - a block-sorting file compressor, v1.0.6
comm (1) - compare two sorted files line by line
sort (1) - sort lines of text files
tsort (1) - perform topological sort
It's useful to know that the manual represented by man
has numbered sections
for different kinds of manual pages. Most of what the average user needs to
know about lives in section 1, "User Commands", so you'll often see the names
of different tools written like sort(1)
or cat(1)
. This can be a good way
to make it clear in writing that you're talking about a specific piece of
software rather than a verb or a small carnivorous mammal. (I specified -s1
for section 1 above just to cut down on clutter, though in practice I usually
don't bother.)
Like other literary traditions, Unix is littered with this sort of convention. This one just happens to date from a time when the manual was still a physical book.
wc
stands for word count. It does about what you'd expect - it
counts the number of words in its input.
$ wc index.md
736 4117 24944 index.md
736 is the number of lines, 4117 the number of words, and 24944 the number of
characters in the file I'm writing right now. I use this constantly. Most
obviously, it's a good way to get an idea of how much you've written. wc
is
the tool I used to track my progress the last time I tried National Novel
Writing Month:
$ find ~/p1k3/archives/2010/11 -regextype egrep -regex '.*([0-9]+|index)' -type f | xargs wc -w | tail -1
6585 total
$ cowsay 'embarrassing.'
_______________
< embarrassing. >
---------------
\ ^__^
\ (oo)\_______
(__)\ )\/\
||----w |
|| ||
Anyway. The less obvious thing about wc
is that you can use it to count the
output of other commands. Want to know how many unique authors we have?
$ sort authors_* | uniq | wc -l
10
This kind of thing is trivial, but it comes in handy more often than you might think.
Remember our old pal cat
, which just splats everything it's given back to
standard output?
Sometimes you've got a piece of output that's more than you actually want to deal with at once. Maybe you just want to glance at the first few lines in a file:
$ head -3 colors
RED
blue
red
...or maybe you want to see the last thing in a list:
$ sort colors | uniq -i | tail -1
red
...or maybe you're only interested in the first "field" in some list. You might
use cut
here, asking it to treat spaces as delimiters between fields and
return only the first field for each line of its input:
$ cut -d' ' -f1 ./authors_*
Eden
Vanessa
Miriam
Gwendolyn
Ursula
Jo
Pat
John
Vanessa
James
John
Suppose we're curious what the few most commonly occurring first names on our author list are? Here's an approach, silly but effective, that combines a lot of what we've discussed so far and looks like plenty of one-liners I wind up writing in real life:
$ cut -d' ' -f1 ./authors_* | sort | uniq -ci | sort -n | tail -3
1 Ursula
2 John
2 Vanessa
Let's walk through this one step by step:
First, we have cut
extract the first field of each line in our author lists.
cut -d' ' -f1 ./authors_*
Then we sort these results
| sort
and pass them to uniq
, asking it for a case-insensitive count of each
repeated line
| uniq -ci
then sort again, numerically,
| sort -n
and finally, we chop off everything but the last three lines:
| tail -3
If you wanted to make sure to count an individual author's first name only once, even if that author appears more than once in the files, you could instead do:
$ sort -u ./authors_* | cut -d' ' -f1 | uniq -ci | sort -n | tail -3
1 Ursula
1 Vanessa
2 John
Notice above how we had to tell cut
that "fields" in authors_*
are
delimited by spaces? It turns out that if you don't use -d
, cut
defaults
to using tab characters for a delimiter.
Tab characters are sort of weird little animals. You can't usually see them directly --- they're like a space character that takes up more than one space when displayed. By convention, one tab is usually rendered as 8 spaces, but it's up to the software that's displaying the character what it wants to do.
(In fact, it's more complicated than that: Tabs are often rendered as marking tab stops, which is a concept I remember from 7th grade typing classes, but haven't actually thought about in my day-to-day life for nearly 20 years.)
Here's a version of our all_authors
that's been rearranged so that the first
field is the author's last name, the second is their first name, the third is
their middle name or initial (if we know it) and the fourth is any suffix.
Fields are separated by a single tab character:
$ cat all_authors.tsv
Robinson Eden
Waring Gwendolyn L.
Tiptree James Jr.
Brunner John
Tolkien John Ronald Reuel
Walton Jo
Toews Miriam
Cadigan Pat
Le Guin Ursula K.
Veselka Vanessa
That looks kind of garbled, right? In order to make it a little more obvious
what's happening, let's use cat -T
, which displays tab characters as ^I
:
$ cat -T all_authors.tsv
Robinson^IEden
Waring^IGwendolyn^IL.
Tiptree^IJames^I^IJr.
Brunner^IJohn
Tolkien^IJohn^IRonald Reuel
Walton^IJo
Toews^IMiriam
Cadigan^IPat
Le Guin^IUrsula^IK.
Veselka^IVanessa
It looks odd when displayed because some names are at or nearly at 8 characters long. "Robinson", at 8 characters, overshoots the first tab stop, so "Eden" gets indented further than other first names, and so on.
Fortunately, in order to make this more human-readable, we can pass it through
expand
, which turns tabs into a given number of spaces (8 by default):
$ expand -t14 all_authors.tsv
Robinson Eden
Waring Gwendolyn L.
Tiptree James Jr.
Brunner John
Tolkien John Ronald Reuel
Walton Jo
Toews Miriam
Cadigan Pat
Le Guin Ursula K.
Veselka Vanessa
Now it's easy to sort by last name:
$ sort -k1 all_authors.tsv | expand -t14
Brunner John
Cadigan Pat
Le Guin Ursula K.
Robinson Eden
Tiptree James Jr.
Toews Miriam
Tolkien John Ronald Reuel
Veselka Vanessa
Walton Jo
Waring Gwendolyn L.
Or just extract middle names and initials:
$ cut -f3 all_authors.tsv
L.
Ronald Reuel
K.
It probably won't surprise you to learn that there's a corresponding paste
command, which takes two or more files and stitches them together with tab
characters. Let's extract a couple of things from our author list and put them
back together in a different order:
$ cut -f1 all_authors.tsv > lastnames
$ cut -f2 all_authors.tsv > firstnames
$ paste firstnames lastnames | sort -k2 | expand -t12
John Brunner
Pat Cadigan
Ursula Le Guin
Eden Robinson
James Tiptree
Miriam Toews
John Tolkien
Vanessa Veselka
Jo Walton
Gwendolyn Waring
As these examples show, TSV is something very like a primitive spreadsheet: A way to represent information in columns and rows. In fact, it's a close cousin of CSV, which is often used as a lowest-common-denominator format for transferring spreadsheets, and which represents data something like this:
last,first,middle,suffix
Tolkien,John,Ronald Reuel,
Tiptree,James,,Jr.
The advantage of tabs is that they're supported by a bunch of the standard tools. A disadvantage is that they're kind of ugly and can be weird to deal with, but they're useful anyway, and character-delimited rows are often a good-enough way to hack your way through problems that call for basic structure.
After all those contortions, what if you actually just want to see which lists an individual author appears on?
$ grep 'Vanessa' ./authors_*
./authors_contemporary_fic:Vanessa Veselka
./authors_sff:Vanessa Veselka
grep
takes a string to search for and, optionally, a list of files to search
in. If you don't specify files, it'll look through standard input instead:
$ cat ./authors_* | grep 'Vanessa'
Vanessa Veselka
Vanessa Veselka
Most of the time, piping the output of cat
to grep
is considered silly,
because grep
knows how to find things in files on its own. Many thousands of
words have been written on this topic by leading lights of the nerd community.
You've probably noticed that this result doesn't contain filenames (and thus
isn't very useful to us). That's because all grep
saw was the lines in the
files, not the names of the files themselves.
To close out this introductory chapter, let's spend a little time on a topic that will likely vex, confound, and (occasionally) delight you for as long as you are acquainted with the command line.
When I was talking about grep
a moment ago, I fudged the details more than a
little by saying that it expects a string to search for. What grep
actually expects is a pattern. Moreover, it expects a specific kind of
pattern, what's known as a regular expression, a cumbersome phrase frequently
shortened to regex.
There's a lot of theory about what makes up a regular expression. Fortunately, very little of it matters to the short version that will let you get useful stuff done. The short version is that a regex is like using wildcards in the shell to match groups of files, but for text in general and with more magic.
$ grep 'Jo.*' ./authors_*
./authors_sff:Jo Walton
./authors_sff:John Ronald Reuel Tolkien
./authors_sff:John Brunner
The pattern Jo.*
says that we're looking for lines which contain a literal
Jo
, followed by any quantity (including none) of any character. In a regex,
.
means "anything" and *
means "any amount of the preceding thing".
.
and *
are magical. In the particular dialect of regexen understood
by grep
, other magical things include:
<tr><td><code>+</code> </td> <td>one or more of the preceding thing </td></tr>
<tr><td><code>?</code> </td> <td>0 or 1 of the preceding thing </td></tr>
<tr><td><code>*</code> </td> <td>any number of the preceding thing </td></tr>
<tr><td><code>(foo|bar)</code></td> <td>"foo" or "bar"</td></tr>
<tr><td><code>(foo)?</code></td> <td>optional "foo"</td></tr>
^ | start of a line |
$ | end of a line |
[abc] | one of a, b, or c |
[a-z] | a character in the range a through z |
[0-9] | a character in the range 0 through 9 |
It's actually a little more complicated than that: By default, if you want to
use a lot of the magical characters, you have to prefix them with \
. This is
both ugly and confusing, so unless you're writing a very simple pattern, it's
often easiest to call grep -E
, for Extended regular expressions, which
means that lots of characters will have special meanings.
Authors with 4-letter first names:
$ grep -iE '^[a-z]{4} ' ./authors_*
./authors_contemporary_fic:Eden Robinson
./authors_sff:John Ronald Reuel Tolkien
./authors_sff:John Brunner
A count of authors named John:
$ grep -c '^John ' ./all_authors
2
Lines in this file matching the words "magic" or "magical":
$ grep -iE 'magic(al)?' ./index.md
Pipes are some of the most important magic in the shell. When the people who
shell to match groups of files, but with more magic.
`.` and `*` are magical. In the particular dialect of regexen understood
by `grep`, other magical things include:
use a lot of the magical characters, you have to prefix them with `\`. This is
Lines in this file matching the words "magic" or "magical":
$ grep -iE 'magic(al)?' ./index.md
Find some "-agic" words in a big list of words:
$ grep -iE '(m|tr|pel)agic' /usr/share/dict/words
magic
magic's
magical
magically
magician
magician's
magicians
pelagic
tragic
tragically
tragicomedies
tragicomedy
tragicomedy's
grep
isn't the only - or even the most important - tool that makes use of
regular expressions, but it's a good place to start because it's one of the
fundamental building blocks for so many other operations. Filtering lists of
things, matching patterns within collections, and writing concise descriptions
of how text should be transformed are at the heart of a practical approach to
Unix-like systems. Regexen turn out to be a seductively powerful way to do
these things - so much so that they've crept their way into text editors,
databases, and full-featured programming languages.
There's a dark side to all of this, for the truth about regular expressions is that they are ugly, inconsistent, brittle, and incredibly difficult to think clearly about. They take years to master and reward the wielder with great power, but they are also a trap: a temptation towards the path of cleverness masquerading as wisdom.
-> ✑ <-
I'll be returning to this theme, but for the time being let's move on. Now that we've established, however haphazardly, some of the basics, let's consider their application to a real-world task.