|
@@ -76,8 +76,9 @@ mirror</a>, and welcome feedback there.</p>
|
76
|
76
|
<li><a href="#code-help-code-and-man-pages"><code>–help</code> and man pages</a></li>
|
77
|
77
|
<li><a href="#wc">wc</a></li>
|
78
|
78
|
<li><a href="#head-tail-and-cut">head, tail, and cut</a></li>
|
|
79
|
+<li><a href="#tab-separated-values">tab separated values</a></li>
|
79
|
80
|
<li><a href="#finding-text-grep">finding text: grep</a></li>
|
80
|
|
-<li><a href="#now-you-have-n-problems-regex-rabbit-holes">now you have n problems: regex + rabbit holes</a></li>
|
|
81
|
+<li><a href="#now-you-have-n-problems-regex-and-rabbit-holes">now you have n problems: regex and rabbit holes</a></li>
|
81
|
82
|
</ul>
|
82
|
83
|
</li>
|
83
|
84
|
<li><a href="#a-literary-problem">2. a literary problem</a></li>
|
|
@@ -861,6 +862,190 @@ you could instead do:</p>
|
861
|
862
|
<!-- end -->
|
862
|
863
|
|
863
|
864
|
|
|
865
|
+<h2><a name=tab-separated-values href=#tab-separated-values>#</a> tab separated values</h2>
|
|
866
|
+
|
|
867
|
+<p>Notice above how we had to tell <code>cut</code> that “fields” in <code>authors_*</code> are
|
|
868
|
+delimited by spaces? It turns out that if you don’t use <code>-d</code>, <code>cut</code> defaults
|
|
869
|
+to using tab characters for a delimiter.</p>
|
|
870
|
+
|
|
871
|
+<p>Tab characters are sort of weird little animals. You can’t usually <em>see</em> them
|
|
872
|
+directly – they’re like a space character that takes up more than one space
|
|
873
|
+when displayed. By convention, one tab is usually rendered as 8 spaces, but
|
|
874
|
+it’s up to the software that’s displaying the character what it wants to do.</p>
|
|
875
|
+
|
|
876
|
+<p>(In fact, it’s more complicated than that: Tabs are often rendered as marking
|
|
877
|
+<em>tab stops</em>, which is a concept I remember from 7th grade typing classes, but
|
|
878
|
+haven’t actually thought about in my day-to-day life for nearly 20 years.)</p>
|
|
879
|
+
|
|
880
|
+<p>Here’s a version of our <code>all_authors</code> that’s been rearranged so that the first
|
|
881
|
+field is the author’s last name, the second is their first name, the third is
|
|
882
|
+their middle name or initial (if we know it) and the fourth is any suffix.
|
|
883
|
+Fields are separated by a single tab character:</p>
|
|
884
|
+
|
|
885
|
+<!-- exec -->
|
|
886
|
+
|
|
887
|
+
|
|
888
|
+<pre><code>$ cat all_authors.tsv
|
|
889
|
+Robinson Eden
|
|
890
|
+Waring Gwendolyn L.
|
|
891
|
+Tiptree James Jr.
|
|
892
|
+Brunner John
|
|
893
|
+Tolkien John Ronald Reuel
|
|
894
|
+Walton Jo
|
|
895
|
+Toews Miriam
|
|
896
|
+Cadigan Pat
|
|
897
|
+Le Guin Ursula K.
|
|
898
|
+Veselka Vanessa
|
|
899
|
+</code></pre>
|
|
900
|
+
|
|
901
|
+<!-- end -->
|
|
902
|
+
|
|
903
|
+
|
|
904
|
+<p>That looks kind of garbled, right? In order to make it a little more obvious
|
|
905
|
+what’s happening, let’s use <code>cat -T</code>, which displays tab characters as <code>^I</code>:</p>
|
|
906
|
+
|
|
907
|
+<!-- exec -->
|
|
908
|
+
|
|
909
|
+
|
|
910
|
+<pre><code>$ cat -T all_authors.tsv
|
|
911
|
+Robinson^IEden
|
|
912
|
+Waring^IGwendolyn^IL.
|
|
913
|
+Tiptree^IJames^I^IJr.
|
|
914
|
+Brunner^IJohn
|
|
915
|
+Tolkien^IJohn^IRonald Reuel
|
|
916
|
+Walton^IJo
|
|
917
|
+Toews^IMiriam
|
|
918
|
+Cadigan^IPat
|
|
919
|
+Le Guin^IUrsula^IK.
|
|
920
|
+Veselka^IVanessa
|
|
921
|
+</code></pre>
|
|
922
|
+
|
|
923
|
+<!-- end -->
|
|
924
|
+
|
|
925
|
+
|
|
926
|
+<p>It looks odd when displayed because some names are at or nearly at 8 characters long.
|
|
927
|
+“Robinson”, at 8 characters, overshoots the first tab stop, so “Eden” gets indented
|
|
928
|
+further than other first names, and so on.</p>
|
|
929
|
+
|
|
930
|
+<p>Fortunately, in order to make this more human-readable, we can pass it through
|
|
931
|
+<code>expand</code>, which turns tabs into a given number of spaces (8 by default):</p>
|
|
932
|
+
|
|
933
|
+<!-- exec -->
|
|
934
|
+
|
|
935
|
+
|
|
936
|
+<pre><code>$ expand -t14 all_authors.tsv
|
|
937
|
+Robinson Eden
|
|
938
|
+Waring Gwendolyn L.
|
|
939
|
+Tiptree James Jr.
|
|
940
|
+Brunner John
|
|
941
|
+Tolkien John Ronald Reuel
|
|
942
|
+Walton Jo
|
|
943
|
+Toews Miriam
|
|
944
|
+Cadigan Pat
|
|
945
|
+Le Guin Ursula K.
|
|
946
|
+Veselka Vanessa
|
|
947
|
+</code></pre>
|
|
948
|
+
|
|
949
|
+<!-- end -->
|
|
950
|
+
|
|
951
|
+
|
|
952
|
+<p>Now it’s easy to sort by last name:</p>
|
|
953
|
+
|
|
954
|
+<!-- exec -->
|
|
955
|
+
|
|
956
|
+
|
|
957
|
+<pre><code>$ sort -k1 all_authors.tsv | expand -t14
|
|
958
|
+Brunner John
|
|
959
|
+Cadigan Pat
|
|
960
|
+Le Guin Ursula K.
|
|
961
|
+Robinson Eden
|
|
962
|
+Tiptree James Jr.
|
|
963
|
+Toews Miriam
|
|
964
|
+Tolkien John Ronald Reuel
|
|
965
|
+Veselka Vanessa
|
|
966
|
+Walton Jo
|
|
967
|
+Waring Gwendolyn L.
|
|
968
|
+</code></pre>
|
|
969
|
+
|
|
970
|
+<!-- end -->
|
|
971
|
+
|
|
972
|
+
|
|
973
|
+<p>Or just extract middle names and initials:</p>
|
|
974
|
+
|
|
975
|
+<!-- exec -->
|
|
976
|
+
|
|
977
|
+
|
|
978
|
+<pre><code>$ cut -f3 all_authors.tsv | grep .
|
|
979
|
+L.
|
|
980
|
+Ronald Reuel
|
|
981
|
+K.
|
|
982
|
+</code></pre>
|
|
983
|
+
|
|
984
|
+<!-- end -->
|
|
985
|
+
|
|
986
|
+
|
|
987
|
+<p>It probably won’t surprise you to learn that there’s a corresponding <code>paste</code>
|
|
988
|
+command, which takes two or more files and stitches them together with tab
|
|
989
|
+characters. Let’s extract a couple of things from our author list and put them
|
|
990
|
+back together in a different order:</p>
|
|
991
|
+
|
|
992
|
+<!-- exec -->
|
|
993
|
+
|
|
994
|
+
|
|
995
|
+<pre><code>$ cut -f1 all_authors.tsv > lastnames
|
|
996
|
+</code></pre>
|
|
997
|
+
|
|
998
|
+<!-- end -->
|
|
999
|
+
|
|
1000
|
+
|
|
1001
|
+
|
|
1002
|
+
|
|
1003
|
+<!-- exec -->
|
|
1004
|
+
|
|
1005
|
+
|
|
1006
|
+<pre><code>$ cut -f2 all_authors.tsv > firstnames
|
|
1007
|
+</code></pre>
|
|
1008
|
+
|
|
1009
|
+<!-- end -->
|
|
1010
|
+
|
|
1011
|
+
|
|
1012
|
+
|
|
1013
|
+
|
|
1014
|
+<!-- exec -->
|
|
1015
|
+
|
|
1016
|
+
|
|
1017
|
+<pre><code>$ paste firstnames lastnames | sort -k2 | expand -t12
|
|
1018
|
+John Brunner
|
|
1019
|
+Pat Cadigan
|
|
1020
|
+Ursula Le Guin
|
|
1021
|
+Eden Robinson
|
|
1022
|
+James Tiptree
|
|
1023
|
+Miriam Toews
|
|
1024
|
+John Tolkien
|
|
1025
|
+Vanessa Veselka
|
|
1026
|
+Jo Walton
|
|
1027
|
+Gwendolyn Waring
|
|
1028
|
+</code></pre>
|
|
1029
|
+
|
|
1030
|
+<!-- end -->
|
|
1031
|
+
|
|
1032
|
+
|
|
1033
|
+<p>As these examples show, TSV is something very like a primitive spreadsheet: A
|
|
1034
|
+way to represent information in columns and rows. In fact, it’s a close cousin
|
|
1035
|
+of CSV, which is often used as a lowest-common-denominator format for
|
|
1036
|
+transferring spreadsheets, and which represents data something like this:</p>
|
|
1037
|
+
|
|
1038
|
+<pre><code>last,first,middle,suffix
|
|
1039
|
+Tolkien,John,Ronald Reuel,
|
|
1040
|
+Tiptree,James,,Jr.
|
|
1041
|
+</code></pre>
|
|
1042
|
+
|
|
1043
|
+<p>The advantage of tabs is that they’re supported by a bunch of the standard
|
|
1044
|
+tools. A disadvantage is that they’re kind of ugly and can be weird to deal
|
|
1045
|
+with, but they’re useful anyway, and character-delimited rows are often a
|
|
1046
|
+good-enough way to hack your way through problems that call for basic
|
|
1047
|
+structure.</p>
|
|
1048
|
+
|
864
|
1049
|
<h2><a name=finding-text-grep href=#finding-text-grep>#</a> finding text: grep</h2>
|
865
|
1050
|
|
866
|
1051
|
<p>After all those contortions, what if you actually just want to see <em>which lists</em>
|
|
@@ -899,7 +1084,7 @@ words have been written on this topic by leading lights of the nerd community.</
|
899
|
1084
|
isn’t very useful to us). That’s because all <code>grep</code> saw was the lines in the
|
900
|
1085
|
files, not the names of the files themselves.</p>
|
901
|
1086
|
|
902
|
|
-<h2><a name=now-you-have-n-problems-regex-rabbit-holes href=#now-you-have-n-problems-regex-rabbit-holes>#</a> now you have n problems: regex + rabbit holes</h2>
|
|
1087
|
+<h2><a name=now-you-have-n-problems-regex-and-rabbit-holes href=#now-you-have-n-problems-regex-and-rabbit-holes>#</a> now you have n problems: regex and rabbit holes</h2>
|
903
|
1088
|
|
904
|
1089
|
<p>To close out this introductory chapter, let’s spend a little time on a topic
|
905
|
1090
|
that will likely vex, confound, and (occasionally) delight you for as long as
|
|
@@ -936,18 +1121,18 @@ shell to match groups of files, but for text in general and with more magic.</p>
|
936
|
1121
|
by <code>grep</code>, other magical things include:</p>
|
937
|
1122
|
|
938
|
1123
|
<table>
|
939
|
|
- <tr><td><code>^</code> </td> <td>start of a line </td></tr>
|
940
|
|
- <tr><td><code>$</code> </td> <td>end of a line </td></tr>
|
941
|
|
- <tr><td><code>[abc]</code></td> <td>one of a, b, or c </td></tr>
|
942
|
|
- <tr><td><code>[a-z]</code></td> <td>a character in the range a through z </td></tr>
|
943
|
|
- <tr><td><code>[0-9]</code></td> <td>a character in the range 0 through 9 </td></tr>
|
944
|
|
-
|
945
|
|
- <tr><td><code>+</code> </td> <td>one or more of the preceding thing </td></tr>
|
946
|
|
- <tr><td><code>?</code> </td> <td>0 or 1 of the preceding thing </td></tr>
|
947
|
|
- <tr><td><code>*</code> </td> <td>any number of the preceding thing </td></tr>
|
948
|
|
-
|
949
|
|
- <tr><td><code>(foo|bar)</code></td> <td>"foo" or "bar"</td></tr>
|
950
|
|
- <tr><td><code>(foo)?</code></td> <td>optional "foo"</td></tr>
|
|
1124
|
+ <tr><td><code>^</code> </td> <td>start of a line </td></tr>
|
|
1125
|
+ <tr><td><code>$</code> </td> <td>end of a line </td></tr>
|
|
1126
|
+ <tr><td><code>[abc]</code></td> <td>one of a, b, or c </td></tr>
|
|
1127
|
+ <tr><td><code>[a-z]</code></td> <td>a character in the range a through z</td></tr>
|
|
1128
|
+ <tr><td><code>[0-9]</code></td> <td>a character in the range 0 through 9</td></tr>
|
|
1129
|
+
|
|
1130
|
+ <tr><td><code>+</code> </td> <td>one or more of the preceding thing </td></tr>
|
|
1131
|
+ <tr><td><code>?</code> </td> <td>0 or 1 of the preceding thing </td></tr>
|
|
1132
|
+ <tr><td><code>*</code> </td> <td>any number of the preceding thing </td></tr>
|
|
1133
|
+
|
|
1134
|
+ <tr><td><code>(foo|bar)</code></td> <td>"foo" or "bar"</td></tr>
|
|
1135
|
+ <tr><td><code>(foo)?</code></td> <td>optional "foo"</td></tr>
|
951
|
1136
|
</table>
|
952
|
1137
|
|
953
|
1138
|
|
|
@@ -1549,6 +1734,9 @@ the same thing as `cat all_authors | nl`, or `nl all_authors`. You won't see
|
1549
|
1734
|
$ sort colors | uniq -i | tail -1
|
1550
|
1735
|
$ cut -d' ' -f1 ./authors_* | sort | uniq -ci | sort -n | tail -3
|
1551
|
1736
|
$ sort -u ./authors_* | cut -d' ' -f1 | uniq -ci | sort -n | tail -3
|
|
1737
|
+ $ sort -k1 all_authors.tsv | expand -t14
|
|
1738
|
+ $ cut -f3 all_authors.tsv | grep .
|
|
1739
|
+ $ paste firstnames lastnames | sort -k2 | expand -t12
|
1552
|
1740
|
$ cat ./authors_* | grep 'Vanessa'
|
1553
|
1741
|
</code></pre>
|
1554
|
1742
|
|
|
@@ -2447,11 +2635,9 @@ If you squint, these look kind of like paths to files on your filesystem.</p>
|
2447
|
2635
|
<hr />
|
2448
|
2636
|
<script>
|
2449
|
2637
|
$(document).ready(function () {
|
2450
|
|
-
|
2451
|
|
- // ☜ ☝ ☞ ☟
|
2452
|
|
- // ☆ ✠ ✡ ✢ ✣ ✤ ✥ ✦ ✧ ✩ ✪
|
2453
|
|
- var closed_sigil = '⇩';
|
2454
|
|
- var open_sigil = '⇧';
|
|
2638
|
+ // ☜ ☝ ☞ ☟ ☆ ✠ ✡ ✢ ✣ ✤ ✥ ✦ ✧ ✩ ✪
|
|
2639
|
+ var closed_sigil = 'show';
|
|
2640
|
+ var open_sigil = 'hide';
|
2455
|
2641
|
|
2456
|
2642
|
var togglesigil = function (elem) {
|
2457
|
2643
|
var sigil = $(elem).html();
|
|
@@ -2462,20 +2648,18 @@ $(document).ready(function () {
|
2462
|
2648
|
}
|
2463
|
2649
|
};
|
2464
|
2650
|
|
2465
|
|
- var togglebutton = function (e) {
|
2466
|
|
- e.preventDefault();
|
2467
|
|
- $details_full.toggle({
|
2468
|
|
- duration: 550
|
2469
|
|
- });
|
2470
|
|
- togglesigil(this);
|
2471
|
|
- };
|
2472
|
|
-
|
2473
|
2651
|
$(".details").each(function () {
|
2474
|
2652
|
var $this = $(this);
|
2475
|
2653
|
var $button = $('<button class=clicker-button>' + closed_sigil + '</button>');
|
2476
|
2654
|
var $details_full = $(this).find('.full');
|
2477
|
2655
|
|
2478
|
|
- $button.click(togglebutton);
|
|
2656
|
+ $button.click(function (e) {
|
|
2657
|
+ e.preventDefault();
|
|
2658
|
+ $details_full.toggle({
|
|
2659
|
+ duration: 550
|
|
2660
|
+ });
|
|
2661
|
+ togglesigil(this);
|
|
2662
|
+ });
|
2479
|
2663
|
|
2480
|
2664
|
$(this).find('.clicker').append($button);
|
2481
|
2665
|
$button.show();
|