WalaWiki content from p1k3.com
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 

72 lines
6.3 KiB

<[Alan]> This does make sense. What's missing here is the concept of records. Some of the standard unix tools are definitely record-based (sort, uniq, join, comm, cut, paste, wc, shuf, tsort). Some not necessarily (cat, tr, split, dd, cksum, head, tail).
For the traditional unix tsv format, records are newline delimited. Within a record, fields are tab delimited and often externally named and typed.
For a json stream (like those emitted by eg the twitter apis) the records are also newline delimited, but within that the fields can be hierarchical, and the whole record is self-describing.
For xml, I'm not sure if there's a consensus as to how to delimit the records in a stream. There's violence inherent in the system, and it usually involves parsing a whole document into memory.
S-expressions? Also not sure about the consensus on streaming records. Surely you could follow json and escape newlines within records.
Protocol buffers are record oriented and hierarchical, but not self-describing. This makes them more compact, at the expense of some extra song and dance at the outset to initialize a schema. You're in a similar boat with a tsv that doesn't include a header line naming the columns.
Anyway, I think it's high time I published "vinyl", this record-processing library that's been in my toolbox for the last half year or so. It has these subcommands so far:
vinyl io # this is like cat(1), grep(1), map, but can also do aggregation over record groups
vinyl lookup # this is like join(1)
vinyl index # this builds an index for vinyl lookup
vinyl split # this is pretty much split(1)
vinyl part # this does a more flexible split(1)
It's in Python. It ain't perfect. It's more of a "what would a multi-format record processing interface look like" experiment. I keep toying with the idea of moving it to a natively-compilable functional language. All that said ... it's been insanely, insanely useful, mostly by virtue of the fact that you can freely mix it into pipelines with the standard unix tools. Which are still for me the ultimate definition of insanely useful. You don't *have* to rewrite sort(1) for json records. Just convert to hierarchical tsv, pipe to sort(1), and convert back -- the copying overhead is way down in the noise.
Supposing I finally get off my ass and release this thing, would you be interested in helping improve it? Not that you'd have to dig into the Python. A critique of the command line interface would be extremely helpful.
<[Brennen]> This is fairly illustrative of how little thought I've actually given the problem.
Anyway, I'm certainly interested in vinyl. I've been feeling a growing itch for some fresh abstractions at around the granularity and composability of the standard tools - something a little higher-order, if that's the way to put it.
(I should mention that I got to thinking along these lines tonight based on [http://twitter.com/#!/stevedekorte/status/93116610373619712 a tweet] from Steve Dekorte.)
<[Alan]> What I was looking for, and what you might be looking for, was a modernized awk.
Sure, if you're completely nuts ^1 you can write an xml parser in awk. But why not lean on a modern language with lots of modern libraries you can call from your one-liner? For a while, I did a lot of perl -lape, but that makes some assumptions about the way records and fields are structured; some formats just can't be accommodated. If you have binary data (documents, say) you can go down the insane escaping path, or you can use a binary format that does length prefixing.
With vinyl, you specify the input and output record formats, like in your "s j" example. For record formats that aren't self-describing, you also pass parameters that specify input and output schema. The rest is one-liner fun: -e "..." gets executed on every record, -B "..." is like BEGIN { } code, and -E "..." is like END { } code, with the refinement that you can also have -B "..." and -E "..." run at every key boundary (think uniq -c, or SELECT some-aggregate() FROM ... GROUP BY).
The nice thing about doing this in Python -- I had no idea this would happen at the outset -- is that there's a facility for executing code inside a local scope specified by a dict.^2 Key / values in the dict become local variables for the code. Also, if the code creates new local variables, these are saved in the dict afterwards.
The upshot is that things like this work:
vinyl io -r json -w tsv -W fields=nwords -e 'nwords=len(text.split())' < tweets.json
This outputs the number of words in each tweet. An input json record looks like this:
{"in_reply_to_user_id":null,"text":"RT @loydcase: 3.6 gigapixel tribute to Blade Runner: http://bit.ly/dlEKvP",...,"user":{...},"id":21686005800,"retweet_count":null}
Since the input record is used as the local variable dict for -e '...', there's automatically a variable called "text". A variable called nwords is created by virtue of the assignment, and becomes a part of the output record. The output format, tsv, only specifies one field, "nwords." Everything else in the record gets thrown away.
With perl -lape, you'd have to first parse the json in your -e code, then set $_ to the field you wanted to print, joining on '\t' if there were multiple fields. Well you know what? Screw that.
With awk it's even more annoying. It *does* set up automatic local variables for the input record, except they're named $1, $2, $3 ... $n. Want to output the whole record, minus some trailing field, plus some new field? Gotta print all those $1, $2, $3 ... $n-1 again.
TL;DR my point is, stop wasting your time arranging for i/o, parsing, and formatting, and get down to the computation already. I think that was also your original point, just felt like expanding on it a bit. ;)
^1 http://pastebin.com/Vwvz3gzb
^2 http://acg.github.com/2011/02/08/mapping-python-over-records-with-lwpb.html
<[Brennen]> That looks slick as shit.
re: ^1, ''gah'' that's terrifying.
<[Alan]> I feel obliged to mention [https://github.com/benbernard/RecordStream]
<[Brennen]> (I feel sort of obliged at this point to apologize once more for the considerable infelicities of the painfully unmaintained regex substitution [https://github.com/brennen/display/blob/master/lib/Wala.pm#L323 markup] hereabouts. Will fix everything I swear one of these decades.)
Anyhow, interesting find...