Web browsers are really complicated these days. They're full of rendering engines, audio and video players, programming languages, development tools, databases --- you name it, and there's a fair chance it's in there somewhere. The modern web browser is kitchen sink software, and to make matters worse, it is totally surrounded by technobabble. It can take years to come to terms with the ocean of words about web stuff and sort out the meaningful ones from the snake oil and bureaucratic mysticism.
All of which can make the web itself seem like a really complicated landscape, and obscure the simplicity of its basic design, which is this:
Some programs pass text around to one another.
Which might sound familiar.
The gist of it is that the web is made out of URLs, "Uniform Resource Locators", which are paths to things. If you squint, these look kind of like paths to files on your filesystem. When you visit a URL in your browser, it asks a server for a certain path, and the server gives it back some text. When you click a button to submit a form, your browser sends some text to the server and waits to see what it says back. The text that gets passed around is (usually) written in a language with particular significance to web browsers, but if you look at it directly, it's a format that humans can understand.
Let's illustrate this. I've written a really simple web page that lives at
http://p1k3.com/hello_world.html
.
$ curl 'https://p1k3.com/hello_world.html'
<html>
<head>
<title>hello, world</title>
</head>
<body>
<h1>hi everybody</h1>
<p>How are things?</p>
</body>
</html>
curl
is a program with lots and lots of features --- it too is a little bit
of a kitchen sink --- but it has one core purpose, which is to grab things from
URLs and spit them back out. It's a little bit like cat
for things that live
on the web. Try the above command with just about any URL you can think of,
and you'll probably get something back. Let's try this book:
$ curl 'https://p1k3.com/userland-book/' | head
<!DOCTYPE html>
<html lang=en>
<head>
<meta charset="utf-8">
<title>userland: a book about the command line for humans</title>
<link rel=stylesheet href="userland.css" />
<script src="js/jquery.js" type="text/javascript"></script>
</head>
<body>
hello_world.html
and userland-book
are both written in HyperText Markup
Language. HTML is just text with a specific kind of structure. It's been
around for quite a while now, and has grown up a lot in 20 years, but at heart
it still looks a lot [like it did in 1991][www].
The basic idea is that the contents of a web page are marked up with tags. A tag looks like this:
<title>hi!</title> -,
| | |
| `- content |
| `- closing tag
`-opening tag
Sometimes you'll see tags with what are known as "attributes":
<a href="https://p1k3.com/userland-book">userland</a>
This is how links are written in HTML. href="..."
tells the browser where to
go when the user clicks on "userland".
Tags are a way to describe not so much what something looks like as what something means. Browsers are, in large part, big collections of knowledge about the meanings of tags and ways to represent those meanings.
While the browser you use day-to-day has (probably) a graphical interface and does all sorts of things impossible to render in a terminal, some of the earliest web browsers were entirely text-based, and text-mode browsers still exist. Lynx, which originated at the University of Kansas in the early 1990s, is still actively maintained:
$ lynx -dump 'http://p1k3.com/userland-book/' | head
userland
__________________________________________________________________
[1]# a book about the command line for humans
Late last year, [2]a side trip into text utilities got me thinking
about how much my writing habits depend on the Linux command line. This
struck me as a good hook for talking about the tools I use every day
with an audience of mixed technical background.
If you invoke Lynx without any options, it'll start up in interactive mode, and
you can navigate between links with the arrow keys. lynx -dump
spits a
rendered version of a page to standard output, with links annotated in square
brackets and printed as footnotes. Another useful option here is -listonly
,
which will print just the list of links contained within a page:
$ lynx -dump -listonly 'http://p1k3.com/userland-book/' | head
References
2. http://p1k3.com/2013/8/4
3. http://p1k3.com/userland-book.git
4. https://github.com/brennen/userland-book
5. http://p1k3.com/userland-book/
6. https://twitter.com/brennen
9. http://p1k3.com/userland-book/#a-book-about-the-command-line-for-humans
10. http://p1k3.com/userland-book/#copying
An alternative to Lynx is w3m, which copes a little more gracefully with the complexities of modern web layout.
$ w3m -dump 'http://p1k3.com/userland-book/' | head
userland
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# a book about the command line for humans
Late last year, a side trip into text utilities got me thinking about how much
my writing habits depend on the Linux command line. This struck me as a good
hook for talking about the tools I use every day with an audience of mixed
technical background.
Neither of these tools can easily replace enormously capable applications like Chrome or Firefox, but they have their place in the toolbox, and help to demonstrate how the web is built (in part) on principles we've already seen at work.