WalaWiki content from p1k3.com
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 

8 lines
1.3 KiB

Everyone uses RegEx to match and pull text out of web documents. This usually works but is painful and tends to produce collections of special cases rather than generalizing well. The 34th time you do it is nearly as much effort as the first, though by that time you hate yourself a lot more for it. Open question: Since HTML generally has enough structure to display, and since display appearance is often the targetted layer of structure (people tend to encode this kind of meaning for other people, not for machine readability), shouldn't there be a graceful way to do this from the browser?
* MozillaFirefox: [http://simile.mit.edu/solvent/ Solvent], a screen scraping aid. Halfway to useful: Will let you graphically find an xpath on the page you're looking at. Meant as an aid to build scrapers for PiggyBank.
* LinuxMagazine: [http://www.stonehenge.com/merlyn/LinuxMag/col55.html using xsh to scrape web pages] - Perl, both a Term::ReadLine based shell and a module. Looks promising. You could find an xpath using Solvent & feed it to your Perl script with a little output template.
* DownThemAll - Very useful for grabbing collections of files. Does regex matching on links.
* MozillaFirebug and hpricot.
<[[Brennen]]> One conclusion is that I probably need to learn JavaScript.