brennen
/
p1k3-wala

<[Brennen]> WareLogging: First thought on writing a decent web-scraping script: If you're working with a known set of files, the first thing to do is cache them locally before you even think about processing them. Broadband or no, local storage is so much faster that you're going to save time unless you get it right on the first try - and how likely is that? Also, a server administrator somewhere is going to be a lot happier with you than if you download the same batch of a thousand files about 15 times. Even if you're going to spider through them looking for other files to add to the set, it _might_ be a good idea to separate that logic from other processing, so that you can avoid loading them from a server more than once while you hammer things into shape.
If you're ever responsible for generating HTML that someone might want to scrape, it would reflect very well on you if you A) cleanly formatted the text, and B) marked the start and endpoints of the real content in an easily located fashion, standard across every page.
It would reflect even better on you if the HTML was either parseable XML or accompanied by a sane, easily parsed data format.