I’m going to give a quick introduction for a tool that I built to ease my own web scraping efforts in Python. Aspirationally, I’ve decided to call it PWSU, Python Web Scraping Utilities, thus allowing me to add more functions as time goes on…
It’s available at http://github.com/pariser/pwsu. It’s actually a pretty simple piece of code, with really one use right now, the HTMLCache.
When writing a web-scraper, you’re rarely able to write the processing code correctly on the first try. The HTMLCache makes it easy to iterate in your web processing code. All of the methods are designated as @classmethod so the HTMLCache need not be instantiated. To use it, you need only import it:
from pwsu import HTMLCache html = HTMLCache.read_url("http://github.com/pariser/pwsu")
First, the HTMLCache will look to a local file cache to see if this URL has been downloaded before. If it has, it will read the html document from file. If not, it will make a live call to download the source of the given URL and put the html document to file before returning it to the user.
HTMLCache also provides other conveniences beyond caching
- It adds a user agent pretending to be Firefox 8.0 MacOS X
- It correctly encodes incoming html documents in utf-8.
Have fun scraping!