At Lexity (where I work and you should too!) we recently had a hack day. I’ve just put the results of my efforts up on Github in a repository called git_post_receive, available at http://github.com/pariser/git_post_receive.
I have always disliked the fact that Github and JIRA speak different languages. If you don’t know what I’m talking about, then you’re probably thankful that you are not the target demographic of this post, but in case you want a bit more understanding about what I’m talking about, we use Github to manage our code and JIRA to manage our bugs and issues). That terrible “Men are from Mars…” idiom applies all too to these two systems…
You would think that when a JIRA issue is referenced in a Github commit message, that the bug would be updated with information about the associated commit. This project, git_post_receive, is a lightweight server I wrote which will do just that, hence increasing my (and hopefully your) productivity!
I’m going to give a quick introduction for a tool that I built to ease my own web scraping efforts in Python. Aspirationally, I’ve decided to call it PWSU, Python Web Scraping Utilities, thus allowing me to add more functions as time goes on…
It’s available at http://github.com/pariser/pwsu. It’s actually a pretty simple piece of code, with really one use right now, the HTMLCache.
When writing a web-scraper, you’re rarely able to write the processing code correctly on the first try. The HTMLCache makes it easy to iterate in your web processing code. All of the methods are designated as @classmethod so the HTMLCache need not be instantiated. To use it, you need only import it:
from pwsu import HTMLCache
html = HTMLCache.read_url("http://github.com/pariser/pwsu")
First, the HTMLCache will look to a local file cache to see if this URL has been downloaded before. If it has, it will read the html document from file. If not, it will make a live call to download the source of the given URL and put the html document to file before returning it to the user.
HTMLCache also provides other conveniences beyond caching
- It adds a user agent pretending to be Firefox 8.0 MacOS X
- It correctly encodes incoming html documents in utf-8.
Have fun scraping!
There are only a few things that frustrate me about traveling, and one of them is the guide book. Don’t get me wrong — there’s a ton of value to a guide book and I’d never leave home without one; but really, it’s not the book that’s of value but the content inside that book. In this day of a ubiquitous iPhone or Android-powered device, tell me why I should carry a 5 pound monster everywhere I go?
A screenshot of the Itinerary Buidler prototype tool I've written, available at http://pariser.me/itinerary
This is the realization that led me to build a prototype tool, which I’ve blandly named Itinerary Builder (http://pariser.me/itinerary), with the hope of re-conceptualizing the format of a “guide book”, at least for the planning stages of a trip. Continue reading