How to web scrape (in Python)

Some day, even in a life of a desktop programmer (such as myself), comes a day when he wants to data mine some page, because a given page doesn’t provide any means to search for it. I am referring only to non-confidential material scraping so if you were looking for tips how to hack a page – sorry, that’s not the place to be.

Without further ado I would like to share with you a few tips regarding web scraping in Python.

Opening a page – not as easy as you are used to

There are two ways to open a page: an easy, and a one which is not that easy.

To just open a page you will most probably use urllib2 to invoke urllib2.urlopen("").read(). That was easy.

Sometimes, though, you will be forced to login to a given page. If you are lucky, the page will use a standard HTTP authentication mechanism – in this case library examples will show you the way. If, however, you are required to use some custom auth. mechanisms, then you will have to play around with POST requests.

Authorization, among other things, is stored in cookies. In order to set some cookies, use:

opener = urllib2.build_opener()
opener.addheaders.append(("Cookie", "foo=0; bar=baz"))

Remember to keep the cookies set by the page, as they are important for, for example, keeping your session-id.

Some pages will not allow you to read them if you don’t supply them with correct (i.e., faked) cookies, like the browser-id. In such cases it’s wise to visit the site beforehand with your browser, and determine the required cookies.

Parsing = bread & butter

When you finally get your page in a nice unicode string, it’s time to scrape it. Manual parsing is not advised, thus we shall use a library.

Library that you might here a lot about is lxml. While this is generally a great library (which we will use, indirectly) I do not recommend it. The reason for that, is that it cannot parse a incorrectly created HTML pages, which there are a lot. lxml documentation presents ways of dealing with badly formatted pages, but, alas, I wasn’t able to force lxml to read my pages.

Best library, so far, for page scraping that I’ve found is the BeautifulSoup. This is a very nice library that allows to parse even the worse pages. Parsing pages with it is very convenient so RTFM and start rolling!

One important notice: for Python 2.x it’s recommended that you install lxml and let it be used by the BS. I’ve done some performance checks and the html.parser-based solution (default) is much slower. For a just a few simple html searches, the lxml was 10% faster. You can imagine how much faster the lxml will be if you are to do more parsing, on larger pages. In Python 3.x the html.parser got much better and you can safely use it instead of lxml.

Performance issues? Threads to the rescue!

Scraping one page is fast. In just a few seconds you can data mine all the interesting data. But what if you want to scrape thousands of them? Now we are talking!

Careful profiling revealed that 50% time of execution is lost during fetching the page. During time, our processor is sleeping. We can’t allow that. To gain a HUGE speed up, I used multiple worker threads where each thread was processing one page at a time. Due to this, I’ve measured a 11x performance increase!

I’ve also tried to fight GIL by using processes instead of threads. The results were unsatisfactory. The reason for that, AFAIK, is the little Python code that is concurrently being run by several threads. For computation-heavy algorithms processes might be of a great help, but in my case the additional cost of keeping processes alive was bigger than the potential gain.

When mining a lot of data, you might run into a problem of creating multiple output files as using single sqlite database or single file will result in either an exception or corrupted data, when running multiple threads. I’ve once created ~100k output files, and believe me: it wasn’t a pretty sight. Eclipse died, bash required me to incorporate some hacks to avoid too long argument list problem, and everything went to hell. To solve this I’ve used the concurrent Queue. This allowed my multiple worker threads to concurrently put their output strings, while the queue handler (running on yet another thread) wrote the strings to single output file.

If you ever start optimizing your page scraper, then remember to use time.time() instead of time.clock(). The latter one measures the processor time, whereas idle waiting is a big part of whole scraping process.


Tags: , ,

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: