Tuesday, May 15, 2007

An unscientific comparison 3.1 - Getting the cookie, Python

Part 3.1 of N in a series. Click here for Part 3.

So, without further ado, here's some actual code. I've started with Python as it's my 'default' language.

 1 import urllib
2 import urllib2
3 import cookielib
4
5 def getCookie(user, pwd, uri, ua):
6 ckCont = cookielib.LWPCookieJar()
7 opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(ckCont))
8 opener.addheaders = [('User-Agent', ua)]
9 data = urllib.urlencode({'id':user, 'pwd': pwd, 'dologin': 'yes'})
10 opener.open(uri, data)
11 return ckCont
12
13 def storeCookie(ck, file_path):
14 ck.save(file_path)
15
16 def retrieveCookie(file_path):
17 ckCont = cookielib.LWPCookieJar(file_path)
18 ckCont.load()
19 return ckCont
Comments

Duplicate libraries
This is one of the messier areas of the stdlib in my mind, and I'm hoping it gets overhauled for Python 3000. If it's not currently planned, I should probably volunteer. Both urllib and urllib2 have similar functionality with regards to the HTTP protocol, but only urllib2 has the capacity to handle cookies and proxies sanely. The only thing urllib is really necessary for (in my experience) is urlencode.

Documentation
The documentation for urllib2 is... sparse. Some more examples would've been brilliant, as would've some clarification of what needs to go with what. For instance, I'm assuming a proxy requiring login will need a ProxyHandler and a BasicProxyAuthHandler, but such (I'd assume) common usages aren't commented on. Perhaps something else I should volunteer to correct.

Edit: A commenter pointed out these two great resources: urllib2 - The Missing Manual and cookielib and ClientCookie. Seems someone else is way ahead of me in that respect.

Out-and-out broken libraries
I had to upgrade to Python 2.5.1 to make this work, as urllib2 in Python 2.4 is apparently broken when trying to use cookie handlers, and my experiences with it anecdotally confirm that. First time I've ever encountered that in Python...

Minor quibble
opener.addheaders - phrased like a method, it's actually a property, a list of headers to add. I know it's more verbose, but perhaps it should be called headersToAdd.
Edit: I did have another quibble here about CookieJars not being serialisable, but with LWPCookieJar, I don't need to serialise anything.

Overall
urllib2 is quite an indepth library, and handles moderately complex behaviour reasonably well, especially when the ability to 'chain' handlers to build a custom opener is used - as I've done with the cookie handler. It also automates some of the mucking around, which is always nice, although it does add a trace of 'magic' to the code. For example, in the above code, there are actually two requests made - if the initial POST is successful, a HTTP 302 (Object moved) response is returned, with a header pointing to a new URI which sets the authentication cookie. Urllib2 sends the GET to this new location without prompting.

While the documentation is a bit bare, it's still more fleshed out than an auto-generated API document. That said, needs work, needs work.

The code feels reasonably concise, and fairly self-documenting. The only part that could be confusing is how the cookie is getting into the CookieJar. (opener stuck it there).

Click here for Part 3.2, the Scheme version

1 comments:

Lawrence Oluyede said...

Have look at these:
http://www.voidspace.org.uk/python/articles/cookielib.shtml
http://www.voidspace.org.uk/python/articles/urllib2.shtml