Python HTTP Client using urllib2

More on downloading HTTP URLs using urllib2.

“I like the night. Without the dark, we’d never see the stars.” ― Stephenie Meyer, Twilight

1. Introduction

Python provides the well-regarded urllib2 module for opening URLs. Let us investigate some of the capabilities of this module, shall we? Note that most use cases are better served by using the higher level Requests module. However, you should know about the available options.

2. Opening a URL

The urlopen() function accepts a URL and opens it, returning a response object. This is a file-like object, so we can use read() on it.

print urllib2.urlopen('http://httpbin.org/uuid').read()
# prints
{
  "uuid": "9a06b604-9f2c-4993-a981-a687d2795152"
}

The response object also provides getcode() which returns the HTTP status code of the response.

r = urllib2.urlopen('http://amazon.com')
print r.geturl(), '=>', r.getcode()
s = r.read()
print 'Read', len(s), ' chars:', s[:20], '...'
# prints
https://www.amazon.com/ => 200
Read 386498  chars: <!doctype html><html ...

Notice that, in the above request, HTTP has been redirected to HTTPS. The urlopen() function handles redirects (301 and 302) automatically.

3. Response Headers

What if you would like to get the headers from the response? The response.info() method returns a mimetools.Message object which returns the headers. Looping over this object returns the available header names.

r = urllib2.urlopen('http://httpbin.org/uuid')
hdr = r.info()
for x in hdr:
    print x, '=>', hdr.getheader(x)
# prints
content-length => 53
x-processed-time => 0.000524044036865
x-powered-by => Flask
server => meinheld/0.6.1
connection => close
via => 1.1 vegur
access-control-allow-credentials => true
date => Mon, 05 Feb 2018 05:41:46 GMT
access-control-allow-origin => *
content-type => application/json

The header object also behaves like a dictionary so you can access individual headers like this:

print hdr['content-type']
# prints
application/json

4. POSTing data

To perform a HTTP POST, pass in the data as the second argument. Data must be in the application/x-www-form-urlencoded format.

r = urllib2.urlopen('http://httpbin.org/post', 'search=hello+world')
print r.geturl(), '=>', r.getcode()
s = r.read()
print 'Read', len(s), ' chars:', s
# prints
http://httpbin.org/post => 200
Read 416  chars: {
  "args": {},
  "data": "",
  "files": {},
  "form": {
    "search": "hello world"
  },
  "headers": {
    "Accept-Encoding": "identity",
    "Connection": "close",
    "Content-Length": "18",
    "Content-Type": "application/x-www-form-urlencoded",
    "Host": "httpbin.org",
    "User-Agent": "Python-urllib/2.7"
  },
  "json": null,
  "origin": "157.49.184.92",
  "url": "http://httpbin.org/post"
}

To properly encode the data, use urllib.urlencode(). Pass in a dict of parameters and receive a string (which can be passed directly).

d = {'search': 'hello world'}
print 'encoded:', urllib.urlencode(d)
r = urllib2.urlopen('http://httpbin.org/post', urllib.urlencode(d))
print r.geturl(), '=>', r.getcode()
s = r.read()
print 'Read', len(s), ' chars:', s
# prints
encoded: search=hello+world
http://httpbin.org/post => 200
Read 416  chars: {
  "args": {},
...
}

5. Uploading a File

Uploading a file is a little bit involved because the data must be encoded in multipart/form-data format. This format is more complex, so we use an external module called poster which handles all the nitty-gritty for us.

We also use the urllib2.Request class to be able to set the headers correctly. This class is discussed more fully below.

The code below uploads an image file after properly encoding it.

with open("image.jpg") as f:
    data, headers = multipart_encode({"file": f})
    req = urllib2.Request("http://httpbin.org/post", data, headers)
    r = urllib2.urlopen(req)
    print r.geturl(), '=>', r.getcode()
    s = r.read()
    print 'Read', len(s), ' chars:', s
# prints
http://httpbin.org/post => 200
Read 65015  chars: {
  "args": {}, 
  "data": "", 
  "files": {
    "file": "data:image/jpeg;base64,/9j/4..."
  }, 
  "form": {}, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Connection": "close", 
    "Content-Length": "48588", 
    "Content-Type": "multipart/form-data; boundary=ac01e827dc254f1ba5c22d8dc5c05cc0", 
    "Host": "httpbin.org", 
    "User-Agent": "Python-urllib/2.7"
  }, 
  "json": null, 
  ...
}

A simpler way to handle file uploads is offered by the Requests package. We will cover the use of this package in a future article. For now, you should know that file uploads are possible using the urllib2 module (though the documentation leaves out the details).

6. Setting Request Headers

While the URL can be directly passed to urllib2.urlopen() for simple requests, you need to create an instance of the urllib2.Request class for setting request headers.

Here is a simple example.

req = urllib2.Request('http://httpbin.org/post', '{"args": "hello world"}')
req.add_header('Content-Type', 'application/json')
r = urllib2.urlopen(req)
print r.geturl(), '=>', r.getcode()
s = r.read()
print '1. Read', len(s), ' chars:', s
# prints
http://httpbin.org/post => 200
1. Read 422  chars: {
  "args": {},
  "data": "{\"args\": \"hello world\"}",
  "files": {},
  "form": {},
  "headers": {
    "Accept-Encoding": "identity",
    "Connection": "close",
    "Content-Length": "23",
    "Content-Type": "application/json",
    "Host": "httpbin.org",
    "User-Agent": "Python-urllib/2.7"
  },
  "json": {
    "args": "hello world"
  },
  ...
}

7. Save and Restore Cookies

Python provides a package cookielib which assists in management of cookies. A couple more steps are required when opening a URL where cookies need to be stored and restored.

  1. Create a cookie jar.
    import cookielib
    cj = cookielib.CookieJar()
    
  2. Associate the cookie jar with an opener.
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    
  3. Create a Request object for an URL and open it using the opener. This is a URL which sets some cookies.
    req = urllib2.Request('http://httpbin.org/cookies/set?animal=cat&sound=meow')
    rsp = opener.open(req)
    
  4. Process the output from the URL, including the status code.
    print rsp.geturl(), '=>', rsp.getcode()
    s = rsp.read()
    print '1. Read', len(s), ' chars:', s
    # prints
    http://httpbin.org/cookies => 200
    1. Read 65  chars: {
    "cookies": {
      "animal": "cat",
      "sound": "meow"
    }
    }
    
  5. List the cookies set in the cookie jar.
    for x in cj:
      print x
    # prints
    <Cookie animal=cat for httpbin.org/>
    <Cookie sound=meow for httpbin.org/>
    
  6. Now create another Request object where the cookies are expected by the server and open it using the same opener. Note that we are using the same cookie jar as in the previous request.
    req = urllib2.Request('http://httpbin.org/cookies')
    rsp = opener.open(req)
    
  7. Process the output from the server, noting that the server has indeed processed the cookies.
    print rsp.geturl(), '=>', rsp.getcode()
    s = rsp.read()
    print '2. Read', len(s), ' chars:', s
    # prints
    http://httpbin.org/cookies => 200
    2. Read 65  chars: {
    "cookies": {
      "animal": "cat",
      "sound": "meow"
    }
    }
    

8. How to do a PUT or a DELETE?

HTTP PUT is a request type which is similar to POST in that it includes data along with the request. It is used commonly in REST APIs with the following semantics:

  • HTTP POST is used to indicate “create new resource with this data”
  • HTTP PUT is used to mean “update existing resource with this data”.

Note that this is a mere convention; there is no requirement that these requests have to be used this way.

Let us now see how to perform a PUT using urllib2. Normally, when you include data with the request, urllib2 sends the data as a POST. You need to tell it by setting get_method as follows.

req = urllib2.Request('http://httpbin.org/put', '{"customerId": 455, "firstName": "Dan", "lastName": "Smith"}')
req.add_header('Content-Type', 'application/json')
req.get_method = lambda: 'PUT'
r = urllib2.urlopen(req)
print r.geturl(), '=>', r.getcode()
s = r.read()
print 'Read', len(s), ' chars:', s
# prints
http://httpbin.org/put => 200
Read 511  chars: {
  "args": {},
  "data": "{\"customerId\": 455, \"firstName\": \"Dan\", \"lastName\": \"Smith\"}",
  "files": {},
  "form": {},
  "headers": {
    "Accept-Encoding": "identity",
    "Connection": "close",
    "Content-Length": "60",
    "Content-Type": "application/json",
    "Host": "httpbin.org",
    "User-Agent": "Python-urllib/2.7"
  },
  "json": {
    "customerId": 455,
    "firstName": "Dan",
    "lastName": "Smith"
  },
  "origin": "157.49.184.56",
  "url": "http://httpbin.org/put"
}

A DELETE request is handled in urllib2 in a similar way. A DELETE request normally means “delete the resource referenced in data”.

req = urllib2.Request('http://httpbin.org/delete', '{"customerId": 455, "firstName": "Dan", "lastName": "Smith"}')
req.add_header('Content-Type', 'application/json')
req.get_method = lambda: 'DELETE'
r = urllib2.urlopen(req)
print r.geturl(), '=>', r.getcode()
s = r.read()
print 'Read', len(s), ' chars:', s
# prints
http://httpbin.org/delete => 200
Read 514  chars: {
  "args": {},
  "data": "{\"customerId\": 455, \"firstName\": \"Dan\", \"lastName\": \"Smith\"}",
  "files": {},
  "form": {},
...
}

10. Use urllib2 to Download File with a Progress Monitor

How can you download a large file using urllib2 without having to hold the entire file data in memory (which is the default mode of operation)? You can read and write the file in chunks as follows:

r = urllib2.urlopen(<large file URL here>)
sz = 2048
buf = ''
with open('file.dat', 'w') as f:
    n = 0
    while True:
        s = r.read(sz)
        if not s:
            break
        n += len(s)
        f.write(s)
        print '\r{0:10} bytes'.format(n),
    print '.. done.'
# prints
   5266467 bytes .. done.

Conclusion

This article presented some tips and tricks for using urllib2. We covered 1) simple HTTP URLs 2) Setting request headers 3) Working with POST data 4) Uploading a file using multipart/formdata 5) Saving and restoring cookies 6) Performing PUT and DELETE requests and 7) Downloading a file in chunks with progress monitoring. If you need help with more issues, please leave a comment below.

Leave a Reply

Your email address will not be published. Required fields are marked *