Getting Started with Scrapy

Learn how to use Python Scrapy to Extract information from Websites.

“When life gives you lemons, chunk it right back.”
― Bill Watterson

1. Introduction

Scrapy is a python-based web crawler which can be used to extract information from websites. It is fast, simple and can navigate pages just like a browser can.

However, note that it is not suitable for websites and apps which use Javascript to manipulate the user interface. Scrapy loads just the HTML. It has no facilities to execute javascript which might be used by the website to tailor the user’s experience.

2. Installation

We use Virtualenv to install scrapy. This allows us to install scrapy without affecting other system installed modules.

Create a working directory and initialize a virtual environment in that directory.

mkdir working
cd working
virtualenv venv
. venv/bin/activate

Install scrapy now.

pip install scrapy

Check that it is working. The following display shows the version of scrapy as 1.4.0.

scrapy
# prints
Scrapy 1.4.0 - no active project

Usage:
  scrapy <command></command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
...

3. Writing a Spider

Scrapy works by loading a python module called a spider which is a class inheriting from scrapy.Spider.

Let us write a simple Spider class to load the top posts from Reddit.

To begin with, create a file called redditspider.py and add the following to it. This is a complete spider class, though one which does not do anything useful for us. A spider class requires, as a minimum, the following:

  • a name identifying the spider
  • a start_urls list variable containing the URLs from which to begin crawling.
  • a parse() method, which can be a no-op as shown
import scrapy

class redditspider(scrapy.Spider):
    name = 'reddit'
    start_urls = ['https://www.reddit.com/']

    def parse(self, response):
        pass

This class can now be executed as follows:

scrapy runspider redditspider.py

# prints
...
2017-06-16 10:42:34 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-06-16 10:42:34 [scrapy.core.engine] INFO: Spider opened
2017-06-16 10:42:34 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
...

4. Turn Off Logging

As you can see, this spider runs and prints a bunch of messages which can be useful for debugging. However, since it obscures the output of out program, let us turn it off for now.

Add these lines to the beginning of the file.

import logging
logging.getLogger('scrapy').setLevel(logging.WARNING)

Now when we run the spider, we should not see the obfuscating messages.

5. Parsing the Response

Let us now parse the response from the scraper. This is done in the method parse(). In this method, we use the method response.css() to perform CSS-style selections on the HTML and extract the required elements.

To identify the CSS selections to extract, we use Chrome’s DOM Inspector tool to pick the elements. From reddit’s front page, we see that each post is wrapped in a <div class="thing">...</div>.

So we select all div.thing from the page and use it to work with further.

def parse(self, response):
    for element in response.css('div.thing'):
        pass

We also implement the following helper methods within the spider class to extract the required text.

The following method extracts all text from an element as a list, joins the elements with a space and strips away the leading and trailing whitespace from the result.

def a(self, response, cssSel):
    return ' '.join(response.css(cssSel).extract()).strip()

And this method extracts text from the first element and returns it.

def f(self, response, cssSel):
    return response.css(cssSel).extract_first()

6. Extracting Required Elements

Once these helper methods are in place, let us extract the title from each reddit post. Within div.thing, the title is available at div.entry>p.title>a.title::text. As mentioned before this CSS selection for the required elements can be determined from any browser’s DOM Inspector.

def parse(self, resp):
    for e in resp.css('div.thing'):
        yield {
            'title': self.a(e,'div.entry>p.title>a.title::text'),
        }

The results are returned to the caller using python’s yield statement. The way yield works is as follows — executing a function which contains a yield statement returns a generator to the caller. The caller repeatedly executes this generator and receives results of the execution till the generator terminates.

In our case, the parse() method returns a dictionary object containing a key (title) to the caller on each invocation till the div.thing list ends.

7. Running the Spider and Collecting Output

Let us now run the spider again. A part of the copious output is shown (after re-instating the log statements).

scrapy runspider redditspider.py
# prints
...
2017-06-16 11:35:27 [scrapy.core.scraper] DEBUG: Scraped from 
{'title': u'The Plight of a Politician'}
2017-06-16 11:35:27 [scrapy.core.scraper] DEBUG: Scraped from 
{'title': u'Elephants foot compared to humans foot'}
...

It is hard to see the real output. Let us redirect the output to a file (posts.json).

scrapy runspider redditspider.py -o posts.json

And here is a part of posts.json.

...
{"title": "They got fit together"},
{"title": "Not all heroes wear capes"},
{"title": "This sub"},
{"title": "So I picked this up at a flea market.."},
...

8. Extract All Required Information

Let us also extract the subreddit name and the number of votes for each post. To do that we just update the result returned from the yield statement.

def parse(S, r):
    for e in r.css('div.thing'):
        yield {
            'title': S.a(e,'div.entry>p.title>a.title::text'),
            'votes': S.f(e,'div.score.likes::attr(title)'),
            'subreddit': S.a(e,'div.entry>p.tagline>a.subreddit::text'),
        }

The resulting posts.json:

...
{"votes": "28962", "title": "They got fit together", "subreddit": "r/pics"},
{"votes": "6904", "title": "My puppy finally caught his Stub", "subreddit": "r/funny"},
{"votes": "3925", "title": "Reddit, please find this woman who went missing during E3!", "subreddit": "r/NintendoSwitch"},
{"votes": "30079", "title": "Yo-Yo Skills", "subreddit": "r/gifs"},
{"votes": "2379", "title": "For every upvote I won't smoke for a day", "subreddit": "r/stopsmoking"},
...

Conclusion

This article provided a basic view of how to extract information from websites using Scrapy. To use scrapy, we need to write a spider module which instructs scrapy to crawl a website and extract structured information from it. This information can then be returned in JSON format for consumption by downstream software.

Leave a Reply

Your email address will not be published. Required fields are marked *