PCrawler: a portable Python Web crawler

PCrawler is a suite of Python modules to build network graphs by crawling the World Wide Web. These webgraphs represent the connectivity of information linking one web site to another. Vertices are distinct pages (URLs) and a (directed) edge exists between two vertices is there is a hyperlink connecting one to the other. While there are many programs designed to crawl the web and collect information, this toolkit provides a simple approach that is tailored to extract the topological information and can be fine-tuned for producing specific reports. Each page visited is time-stamped and receives a unique hash-code value, so that aliases may be easily identified. The crawler can also be used as a diagnostic tool, reporting broken links, and invalid HTML pages within an organization.

Usage

PCrawler is useful in representing closed webgraphs for individual domain and sub-domains. For example, the command

python modular_crawler.py math.nist.gov http://math.nist.gov/tnt 3

finds three pages under math.nist.gov/tnt, following pages only under the math.nist.gov domain. The output is in the format:

*  page header (date/time, size, URL)
#  url alias
SHA check sum
url link 1
url link 2
...

where page header is

* 11292:42229:2009-11-13/20:22:35 http://www.foo.com/bar/...
    |      |      |         |                 |
  page   page    date      time              url
 number  size

followed by an optional alias, if one found, unique SHA check sum based on the page's contents, and the links found on that page, one per line. For example,

*  1:3948:2010-12-03/20:11:09 http://math.nist.gov/tnt
#  http://math.nist.gov/tnt/
f2f952d3e9877e6983dfc6d580f7790c69638e27
http://www.nist.gov/public_affairs/privacy.htm
http://math.nist.gov/pozo
http://math.nist.gov/tnt/download.html
http://math.nist.gov/tnt/examples.html

Design

The design separates the crawler (fetching the web pages) from the processing part (what to do with the information). Generating a webgraph is only one such option. Typically, people are interested in mining this information for specific purposes,.e.g

finding broken links
counting the number of jpg images on a page
extracting content information (articles, essays, blogs)
recording a page's size (in KB)

The framework for the webcrawler, then, is just to serve up pages that have been requested. We wish to keep the internal details of the crawler (and its thread-based version) out of the main program. This is achieved by relying on an interface function with the following signature

links_to_follow = process_webpage(url, canonical_url, page_contents)

Here the function can maintain all the necessary data structure (via shared global args)with the the main app. This function is provided in the constructor of the webcrawler, and will be called repeatedly, i.e.

+------->----------+
|                  |
|                  v
^
|      1. fetch webpage from to-visit list
|                  
|      2. call process_webpage to analyze page and
^               determine new links to follow
|
|      3. put new links on the to-visit list
|
^                  |
|                  v
+-------<----------+

We can fine-tune with other options, like stopping after processing a certain number of pages, and so on, but the basic gist is in the loop above. The Python application, then, looks like

process_webpage():
extract cotent & links, and manage information
return new links to follow

main()
process command-line args
read in from previous crawl if necessary
modular_crawler( process_webpage, seed_urls)
post-process, if necessary

Here is the breakdown of the main components:

modular_cralwer.py: the main generic crawler which follows html links, possibly restricted to a domain. This version uses parallel threads to create multiple calls and streamline the URL fetching process. calls ...
meetup_crawler.py: this is an example of a modified crawler, specifically designed to mine the friendships (social network) of MeetUp.com members. It crawls only along friend-links, omitting unwanted pages. The result is that we generate a much smaller (but more meaningful) graph.
PCcrawler.py: an older stand-alone version crawler, but now used to contain many of the helper functions. (See below)
parallel.py: the parallel crawling engine, which uses the thread_pool model below to spwan mutliple fetch operations. Python itself has global interpreter lock (GIL) which runs in a single OS execution thread, but still benefits from crawling mutliple locations simultaneously. (These types of applications tend to be I/O bound, rather than CPU bound.)
thread_pool.py: module that sets up and executes a group of parallel functions. For example, if function foo() takes a single argument (possibly a list: [ arg1, arg2, ... argN]) and returns a single value (which could also be a list [r1, r2, ... rM]), i.e.
```
[r1, r2, rM] = foo([arg1, arg2, argN])
```
Then,
```
num_threads = 16
P = thread_pool.thread_pool(num_threads, foo)
```
sets up a 16-thread pool of parallel foo's, ready for execution. To call these, use the eval() and result() methods. For example, in the simple case where foo is just a real-valued function, we could write
```
for i in range(0,100):
f.eval(i)
# ...
for i in range(0,100):
res = f.result()
```

This software was developed at the National Institute of Standards and Technology (NIST) by employees of the Federal Government in the course of their official duties. Pursuant to title 17 Section 105 of the United States Code this software is not subject to copyright protection and is in the public domain. NIST assumes no responsibility whatsoever for its use by other parties, and makes no guarantees, expressed or implied, about its quality, reliability, or any other characteristic.

We would appreciate acknowledgement if the software is incorporated in redistributable libraries or applications.

Download

PCrawler: a portable Python Web crawler
- Source code: webcrawler_2_1.zip (30 KB)
- (includes BeautifulSoup.py module developed by Leonard Richardson)

Roldan Pozo