PCrawler: a portable Python Web crawler

PCrawler is a suite of Python modules to build network graphs by crawling the World Wide Web.  These webgraphs represent the connectivity of information linking one web site to another.  Vertices are distinct pages (URLs) and a (directed) edge exists between two vertices is there is a hyperlink connecting one to the other. While there are many programs designed to crawl the web and collect information, this toolkit provides a simple approach that is tailored to extract the topological information and can be fine-tuned for producing specific reports.   Each page visited is time-stamped and receives a unique hash-code value, so that aliases may be easily identified.  The crawler can also be used as a diagnostic tool, reporting broken links, and invalid HTML pages within an organization.


PCrawler is useful in representing closed webgraphs for individual domain and sub-domains.  For example, the command

python modular_crawler.py math.nist.gov http://math.nist.gov/tnt 3

finds three pages under math.nist.gov/tnt, following pages only under the math.nist.gov domain.  The output is in the format:

*  page header (date/time, size, URL)
#  url alias
SHA check sum
url link 1
url link 2

where page header is
* 11292:42229:2009-11-13/20:22:35 http://www.foo.com/bar/...
    |      |      |         |                 |
  page   page    date      time              url
 number  size
followed by an optional alias, if one found, unique SHA check sum based on the page's contents, and the links found on that page, one per line. For example,
*  1:3948:2010-12-03/20:11:09 http://math.nist.gov/tnt
#  http://math.nist.gov/tnt/


The design separates the crawler (fetching the web pages) from the processing part (what to do with the information). Generating a webgraph is only one such option. Typically, people are interested in mining this information for specific purposes,.e.g The framework for the webcrawler, then, is just to serve up pages that have been requested. We wish to keep the internal details of the crawler (and its thread-based version) out of the main program. This is achieved by relying on an interface function with the following signature
links_to_follow = process_webpage(url, canonical_url, page_contents)
Here the function can maintain all the necessary data structure (via shared global args)with the the main app. This function is provided in the constructor of the webcrawler, and will be called repeatedly, i.e.
|                  |
|                  v
|      1. fetch webpage from to-visit list
|      2. call process_webpage to analyze page and
^               determine new links to follow
|      3. put new links on the to-visit list
^                  |
|                  v
We can fine-tune with other options, like stopping after processing a certain number of pages, and so on, but the basic gist is in the loop above. The Python application, then, looks like
extract cotent & links, and manage information
return new links to follow

process command-line args
read in from previous crawl if necessary
modular_crawler( process_webpage, seed_urls)
post-process, if necessary
Here is the breakdown of the main components:

This software was developed at the National Institute of Standards and Technology (NIST) by employees of the Federal Government in the course of their official duties. Pursuant to title 17 Section 105 of the United States Code this software is not subject to copyright protection and is in the public domain. NIST assumes no responsibility whatsoever for its use by other parties, and makes no guarantees, expressed or implied, about its quality, reliability, or any other characteristic.

We would appreciate acknowledgement if the software is incorporated in redistributable libraries or applications.


Roldan Pozo