finds three pages under math.nist.gov/tnt, following pages only under the math.nist.gov domain. The output is in the format:
* page header (date/time, size, URL) # url alias SHA check sum url link 1 url link 2 ...where page header is
* 11292:42229:2009-11-13/20:22:35 http://www.foo.com/bar/... | | | | | page page date time url number sizefollowed by an optional alias, if one found, unique SHA check sum based on the page's contents, and the links found on that page, one per line. For example,
* 1:3948:2010-12-03/20:11:09 http://math.nist.gov/tnt # http://math.nist.gov/tnt/ f2f952d3e9877e6983dfc6d580f7790c69638e27 http://www.nist.gov/public_affairs/privacy.htm http://math.nist.gov/pozo http://math.nist.gov/tnt/download.html http://math.nist.gov/tnt/examples.html
links_to_follow = process_webpage(url, canonical_url, page_contents)Here the function can maintain all the necessary data structure (via shared global args)with the the main app. This function is provided in the constructor of the webcrawler, and will be called repeatedly, i.e.
+------->----------+ | | | v ^ | 1. fetch webpage from to-visit list | | 2. call process_webpage to analyze page and ^ determine new links to follow | | 3. put new links on the to-visit list | ^ | | v +-------<----------+We can fine-tune with other options, like stopping after processing a certain number of pages, and so on, but the basic gist is in the loop above. The Python application, then, looks like
process_webpage(): extract cotent & links, and manage information return new links to follow main() process command-line args read in from previous crawl if necessary modular_crawler( process_webpage, seed_urls) post-process, if necessaryHere is the breakdown of the main components:
modular_cralwer.py
: the main generic crawler which follows html links, possibly restricted to a domain. This version uses parallel threads to create multiple calls and streamline the URL fetching process. calls ...
meetup_crawler.py
: this is an example of a modified crawler, specifically designed to mine the friendships (social network) of MeetUp.com members. It crawls only along friend-links, omitting unwanted pages. The result is that we generate a much smaller (but more meaningful) graph.
PCcrawler.py
: an older stand-alone version crawler, but now used to contain many of the helper functions. (See below)
parallel.py
: the parallel crawling engine, which uses the thread_pool model below to spwan mutliple fetch operations. Python itself has global interpreter lock (GIL) which runs in a single OS execution thread, but still benefits from crawling mutliple locations simultaneously. (These types of applications tend to be I/O bound, rather than CPU bound.)
thread_pool.py
: module that sets up and executes a group of parallel functions. For example, if function foo()
takes a single argument (possibly a list: [ arg1, arg2, ... argN]
) and returns a single value (which could also be a list [r1, r2, ... rM]
), i.e.
[r1, r2, rM] = foo([arg1, arg2, argN])Then,
num_threads = 16 P = thread_pool.thread_pool(num_threads, foo)sets up a 16-thread pool of parallel foo's, ready for execution. To call these, use the
eval()
and result()
methods. For example, in the simple case where foo
is just a real-valued function, we could write
for i in range(0,100): f.eval(i) # ... for i in range(0,100): res = f.result()
We would appreciate acknowledgement if the software is
incorporated in redistributable libraries or applications.
This software was developed at the National Institute of Standards and
Technology (NIST) by employees of the Federal Government in the course of their
official duties. Pursuant to title 17 Section 105 of the United States Code
this software is not subject to copyright protection and is in the public
domain. NIST assumes no responsibility whatsoever for its use by other
parties, and makes no guarantees, expressed or implied, about its quality,
reliability, or any other characteristic.
webcrawler_2_1.zip
(30 KB)