Complex Network Resources
Complex Networks Data Sets
In analyzing large-scale complex networks, it is important to establish a standard dataset from which algorithms and claims be compared and verified. Currently, it is often difficult to track down the original data used for computational experiments. Much of it is floating around in various formats throughout the net, imbedded in papers, and often difficult to get from the authors. Moreover, the datasets are often modified (filtered) by research groups interested in different attributes, so that even when the name and descriptions match a citation in a paper, there is no guarantee that the data is identical.
Major Complex Networks Resources:
Here are some of the basic datasets used in the literature. While there is no standard taxonomy of the types of networks encounted, they fall into several general categories:
Below is a list of sources.
- citation graphs: (diretional) edge (X,Y) exists if paper X cites paper Y.
- collaboration graphs: (bidirectional) edge (X,Y) if person X worked with person Y.
- semantic graphs: dictionaries, thesaurus; edge (X,Y) exists if word X is associated with word Y
- biological graphs: edge (X,Y) exists if process X is related to process Y (e.g. protein interactions, predators, food webs)
- communication graphs: physical computer networks, telephone networks
- News graphs: relationship of events, words, or people in the news (e.g. KEDS, Sept. 11th news stories).
- engineered graphs: circuit design, structural mechanics, mesh generation
- Yeast : protien-protien interaction network in budding yeast. [Undirected graph, 2361 vertices, 7182 edges (536 loops)] (Shiwei Sun, Lunjiang Ling, Nan Zhang, Guojie Li and Runsheng Chen: Topological structure analysis of the protein-protein interaction network in budding yeast. Nucleic Acids Research, 2003, Vol. 31, No. 9 2443-2450)
- US Patent Citations : citations from nearly 3 million US patents (granted Jan. 1963 - Dec. 1999) with over 16 million citations for patents between 1975-1999. Based on paper from the National Bureau of Economic Research. Categories include Chemical, Computer & Communications, Drugs & Medical, Electrical & Electronic, Mechanical, and Others.
- Computational Geometry collaborations: authors collaboration network (weighted graph) with 9072 vertices (authors) and 22577 edges (common publications) with each edge weighted by the number of common publications between two authors.
- Physics papers citations: HEP/KDD Cup 2003.
- US Corporate Ownership: 8343 vertices, 6726 edges. Edge (X,Y) exists if company X owns company Y. (Some companies are independent and therefore are isolated vertices.)
- Erdös Collaborations: List of mathematician Paul Erdös's coauthors and their respective coauthors. More background info can be found here.)
- Associative Thesaurus: 23,219 vertices, 325,624 arcs (564 loops). This is a not a traditional semantic network, but rather the result of experiments were words where shown to several people and they responded with the first word that comes to mind. (The word could be a synonym, antonym, or some other cause relation.)
Complex Network Data Sets:
(V =59,912, E=5,165)
Contains email data for "Scale-free topology of email networks," Eel, Mielsch, Bornholdt (2002), Phys. Rev. E66. Data was collected at Kiel Univeristy (Germany) over a 112 day period. Each line contains the date/time stamp of message between sender and receiver. The details of the data can be found in their paper.
This data was also used by "From Centrality to Temporary Frame: Dynamic Centrality in Complex Networks," Braha, Bar-Yam, Complexity, Vol. 12 (2) NovDec 2006.
part of the Pajek datasets, includes about 20 datasets, including Barabasi's data from Notre Dame.
network databases at CCNR (Barabasi's research group) at Notre Dame.
This includes 5 datasets: World-Wide-Web (html links at Notre Dame), a movie actor graph (based on idbm.com), a cellular network, a protein interaction network, and an email dataset (from Enron?).
Bill Cheswick, creator of several Internet maps, keeps some selected datasets of the internet here. (Rather large 2MB - 36MB files.)
The format of these files is an edge list of two IP addresses forming a directed graph, i.e. "18.104.22.168 22.214.171.124 13 4". (According to the author, Bill Cheswick, it seems the last two numbers signify the number of times the edge appears, and the distrance from the root node.)
interpreted as connectivity graph, Matrix Market contains over 500 sparse matrices from various application areas. (Although these graphs are not self-organizing, do they represent unstructured complex geometries, usually the result of grid generation, or human-inspected analysis.)
The files are in a text format (Matrix Market) format, basically a coordinate list (i, j, val_ij) with a header and optional text. A simple filter (mm_extract_pattern) converts these files into a simple edge list "i j" and strips the header information. For example "cat bcsstk05.mtx | mm_extract_pattern > bccstk05.g" creates portable graph text file.
The software tools for this conversion are in
Email database from Enron, made public by Federal Energy Regulatory Commission.
According to the authors, this is one of few public email datasets of "real" email from a corporate world. Contains 200,399 messages from 158 users. This version has removed artificial messages, such as folders containing discussion threads and other machine-generated emails. (Large, > 400 MB)
The second link contains a subset of about 1700 email messages (4.5MB) that focus on business-related topics, rather than jokes and personal messages.
(The first link is included in the Barbasi collection above.)
Resources to slides, talks, and examples of complex networks in social and ecosystems.
The Stanford WebBase Project: over 100 TB of archived web pages for US Govt, State and local governments, universities, newspapers, media outlets. Focuses election and disaster press coverage.
KDD 2003 Challenge: citations from the Stanford Linear Accelerator Center, High Energy Physics (HEP) literature online since 1974, citing over 500,00 related articles. The citation graph has about 27,771 vertices and 352,807 edges.
Note: This is a tar-bal archive of the arXiv:he-th (High Enegry Physics: Theory) citation datbase from www.arxiv.org, created specifically for the KDD Cup 2003 challenge held in conjunction with the 9th Annal ACM SIGKDD (Knowledge Discovery and Data Mining). See http://www.cs.cornell.edu/projects/kddcup/index.html
- Wikipedia download, with all articles in XML format (150 GB) from which a network graph can be extracted for each page referencing another.
- Pcrawler: a Python web crawler to generate annotated link data for the
generation of web-based information networks.