- Yeast : protien-protien interaction network in budding yeast. [Undirected graph, 2361 vertices, 7182 edges (536 loops)] (Shiwei Sun, Lunjiang Ling, Nan Zhang, Guojie Li and Runsheng Chen: Topological structure analysis of the protein-protein interaction network in budding yeast. Nucleic Acids Research, 2003, Vol. 31, No. 9 2443-2450)
- US Patent Citations : citations from nearly 3 million US patents (granted Jan. 1963 - Dec. 1999) with over 16 million citations for patents between 1975-1999. Based on paper from the National Bureau of Economic Research. Categories include Chemical, Computer & Communications, Drugs & Medical, Electrical & Electronic, Mechanical, and Others.
- Computational Geometry collaborations: authors collaboration network (weighted graph) with 9072 vertices (authors) and 22577 edges (common publications) with each edge weighted by the number of common publications between two authors.
- Physics papers citations: HEP/KDD Cup 2003.
- US Corporate Ownership: 8343 vertices, 6726 edges. Edge (X,Y) exists if company X owns company Y. (Some companies are independent and therefore are isolated vertices.)
- Erdös Collaborations: List of mathematician Paul Erdös's coauthors and their respective coauthors. More background info can be found here.)
- Associative Thesaurus: 23,219 vertices, 325,624 arcs (564 loops). This is a not a traditional semantic network, but rather the result of experiments were words where shown to several people and they responded with the first word that comes to mind. (The word could be a synonym, antonym, or some other cause relation.)
.
Complex Network Data Sets:
- http://www.theo-physik.uni-kiel.de/~ebel/email-net/email_net.html
(V =59,912, E=5,165)
Contains email data for "Scale-free topology of email networks," Eel, Mielsch, Bornholdt (2002), Phys. Rev. E66. Data was collected at Kiel Univeristy (Germany) over a 112 day period. Each line contains the date/time stamp of message between sender and receiver. The details of the data can be found in their paper.
This data was also used by "From Centrality to Temporary Frame: Dynamic Centrality in Complex Networks," Braha, Bar-Yam, Complexity, Vol. 12 (2) NovDec 2006.
- http://vlado.fmf.uni-lj.si/pub/networks/data/:
part of the Pajek datasets, includes about 20 datasets, including Barabasi's data from Notre Dame.
- http://www.nd.edu/~networks/resources.htm:
network databases at CCNR (Barabasi's research group) at Notre Dame.
This includes 5 datasets: World-Wide-Web (html links at Notre Dame), a movie actor graph (based on idbm.com), a cellular network, a protein interaction network, and an email dataset (from Enron?).
- http://www.cheswick.com/ches/map/dbs/index.html:
Bill Cheswick, creator of several Internet maps, keeps some selected datasets of the internet here. (Rather large 2MB - 36MB files.)
The format of these files is an edge list of two IP addresses forming a directed graph, i.e. "12.118.106.6 68.86.96.65 13 4". (According to the author, Bill Cheswick, it seems the last two numbers signify the number of times the edge appears, and the distrance from the root node.)
- http://math.nist.gov/matrixmarket:
interpreted as connectivity graph, Matrix Market contains over 500 sparse matrices from various application areas. (Although these graphs are not self-organizing, do they represent unstructured complex geometries, usually the result of grid generation, or human-inspected analysis.)
The files are in a text format (Matrix Market) format, basically a coordinate list (i, j, val_ij) with a header and optional text. A simple filter (mm_extract_pattern) converts these files into a simple edge list "i j" and strips the header information. For example "cat bcsstk05.mtx | mm_extract_pattern > bccstk05.g" creates portable graph text file.
The software tools for this conversion are in
~/projects/MatrixMarket/tools
.
- http://www.cs.cmu.edu/~enron/
http://bailando.sims.berkeley.edu/enron_email.html
Email database from Enron, made public by Federal Energy Regulatory Commission.
According to the authors, this is one of few public email datasets of "real" email from a corporate world. Contains 200,399 messages from 158 users. This version has removed artificial messages, such as folders containing discussion threads and other machine-generated emails. (Large, > 400 MB)
The second link contains a subset of about 1700 email messages (4.5MB) that focus on business-related topics, rather than jokes and personal messages.
(The first link is included in the Barbasi collection above.)
- http://www.linkgroup.hu/
Resources to slides, talks, and examples of complex networks in social and ecosystems.
- http://dbpubs.stanford.edu:8091/~testbed/doc2/WebBase/
The Stanford WebBase Project: over 100 TB of archived web pages for US Govt, State and local governments, universities, newspapers, media outlets. Focuses election and disaster press coverage.
- http://www.cs.cornell.edu/projects/kddcup/index.html
KDD 2003 Challenge: citations from the Stanford Linear Accelerator Center, High Energy Physics (HEP) literature online since 1974, citing over 500,00 related articles. The citation graph has about 27,771 vertices and 352,807 edges.
Note: This is a tar-bal archive of the arXiv:he-th (High Enegry Physics: Theory) citation datbase from www.arxiv.org, created specifically for the KDD Cup 2003 challenge held in conjunction with the 9th Annal ACM SIGKDD (Knowledge Discovery and Data Mining). See http://www.cs.cornell.edu/projects/kddcup/index.html
- Wikipedia download, with all articles in XML format (150 GB) from which a network graph can be extracted for each page referencing another.
Software Resources
- Pcrawler: a Python web crawler to generate annotated link data for the
generation of web-based information networks.