The Clair library (i.e. Clairlib) is a suite of open-source Perl modules intended to simplify a number of generic tasks in natural language processing (NLP), information retrieval (IR), and network analysis (NA). Its architecture also allows for external software to be plugged in with very little effort. To download, please visit

a prerequisite for all Clairlib versions MEAD

MEAD Evaluation add-on
an Evaluation Framework for Extractive Summarization: MEADeval (temporarily unavailable).

AAN: The AAN corpus includes three networks, paper citation, author citation and auth or collaboration. The paper citation network (paper-citation-network.txt) is a directed network composed of nodes labeled with paper ids which correspond to in dividual papers (acl-metadata.txt). The author citation network (author-citation-network.txt), a directed network, is compiled from the paper network and the metadata file. For each citation in the paper network, where paper A cites paper B, and for each author in paper A, an edge is created for that author to each author in paper B. The author collaboration network (author-collaboration-network.txt), an undirected network, is composed of authors where, for each paper in t he paper citation network, an edge is created between each collaborator for that paper.Download
CSTBank: Cross-document Structure Theory Bank Download
Surveyor: paper collection Download
Cartoons: data set Download
CreateDebate: data set Download
Similarity: data set Download
FRAUD: CLAIR collection of fraud email Download
SUMMBank: a collection of summaries used in the JHU workshop in 2001Download
String Similarity Measures A C++ package for computing similarity between strings. The package supports the following similarity measures
  • Cosine Similarity
  • Jaccard Similarity
  • Similarity based on Levenshtein Distance
  • P-Spectrum Kernel
  • Length-Weighted Kernel
Node Similarity Measures A C++ library for computing similarity between nodes in a graph. The library supports the following similarity measures
  • SimRank
  • Random walk based similarity measure
Relational Classification Dataset
  • Contains 380 papers manually classified into three research areas: Machine Translation, Dependency Parsing and Summarization.
  • Contains Authorship information, venue information, title and citation information for all the papers.
Publication Classification
  • Contains 383 papers manually classified into 31 research areas using session information.
Near Duplicate Detection
A C++ package for detecting near-duplicate documents in a large corpus