Archive for May, 2007

My favorite corpora

Sunday, May 6th, 2007

Here are my favorite corpora:

Enron email
CIA world factbook
DBLP: papers in CS
US congressional speeches
AOL queries
Netflix recommendations
IMDB
PUBMED: biomedical paper abstracts
Wikipedia
ACL Anthology
DOTGOV: download of .GOV
biocreative: biomedical papers
WT100G: 100GB download of the web
Google n-grams
webfreq
SMS corpus
Citeseer
DMOZ
corpus of paraphrases
multilingual parallel parliamentary proceedings
textual entailment corpus
question answering corpus
summarization corpus
various text classification corpora (Reuters-21578, 20NG)
Peekaboom

The North American Linguistics Olympiad

Sunday, May 6th, 2007

Results and problem sets are here:
http://www.namclo.org.