Text Corpus - Some Notable Text Corpora

Some Notable Text Corpora

English language:

  • Google N-Grams Corpus - Largest English corpus at 155 billion words. Also has corpora for other languages. (http://ngrams.googlelabs.com/datasets)
  • American National Corpus
  • Bank of English
  • British National Corpus
  • Corpus Juris Secundum
  • Corpus of Contemporary American English (COCA) 425 million words, 1990-2011. Freely searchable online.
  • Brown Corpus, forming part of the "Brown Family" of corpora, together with LOB, Frown and F-LOB.
  • International Corpus of English
  • Oxford English Corpus
  • Scottish Corpus of Texts & Speech

Other languages:

  • Hamshahri Corpus (Persian a.k.a. Farsi)
  • Amarna letters, (for Akkadian, Egyptian, Sumerogram's, etc.)
  • TEP: Tehran English-Persian Parallel Corpus (http://ece.ut.ac.ir/nlp/)
  • TMC: Tehran Monolingual Corpus, Standard corpus for Persian Language Modeling (http://ece.ut.ac.ir/nlp/)
  • Bijankhan Corpus A Contemporary Persian Corpus for NLP researches
  • Bulgarian National Corpus (http://search.dcl.bas.bg)
  • CETENFolha
  • Croatian Language Corpus
  • Croatian National Corpus
  • Czech National Corpus
  • Neo-Assyrian Text Corpus Project
  • Russian National Corpus
  • Slovenian National Corpus
  • Thesaurus Linguae Graecae (Ancient Greek)
  • Quranic Arabic Corpus (Classical Arabic)
  • Eastern Armenian National Corpus (EANC) 110 million words. Freely searchable online.
  • National Corpus of Polish
  • German Reference Corpus (DeReKo) More than 4 billion words of contemporary written German.
  • Tatoeba A parallel corpus which contains about 913000 sentences in 90 languages.
  • Spanish text corpus by Molino de Ideas, which contains 660 millions words. (Spanish)
  • Kotonoha Japanese language corpus
  • CorALit: the Corpus of Academic Lithuanian Academic texts published in 1999-2009 (approx. 9 million words). Compiled at the University of Vilnius, Lithuania

Read more about this topic:  Text Corpus

Famous quotes containing the words notable and/or text:

    a notable prince that was called King John;
    And he ruled England with main and with might,
    For he did great wrong, and maintained little right.
    —Unknown. King John and the Abbot of Canterbury (l. 2–4)

    Literature is not exhaustible, for the sufficient and simple reason that a single book is not. A book is not an isolated entity: it is a narration, an axis of innumerable narrations. One literature differs from another, either before or after it, not so much because of the text as for the manner in which it is read.
    Jorge Luis Borges (1899–1986)