Create a new corpus from files. Sketch Engine also serves as corpus building software by downloading content from the web or by uploading files. The latter is covered on this page. A corpus can be built by combining both methods. Wikipedia offers free copies of all available content to interested users. These databases can be used for mirroring, personal use, informal backups, offline use or database queries (such as for Wikipedia:Maintenance).All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL).

If you have a text or collection of texts that you are willing to add to the corpus, feel free to contact us at and we can talk about accuracy, fidelity, and markup as needed. In the domain of natural language processing (NLP), statistical NLP in particular, there's a need to train the model or algorithm with lots of data. For this purpose, researchers have assembled many text corpora.

@InProceedings{neveol14quaero, author = {Névéol, Aurélie and Grouin, Cyril and Leixa, Jeremy and Rosset, Sophie and Zweigenbaum, Pierre}, title = {The {Quaero} {French} Medical Corpus: A Ressource for Medical Entity Recognition and…

This repo contains a list of the 10,000 most common English words in order of frequency, as determined by n-gram frequency analysis of the Google's Trillion Word Corpus. Here at Google Research we have been using word n-gram models for a variety of R&D projects, such as statistical machine

AntConc. A freeware corpus analysis toolkit for concordancing and text analysis. In order to use the corpus you can download the following corpus text files (.psd) (encoded in Mac OS Roman, at present). 

In some corpora, these files will not all contain the same type of data; for example, for the nltk.corpus.timit corpus, fileids() will return a list including text files, word segmentation files, phonetic transcription files, sound files, and metadata files.

This site contains downloadable, full-text corpus data from nine large corpora of English -- iWeb, NOW, Wikipedia, COCA, COHA, GloWbE, TV Corpus, Movies Corpus, SOAP Corpus-- as well as the Corpus del Español.The data is being used at hundreds of universities throughout the world, as well as in a wide range of companies.