To create your own, you need a good source of raw material (i.e. written English) in the form of a text corpus such as those available from the non-profit but pricey Linguistic Data Consortium. However, if you need a corpus with a permissive license (CC-BY-SA and GFDL) and at no cost, Wikipedia now presents an excellent alternative. (Another is the set of Google Books n-grams).
This post describes techniques for turning the contents of Wikipedia into a set of sentences and a vocabulary suitable for use with language modelling toolkits or other applications. You will need a reasonable amount of bandwidth, disk space, and some CPU time to proceed.
Step 1: get that dump file
To start, download a Wikipedia database extract. For English, use:
Step 2: convert the dump file to sentences
The Wikipedia dump file XML format and the Wikimedia markup of the articles contain lots of information such as formatting that is irrelevant to statistical language modelling, where we are concerned simply with words and how they form sentences.
To process the XML file into something useful, I used the gwtwiki toolkit (bliki-core-3.0.16.jar) along with the dependency Apache Commons Compress (commons-compress-1.1.jar). There is a wide range of toolkits for processing Wikipedia content in different languages of varying quality. gwtwiki appears to be one of the most functional and robust, handling both the parsing of the XML file and converting each article from markup into a plain text format.
A small java wrapper (Wikipedia2Txt.java) invokes the gwtwiki parser and does some further filtering, such as excluding sentences of less than 6 words. With a few hours of processing, a set of sentences results (one per line). Here are the first few from the 2011-01-15 snapshot of the Anarchism article:
Anarchism is a political philosophy which considers the state undesirable, unnecessary, and harmful, and instead promotes a stateless society, or anarchy.Note that some of these are not real subject-verb-object sentences. As the parser is purely syntactic, it will include collections of words that look like sentences. However, they still represent coherent examples of language use for modelling purposes.
The Concise Oxford Dictionary of Politics.
It seeks to diminish or even abolish authority in the conduct of human relations.
Step 3: convert the sentence list to a corpus file
As most language modelling toolkits are distracted by punctuation, some post-processing (text conditioning) is required. A set of regular expressions (such as in a perl script) is the easiest way to accomplish this. tocorpus.pl removes punctuation and excess space, producing output like:
ANARCHISM IS A POLITICAL PHILOSOPHY WHICH CONSIDERS THE STATE UNDESIRABLE UNNECESSARY AND HARMFUL AND INSTEAD PROMOTES A STATELESS SOCIETY OR ANARCHYFrom a 28G uncompressed version of the English Wikipedia pages from the 2011-01-15 snapshot, the corpus file is 6.6G.
THE CONCISE OXFORD DICTIONARY OF POLITICS
IT SEEKS TO DIMINISH OR EVEN ABOLISH AUTHORITY IN THE CONDUCT OF HUMAN RELATIONS
Step 4: create a vocabulary file
As Wikipedia includes many words which are in fact not words (for example misspellings and other weird and wonderful character sequences like AAA'BBB), it is helpful to create a vocabulary with frequency counts, imposing some restrictions on what is considered a word. mkvocab.pl restricts valid words to those occuring with a minimum frequency and of a minimum length, with some English-specific rules for acceptable use of the apostrophe (english-utils.pl).
Having created a vocabulary file by processing the corpus file through mkvocab.pl, it's easy to sort it in reverse order of frequency using:
sort -nr -k 2 enwiki-vocab.txtwhich produces:
THE 84503449for a total of 1714417 tokens (with a minimum length of 3). Words with frequency 3 include the misspelt (AFTEROON), the unusual (AFGHANIZATION, AGRO-PASTORALIST), and the spurious (AAAABBBCCC).
It is also then trivial to produce a vocabulary of the most commonly used words, e.g.
head -n100000 enwiki-vocab-sorted.txt > enwiki-vocab-100K.txtHowever, with a minimum length of 3, a range of useful English words (a, as, an, ...) are excluded, so it's best to combine the resulting dictionary with a smaller dictionary of higher quality (such as CMUdict), which includes most of the valid 2-letter English words.
Step 5: create a language model
Using a language modelling toolkit, you can create an LM of your own design, using part or all of the Wikipedia corpus, optionally restricted to a specific vocabulary. For example, with mitlm using 1 out of every 30 Wikipedia sentences and a vocabulary restricted to the top 100,000 words from Wikipedia combined with the CMU 0.7a dictionary:
estimate-ngram -vocab enwiki-100K-cmu-combined.txt -text enwiki-sentences-1-from-30.corpus -write-lm enwiki.lmthe resulting LM (close to 700M in ARPA format) has:
ngram 1=163892Constructing a language model with the full set of sentences and full vocabulary (Wikipedia len=3 plus CMU) leads to an LM with
ngram 1=1724335about 12G in size (uncompressed ARPA format).