Tuesday, 15 March 2011

Creating a text corpus from Wikipedia

Speech recognition engines (and other nature language processing applications) need a good language model. Open source speech recognition engines such as the CMU Sphinx toolkit include relatively small LMs, such as the WSJ model with 5000 terms. Some larger models are available online, such as Keith Vertanen's English Gigaword models.

To create your own, you need a good source of raw material (i.e. written English) in the form of a text corpus such as those available from the non-profit but pricey Linguistic Data Consortium. However, if you need a corpus with a permissive license (CC-BY-SA and GFDL) and at no cost, Wikipedia now presents an excellent alternative. (Another is the set of Google Books n-grams).

This post describes techniques for turning the contents of Wikipedia into a set of sentences and a vocabulary suitable for use with language modelling toolkits or other applications. You will need a reasonable amount of bandwidth, disk space, and some CPU time to proceed.

Step 1: get that dump file

To start, download a Wikipedia database extract. For English, use:
which is 6G+ in size.

Step 2: convert the dump file to sentences

The Wikipedia dump file XML format and the Wikimedia markup of the articles contain lots of information such as formatting that is irrelevant to statistical language modelling, where we are concerned simply with words and how they form sentences.

To process the XML file into something useful, I used the gwtwiki toolkit (bliki-core-3.0.16.jar) along with the dependency Apache Commons Compress (commons-compress-1.1.jar). There is a wide range of toolkits for processing Wikipedia content in different languages of varying quality. gwtwiki appears to be one of the most functional and robust, handling both the parsing of the XML file and converting each article from markup into a plain text format.

A small java wrapper (Wikipedia2Txt.java) invokes the gwtwiki parser and does some further filtering, such as excluding sentences of less than 6 words. With a few hours of processing, a set of sentences results (one per line). Here are the first few from the 2011-01-15 snapshot of the Anarchism article:
Anarchism is a political philosophy which considers the state undesirable, unnecessary, and harmful, and instead promotes a stateless society, or anarchy.
The Concise Oxford Dictionary of Politics.
It seeks to diminish or even abolish authority in the conduct of human relations.
Note that some of these are not real subject-verb-object sentences. As the parser is purely syntactic, it will include collections of words that look like sentences. However, they still represent coherent examples of language use for modelling purposes.

Step 3: convert the sentence list to a corpus file

As most language modelling toolkits are distracted by punctuation, some post-processing (text conditioning) is required. A set of regular expressions (such as in a perl script) is the easiest way to accomplish this. tocorpus.pl removes punctuation and excess space, producing output like:
From a 28G uncompressed version of the English Wikipedia pages from the 2011-01-15 snapshot, the corpus file is 6.6G.

Step 4: create a vocabulary file

As Wikipedia includes many words which are in fact not words (for example misspellings and other weird and wonderful character sequences like AAA'BBB), it is helpful to create a vocabulary with frequency counts, imposing some restrictions on what is considered a word. mkvocab.pl restricts valid words to those occuring with a minimum frequency and of a minimum length, with some English-specific rules for acceptable use of the apostrophe (english-utils.pl).

Having created a vocabulary file by processing the corpus file through mkvocab.pl, it's easy to sort it in reverse order of frequency using:
sort -nr -k 2 enwiki-vocab.txt
which produces:
THE     84503449
AND     33700692
WAS     12911542
FOR     10342919
THAT    8318795
for a total of 1714417 tokens (with a minimum length of 3). Words with frequency 3 include the misspelt (AFTEROON), the unusual (AFGHANIZATION, AGRO-PASTORALIST), and the spurious (AAAABBBCCC).

It is also then trivial to produce a vocabulary of the most commonly used words, e.g.
head -n100000 enwiki-vocab-sorted.txt > enwiki-vocab-100K.txt 
However, with a minimum length of 3, a range of useful English words (a, as, an, ...) are excluded, so it's best to combine the resulting dictionary with a smaller dictionary of higher quality (such as CMUdict), which includes most of the valid 2-letter English words.

Step 5: create a language model

Using a language modelling toolkit, you can create an LM of your own design, using part or all of the Wikipedia corpus, optionally restricted to a specific vocabulary. For example, with mitlm using 1 out of every 30 Wikipedia sentences and a vocabulary restricted to the top 100,000 words from Wikipedia combined with the CMU 0.7a dictionary:
estimate-ngram -vocab enwiki-100K-cmu-combined.txt -text enwiki-sentences-1-from-30.corpus -write-lm enwiki.lm
the resulting LM (close to 700M in ARPA format) has:
ngram 1=163892
ngram 2=6251876
ngram 3=17570560
Constructing a language model with the full set of sentences and full vocabulary (Wikipedia len=3 plus CMU) leads to an LM with
ngram 1=1724335
ngram 2=79579226
ngram 3=314047999
about 12G in size (uncompressed ARPA format).

Happy modelling!


  1. Impressive. How long did it take you to work all this out?

  2. Quite a few days! I spent some time investigating the various toolkits and "prior art" tackling the problem (e.g. http://blog.afterthedeadline.com/2009/12/04/generating-a-plain-text-corpus-from-wikipedia/) before settling on gwtwiki.

    The other time-consuming part was trial-and-error evolution of a set of regular expressions to get the output as clean as possible.

    Generating very large LMs from the full corpus exceeded the capacity of my 4G MBP, so I used our local grid computing infrastructure to run those jobs on larger servers.

  3. Hi,

    Do you know a quick way to extract and create the corpus for only the articles of a specific domain?

  4. How do you define "a specific domain" in the Wikipedia context?

  5. say I wanted to extract articles on history or computer science?

  6. hi! i want to know too if there is a way to extract a corpus from only articles that are specific to a given domain (e.g. water domain [water, river, tsunami, purification, freshwater...]

  7. Marwen and danajaatcse,

    To do that you probably don't want to use the same technique here of processing the entire large wikipedia dump file, because most of the articles are not relevant.

    To construct a corpus on a specific topic, you'd need a collection of articles that are related to the topic. One approach would be to use Wikipedia categories or Wikipedia Book pages, and retrieve all linked articles.

    Another approach, which I am working on at the moment, is to start with Wikipedia search with some keywords, and then follow links based on the similarity of articles to each other. I will post a new article about that when I'm done.

    My approach uses vector space modelling and LSA to establish similarity, using the gensim toolkit (http://nlp.fi.muni.cz/projekty/gensim/).

  8. Thank you very much! With your help, I solved the whole problem in just one hour :)

  9. Thanks for your post. could you upload the lm file you just created with this steps.

  10. This comment has been removed by the author.

  11. We can assume this could be done in any of the wikipedias, for example, French or Spanish.
    This would appear to pick out single morphemes well. Is it adaptable to pulling out multiple word collocations?
    For example: "food poisoning," "an easy read," "light snack," and "absorbed in her book," while made up of adjective and noun or verb and prepositional or adverbial phrase, act as commonly occurring sequences of words whose collocation means more or other than the "sum of its parts."
    Any algorithm(s) available to extract collocative phrases?

    1. naive solution will run in O(n) : suppose phrases are made of 2 words, then split a setence(for example) of n words into n-1 "phrases" and calculate..

  12. Hey,

    I dig up this article because I try to make pocketsphinx working in french.
    I was thinking about using different french books agregated into a single file and jumping right to step 3. My understanding is that it would do the trick right ?


    1. This comment has been removed by the author.

    2. Hi,
      Do you get any solution?
      I need to convert into Korean language. So if you could help me it would be good for me.
      can you please knock me at banna.kbet@gmail.com?

  13. How do I need to change Wikipedia2Txt.java to leave paragraphs in one piece, i.e. not break them into separate sentences?

  14. Out of interest, what version of mitlm did you use? I ask because I get segmentation faults when using the same commands (estimate-ngram).



  15. This comment has been removed by the author.

  16. Id your final model or cleaned wiki corspus available for downlad

  17. Hi
    I found the link of this page from CMUsphinx anguage model building page: http://cmusphinx.sourceforge.net/wiki/tutoriallm

    I have downloaded and installed gwtwiki, but don't know how to use it? I have also downloaded the wiki dump file in .xml.

    May you explain more the stem number 2?

    thanks a lot

  18. Hi,
    i need to recognize only 100 specific words so i don't want other unnecessary words in my dictionary .SO how could i do it.Please help me

  19. Sir, I want to extract some books or articles in Malayalam language from web and also want to create the language model. can you please help me to do this? what are the tool kits available for pre-processing the text?

    1. @habi did u get any help ??

    2. no sir.. actually i am trying to prepare the corpus manually. it will be helpful if somebody share the info.

  20. Hi, first of all thank you for your blog, it has really helped me out. Secondly, I am a bit stuck trying to use the perl script, would you have an example of how you converted the html output from the java code to clean text? atm the moment I'm running the Java script on the xml dump file and thought whilst its running id try get the next step working before it finishes with some smaller data.

  21. Hi, first of all thank you for your blog, it has really helped me out. Secondly, I am a bit stuck trying to use the perl script, would you have an example of how you converted the html output from the java code to clean text? atm the moment I'm running the Java script on the xml dump file and thought whilst its running id try get the next step working before it finishes with some smaller data.