Truly Madly Wordly: Creating a text corpus from Wikipedia

Tuesday, 15 March 2011

Creating a text corpus from Wikipedia

Speech recognition engines (and other nature language processing applications) need a good language model. Open source speech recognition engines such as the CMU Sphinx toolkit include relatively small LMs, such as the WSJ model with 5000 terms. Some larger models are available online, such as Keith Vertanen's English Gigaword models.

To create your own, you need a good source of raw material (i.e. written English) in the form of a text corpus such as those available from the non-profit but pricey Linguistic Data Consortium. However, if you need a corpus with a permissive license (CC-BY-SA and GFDL) and at no cost, Wikipedia now presents an excellent alternative. (Another is the set of Google Books n-grams).

This post describes techniques for turning the contents of Wikipedia into a set of sentences and a vocabulary suitable for use with language modelling toolkits or other applications. You will need a reasonable amount of bandwidth, disk space, and some CPU time to proceed.

Step 1: get that dump file

To start, download a Wikipedia database extract. For English, use:

http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

which is 6G+ in size.

Step 2: convert the dump file to sentences

The Wikipedia dump file XML format and the Wikimedia markup of the articles contain lots of information such as formatting that is irrelevant to statistical language modelling, where we are concerned simply with words and how they form sentences.

To process the XML file into something useful, I used the gwtwiki toolkit (bliki-core-3.0.16.jar) along with the dependency Apache Commons Compress (commons-compress-1.1.jar). There is a wide range of toolkits for processing Wikipedia content in different languages of varying quality. gwtwiki appears to be one of the most functional and robust, handling both the parsing of the XML file and converting each article from markup into a plain text format.

A small java wrapper (Wikipedia2Txt.java) invokes the gwtwiki parser and does some further filtering, such as excluding sentences of less than 6 words. With a few hours of processing, a set of sentences results (one per line). Here are the first few from the 2011-01-15 snapshot of the Anarchism article:

Anarchism is a political philosophy which considers the state undesirable, unnecessary, and harmful, and instead promotes a stateless society, or anarchy.
The Concise Oxford Dictionary of Politics.
It seeks to diminish or even abolish authority in the conduct of human relations.

Note that some of these are not real subject-verb-object sentences. As the parser is purely syntactic, it will include collections of words that look like sentences. However, they still represent coherent examples of language use for modelling purposes.

Step 3: convert the sentence list to a corpus file

As most language modelling toolkits are distracted by punctuation, some post-processing (text conditioning) is required. A set of regular expressions (such as in a perl script) is the easiest way to accomplish this. tocorpus.pl removes punctuation and excess space, producing output like:

ANARCHISM IS A POLITICAL PHILOSOPHY WHICH CONSIDERS THE STATE UNDESIRABLE UNNECESSARY AND HARMFUL AND INSTEAD PROMOTES A STATELESS SOCIETY OR ANARCHY
THE CONCISE OXFORD DICTIONARY OF POLITICS
IT SEEKS TO DIMINISH OR EVEN ABOLISH AUTHORITY IN THE CONDUCT OF HUMAN RELATIONS

From a 28G uncompressed version of the English Wikipedia pages from the 2011-01-15 snapshot, the corpus file is 6.6G.

Step 4: create a vocabulary file

As Wikipedia includes many words which are in fact not words (for example misspellings and other weird and wonderful character sequences like AAA'BBB), it is helpful to create a vocabulary with frequency counts, imposing some restrictions on what is considered a word. mkvocab.pl restricts valid words to those occuring with a minimum frequency and of a minimum length, with some English-specific rules for acceptable use of the apostrophe (english-utils.pl).

Having created a vocabulary file by processing the corpus file through mkvocab.pl, it's easy to sort it in reverse order of frequency using:

sort -nr -k 2 enwiki-vocab.txt

which produces:

THE     84503449
AND     33700692
WAS     12911542
FOR     10342919
THAT    8318795

for a total of 1714417 tokens (with a minimum length of 3). Words with frequency 3 include the misspelt (AFTEROON), the unusual (AFGHANIZATION, AGRO-PASTORALIST), and the spurious (AAAABBBCCC).

It is also then trivial to produce a vocabulary of the most commonly used words, e.g.

head -n100000 enwiki-vocab-sorted.txt > enwiki-vocab-100K.txt

However, with a minimum length of 3, a range of useful English words (a, as, an, ...) are excluded, so it's best to combine the resulting dictionary with a smaller dictionary of higher quality (such as CMUdict), which includes most of the valid 2-letter English words.

Step 5: create a language model

Using a language modelling toolkit, you can create an LM of your own design, using part or all of the Wikipedia corpus, optionally restricted to a specific vocabulary. For example, with mitlm using 1 out of every 30 Wikipedia sentences and a vocabulary restricted to the top 100,000 words from Wikipedia combined with the CMU 0.7a dictionary:

estimate-ngram -vocab enwiki-100K-cmu-combined.txt -text enwiki-sentences-1-from-30.corpus -write-lm enwiki.lm

the resulting LM (close to 700M in ARPA format) has:

ngram 1=163892
ngram 2=6251876
ngram 3=17570560

Constructing a language model with the full set of sentences and full vocabulary (Wikipedia len=3 plus CMU) leads to an LM with

ngram 1=1724335
ngram 2=79579226
ngram 3=314047999

about 12G in size (uncompressed ARPA format).

Happy modelling!

27 comments:

Anonymous15 March 2011 at 21:10
Impressive. How long did it take you to work all this out?
ReplyDelete
Replies
Stephen Marquard16 March 2011 at 08:30
Quite a few days! I spent some time investigating the various toolkits and "prior art" tackling the problem (e.g. http://blog.afterthedeadline.com/2009/12/04/generating-a-plain-text-corpus-from-wikipedia/) before settling on gwtwiki.

The other time-consuming part was trial-and-error evolution of a set of regular expressions to get the output as clean as possible.

Generating very large LMs from the full corpus exceeded the capacity of my 4G MBP, so I used our local grid computing infrastructure to run those jobs on larger servers.
ReplyDelete
Replies
Anonymous7 April 2011 at 20:41
Hi,

Do you know a quick way to extract and create the corpus for only the articles of a specific domain?
ReplyDelete
Replies
Stephen Marquard7 April 2011 at 21:51
How do you define "a specific domain" in the Wikipedia context?
ReplyDelete
Replies
Anonymous8 April 2011 at 07:38
say I wanted to extract articles on history or computer science?
ReplyDelete
Replies
Marwen30 April 2011 at 14:19
hi! i want to know too if there is a way to extract a corpus from only articles that are specific to a given domain (e.g. water domain [water, river, tsunami, purification, freshwater...]
ReplyDelete
Replies
Stephen Marquard30 April 2011 at 16:41
Marwen and danajaatcse,

To do that you probably don't want to use the same technique here of processing the entire large wikipedia dump file, because most of the articles are not relevant.

To construct a corpus on a specific topic, you'd need a collection of articles that are related to the topic. One approach would be to use Wikipedia categories or Wikipedia Book pages, and retrieve all linked articles.

Another approach, which I am working on at the moment, is to start with Wikipedia search with some keywords, and then follow links based on the similarity of articles to each other. I will post a new article about that when I'm done.

My approach uses vector space modelling and LSA to establish similarity, using the gensim toolkit (http://nlp.fi.muni.cz/projekty/gensim/).
ReplyDelete
Replies
Anonymous20 March 2013 at 12:28
Thank you very much! With your help, I solved the whole problem in just one hour :)
ReplyDelete
Replies
Vignesh C23 March 2013 at 13:49
Thanks for your post. could you upload the lm file you just created with this steps.
ReplyDelete
Replies
Aon GoltzCrank25 May 2013 at 01:55
This comment has been removed by the author.
ReplyDelete
Replies
Unknown12 June 2013 at 19:43
We can assume this could be done in any of the wikipedias, for example, French or Spanish.
This would appear to pick out single morphemes well. Is it adaptable to pulling out multiple word collocations?
For example: "food poisoning," "an easy read," "light snack," and "absorbed in her book," while made up of adjective and noun or verb and prepositional or adverbial phrase, act as commonly occurring sequences of words whose collocation means more or other than the "sum of its parts."
Any algorithm(s) available to extract collocative phrases?
ReplyDelete
Replies
Unknown2 January 2014 at 14:33
Hey,

I dig up this article because I try to make pocketsphinx working in french.
I was thinking about using different french books agregated into a single file and jumping right to step 3. My understanding is that it would do the trick right ?

Regards
ReplyDelete
Replies
Unknown22 January 2014 at 23:21
How do I need to change Wikipedia2Txt.java to leave paragraphs in one piece, i.e. not break them into separate sentences?
ReplyDelete
Replies
Unknown5 March 2014 at 08:02
Out of interest, what version of mitlm did you use? I ask because I get segmentation faults when using the same commands (estimate-ngram).

Thanks

Chris
ReplyDelete
Replies
Unknown25 May 2014 at 02:42
This comment has been removed by the author.
ReplyDelete
Replies
Abhishek Sharma12 December 2014 at 11:15
Id your final model or cleaned wiki corspus available for downlad
ReplyDelete
Replies
Anonymous1 October 2016 at 14:43
Hi
I found the link of this page from CMUsphinx anguage model building page: http://cmusphinx.sourceforge.net/wiki/tutoriallm

I have downloaded and installed gwtwiki, but don't know how to use it? I have also downloaded the wiki dump file in .xml.

May you explain more the stem number 2?

thanks a lot
ReplyDelete
Replies
Unknown17 March 2017 at 17:24
Hi,
i need to recognize only 100 specific words so i don't want other unnecessary words in my dictionary .SO how could i do it.Please help me
ReplyDelete
Replies
Habi4 May 2017 at 14:23
Sir, I want to extract some books or articles in Malayalam language from web and also want to create the language model. can you please help me to do this? what are the tool kits available for pre-processing the text?
ReplyDelete
Replies
Unknown3 August 2017 at 15:42
Hi, first of all thank you for your blog, it has really helped me out. Secondly, I am a bit stuck trying to use the perl script, would you have an example of how you converted the html output from the java code to clean text? atm the moment I'm running the Java script on the xml dump file and thought whilst its running id try get the next step working before it finishes with some smaller data.
ReplyDelete
Replies
Unknown3 August 2017 at 15:43
Hi, first of all thank you for your blog, it has really helped me out. Secondly, I am a bit stuck trying to use the perl script, would you have an example of how you converted the html output from the java code to clean text? atm the moment I'm running the Java script on the xml dump file and thought whilst its running id try get the next step working before it finishes with some smaller data.
ReplyDelete
Replies
John20 December 2017 at 06:29
what are the command line arguments for mkvocab.pl file . Please help me !!
ReplyDelete
Replies

Add comment

Note: only a member of this blog may post a comment.