Friday, 25 March 2011

Recognizing specialized vocabulary with large dictionaries

One of the goals of the work which inspired this blog is to integrate a speech recognition engine into a lecture capture system (specifically, integrating CMU Sphinx into Opencast Matterhorn).

Many university lectures include a high proportion of specialist terms (e.g. medical and scientific terms, discipline-specific terminology and jargon). These are important words. They are the "content anchors" of the lecture, and are likely to be used as search terms should a student want to locate a particular lecture dealing with a topic, or jump to a section of a recording.

Hence applications of speech recognition in an academic context need to pay special attention to recognizing these words correctly. ASR engines use linguistic resources to recognize words: a pronunciation dictionary which maps words to typical pronunciations, and a language model, which is a statistical model of the frequency with which word and word combinations (n-grams) occur in a body of text.

This post examines the "size and shape" of dictionary that would be required to recognize most specialist terms correctly in a particular domain. The reference text is an edited transcript of a lecture delivered to undergraduate Health Sciences (Medical School) students on "Chemical Pathology of the Liver".

The dictionaries evaluated come from a variety of sources. Google's ngram dictionary is a list of words from English language books with a minimum frequency cutoff of 40. BEEP and CMU are ASR pronunciation dictionaries. The Bing dictionary is a list of the most frequently 1000,000 terms in documents indexed by bing, and WSJ 5K is a small vocabulary from the Wall Street Journal (WSJ) corpus.

The Wikipedia dictionaries were created from a plain text list of sentences from Wikipedia articles. The complete list of words was sorted by descending frequency of use, with a cutoff of 3. Wikipedia 100K, for example, contains the most frequent 100,000 terms from Wikipedia.

The dictionaries all contain variant forms as separate words rather than stem words (e.g. speak, speaker, speaks). The comparison of the lecture text to the dictionary compares only words which are 3 or more characters in length (on the assumption that 1- and 2-letter English words are not problematic in this context, and excluding them from the Wikipedia dictionaries avoids some noise).

The reference text contains 7810 words which meet this requirement, using a vocabulary of 1407 unique words. Compared against the candidate dictionaries, we find:

Dictionary Size OOV
OOV% Unique
OOV words
Google 1gram Eng 2009 4 631 186 12 0.15% 8 0.57%
Wikipedia Full 1 714 417 22 0.28% 13 0.92%
Wikipedia 1M 1 000 000 27 0.35% 16 1.14%
Wikipedia 500K 500 000 41 0.52% 23 1.63%
Wikipedia 250K 250 000 112 1.43% 43 3.06%
Wikipedia 100K 100 000 269 3.44% 90 6.40%
BEEP 1.0 257 560 413 5.29% 124 8.81%
CMU 0.7.a 133 367 455 5.83% 146 10.38%
Bing Top100K Apr2010 98 431 514 6.58% 125 8.88%
WSJ 4 986 2 177 27.87% 696 49.47%

So if we are hoping to find more than 99% of the words in our lecture in a generic English dictionary, i.e. an out of vocabulary (OOV) rate of < 1%, we require a dictionary of between 250K and 500K terms.

Looking at the nature of the words which are OOV at different dictionary sizes, 250K to 500K is also the region where the number of unrecognized general English words becomes insignificant, leaving only specialist vocabulary. So in Wikipedia 250K, missing words include:
sweetish, re-expressed, ex-boss
which are slightly unusual but arguably generic English. Using Wikipedia 500K, the remaining missing words are almost completely domain-specific, for example:
sulfhydryls, aminophenyl, preicteric,  methimine, fibrosed, haematemesis, paracetamols, prehepatic, icteric, urobilin, clottability, hepatoma, sclerae, hypergonadism, extravasates, clottable, necroses, necrose
So the unsurprising conclusion is that a lecture on a narrow, specialist topic may contain a lot of words which are very infrequent in general English. Another way of visualizing this is comparing the word frequency distribution from a lecture transcript to a text from another genre.

This scatter plot shows term frequency in the transcript against dictionary rank (i.e. the position of the word in a dictionary sorted from most-to-least frequent), for the lecture transcript (blue) and the first 10,000 words or so from Alice's Adventures in Wonderland (i.e. a similar wordcount to the lecture).

The narrative fictional text shows the type of distribution we would expect from Zipf's law. The lecture text shows many more outliers -- for example terms with a document frequency of between 10 and 100, and a dictionary rank of 10,000 and below.

So is the solution to recognizing these terms to use a very large dictionary? In this case, larger is not always better. While we may want to recognize a word such as "fibrosed" which occurs with frequency 3 only in the full 1.7M Wikipedia dictionary, in practical terms a dictionary is only as useful as the accompanying language model.

LMs generated with an unrestricted vocabulary from a very large text corpus such as Wikipedia are not only impractical to use (requiring significant memory), but also lose an essential element of context, which is that a lecture is typically about one topic, rather than the whole of human knowledge. Hence we need to take into account that "fibrosed" is significantly more likely to occur in a lecture on liver pathology than "fibro-cement".

This leads to the specialization of language model adaptation, a topic of future posts.

Wednesday, 23 March 2011

Language modelling on the grid

Working with large data sets such as a Wikipedia plain text corpus creates certain challenges. A raw Wikipedia XML dump file is about 28G uncompressed (as of Jan 2011), and a set of plain text sentences from this is about 6.6G uncompressed.

Tools designed to process text corpora often have working memory requirements proportional to the size of the corpus or the size of the output set. In the case of language models, the size of the model can  significantly exceed the size of the input corpus.

With my goal being to create a language model from the full Wikipedia English corpus using the mitlm toolkit and evaluate the perplexity of the resulting model against a reference text, it became clear that my MacBook Pro's humble 4G of memory was insufficient.

Happily, UCT's Information and Communication Technology Services directed me to the relatively new grid computing infrastructure in South Africa in the form of the SAGrid. UCT has both a local computing element (gridspeak for a cluster of grid-linked machines) and a clued-up team in the form of Andrew Lewis and Timothy Carr, who helped me get up and running.

While the grid is capable of extraordinary feats of distributed computing, I basically just needed to be able to execute my single-threaded process on a server with lots of memory (16G or ideally 32G) against a large data set. This turned out to be fairly straightforward. Here are my crib notes (which assume some Linux familiarity):

1. Figure out what the grid is and how it works

Watch the informative GridCafe Tutorial Screencasts from the EGEE Direct User Support Group. These explain basic concepts and how to carry out the most common procedures.

2. Get set up as a grid user and associated with a VOMS

Follow steps 1 to 3 on SAGrid's Getting Started page, with help where needed from your local grid computing support staff. You will need a South African Registration Authority to verify your identity and provide you with the key that allows you to request a digital certificate via INFN.

Once you have been issued with a personal certificate, you need to install it in your browser and register with the SAGrid VOMS (virtual organization).

Cryptic clue: the VOMS registration page needs "TLS-1 disabled" before it will allow you to connect., otherwise you will get a "secure connection failed" error. To disable TLS-1 in Firefox, go to about:config and set the property security.enable_tls to false. You can re-enable it once you've registered successfully.

3. Set up your personal certificate on a UI server

A grid "user interface" just means a server which has the grid glite middleware installed, allowing you to submit jobs to the grid and retrieve the results. I used, which runs Scientific Linux. Once you have a shell account (for ssh login), follow the process outlined in the screencast to create and copy your certificate files to the UI server, viz.


Cryptic clue: if you installed your personal certificate in Firefox on MacOS, you can export it through

Firefox / Preferences / Advanced / Encryption / View Certificates / Backup

which will save a certificate file in PKCS12 format (usually with a .p12 extension). You can convert this to the PEM format required by the glite middleware using openssl, as helpfully described by the NCSA's Useful SSL Commands.

4. Initialize your grid and VOMS proxy credentials

This sets up your authorization to submit jobs on the grid for the next 12 hours: 

voms-proxy-init -voms sagrid

(If you have a job which will take longer than that, you need a further proxy authentication step.)

5. Build the toolkit and create a script to execute it with the right data

If the application you want to run is not installed on the servers which will execute your job, then you need to build it on a similar platform and include it in your job.

In my case, I built mitlm from source, and then created a tar bundle with the executable and its libraries, viz. mitlm.tgz containing


A wrapper script ( then unpacks the app, fetches the data set, runs the toolkit, and compresses the results:

#! /bin/sh

# Unpack the mitlm toolkit
tar zxf mitlm.tgz

# Get our large data set
wget --quiet --no-proxy

# Run the LM toolkit

usr/bin/estimate-ngram -text enwiki-sentences.corpus.bz2 -vocab enwiki-500K-cmu-combined.txt.bz2 -wl wiki.lm

# Compress the resulting LM file
bzip2 wiki.lm

6. Configure and submit the job

With the script set to go, all that remains is to create a Job Description Language (JDL) file for the job and submit it. For the mitlm task above, the lm-big.jdl file contains:

Executable = "";
Arguments = "";
StdOutput = "std.out";
StdError = "std.err";
InputSandbox = { "", "mitlm.tgz", "enwiki-500K-cmu-combined.txt.bz2" };
OutputSandbox = { "std.out", "std.err", "wiki.lm.bz2" };
Requirements = other.GlueCEUniqueID=="";

Small files are sent along with the job in the InputSandbox (here they are located on the portal UI server in the same directory as the JDL file). Large data sets are retrieved separately from some location by the wrapper script. In this case the script does a simple wget from a local server, as an alternative to using grid storage services. The OutputSandbox defines which files will get returned as part of the job output, in this case stdout and stderr, and the resulting language model file.

For this job, I defined a particular computing element on which the job should run (a local cluster) using Requirements. This is to ensure that the process executes on worker nodes which have sufficient memory, and as the input and output data sets are relatively large (approx 6G and 12G), it also helps to keep file transfers on a fast network.

To submit the job, simply run:

glite-wms-job-submit -a -o lm-big.jdl

which saves the resulting job identifier into the file.

7. Get the results

To check on the status of the job, run:

glite-wms-job-status -i

and to retrieve the results and output (i.e. fetch the files defined in the OutputSandbox):

glite-wms-job-output --dir ./results -i


This particular job used around 16G of working memory and took 1 hour to execute. The resulting language model is around 2.6G in ARPA format after bzip2 compression.

A followup job evaluated the perplexity of the model against 2 reference documents (although with mitlm one could in fact do this at the same time as creating the model).

With most of the hard work done, it is now easy to put those grid computing resources to work running multiple variants of the job, for example to evaluate the perplexity of models of different sizes.

Tuesday, 15 March 2011

Creating a text corpus from Wikipedia

Speech recognition engines (and other nature language processing applications) need a good language model. Open source speech recognition engines such as the CMU Sphinx toolkit include relatively small LMs, such as the WSJ model with 5000 terms. Some larger models are available online, such as Keith Vertanen's English Gigaword models.

To create your own, you need a good source of raw material (i.e. written English) in the form of a text corpus such as those available from the non-profit but pricey Linguistic Data Consortium. However, if you need a corpus with a permissive license (CC-BY-SA and GFDL) and at no cost, Wikipedia now presents an excellent alternative. (Another is the set of Google Books n-grams).

This post describes techniques for turning the contents of Wikipedia into a set of sentences and a vocabulary suitable for use with language modelling toolkits or other applications. You will need a reasonable amount of bandwidth, disk space, and some CPU time to proceed.

Step 1: get that dump file

To start, download a Wikipedia database extract. For English, use:
which is 6G+ in size.

Step 2: convert the dump file to sentences

The Wikipedia dump file XML format and the Wikimedia markup of the articles contain lots of information such as formatting that is irrelevant to statistical language modelling, where we are concerned simply with words and how they form sentences.

To process the XML file into something useful, I used the gwtwiki toolkit (bliki-core-3.0.16.jar) along with the dependency Apache Commons Compress (commons-compress-1.1.jar). There is a wide range of toolkits for processing Wikipedia content in different languages of varying quality. gwtwiki appears to be one of the most functional and robust, handling both the parsing of the XML file and converting each article from markup into a plain text format.

A small java wrapper ( invokes the gwtwiki parser and does some further filtering, such as excluding sentences of less than 6 words. With a few hours of processing, a set of sentences results (one per line). Here are the first few from the 2011-01-15 snapshot of the Anarchism article:
Anarchism is a political philosophy which considers the state undesirable, unnecessary, and harmful, and instead promotes a stateless society, or anarchy.
The Concise Oxford Dictionary of Politics.
It seeks to diminish or even abolish authority in the conduct of human relations.
Note that some of these are not real subject-verb-object sentences. As the parser is purely syntactic, it will include collections of words that look like sentences. However, they still represent coherent examples of language use for modelling purposes.

Step 3: convert the sentence list to a corpus file

As most language modelling toolkits are distracted by punctuation, some post-processing (text conditioning) is required. A set of regular expressions (such as in a perl script) is the easiest way to accomplish this. removes punctuation and excess space, producing output like:
From a 28G uncompressed version of the English Wikipedia pages from the 2011-01-15 snapshot, the corpus file is 6.6G.

Step 4: create a vocabulary file

As Wikipedia includes many words which are in fact not words (for example misspellings and other weird and wonderful character sequences like AAA'BBB), it is helpful to create a vocabulary with frequency counts, imposing some restrictions on what is considered a word. restricts valid words to those occuring with a minimum frequency and of a minimum length, with some English-specific rules for acceptable use of the apostrophe (

Having created a vocabulary file by processing the corpus file through, it's easy to sort it in reverse order of frequency using:
sort -nr -k 2 enwiki-vocab.txt
which produces:
THE     84503449
AND     33700692
WAS     12911542
FOR     10342919
THAT    8318795
for a total of 1714417 tokens (with a minimum length of 3). Words with frequency 3 include the misspelt (AFTEROON), the unusual (AFGHANIZATION, AGRO-PASTORALIST), and the spurious (AAAABBBCCC).

It is also then trivial to produce a vocabulary of the most commonly used words, e.g.
head -n100000 enwiki-vocab-sorted.txt > enwiki-vocab-100K.txt 
However, with a minimum length of 3, a range of useful English words (a, as, an, ...) are excluded, so it's best to combine the resulting dictionary with a smaller dictionary of higher quality (such as CMUdict), which includes most of the valid 2-letter English words.

Step 5: create a language model

Using a language modelling toolkit, you can create an LM of your own design, using part or all of the Wikipedia corpus, optionally restricted to a specific vocabulary. For example, with mitlm using 1 out of every 30 Wikipedia sentences and a vocabulary restricted to the top 100,000 words from Wikipedia combined with the CMU 0.7a dictionary:
estimate-ngram -vocab enwiki-100K-cmu-combined.txt -text enwiki-sentences-1-from-30.corpus -write-lm enwiki.lm
the resulting LM (close to 700M in ARPA format) has:
ngram 1=163892
ngram 2=6251876
ngram 3=17570560
Constructing a language model with the full set of sentences and full vocabulary (Wikipedia len=3 plus CMU) leads to an LM with
ngram 1=1724335
ngram 2=79579226
ngram 3=314047999
about 12G in size (uncompressed ARPA format).

Happy modelling!