Wednesday, 23 March 2011

Language modelling on the grid

Working with large data sets such as a Wikipedia plain text corpus creates certain challenges. A raw Wikipedia XML dump file is about 28G uncompressed (as of Jan 2011), and a set of plain text sentences from this is about 6.6G uncompressed.

Tools designed to process text corpora often have working memory requirements proportional to the size of the corpus or the size of the output set. In the case of language models, the size of the model can  significantly exceed the size of the input corpus.

With my goal being to create a language model from the full Wikipedia English corpus using the mitlm toolkit and evaluate the perplexity of the resulting model against a reference text, it became clear that my MacBook Pro's humble 4G of memory was insufficient.

Happily, UCT's Information and Communication Technology Services directed me to the relatively new grid computing infrastructure in South Africa in the form of the SAGrid. UCT has both a local computing element (gridspeak for a cluster of grid-linked machines) and a clued-up team in the form of Andrew Lewis and Timothy Carr, who helped me get up and running.

While the grid is capable of extraordinary feats of distributed computing, I basically just needed to be able to execute my single-threaded process on a server with lots of memory (16G or ideally 32G) against a large data set. This turned out to be fairly straightforward. Here are my crib notes (which assume some Linux familiarity):

1. Figure out what the grid is and how it works

Watch the informative GridCafe Tutorial Screencasts from the EGEE Direct User Support Group. These explain basic concepts and how to carry out the most common procedures.

2. Get set up as a grid user and associated with a VOMS

Follow steps 1 to 3 on SAGrid's Getting Started page, with help where needed from your local grid computing support staff. You will need a South African Registration Authority to verify your identity and provide you with the key that allows you to request a digital certificate via INFN.

Once you have been issued with a personal certificate, you need to install it in your browser and register with the SAGrid VOMS (virtual organization).

Cryptic clue: the VOMS registration page needs "TLS-1 disabled" before it will allow you to connect., otherwise you will get a "secure connection failed" error. To disable TLS-1 in Firefox, go to about:config and set the property security.enable_tls to false. You can re-enable it once you've registered successfully.

3. Set up your personal certificate on a UI server

A grid "user interface" just means a server which has the grid glite middleware installed, allowing you to submit jobs to the grid and retrieve the results. I used, which runs Scientific Linux. Once you have a shell account (for ssh login), follow the process outlined in the screencast to create and copy your certificate files to the UI server, viz.


Cryptic clue: if you installed your personal certificate in Firefox on MacOS, you can export it through

Firefox / Preferences / Advanced / Encryption / View Certificates / Backup

which will save a certificate file in PKCS12 format (usually with a .p12 extension). You can convert this to the PEM format required by the glite middleware using openssl, as helpfully described by the NCSA's Useful SSL Commands.

4. Initialize your grid and VOMS proxy credentials

This sets up your authorization to submit jobs on the grid for the next 12 hours: 

voms-proxy-init -voms sagrid

(If you have a job which will take longer than that, you need a further proxy authentication step.)

5. Build the toolkit and create a script to execute it with the right data

If the application you want to run is not installed on the servers which will execute your job, then you need to build it on a similar platform and include it in your job.

In my case, I built mitlm from source, and then created a tar bundle with the executable and its libraries, viz. mitlm.tgz containing


A wrapper script ( then unpacks the app, fetches the data set, runs the toolkit, and compresses the results:

#! /bin/sh

# Unpack the mitlm toolkit
tar zxf mitlm.tgz

# Get our large data set
wget --quiet --no-proxy

# Run the LM toolkit

usr/bin/estimate-ngram -text enwiki-sentences.corpus.bz2 -vocab enwiki-500K-cmu-combined.txt.bz2 -wl wiki.lm

# Compress the resulting LM file
bzip2 wiki.lm

6. Configure and submit the job

With the script set to go, all that remains is to create a Job Description Language (JDL) file for the job and submit it. For the mitlm task above, the lm-big.jdl file contains:

Executable = "";
Arguments = "";
StdOutput = "std.out";
StdError = "std.err";
InputSandbox = { "", "mitlm.tgz", "enwiki-500K-cmu-combined.txt.bz2" };
OutputSandbox = { "std.out", "std.err", "wiki.lm.bz2" };
Requirements = other.GlueCEUniqueID=="";

Small files are sent along with the job in the InputSandbox (here they are located on the portal UI server in the same directory as the JDL file). Large data sets are retrieved separately from some location by the wrapper script. In this case the script does a simple wget from a local server, as an alternative to using grid storage services. The OutputSandbox defines which files will get returned as part of the job output, in this case stdout and stderr, and the resulting language model file.

For this job, I defined a particular computing element on which the job should run (a local cluster) using Requirements. This is to ensure that the process executes on worker nodes which have sufficient memory, and as the input and output data sets are relatively large (approx 6G and 12G), it also helps to keep file transfers on a fast network.

To submit the job, simply run:

glite-wms-job-submit -a -o lm-big.jdl

which saves the resulting job identifier into the file.

7. Get the results

To check on the status of the job, run:

glite-wms-job-status -i

and to retrieve the results and output (i.e. fetch the files defined in the OutputSandbox):

glite-wms-job-output --dir ./results -i


This particular job used around 16G of working memory and took 1 hour to execute. The resulting language model is around 2.6G in ARPA format after bzip2 compression.

A followup job evaluated the perplexity of the model against 2 reference documents (although with mitlm one could in fact do this at the same time as creating the model).

With most of the hard work done, it is now easy to put those grid computing resources to work running multiple variants of the job, for example to evaluate the perplexity of models of different sizes.


  1. This makes me weep with joy. please keep up the good work and let us know if we can do anything to make your life a pleasure.

  2. Thanks for the tutorials and thoughts. By any chance, do you have your language model available for download anywhere?