tag:blogger.com,1999:blog-31057024729832751352024-02-19T10:15:41.105+02:00Truly Madly WordlyOpen source language modelling and speech recognitionStephen Marquardhttp://www.blogger.com/profile/06185718122117108334noreply@blogger.comBlogger9125tag:blogger.com,1999:blog-3105702472983275135.post-30821830754543704522012-12-19T16:31:00.002+02:002012-12-21T20:54:06.924+02:00Improving searchability of automatically transcribed lectures through dynamic language modellingHere is the abstract of my recently completed Master's dissertation on <a href="http://pubs.cs.uct.ac.za/archive/00000846/" target="_blank">topic-specific language modelling with Wikipedia</a> to improve the accuracy of lecture recording transcriptions: <br />
<blockquote class="tr_bq">
Recording university lectures through lecture capture systems is
increasingly common. However, a single continuous audio recording is
often unhelpful for users, who may wish to navigate quickly to a
particular part of a lecture, or locate a specific lecture within a set
of recordings.<br />
<br />
A transcript of the recording can enable faster
navigation and searching. Automatic speech recognition (ASR)
technologies may be used to create automated transcripts, to avoid the
significant time and cost involved in manual transcription.<br />
<br />
Low
accuracy of ASR-generated transcripts may however limit their
usefulness. In particular, ASR systems optimized for general speech
recognition may not recognize the many technical or discipline-specific
words occurring in university lectures. To improve the usefulness of ASR
transcripts for the purposes of information retrieval (search) and
navigating within recordings, the lexicon and language model used by the
ASR engine may be dynamically adapted for the topic of each lecture.<br />
<br />
A
prototype is presented which uses the English Wikipedia as a
semantically dense, large language corpus to generate a custom lexicon
and language model for each lecture from a small set of keywords. Two
strategies for extracting a topic-specific subset of Wikipedia articles
are investigated: a naïve crawler which follows all article links from a
set of seed articles produced by a Wikipedia search from the initial
keywords, and a refinement which follows only links to articles
sufficiently similar to the parent article. Pair-wise article similarity
is computed from a pre-computed vector space model of Wikipedia article
term scores generated using latent semantic indexing.<br />
<br />
The CMU
Sphinx4 ASR engine is used to generate transcripts from thirteen
recorded lectures from Open Yale Courses, using the English HUB4
language model as a reference and the two topic-specific language models
generated for each lecture from Wikipedia.<br />
<br />
Three standard
metrics – Perplexity, Word Error Rate and Word Correct Rate – are used
to evaluate the extent to which the adapted language models improve the
searchability of the resulting transcripts, and in particular improve
the recognition of specialist words. Ranked Word Correct Rate is
proposed as a new metric better aligned with the goals of improving
transcript searchability and specialist word recognition.<br />
<br />
Analysis
of recognition performance shows that the language models derived using
the similarity-based Wikipedia crawler outperform models created using
the naïve crawler, and that transcripts using similarity-based language
models have better perplexity and Ranked Word Correct Rate scores than
those created using the HUB4 language model, but worse Word Error Rates.<br />
<br />
It
is concluded that English Wikipedia may successfully be used as a
language resource for unsupervised topic adaptation of language models
to improve recognition performance for better searchability of lecture
recording transcripts, although possibly at the expense of other
attributes such as readability.</blockquote>
Links to the open source software toolkits, data sets and custom-written code used for the analysis, including the <a href="http://source.cet.uct.ac.za/svn/people/smarquard/wikicrawler/trunk/" target="_blank">Wikipedia Similarity Crawler</a>, are included in Appendices 2 and 3.<br />
<br />Stephen Marquardhttp://www.blogger.com/profile/06185718122117108334noreply@blogger.com0tag:blogger.com,1999:blog-3105702472983275135.post-39005935636357839732011-12-18T17:00:00.001+02:002011-12-19T11:13:26.356+02:00Sphinx4 speech recognition results for selected lectures from Open Yale Courses<span style="font-size: large;">About this dataset</span><br />
Data on the performance of large-vocabulary, continuous speech recognition engines in real contexts is sometimes hard to find. This dataset describes the performance of the <a href="http://cmusphinx.sourceforge.net/" target="_blank">CMU Sphinx Speech Recognition Toolkit</a> (specifically the Sphinx4 java implementation) in recognition selected lectures from <a href="http://oyc.yale.edu/" target="_blank">Open Yale Courses</a> using the HUB4 acoustic and language models. <br />
<br />
This is the same data summarized on Slide 10 in <a href="http://trulymadlywordly.blogspot.com/2011/06/open-text-speech-recognition-in.html" target="_blank">Speech Recognition in Opencast Matterhorn</a> and forms part of a larger research project on adapting language models to improve the searchability of automated transcriptions of recorded lectures.<br />
<br />
<span style="font-size: large;">Source material</span><br />
The audio files are mp3 recordings which form part of <a href="http://oyc.yale.edu/" target="_blank">Open Yale Courses</a> (OYC), which helpfully includes both transcripts and a <a href="http://oyc.yale.edu/terms-of-use" target="_blank">research-friendly Creative Commons license</a> (CC-BY-NC-SA) permitting reuse and derivative works.<br />
<br />
<span style="font-size: small;">The 13 recordings were selected for audio quality, reasonable match of the speaker accent to a North American English accent (presumed to align reasonably with the acoustic model), a variety of speakers and topics, consistent length (around 50min each), and to primarily consist of a single speaker (i.e. minimal audience involvement). Of the 13 lectures, 11 are by male speakers and 2 by female speakers.</span><br />
<br />
<span style="font-size: small;">The transcripts provided by OYC have been normalized to a single continuous set of words without punctuation or linebreaks for calculating speech recognition accuracy, and are also <a href="http://source.cet.uct.ac.za/svn/people/smarquard/datasets/oyc-transcripts/" target="_blank">provided as a separate dataset</a>.</span><br />
<br />
<span style="font-size: large;">Sphinx4 Configuration</span><br />
The Sphinx4 <a href="http://source.cet.uct.ac.za/svn/people/smarquard/datasets/sphinx4-hub4-oyc/sphinx-custom.xml" target="_blank">configuration</a> is for large-vocabulary, continuous speech recognition, using the <a href="https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/US%20English%20HUB4%20Acoustic%20Model/" target="_blank">HUB4 US English Acoustic Model</a>, <a href="https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/US%20English%20HUB4%20Language%20Model/" target="_blank">HUB4 Trigram Language Model</a> and <a href="https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/cmudict/sphinxdict/cmudict.0.7a_SPHINX_40" target="_blank">CMUdict 0.7a</a> dictionary. HUB4 contains 64000 terms, and in the worst case below, matches just over 95% of the vocabulary (though not much of the specialist vocabulary, which is a different topic).<br />
<br />
There are many ways to adjust Sphinx's configuration depending on the task at hand, and the configuration used here may not be optimal, though experimenting with settings such as beam width and word insertion probability did not have a significant effect on accuracy.<br />
<br />
<span style="font-size: large;">Results</span><br />
<br />
In the table below, click on the title to go to the OYC page for the lecture (which includes links to audio and transcript), and click on the WER to see the Sphinx recognition output.<br />
<br />
<table border="0" cellpadding="0" cellspacing="0" class="MsoNormalTable" style="border-collapse: collapse; margin-left: 4.75pt; width: 427px;">
<tbody>
<tr style="height: 52.0pt; mso-yfti-firstrow: yes; mso-yfti-irow: 0;">
<td style="border: 1pt solid windowtext; height: 52pt; padding: 0cm 5.4pt; width: 130.1pt;" valign="bottom" width="130"><div class="MsoNormal">
<b><span style="font-size: x-small;"><span lang="EN-GB">Lecture</span></span></b></div>
</td>
<td style="-moz-border-bottom-colors: none; -moz-border-image: none; -moz-border-left-colors: none; -moz-border-right-colors: none; -moz-border-top-colors: none; border-color: windowtext windowtext windowtext -moz-use-text-color; border-style: solid solid solid none; border-width: 1pt 1pt 1pt medium; height: 52pt; padding: 0cm 5.4pt; width: 55pt;" valign="bottom" width="55"><div align="center" class="MsoNormal" style="text-align: center;">
<b><span style="font-size: x-small;"> Words </span></b></div>
</td>
<td style="-moz-border-bottom-colors: none; -moz-border-image: none; -moz-border-left-colors: none; -moz-border-right-colors: none; -moz-border-top-colors: none; border-color: windowtext windowtext windowtext -moz-use-text-color; border-style: solid solid solid none; border-width: 1pt 1pt 1pt medium; height: 52pt; padding: 0cm 5.4pt; width: 47.65pt;" valign="bottom" width="48"><div align="center" class="MsoNormal" style="text-align: center;">
<b><span style="font-size: x-small;">Word
Error Rate (WER)</span></b></div>
</td>
<td style="-moz-border-bottom-colors: none; -moz-border-image: none; -moz-border-left-colors: none; -moz-border-right-colors: none; -moz-border-top-colors: none; border-color: windowtext windowtext windowtext -moz-use-text-color; border-style: solid solid solid none; border-width: 1pt 1pt 1pt medium; height: 52pt; padding: 0cm 5.4pt; width: 71.35pt;" valign="bottom" width="71"><div align="center" class="MsoNormal" style="text-align: center;">
<b><span style="font-size: x-small;"> Perplexity (sentence transcript) </span></b></div>
</td>
<td style="-moz-border-bottom-colors: none; -moz-border-image: none; -moz-border-left-colors: none; -moz-border-right-colors: none; -moz-border-top-colors: none; border-color: windowtext windowtext windowtext -moz-use-text-color; border-style: solid solid solid none; border-width: 1pt 1pt 1pt medium; height: 52pt; padding: 0cm 5.4pt; width: 72.45pt;" valign="bottom" width="72"><div align="center" class="MsoNormal" style="text-align: center;">
<b><span style="font-size: x-small;"> Out of vocabulary (OOV) words </span></b></div>
</td>
<td style="-moz-border-bottom-colors: none; -moz-border-image: none; -moz-border-left-colors: none; -moz-border-right-colors: none; -moz-border-top-colors: none; border-color: windowtext windowtext windowtext -moz-use-text-color; border-style: solid solid solid none; border-width: 1pt 1pt 1pt medium; height: 52pt; padding: 0cm 5.4pt; width: 49.95pt;" valign="bottom" width="50"><div align="center" class="MsoNormal" style="text-align: center;">
<b><span style="font-size: x-small;"> OOV % </span></b></div>
</td>
</tr>
<tr style="height: 21pt;">
<td nowrap="nowrap" style="-moz-border-bottom-colors: none; -moz-border-image: none; -moz-border-left-colors: none; -moz-border-right-colors: none; -moz-border-top-colors: none; border-color: -moz-use-text-color windowtext windowtext; border-style: none solid solid; border-width: medium 1pt 1pt; height: 21pt; padding: 0cm 5.4pt; width: 130.1pt;" valign="top" width="130"><div class="MsoNormal">
<span style="font-size: x-small;"><a href="http://oyc.yale.edu/astronomy/frontiers-and-controversies-in-astrophysics/content/sessions/lecture21.html" target="_blank"><span lang="EN-GB">Dark Energy and the Accelerating Universe and the Big Rip</span></a></span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 55pt;" width="55"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;">6704</span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 47.65pt;" width="48"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"><a href="http://source.cet.uct.ac.za/svn/people/smarquard/datasets/sphinx4-hub4-oyc/astr160_21_041707-output.txt" target="_blank">35%</a></span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 71.35pt;" width="71"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"> 228 </span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 72.45pt;" width="72"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"> 110 </span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 49.95pt;" width="50"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;">1.6%</span></div>
</td>
</tr>
<tr style="height: 21pt;">
<td nowrap="nowrap" style="-moz-border-bottom-colors: none; -moz-border-image: none; -moz-border-left-colors: none; -moz-border-right-colors: none; -moz-border-top-colors: none; border-color: -moz-use-text-color windowtext windowtext; border-style: none solid solid; border-width: medium 1pt 1pt; height: 21pt; padding: 0cm 5.4pt; width: 130.1pt;" valign="top" width="130"><div class="MsoNormal">
<span style="font-size: x-small;"><a href="http://oyc.yale.edu/biomedical-engineering/frontiers-in-biomedical-engineering/content/sessions/session-5-cell-culture-engineering" target="_blank"><span lang="EN-GB">Cell Culture Engineering</span></a></span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 55pt;" width="55"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;">7385</span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 47.65pt;" width="48"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"><a href="http://source.cet.uct.ac.za/svn/people/smarquard/datasets/sphinx4-hub4-oyc/beng100_05_012908-output.txt" target="_blank">35%</a></span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 71.35pt;" width="71"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"> 307 </span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 72.45pt;" width="72"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"> 164 </span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 49.95pt;" width="50"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;">2.2%</span></div>
</td>
</tr>
<tr style="height: 21pt;">
<td nowrap="nowrap" style="-moz-border-bottom-colors: none; -moz-border-image: none; -moz-border-left-colors: none; -moz-border-right-colors: none; -moz-border-top-colors: none; border-color: -moz-use-text-color windowtext windowtext; border-style: none solid solid; border-width: medium 1pt 1pt; height: 21pt; padding: 0cm 5.4pt; width: 130.1pt;" valign="top" width="130"><div class="MsoNormal">
<span style="font-size: x-small;"><a href="http://oyc.yale.edu/biomedical-engineering/frontiers-in-biomedical-engineering/content/sessions/session-9-biomolecular-engineering-engineering-of" target="_blank"><span lang="EN-GB">Biomolecular Engineering: Engineering of Immunity</span></a></span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 55pt;" width="55"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;">6974</span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 47.65pt;" width="48"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"><a href="http://source.cet.uct.ac.za/svn/people/smarquard/datasets/sphinx4-hub4-oyc/beng100_09_021208-output.txt" target="_blank">32%</a></span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 71.35pt;" width="71"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"> 211 </span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 72.45pt;" width="72"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"> 96 </span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 49.95pt;" width="50"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;">1.4%</span></div>
</td>
</tr>
<tr style="height: 21pt;">
<td nowrap="nowrap" style="-moz-border-bottom-colors: none; -moz-border-image: none; -moz-border-left-colors: none; -moz-border-right-colors: none; -moz-border-top-colors: none; border-color: -moz-use-text-color windowtext windowtext; border-style: none solid solid; border-width: medium 1pt 1pt; height: 21pt; padding: 0cm 5.4pt; width: 130.1pt;" valign="top" width="130"><div class="MsoNormal">
<span style="font-size: x-small;"><a href="http://oyc.yale.edu/ecology-and-evolutionary-biology/principles-of-evolution-ecology-and-behavior/content/sessions/lecture34.html" target="_blank"><span lang="EN-GB">Mating Systems and Parental Care</span></a></span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 55pt;" width="55"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;">5795</span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 47.65pt;" width="48"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"><a href="http://source.cet.uct.ac.za/svn/people/smarquard/datasets/sphinx4-hub4-oyc/eeb122_34_042009-output.txt" target="_blank">40%</a></span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 71.35pt;" width="71"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"> 331 </span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 72.45pt;" width="72"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"> 145 </span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 49.95pt;" width="50"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;">2.5%</span></div>
</td>
</tr>
<tr style="height: 21pt;">
<td nowrap="nowrap" style="-moz-border-bottom-colors: none; -moz-border-image: none; -moz-border-left-colors: none; -moz-border-right-colors: none; -moz-border-top-colors: none; border-color: -moz-use-text-color windowtext windowtext; border-style: none solid solid; border-width: medium 1pt 1pt; height: 21pt; padding: 0cm 5.4pt; width: 130.1pt;" valign="top" width="130"><div class="MsoNormal">
<span style="font-size: x-small;"><a href="http://oyc.yale.edu/english/milton/content/sessions/session-6-lycidas" target="_blank"><span lang="EN-GB">Lycidas</span></a></span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 55pt;" width="55"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;">7350</span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 47.65pt;" width="48"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"><a href="http://source.cet.uct.ac.za/svn/people/smarquard/datasets/sphinx4-hub4-oyc/engl220_06_092407-output.txt" target="_blank">43%</a></span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 71.35pt;" width="71"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"> 535 </span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 72.45pt;" width="72"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"> 314 </span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 49.95pt;" width="50"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;">4.3%</span></div>
</td>
</tr>
<tr style="height: 21pt;">
<td nowrap="nowrap" style="-moz-border-bottom-colors: none; -moz-border-image: none; -moz-border-left-colors: none; -moz-border-right-colors: none; -moz-border-top-colors: none; border-color: -moz-use-text-color windowtext windowtext; border-style: none solid solid; border-width: medium 1pt 1pt; height: 21pt; padding: 0cm 5.4pt; width: 130.1pt;" valign="top" width="130"><div class="MsoNormal">
<span style="font-size: x-small;"><a href="http://oyc.yale.edu/english/american-novel-since-1945/content/sessions/session-12-thomas-pynchon-the-crying-of-lot-49" target="_blank"><span lang="EN-GB">Thomas Pynchon, The Crying of Lot 49</span></a></span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 55pt;" width="55"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;">6201</span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 47.65pt;" width="48"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"><a href="http://source.cet.uct.ac.za/svn/people/smarquard/datasets/sphinx4-hub4-oyc/engl291_12_022008-output.txt" target="_blank">32%</a></span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 71.35pt;" width="71"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"> 379 </span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 72.45pt;" width="72"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"> 174 </span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 49.95pt;" width="50"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;">2.8%</span></div>
</td>
</tr>
<tr style="height: 21pt;">
<td nowrap="nowrap" style="-moz-border-bottom-colors: none; -moz-border-image: none; -moz-border-left-colors: none; -moz-border-right-colors: none; -moz-border-top-colors: none; border-color: -moz-use-text-color windowtext windowtext; border-style: none solid solid; border-width: medium 1pt 1pt; height: 21pt; padding: 0cm 5.4pt; width: 130.1pt;" valign="top" width="130"><div class="MsoNormal">
<span style="font-size: x-small;"><a href="http://oyc.yale.edu/english/introduction-to-theory-of-literature/content/sessions/lecture15.html" target="_blank"><span lang="EN-GB">The Postmodern Psyches</span></a></span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 55pt;" width="55"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;">6701</span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 47.65pt;" width="48"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"><a href="http://source.cet.uct.ac.za/svn/people/smarquard/datasets/sphinx4-hub4-oyc/engl300_15_030309-output.txt" target="_blank">47%</a></span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 71.35pt;" width="71"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"> 274 </span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 72.45pt;" width="72"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"> 265 </span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 49.95pt;" width="50"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;">4.0%</span></div>
</td>
</tr>
<tr style="height: 21pt;">
<td nowrap="nowrap" style="-moz-border-bottom-colors: none; -moz-border-image: none; -moz-border-left-colors: none; -moz-border-right-colors: none; -moz-border-top-colors: none; border-color: -moz-use-text-color windowtext windowtext; border-style: none solid solid; border-width: medium 1pt 1pt; height: 21pt; padding: 0cm 5.4pt; width: 130.1pt;" valign="top" width="130"><div class="MsoNormal">
<span style="font-size: x-small;"><a href="http://oyc.yale.edu/history/the-american-revolution/content/sessions/lecture08.html" target="_blank"><span lang="EN-GB">The Logic of Resistance</span></a></span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 55pt;" width="55"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;">7902</span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 47.65pt;" width="48"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"><a href="http://source.cet.uct.ac.za/svn/people/smarquard/datasets/sphinx4-hub4-oyc/hist116_08_020410-output.txt" target="_blank">40%</a></span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 71.35pt;" width="71"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"> 309 </span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 72.45pt;" width="72"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"> 74 </span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 49.95pt;" width="50"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;">0.9%</span></div>
</td>
</tr>
<tr style="height: 21pt;">
<td nowrap="nowrap" style="-moz-border-bottom-colors: none; -moz-border-image: none; -moz-border-left-colors: none; -moz-border-right-colors: none; -moz-border-top-colors: none; border-color: -moz-use-text-color windowtext windowtext; border-style: none solid solid; border-width: medium 1pt 1pt; height: 21pt; padding: 0cm 5.4pt; width: 130.1pt;" valign="top" width="130"><div class="MsoNormal">
<span style="font-size: x-small;"><a href="http://oyc.yale.edu/history/european-civilization-1648-1945/content/sessions/lecture06.html" target="_blank"><span lang="EN-GB">Maximilien Robespierre and the French Revolution</span></a></span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 55pt;" width="55"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;">6643</span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 47.65pt;" width="48"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"><a href="http://source.cet.uct.ac.za/svn/people/smarquard/datasets/sphinx4-hub4-oyc/hist202_06_092208-output.txt" target="_blank">61%</a></span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 71.35pt;" width="71"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"> 252 </span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 72.45pt;" width="72"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"> 212 </span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 49.95pt;" width="50"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;">3.2%</span></div>
</td>
</tr>
<tr style="height: 21pt;">
<td nowrap="nowrap" style="-moz-border-bottom-colors: none; -moz-border-image: none; -moz-border-left-colors: none; -moz-border-right-colors: none; -moz-border-top-colors: none; border-color: -moz-use-text-color windowtext windowtext; border-style: none solid solid; border-width: medium 1pt 1pt; height: 21pt; padding: 0cm 5.4pt; width: 130.1pt;" valign="top" width="130"><div class="MsoNormal">
<span style="font-size: x-small;"><a href="http://oyc.yale.edu/philosophy/death/content/sessions/lecture13.html" target="_blank"><span lang="EN-GB">Personal identity, Part IV; What matters?</span></a></span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 55pt;" width="55"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;">6603</span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 47.65pt;" width="48"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"><a href="http://source.cet.uct.ac.za/svn/people/smarquard/datasets/sphinx4-hub4-oyc/phil176_13_022707-output.txt" target="_blank">49%</a></span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 71.35pt;" width="71"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"> 475 </span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 72.45pt;" width="72"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"> 97 </span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 49.95pt;" width="50"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;">1.5%</span></div>
</td>
</tr>
<tr style="height: 21pt;">
<td nowrap="nowrap" style="-moz-border-bottom-colors: none; -moz-border-image: none; -moz-border-left-colors: none; -moz-border-right-colors: none; -moz-border-top-colors: none; border-color: -moz-use-text-color windowtext windowtext; border-style: none solid solid; border-width: medium 1pt 1pt; height: 21pt; padding: 0cm 5.4pt; width: 130.1pt;" valign="top" width="130"><div class="MsoNormal">
<span style="font-size: x-small;"><a href="http://oyc.yale.edu/political-science/introduction-to-political-philosophy/content/sessions/lecture02.html" target="_blank"><span lang="EN-GB">Socratic Citizenship: Plato, Apology</span></a></span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 55pt;" width="55"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;">5473</span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 47.65pt;" width="48"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"><a href="http://source.cet.uct.ac.za/svn/people/smarquard/datasets/sphinx4-hub4-oyc/plsc114_02_091306-output.txt" target="_blank">32%</a></span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 71.35pt;" width="71"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"> 357 </span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 72.45pt;" width="72"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"> 103 </span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 49.95pt;" width="50"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;">1.9%</span></div>
</td>
</tr>
<tr style="height: 21pt;">
<td nowrap="nowrap" style="-moz-border-bottom-colors: none; -moz-border-image: none; -moz-border-left-colors: none; -moz-border-right-colors: none; -moz-border-top-colors: none; border-color: -moz-use-text-color windowtext windowtext; border-style: none solid solid; border-width: medium 1pt 1pt; height: 21pt; padding: 0cm 5.4pt; width: 130.1pt;" valign="top" width="130"><div class="MsoNormal">
<span style="font-size: x-small;"><a href="http://oyc.yale.edu/yale/psychology/introduction-to-psychology/content/sessions/lecture05.html" target="_blank"><span lang="EN-GB">What Is It Like to Be a Baby: The Development of Thought</span></a></span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 55pt;" width="55"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;">7085</span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 47.65pt;" width="48"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"><a href="http://source.cet.uct.ac.za/svn/people/smarquard/datasets/sphinx4-hub4-oyc/psyc110_05_013107-output.txt" target="_blank">42%</a></span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 71.35pt;" width="71"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"> 275 </span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 72.45pt;" width="72"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"> 119 </span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 49.95pt;" width="50"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;">1.7%</span></div>
</td>
</tr>
<tr style="height: 21pt;">
<td nowrap="nowrap" style="-moz-border-bottom-colors: none; -moz-border-image: none; -moz-border-left-colors: none; -moz-border-right-colors: none; -moz-border-top-colors: none; border-color: -moz-use-text-color windowtext windowtext; border-style: none solid solid; border-width: medium 1pt 1pt; height: 21pt; padding: 0cm 5.4pt; width: 130.1pt;" valign="top" width="130"><div class="MsoNormal">
<span style="font-size: x-small;"><a href="http://oyc.yale.edu/religious-studies/introduction-to-new-testament/content/sessions/lecture26.html" target="_blank"><span lang="EN-GB">The "Afterlife" of the New Testament and Postmodern Interpretation</span></a></span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 55pt;" width="55"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;">8196</span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 47.65pt;" width="48"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"><a href="http://source.cet.uct.ac.za/svn/people/smarquard/datasets/sphinx4-hub4-oyc/rlst152_26_042209-output.txt" target="_blank">41%</a></span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 71.35pt;" width="71"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"> 286 </span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 72.45pt;" width="72"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;"> 91 </span></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 21pt; padding: 0cm 5.4pt; width: 49.95pt;" width="50"><div align="center" class="MsoNormal" style="text-align: center;">
<span style="font-size: x-small;">1.1%</span></div>
</td>
</tr>
<tr style="height: 22.0pt; mso-yfti-irow: 14; mso-yfti-lastrow: yes;">
<td nowrap="nowrap" style="-moz-border-bottom-colors: none; -moz-border-image: none; -moz-border-left-colors: none; -moz-border-right-colors: none; -moz-border-top-colors: none; border-color: -moz-use-text-color windowtext windowtext; border-style: none solid solid; border-width: medium 1pt 1pt; height: 22pt; padding: 0cm 5.4pt; width: 130.1pt;" valign="bottom" width="130"><div class="MsoNormal">
<b><span style="font-size: x-small;">Average</span></b></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 22pt; padding: 0cm 5.4pt; width: 55pt;" valign="bottom" width="55"><div align="center" class="MsoNormal" style="text-align: center;">
<b><span style="font-size: x-small;">6847</span></b></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 22pt; padding: 0cm 5.4pt; width: 47.65pt;" valign="bottom" width="48"><div align="center" class="MsoNormal" style="text-align: center;">
<b><span style="font-size: x-small;">41%</span></b></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 22pt; padding: 0cm 5.4pt; width: 71.35pt;" valign="bottom" width="71"><div align="center" class="MsoNormal" style="text-align: center;">
<b><span style="font-size: x-small;"> 324 </span></b></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 22pt; padding: 0cm 5.4pt; width: 72.45pt;" valign="bottom" width="72"><div align="center" class="MsoNormal" style="text-align: center;">
<b><span style="font-size: x-small;"> 151 </span></b></div>
</td>
<td nowrap="nowrap" style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 22pt; padding: 0cm 5.4pt; width: 49.95pt;" valign="bottom" width="50"><div align="center" class="MsoNormal" style="text-align: center;">
<b><span style="font-size: x-small;">2.2%</span></b></div>
</td>
</tr>
</tbody></table>
<br />
The full data set with <a href="http://source.cet.uct.ac.za/svn/people/smarquard/datasets/sphinx4-hub4-oyc/sphinx4-hub4-oyc.pdf" target="_blank">more detailed statistics</a> is available at <a href="http://source.cet.uct.ac.za/svn/people/smarquard/datasets/sphinx4-hub4-oyc/" target="_blank">http://source.cet.uct.ac.za/svn/people/smarquard/datasets/sphinx4-hub4-oyc/</a> (start with the <a href="http://source.cet.uct.ac.za/svn/people/smarquard/datasets/sphinx4-hub4-oyc/README.txt" target="_blank">README</a>).<br />
<br />
<span style="font-size: large;">Analysis</span><br />
The error rate varies widely from a minimum of 32% to maximum of 61% with the output ranging from <a href="http://source.cet.uct.ac.za/svn/people/smarquard/datasets/sphinx4-hub4-oyc/engl291_12_022008-output.txt" target="_blank">just readable</a> to <a href="http://source.cet.uct.ac.za/svn/people/smarquard/datasets/sphinx4-hub4-oyc/hist202_06_092208-output.txt" target="_blank">nonsensical</a>. While the perplexity and OOV figures show some mismatch between the language model and the text, it is likely that acoustic issues (audio quality and/or speaker/model mismatch) have the biggest impact on performance, particularly for the outliers with word error rates above 45%.<br />
<br />
<span style="font-size: large;">Feedback and comparisons</span><br />
If you have suggestions for improving the
Sphinx configuration to produce better accuracy for this dataset, or have comparative results using this set of lectures using another speech recognition engine or 3rd-party service, please add a comment below.<br />
<br />Stephen Marquardhttp://www.blogger.com/profile/06185718122117108334noreply@blogger.com9tag:blogger.com,1999:blog-3105702472983275135.post-25069767642815462432011-06-13T23:19:00.000+02:002011-06-13T23:19:21.620+02:00Speech recognition in Opencast MatterhornPresented at the <a href="http://opencast.jira.com/wiki/display/MH/Opencast+Matterhorn+Workshop+-+June+2011+in+LA">Opencast Matterhorn Workshop</a> in LA, 13 June 2011.<br />
<div id="__ss_8296872" style="width: 425px;"><span style="display: block; margin: 12px 0pt 4px;"><a href="http://www.slideshare.net/smarquard/open-textspeech-recognition-in-opencast-matterhorn" title="Open Text: Speech recognition in Opencast Matterhorn ">Open Text: Speech recognition in Opencast Matterhorn </a></span> <iframe frameborder="0" height="355" marginheight="0" marginwidth="0" scrolling="no" src="http://www.slideshare.net/slideshow/embed_code/8296872" width="425"></iframe> <br />
<div style="padding: 5px 0 12px;">View more <a href="http://www.slideshare.net/">presentations</a> from <a href="http://www.slideshare.net/smarquard">Stephen Marquard</a> </div></div>Stephen Marquardhttp://www.blogger.com/profile/06185718122117108334noreply@blogger.com2tag:blogger.com,1999:blog-3105702472983275135.post-32422504508328680322011-05-16T22:47:00.000+02:002011-05-16T22:47:03.587+02:00Using freetts and phonetisaurus for creating custom Sphinx dictionariesIf you use CMU Sphinx (or other recognizers) for large vocabulary speech recognition, you may want to recognize words which are not contained in the <a href="http://www.speech.cs.cmu.edu/cgi-bin/cmudict">CMU Pronouncing Dictionary</a>.<br />
<br />
The <a href="https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/cmudict/">current version of cmudict</a> (0.7a) contains around 133,000 words which is adequate for general English, but if you want to generate your own language models for specific domains, you will need to generate an accompanying custom dictionary with pronunciations for words in the language model but not in cmudict.<br />
<br />
This process is also known as letter-to-sound conversion, or grapheme-to-phoneme conversion. There are several good open source options for accomplishing this, two of which are discussed here: freetts and phonetisaurus. (Others include <a href="http://www-i6.informatik.rwth-aachen.de/web/Software/g2p.html">Sequitur G2P</a> and <a href="http://espeak.sourceforge.net/">espeak</a>).<br />
<br />
<span style="font-size: large;">freetts </span><br />
<a href="http://freetts.sourceforge.net/">freetts</a> is a speech synthesis system which includes a <a href="http://freetts.sourceforge.net/javadoc/com/sun/speech/freetts/lexicon/LetterToSoundImpl.html">LetterToSound implementation</a>:<br />
<blockquote>"using the CMU6 letter-to-sound (LTS) rules, which are based on the Black, Lenzo, and Pagel paper, 'Issues in Building General Letter-to-Sound Rules.' Proceedings of ECSA Workshop on Speech Synthesis, pages 77-80, Australia, 1998."</blockquote>(also implemented in <a href="http://www.cstr.ed.ac.uk/projects/festival/">Festival</a>, according to the above paper.)<br />
<br />
Using freetts is quite straightforward. <a href="http://source.cet.uct.ac.za/svn/people/smarquard/sphinx/scripts/word2phones.py">word2phones.py</a> is a Jython script which calls getPhones() for a list of words. The output is tweaked to remove trailing stress marks (e.g. EY2 to EY) and replaces AX with AH for compatibility with Sphinx. Executing it with:<br />
<blockquote>echo ABBREVIATION | jython ./word2phones.py</blockquote>gives<br />
<blockquote>ABBREVIATION AH B R IY V IY EY SH AH N</blockquote><span style="font-size: large;">phonetisaurus </span><br />
<a href="http://code.google.com/p/phonetisaurus/">Phonetisaurus</a> is "a WFST-driven grapheme-to-phoneme (g2p) framework suitable for rapid development of high quality g2p or p2g systems". To install it, you need:<br />
<ul><li>a <a href="http://mercurial.selenic.com/downloads/">mercurial</a> client (to check out the source), and dependencies</li>
<li><a href="http://www.openfst.org/twiki/bin/view/FST/FstDownload" rel="nofollow">OpenFst</a></li>
<li><a href="http://code.google.com/p/m2m-aligner/" rel="nofollow">m2m-aligner</a></li>
<li> <a href="https://code.google.com/p/mitlm/" rel="nofollow">mitlm</a> </li>
</ul>You may also need python modules simplejson and argparse. See the <a href="http://code.google.com/p/phonetisaurus/wiki/PhoneticizerTutorial">PhoneticizerTutorial</a> and <a href="http://code.google.com/p/phonetisaurus/wiki/QuickStartExamples">QuickStartExamples</a>.<br />
<br />
phonetisaurus can use a model trained on a pronunciation dictionary (such as CMUdict) to create one or more pronunciation hypotheses for unseen words. Joseph Novak, the author, has helpfully created a model from CMUdict 0.7a which can be downloaded from <a href="http://www.google.com/url?sa=D&q=http://www.gavo.t.u-tokyo.ac.jp/%7Enovakj/cmudict.0.7a.tgz" rel="nofollow" target="_blank">http://www.gavo.t.u-tokyo.ac.jp/~novakj/cmudict.0.7a.tgz</a> (42M). To use it, run the <b>compile.sh</b> script which will create <b>cmudict.0.7a.fst</b>.<br />
<blockquote></blockquote>To produce results in a format usable as a CMU Sphinx dictionary, I wrote the wrapper script <a href="http://source.cet.uct.ac.za/svn/people/smarquard/sphinx/scripts/phonetiwords.pl">phonetiwords.pl</a> which passes a word list to phonetisaurus and then processes the resulting output for Sphinx by removing stress marks. <br />
<blockquote>echo X-RAY | phonetiwords.pl</blockquote>gives<br />
<blockquote>X-RAY EH K S R EY</blockquote><span style="font-size: large;">How do they differ?</span><br />
On a quick test on a dictionary of the most frequent 40,000 words from Wikipedia (3 characters or longer), freetts (CMU6 LTS rules) and phonetisaurus (with the CMUdict model) produce the same results for 23,654 words, 59% of the sample, omitting stress markers.<br />
<br />
freetts is quite a lot faster at 8s for 40,000 words compared to over a minute for phonetisaurus, although for this application, speed is not likely to be a major issue.<br />
<br />
It will be interesting to see if there is any advantage to one over the other in real-world speech recognition applications, given the 40%+ difference in results. phonetisaurus allows producing multiple hypotheses for each word, which might also have value for speech recognition.Stephen Marquardhttp://www.blogger.com/profile/06185718122117108334noreply@blogger.com3tag:blogger.com,1999:blog-3105702472983275135.post-89970057145554821332011-05-05T20:36:00.000+02:002011-05-05T20:36:08.567+02:00Matrix Market to mysql and back again<i>[<b>Update</b>: Radim Rehurek, gensim's author, pointed out a way to achieve this with gensim indexes - see <a href="http://groups.google.com/group/gensim/browse_thread/thread/e2884f8e903dd063">this thread</a> for details.] </i><br />
<br />
Here are some perl scripts for converting a <a href="http://en.wikipedia.org/wiki/Sparse_matrix">sparse matrix</a> from <a href="http://math.nist.gov/MatrixMarket/formats.html">matrix market</a> (.mm) format to mysql (or another database) and back again.<br />
<br />
My purpose in creating these is to use subsets of a very large matrix with the <a href="http://nlp.fi.muni.cz/projekty/gensim/">gensim</a> <a href="http://en.wikipedia.org/wiki/Vector_space_model">vector space modelling</a> toolkit. For example, a matrix representing a <a href="http://en.wikipedia.org/wiki/Bag_of_words_model">bag-of-words model</a> of 3.3 million or so English Wikipedia articles with a vocabulary of 100,000 words is rather large, around 7.4G in its matrix market (mm) file format.<br />
<br />
To perform operations on a subset of the matrix (in my application, similarity queries on a small set of documents), it's useful to be able to quickly extract a given set of rows from the larger matrix, without reading the entire 7.4G file each time.<br />
<br />
Thus, the scripts allow converting the large mm file into a mysql database, which can be queried efficiently to return a specific set of rows that can be converted back into a much smaller mm file that gensim can load for use with memory-bound operations such as<span class="n"> <a href="http://nlp.fi.muni.cz/projekty/gensim/tut3.html">MatrixSimilarity</a>.</span><br />
<br />
<span style="font-size: large;">Dependencies </span><br />
<ul><li>perl with CPAN modules DBI and DBD::mysql</li>
<li>mysql </li>
</ul><span style="font-size: large;">Importing a matrix into mysql</span><br />
<br />
<a href="http://source.cet.uct.ac.za/svn/people/smarquard/gensim/mm2sql.pl">mm2sql.pl</a> reads a .mm file, and outputs a set of SQL statements to import the matrix into a mysql database. To create the schema which consists of the <span style="font-size: small;"><span style="font-family: "Courier New",Courier,monospace;">matrix_info</span></span> and<span style="font-size: small;"> <span style="font-family: "Courier New",Courier,monospace;">matrix</span></span> tables and indexes, first create a database (for examples here called <span style="font-family: "Courier New",Courier,monospace; font-size: small;">gensim</span>), and then run:<br />
<blockquote><span style="font-size: small;"><span style="font-family: "Courier New",Courier,monospace;">pod2text mm2sql.pl | mysql -u root -p gensim</span></span></blockquote>To import a matrix into the mysql database, run:<br />
<blockquote><div style="font-family: "Courier New",Courier,monospace;"><span style="font-size: small;">mm2sql.pl matrixname.mm | mysql -u root -p gensim</span></div></blockquote>If you want to import more than one matrix into the same database, then set the matrixid value for each by editing the mm2sql.pl file before running the import command. The default matrixid is 1.<br />
<br />
<span style="font-size: large;">Exporting rows from a matrix</span><br />
<br />
Use <a href="http://source.cet.uct.ac.za/svn/people/smarquard/gensim/db2mm.pl">db2mm.sql</a> for the reverse operation. First edit the script to set your local mysql connection info (hostname, database name, username and password). Then give the matrix id and row numbers on the command-line. For example to export rows 7, 9 and 17 from matrix id 1:<br />
<blockquote><span style="font-family: "Courier New",Courier,monospace; font-size: small;">db2mm.sql 1 7 9 17 > newmatrix.mm</span></blockquote>Note that the export script will renumber the rows in the matrix, so row 7 becomes row 1, row 9 becomes row 2, and row 17 becomes row 3. To preserve the original row numbering (and produce a very sparse matrix with many empty rows), it is fairly straightforward to edit the script to change this behaviour.Stephen Marquardhttp://www.blogger.com/profile/06185718122117108334noreply@blogger.com0tag:blogger.com,1999:blog-3105702472983275135.post-2495984685571149822011-04-14T15:40:00.000+02:002011-04-14T15:40:11.918+02:00Speech recognition for lecture recordingsHere are the slides from a seminar at UCT introducing speech recognition and the project to integrate CMU Sphinx into Opencast Matterhorn, looking inter alia at language modelling using Wikipedia.<br />
<br />
The project is at an early stage, so this is more an overview of the problem space and plans rather than specific results.<br />
<div id="__ss_7626932" style="width: 425px;"><b style="display: block; margin: 12px 0 4px;"><a href="http://www.slideshare.net/smarquard/wreck-a-nice-beach-adventures-in-speech-recognition" title="Wreck a nice beach: adventures in speech recognition">Wreck a nice beach: adventures in speech recognition</a></b> <iframe frameborder="0" height="355" marginheight="0" marginwidth="0" scrolling="no" src="http://www.slideshare.net/slideshow/embed_code/7626932" width="425"></iframe> <br />
<div style="padding: 5px 0 12px;">View more <a href="http://www.slideshare.net/">presentations</a> from <a href="http://www.slideshare.net/smarquard">Stephen Marquard</a> </div></div>Stephen Marquardhttp://www.blogger.com/profile/06185718122117108334noreply@blogger.com1tag:blogger.com,1999:blog-3105702472983275135.post-44224938659177096812011-03-25T11:34:00.000+02:002011-03-27T19:00:11.052+02:00Recognizing specialized vocabulary with large dictionariesOne of the goals of the work which inspired this blog is to integrate a speech recognition engine into a lecture capture system (specifically, integrating <a href="http://cmusphinx.sourceforge.net/">CMU Sphinx</a> into <a href="http://www.opencastproject.org/">Opencast Matterhorn</a>).<br />
<br />
Many university lectures include a high proportion of specialist terms (e.g. medical and scientific terms, discipline-specific terminology and jargon). These are important words. They are the "content anchors" of the lecture, and are likely to be used as search terms should a student want to locate a particular lecture dealing with a topic, or jump to a section of a recording.<br />
<br />
Hence applications of speech recognition in an academic context need to pay special attention to recognizing these words correctly. ASR engines use linguistic resources to recognize words: a <a href="http://en.wikipedia.org/wiki/CMU_Pronouncing_Dictionary">pronunciation dictionary</a> which maps words to typical pronunciations, and a <a href="http://en.wikipedia.org/wiki/Language_model">language model</a>, which is a statistical model of the frequency with which word and word combinations (n-grams) occur in a body of text.<br />
<br />
This post examines the "size and shape" of dictionary that would be required to recognize most specialist terms correctly in a particular domain. The reference text is an edited transcript of a lecture delivered to undergraduate Health Sciences (Medical School) students on "Chemical Pathology of the Liver".<br />
<br />
The dictionaries evaluated come from a variety of sources. Google's ngram dictionary is a list of words from English language books with a minimum frequency cutoff of 40. BEEP and CMU are ASR pronunciation dictionaries. The Bing dictionary is a list of the most frequently 1000,000 terms in documents indexed by bing, and WSJ 5K is a small vocabulary from the <a href="http://portal.acm.org/citation.cfm?id=1075614">Wall Street Journal (WSJ) corpus</a>.<br />
<br />
The Wikipedia dictionaries were created from a <a href="http://trulymadlywordly.blogspot.com/2011/03/creating-text-corpus-from-wikipedia.html">plain text list of sentences from Wikipedia articles</a>. The complete list of words was sorted by descending frequency of use, with a cutoff of 3. Wikipedia 100K, for example, contains the most frequent 100,000 terms from Wikipedia.<br />
<br />
The dictionaries all contain variant forms as separate words rather than stem words (e.g. speak, speaker, speaks). The comparison of the lecture text to the dictionary compares only words which are 3 or more characters in length (on the assumption that 1- and 2-letter English words are not problematic in this context, and excluding them from the Wikipedia dictionaries avoids some noise).<br />
<br />
The reference text contains 7810 words which meet this requirement, using a vocabulary of 1407 unique words. Compared against the candidate dictionaries, we find:<br />
<br />
<table border="0" cellpadding="0" cellspacing="0" style="border-collapse: collapse; width: 555px;"><colgroup><col style="mso-width-alt: 4864; mso-width-source: userset;" width="133"></col> <col style="mso-width-alt: 2925; mso-width-source: userset;" width="80"></col> <col style="mso-width-alt: 2706; mso-width-source: userset;" width="74"></col> <col style="mso-width-alt: 2267; mso-width-source: userset;" width="62"></col> <col style="mso-width-alt: 4169; mso-width-source: userset;" width="114"></col> <col style="mso-width-alt: 3364; mso-width-source: userset;" width="92"></col> </colgroup><tbody>
<tr height="13"> <td class="xl24" height="13" style="font-family: inherit;" width="133"><b><span style="font-size: small;">Dictionary</span></b></td> <td class="xl25" style="font-family: inherit; text-align: right;" width="80"><b><span style="font-size: small;">Size</span></b></td> <td class="xl25" style="font-family: inherit; text-align: right;" width="74"><b><span style="font-size: small;">OOV<br />
words</span></b></td> <td class="xl26" style="font-family: inherit; text-align: right;" width="62"><b><span style="font-size: small;">OOV%</span></b></td> <td class="xl24" style="font-family: inherit; text-align: right;" width="114"><b><span style="font-size: small;">Unique<br />
OOV words</span></b></td> <td class="xl24" style="font-family: inherit; text-align: right;" width="92"><b><span style="font-size: small;">Unique <br />
OOV %</span></b></td> </tr>
<tr height="13"> <td class="xl27" height="13" style="font-family: inherit;"><a href="http://ngrams.googlelabs.com/datasets"><span style="font-size: small;">Google 1gram Eng 2009</span></a></td> <td align="right" class="xl28" style="font-family: inherit;"><span style="font-size: small;">4 631 186</span></td> <td align="right" class="xl28" style="font-family: inherit;"><span style="font-size: small;">12</span></td> <td align="right" class="xl29" style="font-family: inherit;"><span style="font-size: small;">0.15%</span></td> <td align="right" class="xl27" style="font-family: inherit;"><span style="font-size: small;">8</span></td> <td align="right" class="xl29" style="font-family: inherit;"><span style="font-size: small;">0.57%</span></td> </tr>
<tr height="13"> <td class="xl27" height="13" style="font-family: inherit;"><span style="font-size: small;">Wikipedia Full</span></td> <td align="right" class="xl28" style="font-family: inherit;"><span style="font-size: small;">1 714 417</span></td> <td align="right" class="xl28" style="font-family: inherit;"><span style="font-size: small;">22</span></td> <td align="right" class="xl29" style="font-family: inherit;"><span style="font-size: small;">0.28%</span></td> <td align="right" class="xl27" style="font-family: inherit;"><span style="font-size: small;">13</span></td> <td align="right" class="xl29" style="font-family: inherit;"><span style="font-size: small;">0.92%</span></td> </tr>
<tr height="13"> <td class="xl27" height="13" style="font-family: inherit;"><span style="font-size: small;">Wikipedia 1M</span></td> <td align="right" class="xl28" style="font-family: inherit;"><span style="font-size: small;">1 000 000</span></td> <td align="right" class="xl28" style="font-family: inherit;"><span style="font-size: small;">27</span></td> <td align="right" class="xl29" style="font-family: inherit;"><span style="font-size: small;">0.35%</span></td> <td align="right" class="xl27" style="font-family: inherit;"><span style="font-size: small;">16</span></td> <td align="right" class="xl29" style="font-family: inherit;"><span style="font-size: small;">1.14%</span></td> </tr>
<tr height="13"> <td class="xl27" height="13" style="font-family: inherit;"><span style="font-size: small;">Wikipedia 500K</span></td> <td align="right" class="xl28" style="font-family: inherit;"><span style="font-size: small;">500 000</span></td> <td align="right" class="xl28" style="font-family: inherit;"><span style="font-size: small;">41</span></td> <td align="right" class="xl29" style="font-family: inherit;"><span style="font-size: small;">0.52%</span></td> <td align="right" class="xl27" style="font-family: inherit;"><span style="font-size: small;">23</span></td> <td align="right" class="xl29" style="font-family: inherit;"><span style="font-size: small;">1.63%</span></td> </tr>
<tr height="13"> <td class="xl27" height="13" style="font-family: inherit;"><span style="font-size: small;">Wikipedia 250K</span></td> <td align="right" class="xl28" style="font-family: inherit;"><span style="font-size: small;">250 000</span></td> <td align="right" class="xl28" style="font-family: inherit;"><span style="font-size: small;">112</span></td> <td align="right" class="xl29" style="font-family: inherit;"><span style="font-size: small;">1.43%</span></td> <td align="right" class="xl27" style="font-family: inherit;"><span style="font-size: small;">43</span></td> <td align="right" class="xl29" style="font-family: inherit;"><span style="font-size: small;">3.06%</span></td> </tr>
<tr height="13"> <td class="xl27" height="13" style="font-family: inherit;"><span style="font-size: small;">Wikipedia 100K</span></td> <td align="right" class="xl28" style="font-family: inherit;"><span style="font-size: small;">100 000</span></td> <td align="right" class="xl28" style="font-family: inherit;"><span style="font-size: small;">269</span></td> <td align="right" class="xl29" style="font-family: inherit;"><span style="font-size: small;">3.44%</span></td> <td align="right" class="xl27" style="font-family: inherit;"><span style="font-size: small;">90</span></td> <td align="right" class="xl29" style="font-family: inherit;"><span style="font-size: small;">6.40%</span></td> </tr>
<tr height="13"> <td class="xl27" height="13" style="font-family: inherit;"><span style="font-size: small;"><a href="ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/dictionaries/beep-1.0.README">BEEP 1.0</a></span></td> <td align="right" class="xl28" style="font-family: inherit;"><span style="font-size: small;">257 560</span></td> <td align="right" class="xl28" style="font-family: inherit;"><span style="font-size: small;">413</span></td> <td align="right" class="xl29" style="font-family: inherit;"><span style="font-size: small;">5.29%</span></td> <td align="right" class="xl27" style="font-family: inherit;"><span style="font-size: small;">124</span></td> <td align="right" class="xl29" style="font-family: inherit;"><span style="font-size: small;">8.81%</span></td> </tr>
<tr height="13"> <td class="xl27" height="13" style="font-family: inherit;"><a href="http://www.speech.cs.cmu.edu/cgi-bin/cmudict"><span style="font-size: small;">CMU 0.7.a</span></a></td> <td align="right" class="xl28" style="font-family: inherit;"><span style="font-size: small;">133 367</span></td> <td align="right" class="xl28" style="font-family: inherit;"><span style="font-size: small;">455</span></td> <td align="right" class="xl29" style="font-family: inherit;"><span style="font-size: small;">5.83%</span></td> <td align="right" class="xl27" style="font-family: inherit;"><span style="font-size: small;">146</span></td> <td align="right" class="xl29" style="font-family: inherit;"><span style="font-size: small;">10.38%</span></td> </tr>
<tr height="13"> <td class="xl27" height="13" style="font-family: inherit;"><a href="http://web-ngram.research.microsoft.com/info/"><span style="font-size: small;">Bing Top100K Apr2010</span></a></td> <td align="right" class="xl28" style="font-family: inherit;"><span style="font-size: small;">98 431</span></td> <td align="right" class="xl28" style="font-family: inherit;"><span style="font-size: small;">514</span></td> <td align="right" class="xl29" style="font-family: inherit;"><span style="font-size: small;">6.58%</span></td> <td align="right" class="xl27" style="font-family: inherit;"><span style="font-size: small;">125</span></td> <td align="right" class="xl29" style="font-family: inherit;"><span style="font-size: small;">8.88%</span></td> </tr>
<tr height="13"> <td class="xl27" height="13" style="font-family: inherit;"><a href="https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/sphinx4/models/language/wsj"><span style="font-size: small;">WSJ</span></a></td> <td align="right" class="xl28" style="font-family: inherit;"><span style="font-size: small;">4 986</span></td> <td align="right" class="xl28" style="font-family: inherit;"><span style="font-size: small;">2 177</span></td> <td align="right" class="xl29" style="font-family: inherit;"><span style="font-size: small;">27.87%</span></td> <td align="right" class="xl27" style="font-family: inherit;"><span style="font-size: small;">696</span></td> <td align="right" class="xl29" style="font-family: inherit;"><span style="font-size: small;">49.47%</span></td> </tr>
</tbody></table><br />
So if we are hoping to find more than 99% of the words in our lecture in a generic English dictionary, i.e. an out of vocabulary (OOV) rate of < 1%, we require a dictionary of between 250K and 500K terms.<br />
<br />
Looking at the nature of the words which are OOV at different dictionary sizes, 250K to 500K is also the region where the number of unrecognized general English words becomes insignificant, leaving only specialist vocabulary. So in Wikipedia 250K, missing words include:<br />
<blockquote>sweetish, re-expressed, ex-boss</blockquote>which are slightly unusual but arguably generic English. Using Wikipedia 500K, the remaining missing words are almost completely domain-specific, for example: <br />
<blockquote>sulfhydryls, aminophenyl, preicteric, methimine, fibrosed, haematemesis, paracetamols, prehepatic, icteric, urobilin, clottability, hepatoma, sclerae, hypergonadism, extravasates, clottable, necroses, necrose</blockquote>So the unsurprising conclusion is that a lecture on a narrow, specialist topic may contain a lot of words which are very infrequent in general English. Another way of visualizing this is comparing the word frequency distribution from a lecture transcript to a text from another genre.<br />
<br />
This scatter plot shows term frequency in the transcript against dictionary rank (i.e. the position of the word in a dictionary sorted from most-to-least frequent), for the lecture transcript (blue) and the first 10,000 words or so from <a href="http://www.gutenberg.org/ebooks/11">Alice's Adventures in Wonderland</a> (i.e. a similar wordcount to the lecture).<br />
<br />
<div class="separator" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em; text-align: center;"><br />
</div><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgAzjJiotD5kO832MB4H5Runi6UYQrpMkSgag2kI9XeCQry8L_WIZ6CMXcOKxH6CaVhIYFKjl2fsQL-UO0Zb42BtqFdqDm7wPIs4cv50P8U1aF106zAzoSGam1Fdf7cCPmAk4IuW66ojpw/s640/wordfreq-berman-alice.png" width="600" /><br />
<br />
The narrative fictional text shows the type of distribution we would expect from <a href="http://en.wikipedia.org/wiki/Zipf%27s_law">Zipf's law</a>. The lecture text shows many more outliers -- for example terms with a document frequency of between 10 and 100, and a dictionary rank of 10,000 and below.<br />
<br />
So is the solution to recognizing these terms to use a very large dictionary? In this case, larger is not always better. While we may want to recognize a word such as "fibrosed" which occurs with frequency 3 only in the full 1.7M Wikipedia dictionary, in practical terms a dictionary is only as useful as the accompanying language model.<br />
<br />
LMs generated with an unrestricted vocabulary from a very large text corpus such as Wikipedia are not only impractical to use (requiring significant memory), but also lose an essential element of context, which is that a lecture is typically about one topic, rather than <a href="http://xkcd.com/863/">the whole of human knowledge</a>. Hence we need to take into account that "fibrosed" is significantly more likely to occur in a lecture on liver pathology than "fibro-cement".<br />
<br />
This leads to the specialization of <a href="http://scholar.google.com/scholar?q=language+model+adaptation">language model adaptation</a>, a topic of future posts.Stephen Marquardhttp://www.blogger.com/profile/06185718122117108334noreply@blogger.com1tag:blogger.com,1999:blog-3105702472983275135.post-49539433021218669832011-03-23T13:32:00.000+02:002011-03-23T13:35:20.595+02:00Language modelling on the gridWorking with large data sets such as a <a href="http://trulymadlywordly.blogspot.com/2011/03/creating-text-corpus-from-wikipedia.html">Wikipedia plain text corpus</a> creates certain challenges. A raw Wikipedia XML dump file is about 28G uncompressed (as of Jan 2011), and a set of plain text sentences from this is about 6.6G uncompressed.<br />
<br />
Tools designed to process text corpora often have working memory requirements proportional to the size of the corpus or the size of the output set. In the case of language models, the size of the model can significantly exceed the size of the input corpus.<br />
<br />
With my goal being to create a language model from the full Wikipedia English corpus using the <a href="http://code.google.com/p/mitlm/">mitlm</a> toolkit and evaluate the <a href="http://en.wikipedia.org/wiki/Perplexity#Perplexity_per_word">perplexity</a> of the resulting model against a reference text, it became clear that my MacBook Pro's humble 4G of memory was insufficient.<br />
<br />
Happily, UCT's <a href="http://www.icts.uct.ac.za/">Information and Communication Technology Services</a> directed me to the relatively new grid computing infrastructure in South Africa in the form of the <a href="http://www.sagrid.ac.za/">SAGrid</a>. UCT has both a local computing element (gridspeak for a cluster of grid-linked machines) and a clued-up team in the form of Andrew Lewis and Timothy Carr, who helped me get up and running.<br />
<br />
While the grid is capable of extraordinary feats of distributed computing, I basically just needed to be able to execute my single-threaded process on a server with lots of memory (16G or ideally 32G) against a large data set. This turned out to be fairly straightforward. Here are my crib notes (which assume some Linux familiarity):<br />
<br />
<span style="font-size: large;">1. Figure out what the grid is and how it works</span><br />
<br />
Watch the informative <a href="http://www.gridcafe.org/tutorials.html">GridCafe Tutorial Screencasts</a> from the EGEE <a href="https://twiki.cern.ch/twiki/bin/view/EGEE/DirectUserSupport">Direct User Support Group</a>. These explain basic concepts and how to carry out the most common procedures.<br />
<br />
<span style="font-size: large;">2. Get set up as a grid user and associated with a VOMS</span><br />
<br />
Follow steps 1 to 3 on <a href="http://www.sagrid.ac.za/index.php/getting-started">SAGrid's Getting Started</a> page, with help where needed from your local grid computing support staff. You will need a South African Registration Authority to verify your identity and provide you with the key that allows you to request a digital certificate via INFN. <br />
<br />
Once you have been issued with a personal certificate, you need to install it in your browser and register with the SAGrid VOMS (virtual organization).<br />
<br />
<i>Cryptic clue: </i>the VOMS registration page needs "TLS-1 disabled" before it will allow you to connect., otherwise you will get a "secure connection failed" error. To disable TLS-1 in Firefox, go to about:config and set the property security.enable_tls to false. You can re-enable it once you've registered successfully.<br />
<span style="font-size: large;"><br />
3. Set up your personal certificate on a UI server</span><br />
<br />
A grid "user interface" just means a server which has the grid <a href="http://glite.cern.ch/">glite middleware</a> installed, allowing you to submit jobs to the grid and retrieve the results. I used portal.sagrid.ac.za, which runs <a href="https://www.scientificlinux.org/">Scientific Linux</a>. Once you have a shell account (for ssh login), follow the process outlined in the screencast to create and copy your certificate files to the UI server, viz.<br />
<div style="font-family: "Courier New",Courier,monospace;"><span style="font-size: small;"><br />
</span></div><div style="font-family: "Courier New",Courier,monospace;"><span style="font-size: small;">.globus/usercert.pem</span></div><div style="font-family: "Courier New",Courier,monospace;"><span style="font-size: small;">.globus/userkey.pem</span></div><br />
<i>Cryptic clue:</i> if you installed your personal certificate in Firefox on MacOS, you can export it through<br />
<br />
Firefox / Preferences / Advanced / Encryption / View Certificates / Backup<br />
<br />
which will save a certificate file in PKCS12 format (usually with a .p12 extension). You can convert this to the PEM format required by the glite middleware using openssl, as helpfully described by the NCSA's <a href="http://security.ncsa.illinois.edu/research/grid-howtos/usefulopenssl.html">Useful SSL Commands</a>.<br />
<br />
<span style="font-size: large;">4. Initialize your grid and VOMS proxy credentials</span><br />
<br />
This sets up your authorization to submit jobs on the grid for the next 12 hours:<span style="font-size: x-small;"><span style="font-size: small;"><span style="font-family: "Courier New",Courier,monospace;"> </span></span></span><br />
<br />
<span style="font-size: x-small;"><span style="font-size: small;"><span style="font-family: "Courier New",Courier,monospace;">grid-proxy-init</span><br style="font-family: "Courier New",Courier,monospace;" /><span style="font-family: "Courier New",Courier,monospace;">voms-proxy-init -voms sagrid</span></span></span><br />
<br />
(If you have a job which will take longer than that, you need a further proxy authentication step.)<br />
<br />
<span style="font-size: large;">5. Build the toolkit and create a script to execute it with the right data</span><br />
<br />
If the application you want to run is not installed on the servers which will execute your job, then you need to build it on a similar platform and include it in your job.<br />
<br />
In my case, I built mitlm from source, and then created a tar bundle with the executable and its libraries, viz. <span style="font-family: "Courier New",Courier,monospace;">mitlm.tgz</span> containing<br />
<br />
<span style="font-family: "Courier New",Courier,monospace;">usr/bin/interpolate-ngram</span><br />
<span style="font-family: "Courier New",Courier,monospace;">usr/bin/estimate-ngram</span><br />
<span style="font-family: "Courier New",Courier,monospace;">usr/bin/evaluate-ngram</span><br />
<span style="font-family: "Courier New",Courier,monospace;">usr/lib/libmitlm.a</span><br />
<span style="font-family: "Courier New",Courier,monospace;">usr/lib/libmitlm.la</span><br />
<span style="font-family: "Courier New",Courier,monospace;">usr/lib/libmitlm.so.0</span><br />
<span style="font-family: "Courier New",Courier,monospace;">usr/lib/libmitlm.so</span><br />
<span style="font-family: "Courier New",Courier,monospace;">usr/lib/libmitlm.so.0.0.0</span><br />
<br />
A wrapper script (<span style="font-family: "Courier New",Courier,monospace;">lmrun.sh</span>) then unpacks the app, fetches the data set, runs the toolkit, and compresses the results:<br />
<br />
<span style="font-family: "Courier New",Courier,monospace;">#! /bin/sh</span><br />
<br />
<span style="font-family: "Courier New",Courier,monospace;"># Unpack the mitlm toolkit</span><br />
<span style="font-family: "Courier New",Courier,monospace;">tar zxf mitlm.tgz</span><br />
<br />
<span style="font-family: "Courier New",Courier,monospace;"># Get our large data set</span><br />
<span style="font-family: "Courier New",Courier,monospace;">wget --quiet --no-proxy http://arabica.cet.uct.ac.za/tmp/enwiki-sentences.corpus.bz2</span><br />
<br />
<span style="font-family: "Courier New",Courier,monospace;"># Run the LM toolkit</span><br />
<span style="font-family: "Courier New",Courier,monospace;">HERE=`pwd`</span><br />
<span style="font-family: "Courier New",Courier,monospace;">LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HERE/usr/lib</span><br />
<br />
<span style="font-family: "Courier New",Courier,monospace;">usr/bin/estimate-ngram -text enwiki-sentences.corpus.bz2 -vocab enwiki-500K-cmu-combined.txt.bz2 -wl wiki.lm</span><br />
<br />
<span style="font-family: "Courier New",Courier,monospace;"># Compress the resulting LM file</span><br />
<span style="font-family: "Courier New",Courier,monospace;">bzip2 wiki.lm</span><br />
<br />
<span style="font-size: large;">6. Configure and submit the job</span><br />
<br />
With the script set to go, all that remains is to create a Job Description Language (JDL) file for the job and submit it. For the mitlm task above, the <span style="font-family: "Courier New",Courier,monospace;">lm-big.jdl</span> file contains:<br />
<div style="font-family: "Courier New",Courier,monospace;"><br />
</div><span style="font-family: "Courier New",Courier,monospace;">Executable = "lmrun.sh";</span><br />
<span style="font-family: "Courier New",Courier,monospace;">Arguments = "";</span><br />
<span style="font-family: "Courier New",Courier,monospace;">StdOutput = "std.out";</span><br />
<span style="font-family: "Courier New",Courier,monospace;">StdError = "std.err";</span><br />
<span style="font-family: "Courier New",Courier,monospace;">InputSandbox = { "lmrun.sh", "mitlm.tgz", "enwiki-500K-cmu-combined.txt.bz2" };</span><br />
<span style="font-family: "Courier New",Courier,monospace;">OutputSandbox = { "std.out", "std.err", "wiki.lm.bz2" };</span><br />
<span style="font-family: "Courier New",Courier,monospace;">Requirements = other.GlueCEUniqueID=="srvslngrd004.uct.ac.za:8443/cream-pbs-sagrid";</span><br />
<br />
Small files are sent along with the job in the <b style="font-family: "Courier New",Courier,monospace;">InputSandbox</b> (here they are located on the portal UI server in the same directory as the JDL file). Large data sets are retrieved separately from some location by the wrapper script. In this case the script does a simple wget from a local server, as an alternative to using grid storage services. The <b><span style="font-family: "Courier New",Courier,monospace;">OutputSandbox</span> </b>defines which files will get returned as part of the job output, in this case stdout and stderr, and the resulting language model file.<br />
<br />
For this job, I defined a particular computing element on which the job should run (a local cluster) using <b><span style="font-family: "Courier New",Courier,monospace;">Requirements</span></b>. This is to ensure that the process executes on worker nodes which have sufficient memory, and as the input and output data sets are relatively large (approx 6G and 12G), it also helps to keep file transfers on a fast network.<br />
<br />
To submit the job, simply run:<br />
<br />
<span style="font-family: "Courier New",Courier,monospace;">glite-wms-job-submit -a -o job.id lm-big.jdl</span><br />
<br />
which saves the resulting job identifier into the <span style="font-family: "Courier New",Courier,monospace;">job.id</span> file.<br />
<br />
<span style="font-size: large;">7. Get the results</span><br />
<br />
To check on the status of the job, run:<br />
<br />
<span style="font-family: "Courier New",Courier,monospace;">glite-wms-job-status -i job.id</span><br />
<br />
and to retrieve the results and output (i.e. fetch the files defined in the OutputSandbox):<br />
<br />
<span style="font-family: "Courier New",Courier,monospace;">glite-wms-job-output --dir ./results -i job.id</span><br />
<br />
Success!<br />
<br />
This particular job used around 16G of working memory and took 1 hour to execute. The resulting language model is around 2.6G in ARPA format after bzip2 compression.<br />
<br />
A followup job evaluated the perplexity of the model against 2 reference documents (although with mitlm one could in fact do this at the same time as creating the model).<br />
<br />
With most of the hard work done, it is now easy to put those grid computing resources to work running multiple variants of the job, for example to evaluate the perplexity of models of different sizes.Stephen Marquardhttp://www.blogger.com/profile/06185718122117108334noreply@blogger.com2tag:blogger.com,1999:blog-3105702472983275135.post-60674896491172495462011-03-15T13:32:00.000+02:002011-03-15T13:50:08.815+02:00Creating a text corpus from WikipediaSpeech recognition engines (and other nature language processing applications) need a good <a href="http://en.wikipedia.org/wiki/Language_model">language model</a>. Open source speech recognition engines such as the <a href="http://cmusphinx.sourceforge.net/">CMU Sphinx toolkit</a> include relatively small LMs, such as the WSJ model with 5000 terms. Some larger models are available online, such as <a href="http://www.keithv.com/software/giga/">Keith Vertanen's English Gigaword models</a>.<br />
<br />
To create your own, you need a good source of raw material (i.e. written English) in the form of a <a href="http://en.wikipedia.org/wiki/Text_corpus">text corpus</a> such as those available from the non-profit but pricey <a href="http://www.ldc.upenn.edu/">Linguistic Data Consortium</a>. However, if you need a corpus with a <a href="http://en.wikipedia.org/wiki/Wikipedia:Copyrights">permissive license</a> (CC-BY-SA and GFDL) and at no cost, <a href="http://www.wikipedia.org/">Wikipedia</a> now presents an excellent alternative. (Another is the set of <a href="http://ngrams.googlelabs.com/datasets">Google Books n-grams</a>).<br />
<br />
This post describes techniques for turning the contents of Wikipedia into a set of sentences and a vocabulary suitable for use with language modelling toolkits or other applications. You will need a reasonable amount of bandwidth, disk space, and some CPU time to proceed.<br />
<br />
<span style="font-size: large;">Step 1: get that dump file</span><br />
<br />
To start, download a <a href="http://en.wikipedia.org/wiki/Wikipedia:Database_download">Wikipedia database</a> extract. For English, use:<br />
<blockquote><a href="http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2"><span style="font-size: small;">http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2</span></a></blockquote>which is 6G+ in size.<br />
<br />
<span style="font-size: large;">Step 2: convert the dump file to sentences</span><br />
<br />
The Wikipedia dump file XML format and the Wikimedia markup of the articles contain lots of information such as formatting that is irrelevant to statistical language modelling, where we are concerned simply with words and how they form sentences.<br />
<br />
To process the XML file into something useful, I used the <a href="http://code.google.com/p/gwtwiki/">gwtwiki</a> toolkit (bliki-core-3.0.16.jar) along with the dependency <a href="http://commons.apache.org/compress/">Apache Commons Compress</a> (commons-compress-1.1.jar). There is a wide range of toolkits for processing Wikipedia content in different languages of varying quality. gwtwiki appears to be one of the most functional and robust, handling both the parsing of the XML file and converting each article from markup into a plain text format.<br />
<br />
A small java wrapper (<a href="http://source.cet.uct.ac.za/svn/people/smarquard/sphinx/experiments/scripts/Wikipedia2Txt.java">Wikipedia2Txt.java</a>) invokes the gwtwiki parser and does some further filtering, such as excluding sentences of less than 6 words. With a few hours of processing, a set of sentences results (one per line). Here are the first few from the 2011-01-15 snapshot of the <a href="http://en.wikipedia.org/wiki/Anarchism">Anarchism</a> article:<br />
<blockquote>Anarchism is a political philosophy which considers the state undesirable, unnecessary, and harmful, and instead promotes a stateless society, or anarchy.<br />
The Concise Oxford Dictionary of Politics.<br />
It seeks to diminish or even abolish authority in the conduct of human relations.</blockquote>Note that some of these are not real subject-verb-object sentences. As the parser is purely syntactic, it will include collections of words that look like sentences. However, they still represent coherent examples of language use for modelling purposes.<br />
<br />
<span style="font-size: large;">Step 3: convert the sentence list to a corpus file</span><br />
<br />
As most language modelling toolkits are distracted by punctuation, some post-processing (text conditioning) is required. A set of <a href="http://xkcd.com/208/">regular expressions</a> (such as in a perl script) is the easiest way to accomplish this. <a href="http://source.cet.uct.ac.za/svn/people/smarquard/sphinx/experiments/scripts/tocorpus.pl">tocorpus.pl</a> removes punctuation and excess space, producing output like:<br />
<blockquote><span style="font-size: x-small;">ANARCHISM IS A POLITICAL PHILOSOPHY WHICH CONSIDERS THE STATE UNDESIRABLE UNNECESSARY AND HARMFUL AND INSTEAD PROMOTES A STATELESS SOCIETY OR ANARCHY</span><br />
<span style="font-size: x-small;">THE CONCISE OXFORD DICTIONARY OF POLITICS</span><br />
<span style="font-size: x-small;">IT SEEKS TO DIMINISH OR EVEN ABOLISH AUTHORITY IN THE CONDUCT OF HUMAN RELATIONS</span></blockquote>From a 28G uncompressed version of the English Wikipedia pages from the 2011-01-15 snapshot, the corpus file is 6.6G.<br />
<br />
<span style="font-size: large;">Step 4: create a vocabulary file</span><br />
<br />
As Wikipedia includes many words which are in fact not words (for example misspellings and other weird and wonderful character sequences like AAA'BBB), it is helpful to create a vocabulary with frequency counts, imposing some restrictions on what is considered a word. <a href="http://source.cet.uct.ac.za/svn/people/smarquard/sphinx/experiments/scripts/mkvocab.pl">mkvocab.pl</a> restricts valid words to those occuring with a minimum frequency and of a minimum length, with some English-specific rules for acceptable use of the apostrophe (<a href="http://source.cet.uct.ac.za/svn/people/smarquard/sphinx/experiments/scripts/english-utils.pl">english-utils.pl</a>).<br />
<br />
Having created a vocabulary file by processing the corpus file through mkvocab.pl, it's easy to sort it in reverse order of frequency using:<br />
<blockquote style="font-family: "Courier New",Courier,monospace;">sort -nr -k 2 enwiki-vocab.txt</blockquote>which produces:<br />
<blockquote><span style="font-size: x-small;">THE 84503449</span><br />
<span style="font-size: x-small;">AND 33700692</span><br />
<span style="font-size: x-small;">WAS 12911542</span><br />
<span style="font-size: x-small;">FOR 10342919</span><br />
<span style="font-size: x-small;">THAT 8318795</span></blockquote>for a total of 1714417 tokens (with a minimum length of 3). Words with frequency 3 include the misspelt (AFTEROON), the unusual (AFGHANIZATION, AGRO-PASTORALIST), and the spurious (AAAABBBCCC).<br />
<br />
It is also then trivial to produce a vocabulary of the most commonly used words, e.g.<br />
<blockquote style="font-family: "Courier New",Courier,monospace;">head -n100000 enwiki-vocab-sorted.txt > enwiki-vocab-100K.txt </blockquote>However, with a minimum length of 3, a range of useful English words (a, as, an, ...) are excluded, so it's best to combine the resulting dictionary with a smaller dictionary of higher quality (such as <a href="http://www.speech.cs.cmu.edu/cgi-bin/cmudict">CMUdict</a>), which includes most of the valid 2-letter English words.<br />
<br />
<span style="font-size: large;">Step 5: create a language model</span><br />
<br />
Using a <a href="http://www.google.com/search?q=language+modelling+toolkit">language modelling toolkit</a>, you can create an LM of your own design, using part or all of the Wikipedia corpus, optionally restricted to a specific vocabulary. For example, with <a href="http://code.google.com/p/mitlm/">mitlm</a> using 1 out of every 30 Wikipedia sentences and a vocabulary restricted to the top 100,000 words from Wikipedia combined with the CMU 0.7a dictionary:<br />
<blockquote style="font-family: "Courier New",Courier,monospace;">estimate-ngram -vocab enwiki-100K-cmu-combined.txt -text enwiki-sentences-1-from-30.corpus -write-lm enwiki.lm</blockquote>the resulting LM (close to 700M in ARPA format) has:<br />
<blockquote>ngram 1=163892<br />
ngram 2=6251876<br />
ngram 3=17570560</blockquote>Constructing a language model with the full set of sentences and full vocabulary (Wikipedia len=3 plus CMU) leads to an LM with<br />
<blockquote>ngram 1=1724335<br />
ngram 2=79579226<br />
ngram 3=314047999</blockquote>about 12G in size (uncompressed ARPA format).<br />
<br />
Happy modelling!Stephen Marquardhttp://www.blogger.com/profile/06185718122117108334noreply@blogger.com27