Wednesday 19 December 2012

Improving searchability of automatically transcribed lectures through dynamic language modelling

Here is the abstract  of my recently completed Master's dissertation on topic-specific language modelling with Wikipedia to improve the accuracy of lecture recording transcriptions:
Recording university lectures through lecture capture systems is increasingly common. However, a single continuous audio recording is often unhelpful for users, who may wish to navigate quickly to a particular part of a lecture, or locate a specific lecture within a set of recordings.

A transcript of the recording can enable faster navigation and searching. Automatic speech recognition (ASR) technologies may be used to create automated transcripts, to avoid the significant time and cost involved in manual transcription.

Low accuracy of ASR-generated transcripts may however limit their usefulness. In particular, ASR systems optimized for general speech recognition may not recognize the many technical or discipline-specific words occurring in university lectures. To improve the usefulness of ASR transcripts for the purposes of information retrieval (search) and navigating within recordings, the lexicon and language model used by the ASR engine may be dynamically adapted for the topic of each lecture.

A prototype is presented which uses the English Wikipedia as a semantically dense, large language corpus to generate a custom lexicon and language model for each lecture from a small set of keywords. Two strategies for extracting a topic-specific subset of Wikipedia articles are investigated: a naïve crawler which follows all article links from a set of seed articles produced by a Wikipedia search from the initial keywords, and a refinement which follows only links to articles sufficiently similar to the parent article. Pair-wise article similarity is computed from a pre-computed vector space model of Wikipedia article term scores generated using latent semantic indexing.

The CMU Sphinx4 ASR engine is used to generate transcripts from thirteen recorded lectures from Open Yale Courses, using the English HUB4 language model as a reference and the two topic-specific language models generated for each lecture from Wikipedia.

Three standard metrics – Perplexity, Word Error Rate and Word Correct Rate – are used to evaluate the extent to which the adapted language models improve the searchability of the resulting transcripts, and in particular improve the recognition of specialist words. Ranked Word Correct Rate is proposed as a new metric better aligned with the goals of improving transcript searchability and specialist word recognition.

Analysis of recognition performance shows that the language models derived using the similarity-based Wikipedia crawler outperform models created using the naïve crawler, and that transcripts using similarity-based language models have better perplexity and Ranked Word Correct Rate scores than those created using the HUB4 language model, but worse Word Error Rates.

It is concluded that English Wikipedia may successfully be used as a language resource for unsupervised topic adaptation of language models to improve recognition performance for better searchability of lecture recording transcripts, although possibly at the expense of other attributes such as readability.
Links to the open source software toolkits, data sets and custom-written code used for the analysis, including the Wikipedia Similarity Crawler, are included in Appendices 2 and 3.

Sunday 18 December 2011

Sphinx4 speech recognition results for selected lectures from Open Yale Courses

About this dataset
Data on the performance of large-vocabulary, continuous speech recognition engines in real contexts is sometimes hard to find. This dataset describes the performance of the CMU Sphinx Speech Recognition Toolkit (specifically the Sphinx4 java implementation) in recognition selected lectures from Open Yale Courses using the HUB4 acoustic and language models.

This is the same data summarized on Slide 10 in Speech Recognition in Opencast Matterhorn and forms part of a larger research project on adapting language models to improve the searchability of automated transcriptions of recorded lectures.

Source material
The audio files are mp3 recordings which form part of Open Yale Courses (OYC), which helpfully includes both transcripts and a research-friendly Creative Commons license (CC-BY-NC-SA) permitting reuse and derivative works.

The 13 recordings were selected for audio quality, reasonable match of the speaker accent to a North American English accent (presumed to align reasonably with the acoustic model), a variety of speakers and topics, consistent length (around 50min each), and to primarily consist of a single speaker (i.e. minimal audience involvement). Of the 13 lectures, 11 are by male speakers and 2 by female speakers.

The transcripts provided by OYC have been normalized to a single continuous set of words without punctuation or linebreaks for calculating speech recognition accuracy, and are also provided as a separate dataset.

Sphinx4 Configuration
The Sphinx4 configuration is for large-vocabulary, continuous speech recognition, using the HUB4 US English Acoustic Model, HUB4 Trigram Language Model and CMUdict 0.7a dictionary. HUB4 contains 64000 terms, and in the worst case below, matches just over 95% of the vocabulary (though not much of the specialist vocabulary, which is a different topic).

There are many ways to adjust Sphinx's configuration depending on the task at hand, and the configuration used here may not be optimal, though experimenting with settings such as beam width and word insertion probability did not have a significant effect on accuracy.

Results

In the table below, click on the title to go to the OYC page for the lecture (which includes links to audio and transcript), and click on the WER to see the Sphinx recognition output.

Lecture
 Words
Word Error Rate (WER)
 Perplexity (sentence transcript)
 Out of vocabulary (OOV) words
 OOV %
6704
 228
 110
1.6%
7385
 307
 164
2.2%
6974
 211
 96
1.4%
5795
 331
 145
2.5%
7350
 535
 314
4.3%
6201
 379
 174
2.8%
6701
 274
 265
4.0%
7902
 309
 74
0.9%
6643
 252
 212
3.2%
6603
 475
 97
1.5%
5473
 357
 103
1.9%
7085
 275
 119
1.7%
8196
 286
 91
1.1%
Average
6847
41%
 324
 151
2.2%

The full data set with more detailed statistics is available at http://source.cet.uct.ac.za/svn/people/smarquard/datasets/sphinx4-hub4-oyc/ (start with the README).

Analysis
The error rate varies widely from a minimum of 32% to maximum of 61% with the output ranging from just readable to nonsensical. While the perplexity and OOV figures show some mismatch between the language model and the text, it is likely that acoustic issues (audio quality and/or speaker/model mismatch) have the biggest impact on performance, particularly for the outliers with word error rates above 45%.

Feedback and comparisons
If you have suggestions for improving the Sphinx configuration to produce better accuracy for this dataset, or have comparative results using this set of lectures using another speech recognition engine or 3rd-party service, please add a comment below.

Monday 16 May 2011

Using freetts and phonetisaurus for creating custom Sphinx dictionaries

If you use CMU Sphinx (or other recognizers) for large vocabulary speech recognition, you may want to recognize words which are not contained in the CMU Pronouncing Dictionary.

The current version of  cmudict (0.7a) contains around 133,000 words which is adequate for general English, but if you want to generate your own language models for specific domains, you will need to generate an accompanying custom dictionary with pronunciations for words in the language model but not in cmudict.

This process is also known as letter-to-sound conversion, or grapheme-to-phoneme conversion. There are several good open source options for accomplishing this, two of which are discussed here: freetts and phonetisaurus. (Others include Sequitur G2P and espeak).

freetts
freetts is a speech synthesis system which includes a LetterToSound implementation:
"using the CMU6 letter-to-sound (LTS) rules, which are based on the Black, Lenzo, and Pagel paper, 'Issues in Building General Letter-to-Sound Rules.' Proceedings of ECSA Workshop on Speech Synthesis, pages 77-80, Australia, 1998."
(also implemented in Festival, according to the above paper.)

Using freetts is quite straightforward. word2phones.py is a Jython script which calls getPhones() for a list of words. The output is tweaked to remove trailing stress marks (e.g. EY2 to EY) and replaces AX with AH for compatibility with Sphinx. Executing it with:
echo ABBREVIATION | jython ./word2phones.py
gives
ABBREVIATION    AH B R IY V IY EY SH AH N
phonetisaurus
Phonetisaurus is "a WFST-driven grapheme-to-phoneme (g2p) framework suitable for rapid development of high quality g2p or p2g systems". To install it, you need:
You may also need python modules simplejson and argparse. See the PhoneticizerTutorial and QuickStartExamples.

phonetisaurus can use a model trained on a pronunciation dictionary (such as CMUdict) to create one or more pronunciation hypotheses for unseen words. Joseph Novak, the author, has helpfully created a model from CMUdict 0.7a which can be downloaded from http://www.gavo.t.u-tokyo.ac.jp/~novakj/cmudict.0.7a.tgz (42M). To use it, run the compile.sh script which will create cmudict.0.7a.fst.
To produce results in a format usable as a CMU Sphinx dictionary, I wrote the wrapper script phonetiwords.pl which passes a word list to phonetisaurus and then processes the resulting output for Sphinx by removing stress marks.
echo X-RAY | phonetiwords.pl
gives
X-RAY    EH K S R EY
How do they differ?
On a quick test on a dictionary of the most frequent 40,000 words from Wikipedia (3 characters or longer), freetts (CMU6 LTS rules) and phonetisaurus (with the CMUdict model) produce the same results for 23,654 words, 59% of the sample, omitting stress markers.

freetts is quite a lot faster at 8s for 40,000 words compared to over a minute for phonetisaurus, although for this application, speed is not likely to be a major issue.

It will be interesting to see if there is any advantage to one over the other in real-world speech recognition applications, given the 40%+ difference in results. phonetisaurus allows producing multiple hypotheses for each word, which might also have value for speech recognition.

Thursday 5 May 2011

Matrix Market to mysql and back again

[Update: Radim Rehurek, gensim's author, pointed out a way to achieve this with gensim indexes - see this thread for details.]

Here are some perl scripts for converting a sparse matrix from matrix market (.mm) format to mysql (or another database) and back again.

My purpose in creating these is to use subsets of a very large matrix with the gensim vector space modelling toolkit. For example, a matrix representing a bag-of-words model of 3.3 million or so English Wikipedia articles with a vocabulary of 100,000 words is rather large, around 7.4G in its matrix market (mm) file format.

To perform operations on a subset of the matrix (in my application, similarity queries on a small set of documents), it's useful to be able to quickly extract a given set of rows from the larger matrix, without reading the entire 7.4G file each time.

Thus, the scripts allow converting the large mm file into a mysql database, which can be queried efficiently to return a specific set of rows that can be converted back into a much smaller mm file that gensim can load for use with memory-bound operations such as MatrixSimilarity.

Dependencies
  • perl with CPAN modules DBI and DBD::mysql
  • mysql
Importing a matrix into mysql

mm2sql.pl reads a .mm file, and outputs a set of SQL statements to import the matrix into a mysql database. To create the schema which consists of the matrix_info and matrix tables and indexes, first create a database (for examples here called gensim), and then run:
pod2text mm2sql.pl | mysql -u root -p gensim
To import a matrix into the mysql database, run:
mm2sql.pl matrixname.mm | mysql -u root -p gensim
If you want to import more than one matrix into the same database, then set the matrixid value for each by editing the mm2sql.pl file before running the import command. The default matrixid is 1.

Exporting rows from a matrix

Use db2mm.sql for the reverse operation. First edit the script to set your local mysql connection info (hostname, database name, username and password). Then give the matrix id and row numbers on the command-line. For example to export rows 7, 9 and 17 from matrix id 1:
db2mm.sql 1 7 9 17 > newmatrix.mm
Note that the export script will renumber the rows in the matrix, so row 7 becomes row 1, row 9 becomes row 2, and row 17 becomes row 3. To preserve the original row numbering (and produce a very sparse matrix with many empty rows), it is fairly straightforward to edit the script to change this behaviour.

Thursday 14 April 2011

Speech recognition for lecture recordings

Here are the slides from a seminar at UCT introducing speech recognition and the project to integrate CMU Sphinx into Opencast Matterhorn, looking inter alia at language modelling using Wikipedia.

The project is at an early stage, so this is more an overview of the problem space and plans rather than specific results.

Friday 25 March 2011

Recognizing specialized vocabulary with large dictionaries

One of the goals of the work which inspired this blog is to integrate a speech recognition engine into a lecture capture system (specifically, integrating CMU Sphinx into Opencast Matterhorn).

Many university lectures include a high proportion of specialist terms (e.g. medical and scientific terms, discipline-specific terminology and jargon). These are important words. They are the "content anchors" of the lecture, and are likely to be used as search terms should a student want to locate a particular lecture dealing with a topic, or jump to a section of a recording.

Hence applications of speech recognition in an academic context need to pay special attention to recognizing these words correctly. ASR engines use linguistic resources to recognize words: a pronunciation dictionary which maps words to typical pronunciations, and a language model, which is a statistical model of the frequency with which word and word combinations (n-grams) occur in a body of text.

This post examines the "size and shape" of dictionary that would be required to recognize most specialist terms correctly in a particular domain. The reference text is an edited transcript of a lecture delivered to undergraduate Health Sciences (Medical School) students on "Chemical Pathology of the Liver".

The dictionaries evaluated come from a variety of sources. Google's ngram dictionary is a list of words from English language books with a minimum frequency cutoff of 40. BEEP and CMU are ASR pronunciation dictionaries. The Bing dictionary is a list of the most frequently 1000,000 terms in documents indexed by bing, and WSJ 5K is a small vocabulary from the Wall Street Journal (WSJ) corpus.

The Wikipedia dictionaries were created from a plain text list of sentences from Wikipedia articles. The complete list of words was sorted by descending frequency of use, with a cutoff of 3. Wikipedia 100K, for example, contains the most frequent 100,000 terms from Wikipedia.

The dictionaries all contain variant forms as separate words rather than stem words (e.g. speak, speaker, speaks). The comparison of the lecture text to the dictionary compares only words which are 3 or more characters in length (on the assumption that 1- and 2-letter English words are not problematic in this context, and excluding them from the Wikipedia dictionaries avoids some noise).

The reference text contains 7810 words which meet this requirement, using a vocabulary of 1407 unique words. Compared against the candidate dictionaries, we find:

Dictionary Size OOV
words
OOV% Unique
OOV words
Unique
OOV %
Google 1gram Eng 2009 4 631 186 12 0.15% 8 0.57%
Wikipedia Full 1 714 417 22 0.28% 13 0.92%
Wikipedia 1M 1 000 000 27 0.35% 16 1.14%
Wikipedia 500K 500 000 41 0.52% 23 1.63%
Wikipedia 250K 250 000 112 1.43% 43 3.06%
Wikipedia 100K 100 000 269 3.44% 90 6.40%
BEEP 1.0 257 560 413 5.29% 124 8.81%
CMU 0.7.a 133 367 455 5.83% 146 10.38%
Bing Top100K Apr2010 98 431 514 6.58% 125 8.88%
WSJ 4 986 2 177 27.87% 696 49.47%

So if we are hoping to find more than 99% of the words in our lecture in a generic English dictionary, i.e. an out of vocabulary (OOV) rate of < 1%, we require a dictionary of between 250K and 500K terms.

Looking at the nature of the words which are OOV at different dictionary sizes, 250K to 500K is also the region where the number of unrecognized general English words becomes insignificant, leaving only specialist vocabulary. So in Wikipedia 250K, missing words include:
sweetish, re-expressed, ex-boss
which are slightly unusual but arguably generic English. Using Wikipedia 500K, the remaining missing words are almost completely domain-specific, for example:
sulfhydryls, aminophenyl, preicteric,  methimine, fibrosed, haematemesis, paracetamols, prehepatic, icteric, urobilin, clottability, hepatoma, sclerae, hypergonadism, extravasates, clottable, necroses, necrose
So the unsurprising conclusion is that a lecture on a narrow, specialist topic may contain a lot of words which are very infrequent in general English. Another way of visualizing this is comparing the word frequency distribution from a lecture transcript to a text from another genre.

This scatter plot shows term frequency in the transcript against dictionary rank (i.e. the position of the word in a dictionary sorted from most-to-least frequent), for the lecture transcript (blue) and the first 10,000 words or so from Alice's Adventures in Wonderland (i.e. a similar wordcount to the lecture).




The narrative fictional text shows the type of distribution we would expect from Zipf's law. The lecture text shows many more outliers -- for example terms with a document frequency of between 10 and 100, and a dictionary rank of 10,000 and below.

So is the solution to recognizing these terms to use a very large dictionary? In this case, larger is not always better. While we may want to recognize a word such as "fibrosed" which occurs with frequency 3 only in the full 1.7M Wikipedia dictionary, in practical terms a dictionary is only as useful as the accompanying language model.

LMs generated with an unrestricted vocabulary from a very large text corpus such as Wikipedia are not only impractical to use (requiring significant memory), but also lose an essential element of context, which is that a lecture is typically about one topic, rather than the whole of human knowledge. Hence we need to take into account that "fibrosed" is significantly more likely to occur in a lecture on liver pathology than "fibro-cement".

This leads to the specialization of language model adaptation, a topic of future posts.