Data on the performance of large-vocabulary, continuous speech recognition engines in real contexts is sometimes hard to find. This dataset describes the performance of the CMU Sphinx Speech Recognition Toolkit (specifically the Sphinx4 java implementation) in recognition selected lectures from Open Yale Courses using the HUB4 acoustic and language models.
This is the same data summarized on Slide 10 in Speech Recognition in Opencast Matterhorn and forms part of a larger research project on adapting language models to improve the searchability of automated transcriptions of recorded lectures.
Source material
The audio files are mp3 recordings which form part of Open Yale Courses (OYC), which helpfully includes both transcripts and a research-friendly Creative Commons license (CC-BY-NC-SA) permitting reuse and derivative works.
The 13 recordings were selected for audio quality, reasonable match of the speaker accent to a North American English accent (presumed to align reasonably with the acoustic model), a variety of speakers and topics, consistent length (around 50min each), and to primarily consist of a single speaker (i.e. minimal audience involvement). Of the 13 lectures, 11 are by male speakers and 2 by female speakers.
The transcripts provided by OYC have been normalized to a single continuous set of words without punctuation or linebreaks for calculating speech recognition accuracy, and are also provided as a separate dataset.
Sphinx4 Configuration
The Sphinx4 configuration is for large-vocabulary, continuous speech recognition, using the HUB4 US English Acoustic Model, HUB4 Trigram Language Model and CMUdict 0.7a dictionary. HUB4 contains 64000 terms, and in the worst case below, matches just over 95% of the vocabulary (though not much of the specialist vocabulary, which is a different topic).
There are many ways to adjust Sphinx's configuration depending on the task at hand, and the configuration used here may not be optimal, though experimenting with settings such as beam width and word insertion probability did not have a significant effect on accuracy.
Results
In the table below, click on the title to go to the OYC page for the lecture (which includes links to audio and transcript), and click on the WER to see the Sphinx recognition output.
Lecture
|
Words
|
Word
Error Rate (WER)
|
Perplexity (sentence transcript)
|
Out of vocabulary (OOV) words
|
OOV %
|
6704
|
228
|
110
|
1.6%
|
||
7385
|
307
|
164
|
2.2%
|
||
6974
|
211
|
96
|
1.4%
|
||
5795
|
331
|
145
|
2.5%
|
||
7350
|
535
|
314
|
4.3%
|
||
6201
|
379
|
174
|
2.8%
|
||
6701
|
274
|
265
|
4.0%
|
||
7902
|
309
|
74
|
0.9%
|
||
6643
|
252
|
212
|
3.2%
|
||
6603
|
475
|
97
|
1.5%
|
||
5473
|
357
|
103
|
1.9%
|
||
7085
|
275
|
119
|
1.7%
|
||
8196
|
286
|
91
|
1.1%
|
||
Average
|
6847
|
41%
|
324
|
151
|
2.2%
|
The full data set with more detailed statistics is available at http://source.cet.uct.ac.za/svn/people/smarquard/datasets/sphinx4-hub4-oyc/ (start with the README).
Analysis
The error rate varies widely from a minimum of 32% to maximum of 61% with the output ranging from just readable to nonsensical. While the perplexity and OOV figures show some mismatch between the language model and the text, it is likely that acoustic issues (audio quality and/or speaker/model mismatch) have the biggest impact on performance, particularly for the outliers with word error rates above 45%.
Feedback and comparisons
If you have suggestions for improving the Sphinx configuration to produce better accuracy for this dataset, or have comparative results using this set of lectures using another speech recognition engine or 3rd-party service, please add a comment below.