The current version of cmudict (0.7a) contains around 133,000 words which is adequate for general English, but if you want to generate your own language models for specific domains, you will need to generate an accompanying custom dictionary with pronunciations for words in the language model but not in cmudict.
This process is also known as letter-to-sound conversion, or grapheme-to-phoneme conversion. There are several good open source options for accomplishing this, two of which are discussed here: freetts and phonetisaurus. (Others include Sequitur G2P and espeak).
freetts
freetts is a speech synthesis system which includes a LetterToSound implementation:
"using the CMU6 letter-to-sound (LTS) rules, which are based on the Black, Lenzo, and Pagel paper, 'Issues in Building General Letter-to-Sound Rules.' Proceedings of ECSA Workshop on Speech Synthesis, pages 77-80, Australia, 1998."(also implemented in Festival, according to the above paper.)
Using freetts is quite straightforward. word2phones.py is a Jython script which calls getPhones() for a list of words. The output is tweaked to remove trailing stress marks (e.g. EY2 to EY) and replaces AX with AH for compatibility with Sphinx. Executing it with:
echo ABBREVIATION | jython ./word2phones.pygives
ABBREVIATION AH B R IY V IY EY SH AH Nphonetisaurus
Phonetisaurus is "a WFST-driven grapheme-to-phoneme (g2p) framework suitable for rapid development of high quality g2p or p2g systems". To install it, you need:
- a mercurial client (to check out the source), and dependencies
- OpenFst
- m2m-aligner
- mitlm
phonetisaurus can use a model trained on a pronunciation dictionary (such as CMUdict) to create one or more pronunciation hypotheses for unseen words. Joseph Novak, the author, has helpfully created a model from CMUdict 0.7a which can be downloaded from http://www.gavo.t.u-tokyo.ac.jp/~novakj/cmudict.0.7a.tgz (42M). To use it, run the compile.sh script which will create cmudict.0.7a.fst.
To produce results in a format usable as a CMU Sphinx dictionary, I wrote the wrapper script phonetiwords.pl which passes a word list to phonetisaurus and then processes the resulting output for Sphinx by removing stress marks.
echo X-RAY | phonetiwords.plgives
X-RAY EH K S R EYHow do they differ?
On a quick test on a dictionary of the most frequent 40,000 words from Wikipedia (3 characters or longer), freetts (CMU6 LTS rules) and phonetisaurus (with the CMUdict model) produce the same results for 23,654 words, 59% of the sample, omitting stress markers.
freetts is quite a lot faster at 8s for 40,000 words compared to over a minute for phonetisaurus, although for this application, speed is not likely to be a major issue.
It will be interesting to see if there is any advantage to one over the other in real-world speech recognition applications, given the 40%+ difference in results. phonetisaurus allows producing multiple hypotheses for each word, which might also have value for speech recognition.