Monday, 16 May 2011

Using freetts and phonetisaurus for creating custom Sphinx dictionaries

If you use CMU Sphinx (or other recognizers) for large vocabulary speech recognition, you may want to recognize words which are not contained in the CMU Pronouncing Dictionary.

The current version of  cmudict (0.7a) contains around 133,000 words which is adequate for general English, but if you want to generate your own language models for specific domains, you will need to generate an accompanying custom dictionary with pronunciations for words in the language model but not in cmudict.

This process is also known as letter-to-sound conversion, or grapheme-to-phoneme conversion. There are several good open source options for accomplishing this, two of which are discussed here: freetts and phonetisaurus. (Others include Sequitur G2P and espeak).

freetts is a speech synthesis system which includes a LetterToSound implementation:
"using the CMU6 letter-to-sound (LTS) rules, which are based on the Black, Lenzo, and Pagel paper, 'Issues in Building General Letter-to-Sound Rules.' Proceedings of ECSA Workshop on Speech Synthesis, pages 77-80, Australia, 1998."
(also implemented in Festival, according to the above paper.)

Using freetts is quite straightforward. is a Jython script which calls getPhones() for a list of words. The output is tweaked to remove trailing stress marks (e.g. EY2 to EY) and replaces AX with AH for compatibility with Sphinx. Executing it with:
echo ABBREVIATION | jython ./
Phonetisaurus is "a WFST-driven grapheme-to-phoneme (g2p) framework suitable for rapid development of high quality g2p or p2g systems". To install it, you need:
You may also need python modules simplejson and argparse. See the PhoneticizerTutorial and QuickStartExamples.

phonetisaurus can use a model trained on a pronunciation dictionary (such as CMUdict) to create one or more pronunciation hypotheses for unseen words. Joseph Novak, the author, has helpfully created a model from CMUdict 0.7a which can be downloaded from (42M). To use it, run the script which will create cmudict.0.7a.fst.
To produce results in a format usable as a CMU Sphinx dictionary, I wrote the wrapper script which passes a word list to phonetisaurus and then processes the resulting output for Sphinx by removing stress marks.
echo X-RAY |
How do they differ?
On a quick test on a dictionary of the most frequent 40,000 words from Wikipedia (3 characters or longer), freetts (CMU6 LTS rules) and phonetisaurus (with the CMUdict model) produce the same results for 23,654 words, 59% of the sample, omitting stress markers.

freetts is quite a lot faster at 8s for 40,000 words compared to over a minute for phonetisaurus, although for this application, speed is not likely to be a major issue.

It will be interesting to see if there is any advantage to one over the other in real-world speech recognition applications, given the 40%+ difference in results. phonetisaurus allows producing multiple hypotheses for each word, which might also have value for speech recognition.


  1. Hi,

    I'm the author/developer of phonetisaurus. Thanks for trying it out!

    I have a couple of questions about the setup and test set. cmu6-lts seems pretty fast; I doubt my programming-fu is sufficient to get phonetisaurus up to that kind of speed, but I'd like to try and improve it a bit further if possible.

    * For the Wikipedia word list you are using, what kind of overlap is there with the original cmudict.0.7a file? I imagine that most of the top 40,000 words are covered in the cmudict, but I'd be curious to know exactly what the overlap is.

    * I wonder how the cmu6-lts model handles words found in the training data? I seem to recall that in festival/speech_tools the lexicon is a combination of the original dictionary (which is used for previously seen words), and the lts rules (which are used for unseen words).

    In the case of phonetisaurus, the training entries are not recorded explicitly, which means that in its current form it will probably perform somewhat worse on previously seen data than the lts model, if the lts model is using a lookup for previously seen words. Nevertheless with an 11-gram model the majority of the previously seen words should produce correct hypotheses.

    In order to test this guess, I ran phonetisaurus on the cmudic.0.7a training data, and did the same using freetts and your script, but using the cmudict0.4 included in festival, (which I think is what the lts model is trained on). Phonetisaurus achieved a word accuracy of ~92%, using all 133k entries in 0.7a, while freetts and the lts model achieved about 61% on the 111k entries in the 0.4 dictionary. I also ran words2phones on the cmudic.0.7a where it achieved about 55% word accuracy. In all three cases a substantial portion of the errors are due to the fact that the input dictionaries contain many alternative pronunciations, but the g2p tools produce the same output for the same word (when Nbest=1). This means that where we have entries like,

    the best we can do with Nbest=1 is to get just one right. No doubt both systems would produce significantly higher WACC rates if I had accounted for this in the simple eval above.

    This is definitely not a valid comparison given that the training data is very different, and the underlying phoneme sets are also different. Nevertheless it does suggest that the lts model does not index the training data (or that I completely screwed up in cleaning up the 0.4 dictionary training data). I would definitely be interested in hearing more about what, if any impact this has on your ASR experiments, and if possible also taking a look at the wikipedia wordlist/dictionary.

    Your post also suggested to me that it might be worth adding another option to index seen words directly as opposed to just dealing with them in the pronunciation model. This might significantly speed up the search in situations where the majority of words were actually seen during training, although I doubt it will ever be as fast as cmu6-lts.

    Thanks again for taking the time to try it out.


  2. On the 40K word list, 33,858 were in cmudict 0.7a.

    I did a further comparison this time using only words NOT in cmudict 0.7a, by taking the top 250K words from Wikipedia, and excluding those in cmudict, leaving 165,595 words.

    phonetisaurus took 4m59s on this set, compared to 19s for freetts. freetts and phonetisaurus produced identical results for 64,305 of the words (39%).

    So the speed difference on unknown words is real, but I'd consider accuracy more important than speed here. If speed was a real issue, one could always pre-generate a set of pronunciations for a very large dictionary, and then cherry-pick from that each time as needed.

    I've put the full Wikipedia word list at (not a high-quality dictionary by any means - it includes lots of tokens that are not really words).

  3. Thanks for the additional info and suggestions!

    I'd previously only compared the speed with reports on sequitur and directl, so this was good to find out, although I'm not sure how much faster I'll be able to make it given the current architecture.

    I'll definitely be interested to know if it works ok in any ASR experiments, so I hope there will be further updates~

  4. The current best text to speech software is Text Speaker. It has customizable pronunciation, reads anything on your screen, and it even has talking reminders. It is great for learning languages as it highlights the words as they are being read. The bundled voices are well priced and sound very human. Voices are available in English, French, Italian, Spanish, German, and more. Easily converts blogs, email, e-books, and more to MP3 or for listening instantly.