Monday, 16 May 2011

Using freetts and phonetisaurus for creating custom Sphinx dictionaries

If you use CMU Sphinx (or other recognizers) for large vocabulary speech recognition, you may want to recognize words which are not contained in the CMU Pronouncing Dictionary.

The current version of  cmudict (0.7a) contains around 133,000 words which is adequate for general English, but if you want to generate your own language models for specific domains, you will need to generate an accompanying custom dictionary with pronunciations for words in the language model but not in cmudict.

This process is also known as letter-to-sound conversion, or grapheme-to-phoneme conversion. There are several good open source options for accomplishing this, two of which are discussed here: freetts and phonetisaurus. (Others include Sequitur G2P and espeak).

freetts
freetts is a speech synthesis system which includes a LetterToSound implementation:
"using the CMU6 letter-to-sound (LTS) rules, which are based on the Black, Lenzo, and Pagel paper, 'Issues in Building General Letter-to-Sound Rules.' Proceedings of ECSA Workshop on Speech Synthesis, pages 77-80, Australia, 1998."
(also implemented in Festival, according to the above paper.)

Using freetts is quite straightforward. word2phones.py is a Jython script which calls getPhones() for a list of words. The output is tweaked to remove trailing stress marks (e.g. EY2 to EY) and replaces AX with AH for compatibility with Sphinx. Executing it with:
echo ABBREVIATION | jython ./word2phones.py
gives
ABBREVIATION    AH B R IY V IY EY SH AH N
phonetisaurus
Phonetisaurus is "a WFST-driven grapheme-to-phoneme (g2p) framework suitable for rapid development of high quality g2p or p2g systems". To install it, you need:
You may also need python modules simplejson and argparse. See the PhoneticizerTutorial and QuickStartExamples.

phonetisaurus can use a model trained on a pronunciation dictionary (such as CMUdict) to create one or more pronunciation hypotheses for unseen words. Joseph Novak, the author, has helpfully created a model from CMUdict 0.7a which can be downloaded from http://www.gavo.t.u-tokyo.ac.jp/~novakj/cmudict.0.7a.tgz (42M). To use it, run the compile.sh script which will create cmudict.0.7a.fst.
To produce results in a format usable as a CMU Sphinx dictionary, I wrote the wrapper script phonetiwords.pl which passes a word list to phonetisaurus and then processes the resulting output for Sphinx by removing stress marks.
echo X-RAY | phonetiwords.pl
gives
X-RAY    EH K S R EY
How do they differ?
On a quick test on a dictionary of the most frequent 40,000 words from Wikipedia (3 characters or longer), freetts (CMU6 LTS rules) and phonetisaurus (with the CMUdict model) produce the same results for 23,654 words, 59% of the sample, omitting stress markers.

freetts is quite a lot faster at 8s for 40,000 words compared to over a minute for phonetisaurus, although for this application, speed is not likely to be a major issue.

It will be interesting to see if there is any advantage to one over the other in real-world speech recognition applications, given the 40%+ difference in results. phonetisaurus allows producing multiple hypotheses for each word, which might also have value for speech recognition.

Thursday, 5 May 2011

Matrix Market to mysql and back again

[Update: Radim Rehurek, gensim's author, pointed out a way to achieve this with gensim indexes - see this thread for details.]

Here are some perl scripts for converting a sparse matrix from matrix market (.mm) format to mysql (or another database) and back again.

My purpose in creating these is to use subsets of a very large matrix with the gensim vector space modelling toolkit. For example, a matrix representing a bag-of-words model of 3.3 million or so English Wikipedia articles with a vocabulary of 100,000 words is rather large, around 7.4G in its matrix market (mm) file format.

To perform operations on a subset of the matrix (in my application, similarity queries on a small set of documents), it's useful to be able to quickly extract a given set of rows from the larger matrix, without reading the entire 7.4G file each time.

Thus, the scripts allow converting the large mm file into a mysql database, which can be queried efficiently to return a specific set of rows that can be converted back into a much smaller mm file that gensim can load for use with memory-bound operations such as MatrixSimilarity.

Dependencies
  • perl with CPAN modules DBI and DBD::mysql
  • mysql
Importing a matrix into mysql

mm2sql.pl reads a .mm file, and outputs a set of SQL statements to import the matrix into a mysql database. To create the schema which consists of the matrix_info and matrix tables and indexes, first create a database (for examples here called gensim), and then run:
pod2text mm2sql.pl | mysql -u root -p gensim
To import a matrix into the mysql database, run:
mm2sql.pl matrixname.mm | mysql -u root -p gensim
If you want to import more than one matrix into the same database, then set the matrixid value for each by editing the mm2sql.pl file before running the import command. The default matrixid is 1.

Exporting rows from a matrix

Use db2mm.sql for the reverse operation. First edit the script to set your local mysql connection info (hostname, database name, username and password). Then give the matrix id and row numbers on the command-line. For example to export rows 7, 9 and 17 from matrix id 1:
db2mm.sql 1 7 9 17 > newmatrix.mm
Note that the export script will renumber the rows in the matrix, so row 7 becomes row 1, row 9 becomes row 2, and row 17 becomes row 3. To preserve the original row numbering (and produce a very sparse matrix with many empty rows), it is fairly straightforward to edit the script to change this behaviour.