Interview with Melvyn Hunt: Voice search of lists of millions of entries
Below is a copy of a recent interview with Melvyn Hunt carried out by Bill Meisel. The original article, entitled "Interview with Melvyn Hunt, Novauris: Voice search of lists of millions of entries", can be found in the September 2006 issue of Speech Strategy News (formerly Speech Recognition Update) .
Please outline the history and company focus for Novauris.
Novauris was founded by Jim Baker, John Bridle and myself. We began operations in England in March 2002 after succeeding in reassembling almost all the former Dragon Systems UK R&D team.
Jim Baker funded our initial development work and originally owned the company. In September 2004, with the help of some private investors, John Bridle and I negotiated a management buyout from Jim, and Yoon Kim, who had been heading the TTS company, NeoSpeech, joined the company as CEO. Yoon is based in our California office. Thank goodness for VoIP!
Although Novauris has developed all its own speech recognition technology, we do not regard ourselves as a general speech recognition company. Rather, we concentrate specifically on developing technology for single-utterance (therefore, dialogue-free) access to large sets of data, without need for user enrolment. We are interested in both server-based applications using standard telephone or data networks and applications running entirely on mobile platforms.
Please describe the features of your NovaSearch technology.
In 2003 we demonstrated spoken selection of a name and address from a set of 245 million US names and addresses (with artificial names). We were able to get extremely high accuracy (error rate << 1%), with rapid response (< 1 sec) on a standard PC and with modest memory requirements (< 50 MB).
Since then, we have been developing applications with shorter and sometimes less structured items, which for us are more challenging even though the number of such items is smaller (for example, the 5 million* street-city-state combinations in the continental U.S.). This has required an increase in the intrinsic accuracy of our technology, and we have in fact been able to cut our error rates by a factor of around four.
For server-based applications, we have a software product called NovaSystem, which runs under Linux or MS Windows. It accepts multiple channels of speech input over IP links, farms out the recognition and search computations to multiple processors if needed and if available, and returns an ordered list of matches with associated probability estimates.
As for mobile platforms, we expect to complete our first large-scale entirely handheld implementation in two or three months.
How can you achieve this difficult technical task?
There is tight integration between our speech recognition and data search techniques. We are able to get the dramatic performance on names and addresses by exploiting the structure and redundancy present in them. However, even on much simpler sets of items without obvious structure or redundancy we still get rather striking results.
Our success stems partly from what we call “statistical phonetics”. That is, we analyze statistically large amounts of speech and the response of our decoder to it, and draw phonetic insights from the analysis, which we usually express in probabilistic terms. For example, we make use of knowledge of English syllable structure, which is normally ignored in automatic speech recognition. I'd like to make a personal comment if I may: As someone fascinated by phonetics and having spent decades working on ASR, I have often been dismayed by how little phonetics has apparently contributed to progress in ASR, despite many announcements over the years that some group or other planned to use phonetic knowledge to revolutionize speech recognition. We aren't claiming a revolution, but we are seeing solid advantages from exploiting phonetic insights. The reason, I think, is that our insights are drawn from large-scale observations and are implemented not as rigid rules but as statistical expectations.
On another level, I would say that our technical strengths benefit, paradoxically, from our being a relatively small team of bright people who work closely together without ego problems and who each feel a responsibility for our technology from its theoretical basis to practical issues raised by our customers. In fact, our customers have remarked on our responsiveness and flexibility.
What are some of the telephone applications of NovaSearch?
For our server-based technology, we have seen most interest in entertainment-related applications: selection of music, ringtones, TV programs, games, etc.
We are also seeing a growing interest in “location-based” applications, such as specifying one’s location when ordering taxis, pizzas, etc. However, location-based applications are probably going to be more important in mobile-platform implementations.
What are some of the other applications for NovaSearch?
Well, as I just mentioned, we see location-based applications as being important on mobile platforms, the main search application being the entry of destinations to in-vehicle navigation systems. In North America and Europe this mainly means some reduced form of the postal address. However, in Japan and Korea it appears that there is more interest in specifying what are called points of interest or landmarks rather than addresses.
There is also a less obvious application of our technology on mobile platforms, namely flexible phrase recognition for automatic spoken translation and commands for domestic robots (in which there is more interest in Asia than in the West). Although a concept demonstrator of a voice translator to Korean and Japanese was surprisingly successful, I should say that we see this class of application as being some way off yet.
How is NovaSearch sold for the various markets (e.g., is it licensed by the port for telephony)?
Although we are open to alternatives, for server-based applications we currently prefer a per-transaction or revenue-related arrangement. This gives us and our customer a common interest in the success of an application. The customer obviously benefits from our ongoing interest, but we think that Novauris benefits as well from having a close link to the ultimate users of our technology.
Obviously, for embedded applications a quantity-related unit royalty is more appropriate.
Any final comments?
I've been working in speech technology R&D for a long time. Although it's been intellectually absorbing, it's sometimes been frustrating. For instance, I'm a devoted user of Dragon NaturallySpeaking, but general automatic dictation has never really taken off as I thought it should. Similarly, we thought that voice command in military aircraft would revolutionize pilot-machine interaction, but I don't think that it has. However, it does seem that a really important application of ASR may finally be emerging, namely mobile voice search. My optimism is strengthened by your article on Tellme's successes in the September edition of Telephone Strategy News (advance copies e-mailed to subscribers), and by Amol Sharma’s piece in the Wall Street Journal on July 27 drawing attention to the mobile search service being introduced by Verizon Wireless. It makes me think that sticking around in this field for all these years has been worthwhile after all.
|