Mobile Opportunity: Speech recognition: Almost ready for mobile prime time

I've always wanted to see speech recognition incorporated into mobile devices. Since you don't have a big keyboard when you're on the go, you ought to be able to just talk to your phone and tell it what to do, or dictate memos to it and have it convert them into e-mails or SMS messages. In addition to being incredibly convenient, this would increase the safety of a lot of drivers. It's a spooky fact, but in surveys I've done more than 10 percent of the US population admitted to sometimes sending text messages while driving.

Not smart, not safe.

So, is voice recognition good enough to let you just talk to your mobile device and then send the converted text as a message?

I first asked myself that question a couple of years ago when I bought a copy of Dragon NaturallySpeaking and a small voice recorder. I tried recording weblog posts and other documents while driving, and then brought the recorded sound back to my computer to convert it into text. The result was a disaster. Dragon was unable to keep pace with the recorded sound in the files, and started dropping sentences, paragraphs, and eventually entire pages of spoken text. I was so disgusted, and so disappointed, that I gave up and went back to listening to sports talk radio while I drove.

Recently a newly appointed product manager at Nuance (publisher of Dragon) sent out a survey asking for feedback on the product. Unlike most product managers, she signed the survey form with her own name and with her own e-mail address. Most product managers wouldn't do that because they don't want to be overwhelmed with feedback. I don't know how much feedback she got in general, or how overwhelming it was, but she got a note back from me describing my problems with the product and explaining why I really wasn't satisfied with it.

I didn't expect to get any reply from the company; Nuance has a remarkably restrictive policy on providing technical support unless you pay extra for it. Usually, companies that do that aren't interested in getting any sort of conversation going with their customers. But to my surprise, I got a note from the product manager not only sympathizing with my problems but offering to send me a copy of the latest version of the software and a voice recorder that she said would work well with the software. I wish my weblog address hadn't been in my signature, so I would know if they do this sort of thing for every frustrated user. But anyway I took her up on the offer.

You can see the results here. I dictated this weblog post using the voice recorder, synced it onto my computer for recognition, and then corrected the (few) errors by hand. There are pluses and minuses to the dictation system. The good news is that the program can now keep up with my dictated speech. I no longer lose sentences or paragraphs of text. I'm also surprised with the way the product recognizes trade names, so for instance when I say Home Depot or McDonald's or Nike or Apple or IKEA or Lowes, Dragon gets the names correct and properly capitalized (I didn't have to fix anything in that sentence).

On the other hand it does make mistakes -- the packaging claims about 99% accuracy, which means that you should expect one word in every hundred to be incorrect. My guess is that I'm getting somewhere between 97 and 99% accuracy. That's not bad. In fact, it's pretty darned impressive. But in practice it still means you have to go back and do a lot of corrections.

The training is close to torture: reading aloud a 20-minute excerpt from a Dilbert book while trying to pronounce every word correctly. Later I tried setting up the program without any training, and it worked exactly the same. So my advice is to skip the training.

The software is not great at understanding where punctuation should be placed in the text. I have learned that I have to give grammatical guidance by saying things like "comma," "period," and "new paragraph" in order to make sure that the text will be reasonably well formatted.

If I just speak naturally the text will come out like this making it very difficult for anyone else to read and even making it hard for me to edit without punctuation inserted it is very hard to get tell where a sentence was supposed to end and another one start add in a few wreck cognition errors by the soft wear and the text is not something you would want to send to someone uncorrected

Speaking with punctuation is unnatural, and could be somewhat distracting while driving. I have to think carefully about the text that I'm dictating, and I believe for some people that could cause them not to pay enough attention to what's happening on the road. I think I can do it safely or I wouldn't do it, but it definitely is an issue to consider.

Overall, I think this approach will make me a bit more productive, so I should be able to produce a little bit more weblog content and maybe get some other sorts of things done as well.

So it's nice for me, and I finally feel like I got my money's worth from Dragon. But is the technology ready for broad deployment in mobile devices?

I think the answer is technically yes, but practically no. Mobile devices are casual-use; tasks that require too much commitment or effort just don't get used. Without careful attention to spoken punctuation, the software produces errors and the sort of run-on text you saw above. Even in a short message, I think it's likely that you'd get more mistakes than you'd find acceptable. Correcting those errors on a small screen with no mouse would be tedious at best (it's an annoying task even on a PC).

More importantly, the software is very sensitive to the quality of the sound file coming into it. I believe most phone microphones and headsets wouldn't produce the required quality. You'd probably get better results with a service that just records your speech and has someone in India retype it (such services exist today).

So, the news from the world of voice recognition is hopeful for mobile users but not yet wonderful. The technology is good enough that you can definitely use it as a substitute for typing if you have physical problems. It's also a useful PC productivity tool for someone who generates a lot of text for a living.

However, I think we're not yet quite at the point where you can just talk to your phone and have it reliably transform all of your speech into text. It's getting better, but it's not all the way there yet. For a mobile device, the dream of just talking is still a dream. But I do think it's a dream that's getting closer to reality.

===========

PS: I'd also like to compliment Kristen Wylie, the product manager at Nuance who responded to my message. Take notes, folks, this is the right way to communicate with customers online -- sign your real name, use an address they can respond to rather than a no-replies mailbox, and when someone has a problem help them solve it.

5 comments:

AnonymousSunday, February 15, 2009 4:08:00 PM
Timely post given this announcement:

http://recite.microsoft.com/m/index.aspx

Tried it?
AnonymousMonday, February 16, 2009 10:36:00 AM
Michael: In this context, the difference between "ready" and "almost ready" is the difference between lightning and lightning bug.

Performance of ASR (automatic speech recognition) for natural mobile dictation apps, like email and voicemail, is quite limited. Obstacles include noise, microphone quality, audio compression formats, and open-ended vocabulary. For product developers, one good solution is to use ASR for cost reduction, a human editor "wrapper" for error correction, and appropriate confidentiality protection systems for broad market acceptance.
Interactive SpeechFriday, January 08, 2010 1:52:00 PM
I agree with Vipul Bhatt, dealing with ASR in varying noise environment is not an easy task. It also depends on the task and vocabulary size. We can hope that ASR will reach the perfection in the next 10-15 years.
BillWednesday, December 08, 2010 6:59:00 AM
There is an intersting voice recognition technology called Me Me Me. All of the current voice recognition technologies (i.e. Dragon) are not speaker independent. In other words the technology recognizes speech based on the average white american male. The accuracy, even with someone that has, no accent is 70% at best.

Me Me Me has a patented technology that learns each users unique voice patterns. This takes a few minutes of speaking for the system to calibrate. Once completed the users unique voice profile is stored in the cloud. Then any mobile phone application that uses Me Me Me to power their apps mobile phone voice recognition will result in a 95% plus accuracy for the individual. I tried the service and it's fantastic. They are rolling out the first version with Buy.com's mobile app soon.
Michael MaceWednesday, December 08, 2010 11:12:00 AM
Bill, no offense, but did you even read my post? I had to spend about half an hour training Dragon. And the accuracy was way over 70%.