Machine language: Computers are on the brink of mastering speech recognition


Computers are very close to understanding what you're saying as well as another human could, even if they don't yet know what you're talking about.

"Speech recognition is really close to reaching parity with humans, in the next three years," Xuedong Huang, Microsoft's Chief Speech Scientist, told techradar pro.

"If we can achieve this goal it will be a major landmark for civilisation. Language is only something we humans understand and master. The moment a computer can transcribe your conversation over the phone almost as accurately as humans is a major landmark for AI." And for the typical conversation over the phone, he believes we'll get there in three years – at least in terms of recognising what's being said.

"Transcription is different from understanding; understanding is a different story," he cautions. "To understand the message, the subtlety of what's being said – that's a long way off. To understand intent and meaning, we still have a long way to go."

Xuedong Huang showing off some of the design behind Microsoft s open source deep learning toolkit

Xuedong Huang showing off some of the design behind Microsoft's open source deep learning toolkit

Constant progress

He's been working on speech recognition for over 30 years, and every year, he says, he's seen consistent improvements. The benchmark researchers use to measure accuracy is making a transcription of two people talking on the telephone, and every year, he's seen the error rate go down 20% from the previous year.

Thanks to deep learning, the best systems, like Cortana, are now making only twice as many errors as humans do. "The transcription error is around 8% now; that's about twice as high as human error, which is around 4%. If we can maintain a 25% reduction every year – well, you do the math! I hope the last 4% is not too hard, and in the next three years we can achieve this."

The recent advances in speech recognition are down to a relatively new machine learning technique, deep learning.

"Machine learning as a whole is important, but deep learning has been critical to these improvements," Huang explains. Now Microsoft is making the Computational Network Toolkit (CNTK) it uses to build systems like Cortana's speech recognition available, free, as open source on GitHub.

"We believe the work we're doing internally can benefit the whole community. If you have better tools and better recipes, better dishes will be prepared. We believe the tools we're sharing can accelerate the progress of AI."

CNTK has previously been available to academic researchers, for non-commercial projects through the Codeplex site – now anyone can use it to build commercial systems. "We did it in a quiet way, to get feedback," he says. "Now we're trying to broaden the audience. This is one of our best kept secrets. We're moving forward and making it more open."