Researchers are training AI to listen just like humans

Listen up, AI

Artificial intelligence researchers are making progress towards their goals of training AI systems to understand speech from audio input alone, just like humans do.

At the moment, the majority of AI can only recognize speech by first translating it into text. A lot of progress has been made in terms of lowering word error rates and increasing the number of languages support.

However, having AI understand speech through audio input alone is a big jump from this stage, so researchers at MIT's Computer Science and Artificial Intelligence Laboratory have taken a step towards it by mapping speech to images rather than text.

AI hear you

It doesn’t sound like much on the surface, but the phrase 'a picture is worth a thousand words' makes it clear just how big an impact it could have. 

At the Neural Information Processing Systems conference the researchers demonstrated their method in a presentation based on a paper they've written.

The idea behind their research is that if several words can be grouped under a single related image it should be possible for the AI to make a “likely” translation without the need for rigorous training.

To create a training dataset for the AI systems, the researchers used the Places205 dataset which has over 2.5 million images split into 205 different subjects. The researchers paid groups of people to describe what they saw on four random images each from the dataset through audio recordings. They’ve managed to collect over 120,000 captions from 1,163 individuals.

The AI has then been trained to link words in each caption to relevant images, scoring the similarity of each pairing to select the most accurate translation. If a caption is relevant to the image it should score high, if not it should score low. 

In testing, the network was fed audio recordings describing a picture saved in its database and was asked to select ten images that best matched the audio caption. Unfortunately, out of the ten images selected, the correct one would only be in there 31 % of the time. 

This is a disappointing score for the researchers as it’s a fairly basic way of training AI to recognize words without any text or language data to assist its understanding. 

However, it’s believed that with improvement, this means of training could help speech recognition software to adapt more quickly to different languages and provide a new means of teaching it to translate. We can see how image recognition works with learning new languages on the human brain already, with language learning software like that offered by Rosetta Stone. 

Co-author of the paper detailing the research, Jim Glass, said “The goal of this work is to try to get the machine to learn language more like the way humans do.” 

Achieving this kind of unsupervised learning could make training AI much more cost and time effective as well as more useful to society at large. Clearly, though, many more advancements have to happen before that's possible.