Making images useful
There are plenty of ways in which having your images automatically captioned and labelled will be useful, especially if you're a keen photographer trying to stay on top of your image library or a news site looking for the right photograph.
"Indexing your photos by who's in them is a very natural way to way to think about organising photos," Platt points out. With more powerful labelling, you can search for objects in images (a picture of a cat) or actions (a picture of a cat drinking) or the relation between different objects in an image. "If I remember that I had a picture of a boy and a horse, I'd like to be able to index that – both the objects of the boy and the horse, and the relation between them – and put them in an index so I can go and search for them later."
If you're putting together a catalogue of products, having an automatically generated caption might be useful, but Platt doesn't see much demand for something that specific. There is a lot of interest from different product teams at Microsoft, he says, but instead of creating captions for you he expects that "the pieces will be used in various products; behind the scenes, these bits will be running."
Dealing with videos will mean making the recognition faster, and working out how to spot what's interesting (because not every frame will be). But what's important here is not just the speed, but the way the kind of understanding that underlies captioning complex images could transform search.
The deep learning neural networks and machine learning systems this image recognition uses are the same technologies that have revolutionised speech recognition and translation in the last few years (powering Microsoft's upcoming Skype Translator). "Every time you talk to the Bing search engine on your phone you're talking to a deep network," says Platt. Microsoft's video search system, MAVIS, uses a deep network.
The next step is to do more than recognise, and actually understand what things mean.
"Even for text there's a fair amount of work and that's where there's a lot of interesting value, if we can truly understand text as opposed to just doing keyword search. Just doing keyword search gets you a long way, that's how all of our search engines work today. But imagine if you had a system that could truly understand what your documents were about and truly be an assistant to you."
The goal, he says, is to "try to truly understand the semantics of objects like video or speech or image or text, as opposed to the surface forms like just the words or just the colours."