The HowOld Robot doesn't always guess your age correctly when you show it your photograph, but it's certainly caught everyone's attention. And that's just one of the four REST APIs Microsoft Research is making available through Project Oxford.
Ryan Gaglon from Microsoft Research (MSR) explained to TechRadar Pro what the services can actually do – and where developers might recognise them from. "The speech APIs are about [your apps] being able to hear and to speak back. This is the same backend that powers Cortana," he told us.
The service can turn speech into text or synthesise speech from text in a variety of synthetic voices; they cover 17 languages in the initial beta. The recognition works over a Web Socket connection and as you watch, you can see the API figuring out individual words and then going back to turn that into phrases and sentences, complete with punctuation and capital letters.
If what it's trying to recognise is a short phrase that the API isn't certain about – it's very easy to mix up 450 6th Street and 456th Street, for example – it will send back up to five alternatives (and it's up to the developer to decide if it's useful to show those).
The face service is what HowOld Robot is using. "It's about being able to detect, describe and recognise a human face," says Gaglon, "and it does both detection and verification. Detection tells you how many faces there are in a photo and where they are, plus it can give you landmarks on the face – like the tip of the nose or the left and right side of the mouth.
"Then there are the experimental features, like predicting the age and gender. Verification says if you have two photos and there's a face in both photos, what is the likelihood it belongs to the same individual? Then there's grouping – given a collection of photos, which sets have the same people in."
Some of the face recognition services are the same as the ones used by Kinect.
The vision APIs include a wide mix of tools "to help describe the content within an image," Gaglon explains. "You can manipulate and work with images, recognise words in a photo. It can scale and crop photos more intelligently, so you can have it crop a photo in different dimensions but keep the most important content of the photo in the frame."
That would come in handy for automatically resizing images so they work on a phone or tablet screen as well as on a larger desktop screen – in action, it looks very like the way Microsoft's Sway authoring service picks which part of a photo to show.
"The image analysing service helps you describe an image; whether it's clipart or not, whether it's a colour photo or not, whether it's adult content or not." The vision services can also categorise images, stating whether you're looking at a building or a flower or someone swimming – if a picture shows buildings or streets, the service will say the most likely category is a cityscape. "That's some of the same technology that's used in the OneDrive photo tagging," says Gaglon, noting that many of the vision APIs are services Bing uses for image search.
Getting that to work involves some ground-breaking machine learning research. "One of the things the vision APIs make use of is whole image categorisation, and MSR recently published some results where the team was the first to surpass human image recognition performance on the Imagenet benchmark," he mentions.