Modern computer algorithms have been able to 'see' the world for some time. Google's Chauffeur software in its self-driving cars uses a 64-beam laser to map the surrounding terrain and combine the data with a library of high-resolution maps.
Roomba robotic vacuum cleaners use IR and mechanical sensors to avoid obstacles in your home; Microsoft's Kinect sensor uses facial recognition to automatically identify users and activate their profiles.
But few visual recognition algorithms are capable of actively learning about the world around them or understanding the relationships between people, places and objects.
Article continues below
How, for example, does a computer know what a car looks like? We just know. We've built up that knowledge over time by observing lots of cars. Consequently, we know that not all cars look the same. We know that they come in different shapes, sizes and colours. But we can generally recognise a car because they have consistent and definable elements - wheels, tyres, an engine, windscreen and wing mirrors, they travel on roads, and so on.
Could a computer learn all this information in the same way? A team working at Carnegie Mellon University in the United States believes so. It has developed a system called NEIL (Never Ending Image Learner), an ambitious computer program that can decipher the content of photos and make visual connections between them without being taught. Just like a human would.
According to Xinlei Chen, a PHd student who works with NEIL, the software "uses a semi-supervised learning algorithm that jointly discovers common sense relationships - e.g 'Corolla is a kind of/looks similar to Car', 'Wheel is part of Car' - and labels instances of the given visual categories… The input is a large collection of images and the desired output is extracting significant or interesting patterns in visual data - e.g. car is detected frequently in raceways. These patterns help us to extract common sense relationships."
As the 'never ending' part of its name suggests, NEIL is being run continuously, and it works by plundering Google Image Search data to amass a library of objects, scenes and attributes. The current array of information includes everything from aircraft carriers to zebras, basilicas to hospitals, speckled textures to distinctive tartan patterns.
Starting with an image of a desktop computer, for example, NEIL will reference existing images of computers in its database plus any images that have been specified as belonging to a desktop computer, such as monitors, keyboards and mice.
Consequently, it can learn that 'Monitors is part of Desktop Computer' and 'Keyboard is part of Desktop Computer'. In fact, by analysing images in this way, NEIL can form four different types of visual relationship - object to object ('BMW 320 is a kind of Car'), object to attribute ('Sheep is/has White), scene to object ('Bus is found in Bus depot') and scene to attribute ('Ocean is blue'). You can see the ongoing results of NEIL's image cataloguing progress on the project's website.
For the first two and a half months of its operational life, the team at Carnegie Mellon let NEIL loose on 200 processing cores. Since July 15, it has analysed over five million images, labeled 500,000 images and formed over 3,000 common sense relationships. These include the following correct assumptions: 'Agra can have Taj_mahal', 'Mudflat can have Seagull', 'Sydney can be / can have Sunny_weather' and 'Tent_indoor can be / can have Cone_shape'.
Of course, NEIL's approach isn't perfect and, depending on the nature of the source images, it can often make incorrect statements. These have included: 'Windmill can have Helicopter' (the sails of a windmill do look like rotor blades...) and 'Radiator can be a part of Accordion' (the pleated bellows of an accordion can appear similar to the corrugated design of a typical radiator.) So the image learning process isn't entirely autonomous. There's a degree of corrective human moderation involved to purify the semantic data.
That said, NEIL's success rate is surprisingly good. In a random sample, 79 percent of the relationships formed by NEIL were deemed to be correct, while 98 percent of the visual data extracted from Google images was also correctly labelled.
What's the point of it all? There are already established visual databases such as ImageNet, which has over 14 million images. While Caltech's Visipedia project styles itself a crowdsourced 'visual encyclopaedia'.
According to Chen NEIL is "an attempt to develop the world's largest visual structured knowledge base with minimum human labelling effort - one that reflects the factual content of the images on the internet, and that would be useful to many computer vision and AI efforts."
The NEIL project joins the existing NELL (Never Ending Language Learner) research initiative at Carnegie Mellon. This attempts to develop a system that learns to 'read the web' and to extract a set of true, structured facts from the pages that it analyses.
NELL has been running since 2010 and has amassed a knowledge base of 2,069,313 things that it believes to be true. These include 'scrap_booking is form of visual art' and 'Gujarat is a state or province located in the country India'.
Scrap booking trivia and car parts might not sound like technological breakthroughs, but these advances in computer vision and machine learning (albeit human-assisted) will help research the smart search algorithms and artificial intelligences of the future.
Now why not read: Is artificial intelligence becoming a commodity?