Google DeepMind researchers have developed a new technology to help AI vision models perceive and organize the visual world more like humans.
This method can be used to better align AI vision models with human knowledge in order to address pre-existing blind spots such as their inability to detect connections between two objects (a car and an airplane) when they belong to different categories, Google DeepMind said in a blog post on Tuesday, November 11.
Using this new technique, the researchers claim they were able to not only fit AI vision models, but also improve the performance of these models across a range of visual tasks such as learning a new category from a single image (“few-shot learning”), or making reliable decisions, even when the type of images being tested changes (“distribution shift”).
The corresponding AI vision models also achieved a “human-like” form of uncertainty, they said. The study results were also published as a technical paper in the scientific journal Nature.
By enabling AI systems to interpret visual information like humans, new Google DeepMind research could make AI-powered facial recognizers more accurate and less biased. This is critical as these systems are increasingly used in security, law enforcement, and everyday applications. However, aligning AI vision models more closely with human vision could ultimately reinforce our own biases and blind spots like Crow Syndrome.
Google said: “Many existing vision models fail to capture the high-level structure of human cognition. This research offers a potential way to address this problem, and shows that models can be better aligned with human judgments and perform more reliably on various standard AI tasks.” “While we still have more alignment work to do, our work demonstrates a step toward more robust and reliable AI systems,” she added.
Why don’t you see AI models like humans?
According to Google DeepMind, AI vision models produce representations by Mapping images to points in high-dimensional space So that similar items (such as two sheep) are placed close to each other and different items (a sheep and a cake) are placed far away from each other.
Story continues below this ad
However, these models also still fail to capture the commonalities between two objects, such as a car and an airplane, both of which are large vehicles made primarily of metal
In the past, cognitive scientists sought to fit AI models by training them on a THINGS dataset that includes millions of individual human judgments. However, this dataset contains too few images to directly fine-tune powerful AI vision models, according to the AI Research Lab.
Google DeepMind’s proposed 3-step method
To understand the differences in how humans and models perceive images, Google DeepMind ran individual tests, where both humans and AI models were designed to pick out images that didn’t belong among the rest. “Interestingly, we found many cases where humans strongly agree on an answer, but AI models get it wrong,” she said. In order to fill this gap, the researchers undertook a three-step process.
First, they used the THINGS dataset to fine-tune a pre-trained AI vision model called SigLIP-SO400M. “By freezing the master model and carefully organizing the adapter training, we created a teacher model that doesn’t forget its previous training,” Google said.
Story continues below this ad
This landmark model was then used to create a huge new dataset called AligNet which includes millions of individual human-like decisions based on millions of images. The AlignNet dataset has also been used to fine-tune other AI vision models and align them with human-like image perception qualities.
Human-aligned AI vision models have been tested on several tasks such as ranking images based on their similarities. “In each case, our matching models showed significant improvement over human agreement, often agreeing significantly with human judgments across a range of visual tasks,” Google DeepMind said.
(tags for translation) Google DeepMind AI vision models for human-like perception





