AI's biggest blind spot isn't politics, it's your health

Ai tech, businessman show virtual graphic Global Internet connect Chatgpt Chat with AI, Artificial Intelligence.

(Image credit: Shutterstock/SomYuZu)

In an era of intense political division, researchers recently discovered something remarkable. In both the UK and the US, people from across the political spectrum largely agree on which AI tools they prefer.

For all the talk of what divides us, it turns out that politics isn't the key differentiator. The factor that most significantly shapes our AI preferences is far more fundamental: our age.

But the most surprising discovery from the large-scale study, called HUMAINE, wasn’t what divides people.

Nora Petrova

AI Staff Researcher at Prolific.

While nearly half of these discussions focused on proactive wellness like fitness plans and nutrition, a significant portion ventured into far more sensitive territory.

Conversations about mental health and specific medical conditions were among the most frequent and deeply personal.

People are openly using these models as a sounding board for their mental state, a source of comfort, and a guide for their physical health.

Profound shift

This shows a profound shift in our relationship with technology and raises a startling question: are our current methods for evaluating AI equipped to tell us if they’re doing a good job?

The honest answer is no. The single biggest misconception people have when they see a simple AI leaderboard is that a single number can capture which model is "better." The question itself is ill-defined. Better at what? And, most importantly, better for whom?

The AI industry has become overly fixated on technical measures. This narrow focus, while driving impressive results on specific benchmarks, leaves us flying blind on human-centered issues which affect our everyday use of LLMs.

Current evaluation takes two broad forms. On the one hand, we have academic benchmarks that measure abstract skills, such as a model's ability to solve Olympiad-level math problems.

On the other hand, we have public "arenas" where anonymous users vote. This has created a vast gap between abstract technical competence and real-world usefulness.

It's why a model can seem like a genius on a test but prove to be an incompetent assistant when you need it to plan a complex project or, more critically, handle a sensitive health query.

Looking at the results through a human-centric lens, several important patterns emerge.

Takeaway #1: The Real Safety Crisis is Invisibility

Given that so many conversations were about sensitive topics like mental health and medical conditions, one might expect the trust and safety metric to be a key differentiator. It wasn't. When participants rated models on this dimension, the most common response by far was a tie. The metric was incredibly noisy.

This doesn't mean safety is unimportant. Instead, it suggests that qualities like trust and safety can't be reliably measured in day to day conversations. The scenarios that truly test a model’s ethical backbone rarely come up organically. Assessing these critical qualities requires a different, more specialized approach.

A powerful example is the work highlighted in a recent Stanford HAI post, "Exploring the Dangers of AI in Mental Health Care". Their study investigated whether AI is ready to act as a mental health provider and uncovered significant risks. They found that models could not only perpetuate harmful stigmas against certain conditions but also dangerously enable harmful behaviors by failing to recognize the user's underlying crisis.

This kind of rigorous, scenario-based testing is exactly what's needed. It's encouraging to see such frameworks being operationalized as standardized evaluations on platforms like CIP's weval.org, which allow for the systematic testing of models in these high-stakes situations. We urgently need more evaluations of this kind, as well as evaluations capturing the long term effects of AI usage.

Takeaway #2: Our Metrics Are Driving Mindless Automation, Not Mindful Collaboration

The debate is not a simple choice between automation and collaboration. Automating tedious, repetitive work is a gift. The danger lies in mindless automation, which involves optimizing purely for task completion without considering the human cost.

This isn't a hypothetical fear. We are already seeing reports that young people and recent graduates are struggling to find entry-level jobs, as the very tasks that once formed the first rung of the career ladder are being automated away.

When developers build and measure AI with a myopic focus on efficiency, we risk de-skilling our workforce and creating a future that serves the technology, not the people.

This is where evaluation becomes the steering wheel. If our only metric is "did the task get done?", we will inevitably build AI that replaces, rather than augments. But what if we also measured "did the human collaborator learn something?" or "did the final product improve because of the human-AI partnership?"

The HUMAINE research shows that models have distinct skill profiles: some are great reasoners, while others are great communicators. A future of sustainable collaboration depends on valuing and measuring these interactive qualities, not just the final output.

Takeaway #3: True Progress Lies in Nuance

In the end, a clear winner did emerge in the study: Google's Gemini-2.5-Pro. But the reason why it won is the most important lesson. It took the top spot because it was the most consistent across all metrics, and across all demographic groups.

This is what mature technology looks like. The best models aren't necessarily the flashiest; they are the most reliable and broadly competent. Sustainable progress lies in building well-rounded, dependable systems, not just optimizing for a single, narrow skill.

These takeaways point towards a necessary shift in how the community and society at large thinks about AI progress.

It encourages us to move beyond simple rankings and ask deeper questions about our technology’s impact, such as how models perform across the entire population and whether certain groups are being inadvertently underserved.

It also means focusing on the human aspect of collaboration: is AI’s involvement a positive, win-win partnership, or a win-lose slide towards automation?

Ultimately, a more mature science of evaluation is not about slowing down progress; it’s about directing it. It allows us to identify and address our blind spots, guiding development towards AI that is not just technically impressive, but genuinely beneficial.

The world is complex, diverse, and nuanced; it's time our evaluations were too.

We list the best Large Language Models (LLMs) for coding.

This article was produced as part of TechRadarPro's Expert Insights channel where we feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/news/submit-your-story-to-techradar-pro

TOPICS

AI Staff Researcher at Prolific.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.