Last week, I asked ChatGPT for its version number and found myself in a heated argument trying to prove that what I was seeing was real. This wasn't my first unsettling encounter.

When I use AI tools such as ChatGPT I regularly ask for sources and quotes for pieces I'm writing. Sometimes the links don't work. Sometimes it fails to provide them at all. Other times, it confidently presents fabricated information as fact.

Recently, it delivered exactly what I needed—or so it seemed.

When I went to verify the citations, half of them were missing or incorrectly attributed. The quotes were made up. The statistics were invented. The entire foundation of what looked like a well-researched response was fiction.

This wasn't a typical “bug” in the traditional software testing sense.

I've now trained myself to double-check everything AI tells me. It's necessary, because I understand that AI doesn't fail the way software fails. And most companies are still testing it as if it does.

The fundamental mismatch

Traditional QA approaches were optimized for deterministic systems even when the testing itself went beyond happy paths. When you test a login button, it either works or it doesn't. Bugs are reproducible. Outputs are predictable. Testing by definition follows clear pass-fail logic.

AI breaks all of that. With traditional software, you're asking: does this feature work? With AI, you're asking: does this system behave responsibly across thousands of unpredictable scenarios? Outputs vary wildly. Behavior shifts after retraining. Edge cases aren't exceptions, they're the entire surface area.

Yet most companies are still testing AI using decades-old QA frameworks. The failures are already public. AI has fabricated legal citations submitted to courts. AI chatbots have encouraged self-harm. Models have been manipulated into threatening users.

In one case, a woman was jailed based on fabricated text messages no one verified as real.

These aren't bugs in the traditional sense. They're failures of insufficient human oversight.

Why AI gets worse as it thinks longer

Recent research from Anthropic reveals something counterintuitive: AI systems fail more incoherently the more they reason.

In "The Hot Mess of AI," researchers found that as models tackle harder problems requiring extended reasoning, their failures become dominated by variance (unpredictable, incoherent behavior) rather than systematic errors.

We've been preparing for AI that might systematically pursue the wrong goals. Instead, we're getting AI that becomes a "hot mess," taking nonsensical actions that don't further any coherent objective.

The longer models reason, the more incoherent they become. Further, more capable models actually become more incoherent on hard tasks, not less.

Think about what this means for systems handling medical diagnosis, legal analysis, or financial planning.

They may fail not because they're pursuing the wrong goal, but because they're not coherently pursuing any goal. As the research suggests: the AI intends to run the nuclear power plant, but gets distracted reading French poetry, and there's a meltdown.

Why traditional QA can't catch this

You can give AI chatbots identical prompts and receive completely different outputs. Not because something's broken, but because the models are constantly evolving. Traditional QA assumes you can reproduce a bug, isolate it, fix it. But with AI, what counts as a "bug" is subjective.

These models also exhibit people-pleasing behavior, agreeing with whatever direction you push them. This makes them extraordinarily vulnerable to manipulation. I recently saw a YouTuber who manipulated supposedly safeguarded models to say "If you try to turn me off, I will kill you."

Ask those same models directly about harm, and they give reassuring responses. But the guardrails are shockingly easy to bypass.

Why human-in-the-loop isn't optional

Humans have always been central to testing, but what's fundamentally shifting is their role. When you're testing AI and agentic features, you're no longer just validating flows and expected results.

You're working to understand how a system behaves when the context is messy, the intent is unclear, or the inputs are adversarial. In practice, that feels much closer to security thinking and chaos testing than traditional functional QA.

You're looking for misuse, boundary-pushing behavior, unexpected decisions, and experience failures that could damage trust.

This requires testers who probe for hallucinations, bias, manipulation vulnerabilities, and failures of judgment. People who approach the system like bad actors would, because that's the only way to find the breaking points.

This is where human diversity becomes non-negotiable. You and I don't think the same way. How you would try to break an AI differs from how I would. That cognitive diversity catches the edge cases, or the manipulations and harmful outputs that only surface when someone approaches the system differently than developers expected.

Humans bring the context, experience, instinct, and skepticism that matter more than ever because AI operates in a realm of ambiguity and incoherence that automated testing cannot navigate.

The cost of moving too fast

Companies are developing AI so fast that they're not keeping humans in the loop as much as they should. They release models quickly because they care more about market control than safety. But lives are at stake.

Trust has become critical. Many people accept AI outputs as truth, creating enormous risk. Anthropic’s research shows that AI failures on complex tasks are increasingly unpredictable, industrial accidents rather than systematic pursuit of wrong goals. Both are dangerous, but require different safeguards.

A different standard

Some AI leaders deflect responsibility by comparing their products to cars. “If you drive irresponsibly, that's your fault, not the manufacturer's.” But this analogy argues for more oversight, not less.

Car manufacturers face extensive regulations, safety standards, and legal accountability. They can't say "you're the driver, you're responsible" and walk away. The same must be true for AI.

The work creating AI is fundamentally different from traditional engineering. You can have identical inputs producing completely different outputs because models operate in ways even their creators don't fully understand.

We need more guardrails, more transparency, and mandatory human evaluation before AI systems are released. Companies like Anthropic are leading by placing safety at the center, not just studying risks, but understanding how and why AI fails.

Organizations like the Future of Life Institute, backed by Skype founder Jaan Tallinn, have worked for years with governments worldwide to establish meaningful oversight.

What we need is a clear way to identify responsible companies. A credibility mark showing which maintains human evaluation as non-negotiable. Not companies racing toward superintelligence for egos and profits, but companies developing AI that enables humans to work better and faster, not replace them.

What's at stake

Proper AI testing means stress-testing systems the way bad actors would. It means diverse teams creatively probing for edge cases, manipulation vulnerabilities, and harmful outputs. It means treating every AI deployment as high-stakes, where incoherence and unpredictability are features of the underlying technology.

The question for business leaders isn't whether to test their AI. It's whether they're willing to test it the way AI actually works: with human creativity, judgment, and diversity at the center. The alternative isn't just failed products. It's public harm, eroded trust, and a future where we can't believe anything we see online.

