'People don’t trust bad AI voices': listeners rated a Chinese startup's synthetic voices higher for trust and realism than those from Microsoft, Google, and Amazon

AI
(Image credit: Shutterstock)

A new global study suggests people stop trusting AI voices the moment they realize the voice isn't human, which creates a big problem for companies that use synthetic voices in customer service and other public-facing systems.

Participants in the study, which involved more than 10,000 participants over the course of a month, listened to different voices and were asked to react to them. They could like, dislike, skip, or rate each voice. The listeners were not told in advance that they would be listening to AI-generated speech.

Article continues below

Rejecting AI voices

The study measured how people responded to voices across 18 characteristics, including whether they sounded warm, clear, or monotonous. Researchers also tracked how long people listened and how they reacted, rather than relying just on direct feedback.

One of the clearest results was that people tend to reject AI voices once they recognize them. The researchers found a strong negative link between detecting an AI voice and liking it.

The results also showed big differences in quality between voice models. The best-performing system was rated three times higher than the lowest-ranked model.

Smaller AI companies performed especially well in the rankings, with Chinese startup MiniMax ranked the best voice model by both UK and US listeners. Big Tech giants like Google, Amazon, and Microsoft lagged significantly behind.

The study also found differences between countries. UK listeners were 13% more likely to recognize AI-generated voices than Americans. However, European listeners were generally more willing to accept AI voices overall.

“While switching to a specialized TTS takes resources, choosing the wrong provider is becoming a critical brand liability — especially for products built on trust,” said Nick Lahoika, CEO and founder of Vocal Image. “The reality is simple: people still don't trust bad AI voices.”

  • Chinese startup Minimax came top of your audio perceptual study. Can you tell us why this is such a big deal?

We ran this research in January with 10,000 users comparing MiniMax against 19 voice models. The goal was simple: identify which voice people actually trust.

Given the recent viral attention around their videos, our study confirms that MiniMax’s voice, even without the visual avatar, is perceived as the most authentic.

Instead of standard A/B benchmarking like you see on Hugging Face, we focused on human perception. Participants evaluated voices the way they evaluate other people, based on trust, attractiveness, and authority, without knowing they were listening to AI.

In my opinion, this kind of data is far more valuable than the orchestrated upvoting you get on ProductHunt.

The results were interesting. 86% of native UK and US speakers rated MiniMax as the highest-quality voice. British listeners specifically described it as the most confident.

Our research also shows that British listeners are the best at detecting AI voices, which makes that result even more significant. If the hardest audience to fool perceives it as authentic, the model is clearly operating at a very high level.

  • You also noted that big tech giants are "lagging". Why do you think that is the case and what could they do to close the gap (e.g. through acquisitions)?

Big Tech wins on scale but loses on precision. In my opinion, their voice models are built for millions of horizontal use cases where “good enough” is acceptable. That works for something like a weather update. But in high-stakes contexts, such as communication coaching or speech therapy, intonation and rhythm are the product. If the voice feels synthetic, the experience breaks immediately.

In sectors where AI is used for sales, education, or managing sensitive inquiries, the voice must project confidence, clarity, and trustworthiness to build and increase user trust.

We saw this firsthand when our team built a high-fidelity Estonian synthesis model as a tribute to Estonia, where our company is now based after relocating from Belarus in 2020. At the time, the only alternative was Microsoft’s system, and it could not pronounce Estonian numerals correctly.

Imagine a business news broadcast where the numbers are wrong. That’s the “last mile” of quality that large horizontal platforms often overlook.

To close this gap, I expect Big Tech to rely increasingly on acquisitions. It’s difficult to specialize deeply across every vertical while maintaining their scale economics.

For startups, the opportunity lies in building systems optimized for specific, high-value contexts where quality matters more than scale.

  • The report also noted that a lot of listeners preferred AI generated voices. Why do you think that is the case? Are we seeing user fatigue (i.e. there's so much AI produced audio out there that I am tired of fighting against it, I may as well embrace it).

We only checked AI voices for this report, so we can't claim that people like AI voices more than real voices.

A lot of people (66%) couldn't tell the AI voice was fake. This shows AI voice technology is now very good.

I don't think people are tired of human voices. I think they are just getting used to AI voices. Many people speed up videos (1.5x or 2x). When they do this, they stop listening for feelings and just want to get the facts fast.

An AI voice is perfect for facts because it's clean, clear, and has no mistakes or pauses. People are starting to choose AI because it's faster and clearer, not because they have been forced to adapt to it.

In our study, we looked at 18 voice features. The key was that voices that sounded clear and sure were always better than voices that just sounded real.

This was especially true for ElevenLabs and Descript. Their AI voices seem to sound more "professional" than many human voice actors who record in cheap studios.

  • You mentioned 3 broad categories of text-to-speech models in the research: AI platforms, specialized TTS entities and the big guns. How do their respective approaches differ from each other and which one do you think will become the mainstream one in the future?

Our research shows that AI platforms and highly specialized startups are the two categories most likely to dominate the next stage of voice technology.

The industry is moving beyond simply generating sound. The real challenge is aligning voices with human perception, which involves emotion, humor, authority, and subtle nuance.

Creating synthetic speech is rapidly becoming a commodity. Evaluating and tuning voices to how humans actually perceive them is the real bottleneck.

Specialized startups often move faster here because they build systems for specific outcomes instead of optimizing for general capabilities. Sure, large tech companies still have enormous resources, and for them, acquisitions will likely remain the main strategy for closing the quality gap.

  • You told me that in the future, you’d want to develop a single unified way that guides the user holistically, capturing mannerisms, for example, and the entire spectrum of nonverbal communication, like actors do when portraying celebrities in biopics. The cynic in me posits that this goes too far and could be used to create almost perfect deepfakes. Any thoughts on that?

Even today, one photo or video from your Instagram is enough to create a highly realistic deepfake. Voice cloning takes only a few seconds. There is no good or bad technology; there are only people who use it in different ways.

We use data to train our software in soft skills and to provide users with actionable feedback. While this data could potentially be used for fraud detection, our primary focus remains on providing feedback to help users improve. We do not aim to create clones of people. Our goal is the opposite: we want to help people improve their communication skills.

Today, investors already analyze a founder’s written communication. In the future, they will also evaluate how confidently someone speaks, how they present themselves, and how clearly they express ideas.

AI can help train those skills objectively, without the social pressure people often feel in coaching environments.

Speaking anxiety is a massive global problem. More than 200 million people struggle with it. Traditional coaching is expensive and inaccessible to most people.

AI coaching can be up to 280 times more cost-effective than traditional executive training. Instead of hiring multiple specialists, such as a speaking coach, an acting teacher, and a communication trainer, users get structured feedback and daily practice in one system. Traditional executive coaching programs can cost between $7,000 and $25,000 per employee annually, while an annual subscription to our app costs just $89.99 in the U.S.

In short, we are not looking to replace human growth. Our mission is to make personal development accessible to anyone.


Follow TechRadar on Google News and add us as a preferred source to get our expert news, reviews, and opinion in your feeds. Make sure to click the Follow button!

And of course you can also follow TechRadar on TikTok for news, reviews, unboxings in video form, and get regular updates from us on WhatsApp too.

Desire Athow
Managing Editor, TechRadar Pro

Désiré has been musing and writing about technology during a career spanning four decades. He dabbled in website builders and web hosting when DHTML and frames were in vogue and started narrating about the impact of technology on society just before the start of the Y2K hysteria at the turn of the last millennium.

With contributions from

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.