Worried about AI taking your job? Samsung's new tool will let your boss track just how well it's doing
Human annotators refine criteria while AI reviews for contradictions

- Samsung TRUEBench subjects AI chatbots to strict rules with no partial credit
- Samsung uses 2,485 tests across languages to mimic office workloads
- Inputs range from short prompts to documents over twenty thousand characters
The adoption of AI tools in workplaces has grown rapidly, raising concerns not only about automation but also about how these systems are judged.
Until now, most benchmarks have been narrow in scope, testing AI writers and AI chatbot systems with simple prompts that rarely resemble office life.
Samsung has stepped into this debate with TRUEBench, a new framework it says is designed to track whether AI models can handle tasks which resemble actual work.
Testing AI in the workplace
TRUEBench, short for Trustworthy Real-world Usage Evaluation Benchmark, contains 2,485 test sets spread across ten categories and twelve languages.
Unlike conventional benchmarks which focus on one-off questions in English, it introduces longer, more complex tasks such as multi-step document summarization and translation across multiple languages.
Samsung says inputs vary from a handful of characters to over twenty thousand, an attempt to reflect both quick requests and long reports.
The company argues these test sets expose the limits of AI chatbot platforms when they face real-world conditions rather than classroom-style queries.
Sign up to the TechRadar Pro newsletter to get all the top news, opinion, features and guidance your business needs to succeed!
Each test has strict requirements: unless all specified conditions are met, the model fails - this produces results that are demanding and less forgiving than many existing benchmarks, which often credit partial answers.
“Samsung Research brings deep expertise and a competitive edge through its real-world AI experience,” said Paul (Kyungwhoon) Cheun, CTO of the DX Division at Samsung Electronics and Head of Samsung Research.
“We expect TRUEBench to establish evaluation standards for productivity and solidify Samsung’s technological leadership.”
Samsung Research outlines a process where humans and AI cooperate in designing the evaluation criteria.
Human annotators first set the conditions, then AI reviews them to detect contradictions or unnecessary constraints.
The criteria are refined repeatedly until they are consistent and precise.
Automatic scoring is then applied to AI models, minimizing subjective judgments and making comparisons more transparent.
One of the unusual aspects of TRUEBench is its publication on Hugging Face, where leaderboards allow direct comparison of up to five models.
In addition to performance scores, Samsung also discloses the average response length, a metric that helps weigh efficiency alongside accuracy.
The decision to open parts of the system suggests a push for credibility, although it also exposes Samsung’s approach to scrutiny.
Since the advent of AI, many workers already wonder how productivity will be measured when AI systems are given similar responsibilities.
With TRUEBench, managers can have a way to judge if an AI chatbot can replace or supplement staff.
Yet despite its ambitions, benchmarks, however broad, are still synthetic measures and cannot fully capture the messiness of workplace communication or decision-making.
TRUEBench may set higher standards for evaluation, but whether it can resolve fears of job displacement, or simply sharpen them, remains an open question.
Follow TechRadar on Google News and add us as a preferred source to get our expert news, reviews, and opinion in your feeds. Make sure to click the Follow button!
And of course you can also follow TechRadar on TikTok for news, reviews, unboxings in video form, and get regular updates from us on WhatsApp too.
You may also like
- AI adoption at work reveals a surprising divide between skeptics and realists
- These are the best VPN with antivirus solutions
- Take a look at our pick of the best password managers

Efosa has been writing about technology for over 7 years, initially driven by curiosity but now fueled by a strong passion for the field. He holds both a Master's and a PhD in sciences, which provided him with a solid foundation in analytical thinking. Efosa developed a keen interest in technology policy, specifically exploring the intersection of privacy, security, and politics. His research delves into how technological advancements influence regulatory frameworks and societal norms, particularly concerning data protection and cybersecurity. Upon joining TechRadar Pro, in addition to privacy and technology policy, he is also focused on B2B security products. Efosa can be contacted at this email: udinmwenefosa@gmail.com
You must confirm your public display name before commenting
Please logout and then login again, you will then be prompted to enter your display name.