Surprisingly enough, it seems some AI agents aren't quite up to scratch on some basic business tests

A hand reaching out to touch a futuristic rendering of an AI processor.
(Image credit: Shutterstock / NicoElNino)

  • Salesforce research finds single-turn tasks see only 58% success, while multi-turn effectiveness drops to 35%
  • Reasoning models like gemini-2.5-pro tend to outperform lighter models
  • CRMArena-Pro has proven to be a challenging benchmark

Researchers from Salesforce AI Research have introduced a new benchmark – CRMArena-Pro – which uses synthetic enterprise data to access LLM agent performance in difference CRM scenarios.

It found LLM agents achieved around 58% success on tasks which can be completed in a single step, with tasks that require multiple interactions dropping in effectiveness to just 35% – barely more than one in three.

Although models like gemini-2.5-pro achieved over 83% success in workflow execution, the Salesforce researchers still highlighted some concerns with AI agents, suggesting they might not quite be up to scratch after all.

Are AI agents actually that good?

The paper, entitled 'Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions', explained that LLM agents displayed near-zero inherent confidentiality awareness, noting that their performance in handling sensitive information is only improved with explicit prompting (which often came at the expense of task success).

They also criticized previous and existing benchmarks for failing to capture multi-turn interactions, addressing B2B scenarios or confidentiality, and reflecting realistic data environments. CRMArena-Pro is build on synthetic data validated by CRM experts, covering B2B and B2C settings.

In terms of analysis results, reasoning models like gemini-2.5-pro and o1 outperformed lighter models most of the time – Salesforce's researchers concluded that models that seek more clarifications generally perform better, especially in multi-turn tasks.

For example, while the average performance across the nine models tested (three each from OpenAI, Google and Meta) resulted in a score of 35.1%, gemini-2.5-pro scored 54.5%.

"These findings suggest a significant gap between current LLM capabilities and the multifaceted demands of real-world enterprise scenarios, positioning CRMArena-Pro as a challenging testbed for guiding future advancements in developing more sophisticated, reliable, and confidentiality-aware LLM agents for professional use," the researchers concluded.

Looking ahead, Salesforce CEO Marc Benioff views AI agents as a high-margin opportunities, with major corporate clients including governments betting on AI agents for boosted efficiency and further cost savings.

You might also like

With several years’ experience freelancing in tech and automotive circles, Craig’s specific interests lie in technology that is designed to better our lives, including AI and ML, productivity aids, and smart fitness. He is also passionate about cars and the decarbonisation of personal transportation. As an avid bargain-hunter, you can be sure that any deal Craig finds is top value!

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.