Don’t miss only the leaders of OpenAI, Chevron, Nvidia, Kaiser Permanente, and Capital One at VentureBeat Transform 2024. Get key insights into GenAI and expand your network at this exclusive three-day event. He learns more
Mountain rangeEmerging customer experience with artificial intelligence Created by OpenAI Board Member Brett Taylor and Google AR/VR veteran Clay Bavourhe have Set a new standard To evaluate the performance of conversational AI agents. Called TAU-bench, agents are tested on completing complex tasks while performing multiple exchanges with users simulating an LLM to collect the required information. Preliminary results indicate that AI agents built using simple LLM architectures such as function call or ReAct do not perform well on “relatively simple tasks,” highlighting the belief that companies need more sophisticated agent architectures.
Interested developers can check the TAU-bench code Download it from Sierra’s GitHub repository.
TAU-bench: what you need to know
“At Sierra, our experience empowering user-facing conversational agents in the real world has made one thing abundantly clear: solid measurement of agent performance and reliability is critical to their successful deployment. Before companies deploy,” says Karthik Narasimhan, head of research at Sierra “As an AI agent, they need to measure how well it works in as realistic a scenario as possible.”
He claims that current standards, such as WebArena, SWE-bench, and Agentbench, fail in several key areas. Although they can detect high-level agent capabilities, they only evaluate one round of human-agent interaction as follows:
Countdown to VB conversion 2024
Join enterprise leaders in San Francisco July 9-11 for our flagship AI event. Connect with your peers, explore the opportunities and challenges of generative AI, and learn how to integrate AI applications into your industry. Register now
User: “What’s the weather like in New York today?”
Amnesty International: “Today in New York, it’s sunny, with a high of 75°F (24°C) and a low of 60°F (16°C).”
This is limiting because, in real-life scenarios, agents would need to obtain this information using multiple dynamic exchanges:
User: “I want to book a flight.”
AI: “Sure! Where do you want to travel to?”
User: “From Chicago to Miami.”
Amnesty International: “I see. When do you want to travel?”
User: “Next Friday.”
AI: “Okay. Do you have a departure time preference?”
…(conversation continues)
These criteria also focus on top-tier statistics, such as average performance, says Narasimhan. However, they do not provide measurements of reliability or adaptability.
To address these issues with Tau-bench, Sierra has defined three requirements for the benchmark. The first is that most real-world settings require agents to interact seamlessly with humans and software APIs for an extended period of time to gather information and solve complex problems. Next, agents must be able to precisely follow complex mission-specific policies or rules. Finally, agents must be widely consistent and reliable to give companies peace of mind in knowing how they will behave.
TAU-bench assigns many tasks for agents to complete, from working with real-world databases and tool APIs to domain-specific policy documents that dictate desired agent behavior and an LLM-based user simulator that takes instructions for various scenarios to create realistic conversations with the agent. Each task assesses an agent’s ability to follow rules, reason, retain information across long and complex contexts, and communicate in a real-life conversation.
Main features of TAU bench
Narasimhan identifies four key features of the new Sierra standard:
- Realistic dialogue and tool use: Through generative modeling of language, TAU-bench features complex user scenarios produced using natural language rather than relying on writing complex rules.
- Open and diverse missions: TAU-bench features rich and detailed structures, interfaces, and sets of rules, allowing tasks to be created without simple, pre-defined solutions. This challenges AI agents to deal with diverse situations they may encounter in the real world.
- Honest objective evaluation: This criterion does not look at the quality of the conversation. Instead, it evaluates the outcome, the final state after the task is completed. Doing so gives it an objective measure of whether the AI agent succeeded in achieving the mission objective, eliminating the need for additional human judges or evaluators.
- Standard framework: Because TAU-bench is designed as a set of building blocks, it is easy to add new elements such as domains, database entries, rules, APIs, tasks, and evaluation metrics.
How do models fare under this scale?
Sierra tested the TAU-bench using 12 popular LLMs from OpenAI, Anthropic (Claude 3.5 Sonnet not included), Google, and Mistral. I discovered that all of them had difficulties in solving the tasks. In fact, OpenAI’s top-performing GPT-4o proxy had a success rate of less than 50 percent in two areas.
In addition, all clients tested performed “very poorly” in terms of reliability and were “unable to solve the same task continuously when the loop was restarted.”
All of this leads Narasimhan to conclude that more advanced LLM courses are needed to improve thinking and planning along with creating more complex scenarios. It also calls for new methods to make the annotation process easier through the use of automated tools, and the development of more precise evaluation metrics to test other aspects of an agent’s behavior, such as their tone and style.