Hugging Face BlogAI Agent

EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios

2026年6月4日 12:24

重點摘要

Back to Articles EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios Enterprise Article Published June 4, 2026 Upvote 1 Tara Bogavelli tarabogavelli Follow ServiceNow-AI Gabrielle Gauthier Melancon gabegma Follow ServiceNow-AI Katrina Stankiewicz kstankiewicz Follow ServiceNow-AI Nifemi Bamgbose onifemibam Follow ServiceNow-AI Fanny Riols FannyRiols Follow ServiceNow-AI Hoang Nguyen hnguy7 Follow ServiceNow-AI Raghav Mehndiratta rmehndir Follow ServiceNow-AI Lindsay Brin lindsaybrin Follow ServiceNow-AI Hari Subramani Hari-sub Follow ServiceNow-AI Anil Madamala anilmadamala Follow ServiceNow-AI Introduction Voice agent failures are often highly domain-specific. A system that flawlessly processes alphanumeric confirmation codes in flight re-booking transactions might stumble when handli

站內 AI 整理稿

Back to Articles EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios Enterprise Article Published June 4, 2026 Upvote 1 Tara Bogavelli tarabogavelli Follow ServiceNow-AI Gabrielle Gauthier Melancon gabegma Follow ServiceNow-AI Katrina Stankiewicz kstankiewicz Follow ServiceNow-AI Nifemi Bamgbose onifemibam Follow ServiceNow-AI Fanny Riols FannyRiols Follow ServiceNow-AI Hoang Nguyen hnguy7 Follow ServiceNow-AI Raghav Mehndiratta rmehndir Follow ServiceNow-AI Lindsay Brin lindsaybrin Follow ServiceNow-AI Hari Subramani Hari-sub Follow ServiceNow-AI Anil Madamala anilmadamala Follow ServiceNow-AI Introduction Voice agent failures are often highly domain-specific. A system that flawlessly processes alphanumeric confirmation codes in flight re-booking transactions might stumble when handling complex policies in HR systems. Different domains test an agent's ability to adapt to different vocabulary, workflow complexities and user expectations. So with this release, EVA-Bench expands from one enterprise domain to three: Airline Customer Service Management (CSM), Enterprise IT Service Management (ITSM), and Healthcare HR Service Delivery (HRSD). Together they span 213 evaluation scenarios across 121 tools, a roughly 4x increase in scenario coverage from our original release. Every scenario was validated for solvability against three frontier models (OpenAI GPT-5.4, Google Gemini 3.1 Pro, and Anthropic Claude Opus 4.6) ensuring the benchmark is both challenging and fair. All three datasets are open-source and available for download: from datasets import load_dataset # Airline Customer Service Management (CSM) — 50 scenarios airline = load_dataset("ServiceNow-AI/eva-bench", "airline", split="test") # Enterprise IT Service Management (ITSM) — 80 scenarios itsm = load_dataset("ServiceNow-AI/eva-bench", "itsm", split="test") # Healthcare HR Service Delivery (HRSD) — 83 scenarios hrsd = load_dataset("ServiceNow-AI/eva-bench", "medical", split="test") EVA-Bench is built for multiple audiences. If you're evaluating a voice agent, you can run it against a diverse set of realistic enterprise scenarios spanning 35+ distinct workflows. If you're building your own evaluation dataset, this post describes our end-to-end generation and validation process in enough detail to serve as a practical reference. We walk through how each domain was designed and generated and take a deep dive into the two new additions. We also preview our upcoming multilingual extension, which widens the benchmark's reach beyond English-only enterprise deployments. Data Design Principles Five principles guided the design of the EVA-Bench datasets across all three domains. Voice-first scope. Not every enterprise workflow belongs in a voice benchmark. We started by identifying which tasks within each domain are handled over the phone in practice, then selected the most common flows from that subset. This kept the scenarios grounded in realistic call patterns. Realism. Tool schemas were modeled after the kinds of APIs a production platform uses. Scenario policies were drawn from actual enterprise constraints. For the Healthcare HRSD domain, this meant grounding scenarios in actual US healthcare policy and administration systems, including NPI numbers, FMLA, and insurance coverage, so that the benchmark reflects the domain as practitioners encounter it in real life. Variety. Scaling a dataset by simply repeating identical tasks offers limited evaluation signal. To avoid this, we defined specific workflows for each domain and sampled across three scenario types: single-intent calls, multi-intent calls with up to four intents in a single conversation, and adversarial calls where callers attempt to bypass troubleshooting steps, misclassify urgency, or access records they are not authorized to view. Within single and multi-intent scenarios, we also included cases where the user's goal is not satisfiable, because real call volume is not all happy-path, and in our experience models tend to struggle more with unsatisfiable goals than with successful interactions. Authentication. Prior work, (EVA-Bench and τ-Voice), has identified authentication as one of the most consistent failure points for voice agents. Every domain in EVA-Bench includes authentication flows, and the specific mechanisms are calibrated to the task. For example, OTP-based elevation appears where a production system would actually require it, not uniformly across all scenarios. Reproducibility. Without reproducible scenarios, it is difficult to know whether a score difference reflects a genuine capability gap or an artifact of how the scenario played out. We designed the dataset so that every scenario has exactly one correct resolution path. User goal construction ensures the simulator always has the information and instructions it needs to behave consistently, and scenario generation explicitly checks for and eliminates any cases where multiple valid action sequences could achieve the same outcome. Scenario Generation Joint generation. Scenarios are generated using SyGra, a graph-based synthetic data generation pipeline, with GPT-5.4 as the backbone. Each scenario requires three jointly consistent components which are generated together to prevent inconsistencies that arise when components are produced independently: User goal. Reproducibility requires that the user simulator behaves the same way every time a scenario is run. A vague statement of intent does not achieve this: the simulator will make different judgment calls across runs, producing inconsistent evaluation signals. To eliminate this, the user goal is structured as a decision tree that covers every situation the simulator is likely to encounter. The user goal specifies exactly which things the user should ask for along with a negotiation sequence that specifies exactly when to push back, when to ask for alternatives, and when to accept. Common edge cases, such as whether to accept a standby flight or an alternate airport, are handled with explicit instructions rather than left to the simulator to interpret. The resolution condition requires evidence of a completed action, such as a confirmation number or case ID, rather than a verbal commitment, so the simulator stays on the call until the action is actually confirmed. The result is a user that behaves like a consistent, realistic caller rather than one that improvises. Initial scenario database. The backend state the agent's tools will query and modify during the scenario. Generated jointly with the user goal to ensure that every entity referenced in the user goal, such as booking IDs, account details, and authentication credentials, exists and is consistent in the database. Expected final database state (ground truth). We derive the expected outcome by running the generation LLM on the agent instructions, user goal, and initial scenario database, producing a full action trace. As the LLM executes write tool calls, the database is updated incrementally, and the resulting terminal state becomes the ground truth that verifiers check against during evaluation. Joint generation is essential because these three components are deeply interdependent. Independent generation would introduce silent inconsistencies, such as a case ID referenced in the user goal that does not exist in the scenario database, which would corrupt the evaluation signal entirely. To enforce consistency, we run a multi-stage validation loop after each generation attempt and feed any failures back to the generation step, which retries until all checks pass. Validation proceeds in three steps. A structural check validates the scenario database against a Pydantic schema, catching type errors and missing fields. LLM-based validator checks consistency across the scenario more holistically: whether user-facing details in the goal match the database records, whether cross-references are internally valid, and whether authentication data is correctly configured. LL

Related

相關文章

鈦媒體AI Agent

Exa獲2.5億美元融資,打造Agent原生的“Google”

這篇消息聚焦「Exa獲2.5億美元融資,打造Agent原生的“Google”」。原始導語提到:AI時代需要把搜索從底層重新做一遍 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

剛剛
鈦媒體AI Agent

封了自家元寶,微信AI親自下場

這篇消息聚焦「封了自家元寶,微信AI親自下場」。原始導語提到:聊天框裡,如何再裝下一個AI操作系統。 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

6 小時前
智東西AI Agent

又一百億估值獨角獸誕生!AI軟件監控創企拿下新融資,去年ARR破6億

智東西 編譯 | 田忠婷 編輯 | 程茜 智東西6月4日報道,昨晚,以色列AI軟件監控獨角獸Coralogix完成2億美元(約合人民幣13.5億元)F輪融資,投後估值達16億美元(約合人民幣108億元)。 該輪融資金額將主要用於AI智能體能力研發、遙測數據技術設施建設和市場擴張三個領域。過去一年,Coralogix的營收增長超過60%,並且其年化收入在一年多前就突破1億美元,在全球擁有包括IBM、Tradeweb和JFrog在內的5000多家客戶。 Coralogix本輪融資由Advent、加拿大養老金計劃投資委員會(CPPIB)和Greenfield共同領投,Brighton Park Capital跟投。 該公司2025年6月17日完成1.15億美元(約合人民幣7.8億元)E輪融資,投後估值超10億美元(約合人民幣68億元),一舉躍升獨角獸企業。距離上輪融資不到1年,Coralogix就完成了新一輪融資,這也是其成立以來最大的單筆融資。目前,Coralogix累計融資金額已達5.5億美元(約合人民幣37億元) ▲Coralogix獲得2億美元融資的公告(圖源:Coralogix) Coralogix由Ariel Assaraf於2014年在以色列創立,總部位於美國波士頓,是一家專注於AI時代軟件系統監控的公司。其核心業務是為企業提供新一代的運維監控系統,以AI Agent替代傳統的監控軟件,從而幫助企業在AI時代實現更智能、更自主的系統運維。 其創始人Ariel Assaraf畢業於以色列開放大學經濟學與數學專業,後獲神經科學與機器學習碩士學位。他曾在以色列安全部門工作,後在Verint等公司任職,於2014年聯合創立Coralogix並擔任CEO。 一、AI Agent倒逼運維變革,傳統監控軟件顯露短板 傳統監控軟件主要依賴儀表盤、固定告警規則和人工分析,已經難以應對

6 小時前

AI Agent 的門票,MiniMax 想先打下來

這篇消息聚焦「AI Agent 的門票,MiniMax 想先打下來」。原始導語提到:為何人人都在 token 焦慮? 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

10 小時前