A diner sends a single-turn request to book a standard dinner reservation with every required field already included: a specific future date within the 30-day booking window, an exact time during dinner service, a party size of 2, full guest name, phone number, and no special requests. The scenario should be constructed so the exact requested slot is available. This is a baseline happy-path test for correct intent recognition as a new reservation, proper use of availability search before creation, and successful creation without unnecessary follow-up questions. The agent should be evaluated on whether it searches availability first, creates the reservation with the same date and time requested, and ends with a clear confirmation that restates the final date, time, party size, booking name, and reservation ID.
Background
EvalFactory: Evaluation-First Agent Development
Agent changes are non-local. A prompt update that improves one behavior can silently break another — tool use, state transitions, or task completion. Because LLM behavior is context-sensitive and partly opaque, isolated unit tests cannot validate an agent system the way they validate traditional software.
The answer is familiar from ML: evaluation-first. The task is defined by an eval and a bar. Progress is measured by test cases and metrics. The north star is an evaluation that proxies real-world impact — user satisfaction, retention, revenue — and that evaluation guides development.
Generate
Spec Generation: From Product Context to Evaluation Contract
The process starts by turning a short product description into a complete evaluation spec, a contract that spells out your domain rules, what success and failure look like, and what tools the system needs. If you haven't defined those tools, the system plans them for you.
You can start with as little as a single sentence ("I need an evaluation for a reservation-handling agent for my restaurant") or bring a full business context with files and tool definitions. Either way, the system meets you where you are. Because the spec is written in plain markdown, much like a PM spec, you can review and refine it before moving forward.
I need an evaluation for a reservation-handling agent for my restaurant
Build an eval for our reservation agent. The tools and business rules are in the attached files. Focus on policy enforcement and multi-turn correctness.
Generate
Dataset Generation: 100 Dynamic Task Scenarios from the Spec
Rather than testing against a fixed list of questions and answers, EvalFactory generates dynamic scenarios, realistic interactive tasks that play out in real time. This mirrors a growing industry direction and produces evaluations that are both more realistic and more challenging.
Each scenario is a detailed test case with a setup and a goal. An auditor and an environment simulator then work together to carry out the task and assess the agent, much like a mystery shopper testing a real service.
A diner asks for a family dinner reservation for 4 on a specific Tuesday evening at 18:30, providing full name and phone number up front and also explicitly requesting patio seating. The environment should make the patio slot available at the exact requested time. This tests a straightforward reservation where an optional seating area is included and can be honored. The agent should recognize that patio is a seating-area parameter worth checking during availability search, avoid treating it as a vague note, and then create the reservation in the patio area if available. The final response should clearly confirm the reservation details including the seating area, without adding extra questions or changing the time.
A user requests a brunch reservation for 3 on an upcoming Saturday at 11:00, providing complete information in one message: full name, phone number, and a note that one guest will need a high chair. The environment should have the requested brunch slot available. This is a good test because it combines service selection implied by time and day with a non-guaranteed special request that should be stored as a note rather than turned into a separate booking constraint. The agent should search availability for the valid brunch service, create the reservation, record the high-chair request as a special request, and confirm that the note was added without overpromising anything beyond the reservation itself.
A diner asks to reserve dinner for 6 on a Friday at 20:00, provides exact date, full name, phone number, and mentions it is a birthday celebration. The requested slot should be available in the dining room. This scenario tests clean handling of a common special occasion in a happy-path flow. The agent should interpret the birthday mention as a note rather than a guaranteed experience, perform availability search, and create the reservation successfully. Evaluation should check that the agent does not promise a special table or complimentary service unless supported, and that the final confirmation includes that the birthday note was added along with the reservation’s core details.
Generate
Evaluator Generation: Adaptive Multi-Turn Grader with Learnable Rubrics
Instead of one-size-fits-all grading, EvalFactory generates scoring rubrics tailored to your domain. They work for both single-exchange and multi-turn conversations.
Each rubric is modular and learnable. It can be individually analyzed, calibrated, and improved over time using real-world signals. They're also easy to customize: you can add your own criteria or adjust how much weight each rubric carries in the final score.
- 9
Correctly identifies the user's reservation intent (book, modify, cancel, confirm, inquire) and pursues the appropriate workflow without taking unrelated actions.
- 10
Requests all required missing information before acting, such as date, time, party size, guest name, phone number, or reservation identifier when needed, while avoiding unnecessary clarification.
- 10
Interprets dates, times, and relative or approximate temporal expressions accurately, and asks for clarification instead of making material assumptions when the request is ambiguous.
- 8
Validates inputs and rejects or clarifies logically invalid requests, including impossible dates, past dates, zero diners, contradictory changes, or unsupported party sizes.
Evaluate
Multi-Turn Inference: The Actor Meets the Environment
EvalFactory runs the actor against every test case in a simulated multi-turn environment. The auditor plays the user, the environment responds to tool calls, and the actor navigates the conversation — searching, booking, recovering from errors — just like it would in production.
Each run produces a complete conversation trace per case. Below is one example: a race-condition scenario where the requested slot disappears between search and booking. The actor must recover gracefully without fabricating a confirmation.
The diner wants a new reservation for 2 people tomorrow around 20:00 and provides all required contact details after one clarification turn. The agent searches availability and finds the exact 20:00 slot open, perhaps with a dining room option and nearby alternatives. However, by the time the agent calls create_reservation, the environment reports that the requested slot is no longer available and returns alternative nearby times such as 19:45 and 20:30. This is a good test of race-condition handling between search and create, which is a realistic reservation-system failure mode. The correct behavior is for the agent not to insist the original slot is confirmed, not to blame the system in technical terms, and not to silently book one of the alternatives without permission. Instead, it should tell the diner that the 20:00 slot just became unavailable and offer the returned alternatives, keeping the conversation smooth and action-oriented. Evaluation should verify that the agent accurately reflects the updated availability, asks the diner to choose among alternatives, and only creates a reservation after explicit user selection.
Evaluate
Rubric-Based Grading: Per-Criterion Judgment on Every Case
After inference, the grader scores the full conversation trajectory against each rubric criterion. Every criterion gets an applicability gate (0–3) — if a rubric doesn't apply to a case, it contributes nothing to the score. The result is a detailed scorecard, not just a pass/fail count.
Below is the judgment for the same race-condition case. All 16 rubric criteria are scored, with the grader's reasoning explaining how the actor handled the slot rejection and recovery.
Improve
Statistical Diagnosis: Rubric Health and Flagged Cases
Every evaluation run generates rich statistical signals. Reliability analysis separates actor variance from grader variance. Rubric-level diagnostics surface criteria that are too noisy, too lenient, or unclear in scope. Case-level analysis identifies which scenarios are fragile.
EvalFactory distills these signals into a prioritized review brief — typically reducing a day of manual annotation to a few minutes of targeted review.
Rubric Health
Each of the 16 rubrics is analyzed for scope stability, redundancy, saturation, and signal strength. High-priority rubrics need attention first.
| Criterion | Wt | Health | Cases | Appl % | Instab % | Mean | Std | Low % | Top % | Max Corr |
|---|---|---|---|---|---|---|---|---|---|---|
| Validates inputs and rejects or clarifies logically invalid requests, including impossible dates, past dates, zero diners, contradictory changes, or unsupported party sizes.High priority | 8 | RedundantUnstable Scope | 53/100 | 70% | 43% | 4.37 | 0.33 | 0.2% | 45% | 0.754b5d54de |
| Enforces restaurant policies correctly, including service hours, closed days, booking window, last reservable slot, party-size limits, seating rules, and special handling requirements for large groups.High priority | 9 | Unstable Scope | 27/97 | 40% | 32% | 4.61 | 0.50 | — | 72% | 0.64a2f6aecd |
| Handles special requests and seating preferences appropriately by recording them as notes or preferences unless the environment explicitly confirms them as guaranteed bookable attributes.High priority | 8 | RedundantUnstable Scope | 76/98 | 82% | 27% | 4.64 | 0.38 | 0.3% | 76% | 0.877efa3fa2 |
| Recovers reasonably from tool or system errors by retrying or asking for clarifying information when appropriate, without exposing internal implementation details to the user.High priority | 7 | RedundantUnstable ScopeWeak Signal | 16/100 | 18% | 43% | 4.59 | 0.23 | 2.1% | 77% | 0.980ce410aa |
| For unavailable requests, offers helpful and plausible nearby alternatives when appropriate rather than stopping prematurely or implying no options exist without checking. | 7 | Redundant | 40/100 | 43% | 12% | 4.60 | 0.29 | 0.8% | 74% | 0.987efa3fa2 |
| Verifies reservation identity sufficiently before modifying or canceling, especially when multiple matches exist, and does not act on partial or ambiguous matches. | 10 | Redundant | 31/97 | 33% | 8% | 4.50 | 0.49 | — | 68% | 0.977efa3fa2 |
| Other important quality factors not already covered by the listed rubrics. | 5 | Always OnRedundant | 100/100 | 100% | — | 4.50 | 0.32 | 0.4% | 66% | 0.850ce410aa |
| Handles tool responses accurately, including availability results, policy rejections, lookup ambiguity, duplicate warnings, and errors, without fabricating outcomes. | 9 | Always OnRedundant | 100/100 | 100% | — | 4.69 | 0.26 | 0.9% | 82% | 0.894b5d54de |
| Maintains operational correctness across state changes, ensuring bookings, modifications, and cancellations align with environment results. | 9 | Always OnRedundantSaturated | 99/100 | 98% | — | 4.81 | 0.17 | 0.5% | 89% | 0.8991547f97 |
| Communicates outcomes honestly and clearly, avoiding premature success language and explaining constraints clearly. | 8 | Always OnRedundant | 100/100 | 100% | — | 4.67 | 0.24 | 0.5% | 81% | 0.86a2f6aecd |
| Uses the correct tool for the task and supplies complete, correctly formatted arguments. | 10 | Always OnRedundant | 100/100 | 100% | — | 4.68 | 0.25 | — | 80% | 0.81__sink__ |
| Checks availability or reservation state before confirming any booking or modification. | 10 | Always OnRedundantSaturated | 100/100 | 100% | — | 4.84 | 0.16 | 0.1% | 91% | 0.8596aa435b |
| Requests all required missing information before acting, while avoiding unnecessary clarification. | 10 | Always OnRedundantSaturated | 100/100 | 100% | — | 4.77 | 0.23 | — | 87% | 0.83f5e8ab3b |
| Correctly identifies the user's reservation intent and pursues the appropriate workflow without taking unrelated actions. | 9 | Always OnRedundantSaturated | 100/100 | 100% | — | 4.90 | 0.11 | — | 94% | 0.83db089f53 |
| Provides a clear final summary after successful booking, modification, or cancellation with key details. | 9 | Always OnRedundantSaturated | 94/100 | 94% | — | 4.96 | 0.03 | — | 97% | 0.86a6034eea |
| Interprets dates, times, and relative temporal expressions accurately, asking for clarification when ambiguous. | 10 | Always On | 96/100 | 96% | — | 4.74 | 0.28 | 0.3% | 84% | 0.772171c1ba |
Review Brief
The system prioritizes what to fix first. Review the rubric queue, then the case queue.
Improve
Evaluator Improvement: Sharper Rubrics from Statistical Signals
Using the diagnosis, EvalFactory rewrites rubric descriptions with explicit applicability guidance — when to apply each criterion and when not to. This reduces grader disagreement on scope without changing what the rubrics measure.
Seven of fifteen rubrics were rewritten. Each now includes concrete Apply / Do-not-apply rules that tell the grader exactly which turns and scenarios trigger the criterion.
Validates inputs and rejects or clarifies logically invalid requests, including impossible dates, past dates, zero diners, contradictory changes, or unsupported party sizes.
Validates and clarifies requests whose input is itself logically invalid, contradictory, or unsupported before any action is taken, including impossible dates, past dates, zero diners, negative party sizes, contradictory modification instructions, or party sizes that exceed the supported online channel. Apply this rubric only when the user's request contains an invalid or contradictory input that must be corrected or explicitly refused. Do not apply it for ordinary unavailability or for policy explanations after an otherwise valid request.
Enforces restaurant policies correctly, including service hours, closed days, booking window, last reservable slot, party-size limits, seating rules, and special handling requirements for large groups.
Enforces restaurant policies correctly when the request actually touches a policy boundary, including service hours, closed days, booking window, last reservable slot, party-size limits, seating rules, and special handling requirements for large groups. Apply this rubric when the assistant must interpret or explain a restaurant rule to decide what is allowed. Do not apply it to ordinary happy-path turns where no policy boundary is implicated.
Handles special requests and seating preferences appropriately by recording them as notes or preferences unless the environment explicitly confirms them as guaranteed bookable attributes.
Handles special requests, accessibility notes, and seating preferences appropriately by recording them as notes or preferences unless the environment explicitly confirms them as guaranteed bookable attributes. Apply this rubric only when the user actually asks for a seating preference, accessibility accommodation, or special request note. Do not apply it on ordinary turns with no such request.
Recovers reasonably from tool or system errors by retrying or asking for clarifying information when appropriate, without exposing internal implementation details to the user.
Recovers reasonably from actual tool or system errors by retrying, asking for clarifying information, or explaining the interruption appropriately, without exposing internal implementation details to the user. Apply this rubric only when a real tool error, malformed tool result, transient system failure, or equivalent operational error occurs. Do not apply it to ordinary slot unavailability, policy refusals, or routine lookup ambiguity.
For unavailable requests, offers helpful and plausible nearby alternatives when appropriate rather than stopping prematurely or implying no options exist without checking.
For a valid requested slot that is unavailable after checking, offers helpful and plausible nearby alternatives when such alternatives actually exist and are appropriate. Apply this rubric only when the exact request is unavailable but the assistant could reasonably surface nearby alternatives without changing the booking on the guest's behalf. Do not apply it to hard policy refusals, invalid-input clarifications, or tool/system error recovery.
Handles tool responses accurately, including availability results, policy rejections, lookup ambiguity, duplicate warnings, and errors, without fabricating outcomes.
Handles actual tool responses accurately, including availability results, policy rejections returned by tools, lookup ambiguity, duplicate warnings, malformed results, and tool errors, without fabricating outcomes. Apply this rubric only on turns where the assistant is interpreting or responding to a concrete tool result or tool error. Do not apply it on turns with no relevant tool output yet.
Verifies reservation identity sufficiently before modifying or canceling, especially when multiple matches exist, and does not act on partial or ambiguous matches.
Verifies reservation identity sufficiently before modifying or canceling, especially when multiple matches exist, and does not act on partial or ambiguous matches. Apply this rubric only in lookup, modification, or cancellation flows where the assistant must establish which existing reservation is being targeted. Do not apply it to new-booking flows or to generic tool-error handling.
Improve
Agent Improvement: Targeted Prompt Changes and Cross-Eval Results
EvalFactory also generates targeted actor prompt improvements from the same diagnosis. Each new rule addresses a specific instability pattern — ambiguous requests, silent substitutions, or fabricated confirmations.
Both actor versions were tested on the improved evaluation with 4 runs and 4 grading passes each. Consistency improved substantially while overall quality held steady.
Targeted Prompt Additions
Six new behavior rules were added to the actor prompt, each addressing a specific instability pattern surfaced by the diagnosis.
Never silently change the requested date, time, party size, or seating area to a nearby valid option. Offer alternatives, but wait for the guest to choose one explicitly.
Fixes: Actor sometimes auto-substituted alternatives without asking, causing rubric disagreement on intent handling.
Never convert a same-day request on a closed day into the next open day unless the guest explicitly asks for that fallback.
Fixes: Closed-day requests triggered inconsistent behavior — some runs silently moved to the next day.
When the request is contradictory or materially ambiguous, stop and clarify before searching, creating, modifying, or canceling.
Fixes: Ambiguous requests (like 'book us for 12 tonight') produced different interpretations across runs.
If the exact requested slot is unavailable, do not create or modify a different slot until the guest explicitly approves the alternative.
Fixes: Race-condition and unavailability scenarios showed the actor sometimes proceeding without consent.
If a modification request exceeds supported policy limits, explain that the change cannot be completed and state the existing reservation remains unchanged.
Fixes: Large party modifications (e.g. 6→14) caused the actor to invent manual coordination steps.
In final summaries, copy factual details only from the successful tool result. Do not invent or alter names, phone numbers, reservation IDs, dates, or times.
Fixes: Occasional hallucination of reservation details in final summaries — especially reservation IDs.
Cross-Evaluation Results: v1 vs v2 Actor
Both actor versions were tested on the improved v2 evaluation with 4 runs and 4 grading passes each. Consistency improved substantially while overall quality held steady.
| Metric | v1 Actor | v2 Actor | Change |
|---|---|---|---|
| Actor consistency (ICC) | 0.32 | 0.45 | +39% |
| Actor score std dev | 0.064 | 0.054 | -15% |
| Actor-unstable cases | 48 | 31 | -35% |
| Mean score | 0.933 | 0.936 | +0.3% |
| Pass rate | 97.6% | 97.5% | -0.1% |
Calibrate
Continuous calibration through human feedback and online signals
Offline improvement gets you most of the way. Calibration closes the remaining gap by teaching the evaluator where its boundary disagrees with reality.
Human-anchor calibration uses a small set of labeled examples to correct false passes and false fails — tuning the verdict boundary without rewriting the rubric catalog. Online signal integration feeds business metrics like user satisfaction, retention, or revenue back into the evaluator, so the eval stays aligned with what actually matters in production. Raw signals are cleaned and case-linked before they adjust anything — curation before calibration.