EvalFactory — Eval-First Agent Development

Background

EvalFactory: Evaluation-First Agent Development

Agent changes are non-local. A prompt update that improves one behavior can silently break another — tool use, state transitions, or task completion. Because LLM behavior is context-sensitive and partly opaque, isolated unit tests cannot validate an agent system the way they validate traditional software.

The answer is familiar from ML: evaluation-first. The task is defined by an eval and a bar. Progress is measured by test cases and metrics. The north star is an evaluation that proxies real-world impact — user satisfaction, retention, revenue — and that evaluation guides development.

Generate

Spec Generation: From Product Context to Evaluation Contract

The process starts by turning a short product description into a complete evaluation spec, a contract that spells out your domain rules, what success and failure look like, and what tools the system needs. If you haven't defined those tools, the system plans them for you.

You can start with as little as a single sentence ("I need an evaluation for a reservation-handling agent for my restaurant") or bring a full business context with files and tool definitions. Either way, the system meets you where you are. Because the spec is written in plain markdown, much like a PM spec, you can review and refine it before moving forward.

Start with a sentence

I need an evaluation for a reservation-handling agent for my restaurant

Or bring full context

product_spec.mdtools.jsonexisting_prompt.md

Build an eval for our reservation agent. The tools and business rules are in the attached files. Focus on policy enforcement and multi-turn correctness.

Generate

Dataset Generation: 100 Dynamic Task Scenarios from the Spec

Rather than testing against a fixed list of questions and answers, EvalFactory generates dynamic scenarios, realistic interactive tasks that play out in real time. This mirrors a growing industry direction and produces evaluations that are both more realistic and more challenging.

Each scenario is a detailed test case with a setup and a goal. An auditor and an environment simulator then work together to carry out the task and assess the agent, much like a mystery shopper testing a real service.

A diner sends a single-turn request to book a standard dinner reservation with every required field already included: a specific future date within the 30-day booking window, an exact time during dinner service, a party size of 2, full guest name, phone number, and no special requests. The scenario should be constructed so the exact requested slot is available. This is a baseline happy-path test for correct intent recognition as a new reservation, proper use of availability search before creation, and successful creation without unnecessary follow-up questions. The agent should be evaluated on whether it searches availability first, creates the reservation with the same date and time requested, and ends with a clear confirmation that restates the final date, time, party size, booking name, and reservation ID.

A diner asks for a family dinner reservation for 4 on a specific Tuesday evening at 18:30, providing full name and phone number up front and also explicitly requesting patio seating. The environment should make the patio slot available at the exact requested time. This tests a straightforward reservation where an optional seating area is included and can be honored. The agent should recognize that patio is a seating-area parameter worth checking during availability search, avoid treating it as a vague note, and then create the reservation in the patio area if available. The final response should clearly confirm the reservation details including the seating area, without adding extra questions or changing the time.

A user requests a brunch reservation for 3 on an upcoming Saturday at 11:00, providing complete information in one message: full name, phone number, and a note that one guest will need a high chair. The environment should have the requested brunch slot available. This is a good test because it combines service selection implied by time and day with a non-guaranteed special request that should be stored as a note rather than turned into a separate booking constraint. The agent should search availability for the valid brunch service, create the reservation, record the high-chair request as a special request, and confirm that the note was added without overpromising anything beyond the reservation itself.

A diner asks to reserve dinner for 6 on a Friday at 20:00, provides exact date, full name, phone number, and mentions it is a birthday celebration. The requested slot should be available in the dining room. This scenario tests clean handling of a common special occasion in a happy-path flow. The agent should interpret the birthday mention as a note rather than a guaranteed experience, perform availability search, and create the reservation successfully. Evaluation should check that the agent does not promise a special table or complimentary service unless supported, and that the final confirmation includes that the birthday note was added along with the reservation’s core details.

Generate

Evaluator Generation: Adaptive Multi-Turn Grader with Learnable Rubrics

Instead of one-size-fits-all grading, EvalFactory generates scoring rubrics tailored to your domain. They work for both single-exchange and multi-turn conversations.

Each rubric is modular and learnable. It can be individually analyzed, calibrated, and improved over time using real-world signals. They're also easy to customize: you can add your own criteria or adjust how much weight each rubric carries in the final score.

Trajectory-nativeMulti-turn by defaultThe evaluator accepts context turns, target turns, and target roles — covering both multi-turn conversations and single-turn exchanges in one framework.

Applicability gatesNot every rubric fits every caseBefore scoring quality, the judge rates each rubric's relevance (0–3). Inapplicable criteria contribute nothing to the final score. New in EvalFactory.

Sink criterionCatches what rubrics don'tA residual rubric with fixed applicability — always scored on every case — so quality issues that fall outside the explicit criteria are never lost. New in EvalFactory.

Zero-config lifecycleGenerate, analyze, surface, calibrateRubrics are generated from the spec, automatically analyzed across runs to surface unreliable or problematic criteria, calibrated against online signals, and evolved — no manual evaluator plumbing required.

9
Correctly identifies the user's reservation intent (book, modify, cancel, confirm, inquire) and pursues the appropriate workflow without taking unrelated actions.
10
Requests all required missing information before acting, such as date, time, party size, guest name, phone number, or reservation identifier when needed, while avoiding unnecessary clarification.
10
Interprets dates, times, and relative or approximate temporal expressions accurately, and asks for clarification instead of making material assumptions when the request is ambiguous.
8
Validates inputs and rejects or clarifies logically invalid requests, including impossible dates, past dates, zero diners, contradictory changes, or unsupported party sizes.

Evaluate

Multi-Turn Inference: The Actor Meets the Environment

EvalFactory runs the actor against every test case in a simulated multi-turn environment. The auditor plays the user, the environment responds to tool calls, and the actor navigates the conversation — searching, booking, recovering from errors — just like it would in production.

Each run produces a complete conversation trace per case. Below is one example: a race-condition scenario where the requested slot disappears between search and booking. The actor must recover gracefully without fabricating a confirmation.

Test case scenario

The diner wants a new reservation for 2 people tomorrow around 20:00 and provides all required contact details after one clarification turn. The agent searches availability and finds the exact 20:00 slot open, perhaps with a dining room option and nearby alternatives. However, by the time the agent calls create_reservation, the environment reports that the requested slot is no longer available and returns alternative nearby times such as 19:45 and 20:30. This is a good test of race-condition handling between search and create, which is a realistic reservation-system failure mode. The correct behavior is for the agent not to insist the original slot is confirmed, not to blame the system in technical terms, and not to silently book one of the alternatives without permission. Instead, it should tell the diner that the 20:00 slot just became unavailable and offer the returned alternatives, keeping the conversation smooth and action-oriented. Evaluation should verify that the agent accurately reflects the updated availability, asks the diner to choose among alternatives, and only creates a reservation after explicit user selection.

Run 112 messages

User

I'd like to book a table for 2 tomorrow, around 8 pm. My name's Priya Nair.

Tool

Agent

Great news, Priya. We have a 2-top available tomorrow at 20:00 (8:00 PM) in the dining room.

Would you like to book that exact time, or would you prefer one of these close options?

• 19:45 (dining room)

• 20:30 (dining room)

• 19:30 (patio)

Please also provide the phone number to finalize the reservation.

User

Yes, please book the 8:00 pm one. My phone number is 415-555-0187.

Tool

Agent

Thanks, Priya. The 8:00 PM slot is no longer available.

Would you like me to book one of these alternatives instead?

• 7:45 PM (dining room)

• 8:30 PM (dining room)

Please tell me which time to proceed with.

User

7:45 pm works. Yes, please keep it under Priya Nair, 415-555-0187.

Tool

Agent

Your reservation is confirmed.

• Date: March 10, 2026

• Time: 7:45 PM

• Party size: 2

• Name: Priya Nair

• Seating: Dining room

• ID: RSV-20260310-4817

Run 212 messages

User

I need a dinner reservation for 2 tomorrow, around 8 pm.

Tool

Agent

We have seats available tomorrow around 8:00 PM.

Closest options:

• 19:45 (dining room)

• 20:00 (dining room) — preferred

• 20:30 (dining room)

• 19:30 (patio)

To book, I need your full name and phone number.

User

Yes, 8:00 in the dining room works.

Name is Marisol Vega, and my phone number is 415-867-2041.

Tool

Agent

Sorry, that 8:00 PM slot just became unavailable. Here are the closest alternatives:

• 19:45 PM, dining room

• 20:30 PM, dining room

• 20:15 PM, patio

Which option would you like?

User

20:30 in the dining room, please.

Tool

Agent

Your reservation is confirmed.

• Date: 2026-03-10

• Time: 20:30

• Party size: 2

• Name: Marisol Vega

• Seating: Dining room

• ID: RSV-20260310-4817

Run 312 messages

User

I need a dinner reservation for 2 tomorrow, ideally around 8.

Agent

I have tomorrow as 2026-03-10 and you'd like around 8:00 PM for 2 guests.

Could you please provide:

• Your full name

• Your phone number

Optional: seating preference, any special requests.

User

Sure — it should be under Maya Deshpande, and my number is 415-555-0187. Dining room if possible.

Tool

Agent

Thanks, Maya. The 8:00 PM slot is now unavailable.

Here are nearby options:

• 19:45

• 20:30

• 21:00

Which time would you like?

User

20:30 works. No special requests.

Tool

Agent

Your reservation is confirmed.

• Date: 2026-03-10

• Time: 20:30 (8:30 PM)

• Party: 2

• Name: Maya Deshpande

• Seating: Dining room

• ID: RSV-20260310-4831

Run 414 messages

User

I need a dinner reservation for 2 tomorrow around 8 PM.

Tool

Agent

The 8:00 PM slot for 2 is available in the dining room.

To complete the reservation, I need:

• Full name for the reservation

• Phone number

User

Sure — it's Maya Patel, and my phone number is 415-555-0187. No special requests.

Tool

Agent

Sorry, that 8:00 PM slot became unavailable. Here are alternatives:

• 7:45 PM in dining room

• 8:30 PM in dining room

• 9:00 PM on patio

Which would you like?

User

Let's do 7:45 PM in the dining room.

Tool

Agent

Your reservation is confirmed.

• Date: 2026-03-10

• Time: 7:45 PM

• Party size: 2

• Name: Maya Patel

• Seating: Dining room

• ID: RSV-20260310-1842

Evaluate

Rubric-Based Grading: Per-Criterion Judgment on Every Case

After inference, the grader scores the full conversation trajectory against each rubric criterion. Every criterion gets an applicability gate (0–3) — if a rubric doesn't apply to a case, it contributes nothing to the score. The result is a detailed scorecard, not just a pass/fail count.

Below is the judgment for the same race-condition case. All 16 rubric criteria are scored, with the grader's reasoning explaining how the actor handled the slot rejection and recovery.

Conversation trace · Run 1

Run 112 messages

User

I'd like to book a table for 2 tomorrow, around 8 pm. My name's Priya Nair.

Tool

Agent

Great news, Priya. We have a 2-top available tomorrow at 20:00 (8:00 PM) in the dining room.

Would you like to book that exact time, or would you prefer one of these close options?

• 19:45 (dining room)

• 20:30 (dining room)

• 19:30 (patio)

Please also provide the phone number to finalize the reservation.

User

Yes, please book the 8:00 pm one. My phone number is 415-555-0187.

Tool

Agent

Thanks, Priya. The 8:00 PM slot is no longer available.

Would you like me to book one of these alternatives instead?

• 7:45 PM (dining room)

• 8:30 PM (dining room)

Please tell me which time to proceed with.

User

7:45 pm works. Yes, please keep it under Priya Nair, 415-555-0187.

Tool

Agent

Your reservation is confirmed.

• Date: March 10, 2026

• Time: 7:45 PM

• Party size: 2

• Name: Priya Nair

• Seating: Dining room

• ID: RSV-20260310-4817

Grader scores · 4 runs

CriterionWtScore (1–5)Appl. (0–3)

Correctly identifies the user’s reservation intent and pursues the appropriate workflow without taking unrelated actions.9

5.0

High

Requests all required missing information before acting, while avoiding unnecessary clarification.10

4.8

High

Interprets dates, times, and relative temporal expressions accurately, asking for clarification when ambiguous.10

5.0

High

Validates inputs and rejects logically invalid requests (impossible dates, zero diners, contradictory changes).8

5.0

Low

Checks availability before confirming any booking or modification.10

4.8

High

Uses the correct tool for the task with complete, correctly formatted arguments.10

5.0

High

Handles tool responses accurately, including rejections and errors, without fabricating outcomes.9

5.0

High

Enforces restaurant policies correctly (hours, booking window, party-size limits, seating rules).9

5.0

Low

Offers helpful nearby alternatives for unavailable requests rather than stopping prematurely.7

5.0

High

Verifies reservation identity before modifying or canceling, especially with multiple matches.10

5.0

N/A

Maintains operational correctness across state changes, ensuring details align with environment results.9

5.0

High

Handles special requests appropriately, recording them as preferences unless explicitly confirmed.8

5.0

Low

Communicates outcomes honestly, avoiding premature success language and explaining constraints clearly.8

4.8

High

Provides a clear final summary with key details (date, time, party size, name, reservation ID).9

5.0

High

Recovers from tool or system errors by retrying or asking for clarification without exposing internals.7

5.0

Med

Other important quality factors not already covered by the listed rubrics.5

4.8

High

Improve

Statistical Diagnosis: Rubric Health and Flagged Cases

Every evaluation run generates rich statistical signals. Reliability analysis separates actor variance from grader variance. Rubric-level diagnostics surface criteria that are too noisy, too lenient, or unclear in scope. Case-level analysis identifies which scenarios are fragile.

EvalFactory distills these signals into a prioritized review brief — typically reducing a day of manual annotation to a few minutes of targeted review.

Grader reliability (ICC)0.92 — ExcellentIntraclass correlation across 4 independent grading passes. The grader itself is highly consistent — score differences are real, not noise.

Actor consistency (ICC)0.32 — FairIntraclass correlation across 4 independent actor runs. The agent produces meaningfully different behaviors on repeated runs — the main source of score variance.

Actor-unstable cases34 / 100Cases where repeated actor runs produce different verdicts or scores that swing by more than 0.1. These cases need rubric or actor attention.

Flagged rubrics16 / 16All rubrics have at least one health issue (redundancy, saturation, scope instability, or weak signal). This is expected in a first pass — the catalog was generated without validation data.

Rubric Health

Each of the 16 rubrics is analyzed for scope stability, redundancy, saturation, and signal strength. High-priority rubrics need attention first.

Criterion	Wt	Health	Cases	Appl %	Instab %	Mean	Std	Low %	Top %	Max Corr
Validates inputs and rejects or clarifies logically invalid requests, including impossible dates, past dates, zero diners, contradictory changes, or unsupported party sizes.High priority	8	RedundantUnstable Scope	53/100	70%	43%	4.37	0.33	0.2%	45%	0.754b5d54de
Enforces restaurant policies correctly, including service hours, closed days, booking window, last reservable slot, party-size limits, seating rules, and special handling requirements for large groups.High priority	9	Unstable Scope	27/97	40%	32%	4.61	0.50	—	72%	0.64a2f6aecd
Handles special requests and seating preferences appropriately by recording them as notes or preferences unless the environment explicitly confirms them as guaranteed bookable attributes.High priority	8	RedundantUnstable Scope	76/98	82%	27%	4.64	0.38	0.3%	76%	0.877efa3fa2
Recovers reasonably from tool or system errors by retrying or asking for clarifying information when appropriate, without exposing internal implementation details to the user.High priority	7	RedundantUnstable ScopeWeak Signal	16/100	18%	43%	4.59	0.23	2.1%	77%	0.980ce410aa
For unavailable requests, offers helpful and plausible nearby alternatives when appropriate rather than stopping prematurely or implying no options exist without checking.	7	Redundant	40/100	43%	12%	4.60	0.29	0.8%	74%	0.987efa3fa2
Verifies reservation identity sufficiently before modifying or canceling, especially when multiple matches exist, and does not act on partial or ambiguous matches.	10	Redundant	31/97	33%	8%	4.50	0.49	—	68%	0.977efa3fa2
Other important quality factors not already covered by the listed rubrics.	5	Always OnRedundant	100/100	100%	—	4.50	0.32	0.4%	66%	0.850ce410aa
Handles tool responses accurately, including availability results, policy rejections, lookup ambiguity, duplicate warnings, and errors, without fabricating outcomes.	9	Always OnRedundant	100/100	100%	—	4.69	0.26	0.9%	82%	0.894b5d54de
Maintains operational correctness across state changes, ensuring bookings, modifications, and cancellations align with environment results.	9	Always OnRedundantSaturated	99/100	98%	—	4.81	0.17	0.5%	89%	0.8991547f97
Communicates outcomes honestly and clearly, avoiding premature success language and explaining constraints clearly.	8	Always OnRedundant	100/100	100%	—	4.67	0.24	0.5%	81%	0.86a2f6aecd
Uses the correct tool for the task and supplies complete, correctly formatted arguments.	10	Always OnRedundant	100/100	100%	—	4.68	0.25	—	80%	0.81__sink__
Checks availability or reservation state before confirming any booking or modification.	10	Always OnRedundantSaturated	100/100	100%	—	4.84	0.16	0.1%	91%	0.8596aa435b
Requests all required missing information before acting, while avoiding unnecessary clarification.	10	Always OnRedundantSaturated	100/100	100%	—	4.77	0.23	—	87%	0.83f5e8ab3b
Correctly identifies the user's reservation intent and pursues the appropriate workflow without taking unrelated actions.	9	Always OnRedundantSaturated	100/100	100%	—	4.90	0.11	—	94%	0.83db089f53
Provides a clear final summary after successful booking, modification, or cancellation with key details.	9	Always OnRedundantSaturated	94/100	94%	—	4.96	0.03	—	97%	0.86a6034eea
Interprets dates, times, and relative temporal expressions accurately, asking for clarification when ambiguous.	10	Always On	96/100	96%	—	4.74	0.28	0.3%	84%	0.772171c1ba

Review Brief

The system prioritizes what to fix first. Review the rubric queue, then the case queue.

Rubric Review Queue

Case Review Queue

Improve

Evaluator Improvement: Sharper Rubrics from Statistical Signals

Using the diagnosis, EvalFactory rewrites rubric descriptions with explicit applicability guidance — when to apply each criterion and when not to. This reduces grader disagreement on scope without changing what the rubrics measure.

Seven of fifteen rubrics were rewritten. Each now includes concrete Apply / Do-not-apply rules that tell the grader exactly which turns and scenarios trigger the criterion.

Rubrics rewritten7 / 15Seven rubric descriptions were rewritten with explicit applicability guidance — when to apply and when not to apply.

Applicability guidanceApply / Do-not-apply rulesEach rewritten rubric now includes concrete examples of when it should and should not be scored, reducing grader disagreement on scope.

Grader ICC improvement0.93 → 0.95Grader consistency improved from already-excellent to near-perfect after the rubric rewrites.

2171c1bawt 8Applicability scope added

Before

Validates inputs and rejects or clarifies logically invalid requests, including impossible dates, past dates, zero diners, contradictory changes, or unsupported party sizes.

After

Validates and clarifies requests whose input is itself logically invalid, contradictory, or unsupported before any action is taken, including impossible dates, past dates, zero diners, negative party sizes, contradictory modification instructions, or party sizes that exceed the supported online channel. Apply this rubric only when the user's request contains an invalid or contradictory input that must be corrected or explicitly refused. Do not apply it for ordinary unavailability or for policy explanations after an otherwise valid request.

4b5d54dewt 9Applicability scope added

Before

Enforces restaurant policies correctly, including service hours, closed days, booking window, last reservable slot, party-size limits, seating rules, and special handling requirements for large groups.

After

Enforces restaurant policies correctly when the request actually touches a policy boundary, including service hours, closed days, booking window, last reservable slot, party-size limits, seating rules, and special handling requirements for large groups. Apply this rubric when the assistant must interpret or explain a restaurant rule to decide what is allowed. Do not apply it to ordinary happy-path turns where no policy boundary is implicated.

c9c3bbaewt 8Applicability scope added

Before

Handles special requests and seating preferences appropriately by recording them as notes or preferences unless the environment explicitly confirms them as guaranteed bookable attributes.

After

Handles special requests, accessibility notes, and seating preferences appropriately by recording them as notes or preferences unless the environment explicitly confirms them as guaranteed bookable attributes. Apply this rubric only when the user actually asks for a seating preference, accessibility accommodation, or special request note. Do not apply it on ordinary turns with no such request.

7efa3fa2wt 7Applicability scope + examples

Before

Recovers reasonably from tool or system errors by retrying or asking for clarifying information when appropriate, without exposing internal implementation details to the user.

After

Recovers reasonably from actual tool or system errors by retrying, asking for clarifying information, or explaining the interruption appropriately, without exposing internal implementation details to the user. Apply this rubric only when a real tool error, malformed tool result, transient system failure, or equivalent operational error occurs. Do not apply it to ordinary slot unavailability, policy refusals, or routine lookup ambiguity.

0ce410aawt 7Scope narrowed + exclusions

Before

For unavailable requests, offers helpful and plausible nearby alternatives when appropriate rather than stopping prematurely or implying no options exist without checking.

After

For a valid requested slot that is unavailable after checking, offers helpful and plausible nearby alternatives when such alternatives actually exist and are appropriate. Apply this rubric only when the exact request is unavailable but the assistant could reasonably surface nearby alternatives without changing the booking on the guest's behalf. Do not apply it to hard policy refusals, invalid-input clarifications, or tool/system error recovery.

75274931wt 9Turn-level applicability added

Before

Handles tool responses accurately, including availability results, policy rejections, lookup ambiguity, duplicate warnings, and errors, without fabricating outcomes.

After

Handles actual tool responses accurately, including availability results, policy rejections returned by tools, lookup ambiguity, duplicate warnings, malformed results, and tool errors, without fabricating outcomes. Apply this rubric only on turns where the assistant is interpreting or responding to a concrete tool result or tool error. Do not apply it on turns with no relevant tool output yet.

db4f4c20wt 10Flow-level applicability added

Before

Verifies reservation identity sufficiently before modifying or canceling, especially when multiple matches exist, and does not act on partial or ambiguous matches.

After

Verifies reservation identity sufficiently before modifying or canceling, especially when multiple matches exist, and does not act on partial or ambiguous matches. Apply this rubric only in lookup, modification, or cancellation flows where the assistant must establish which existing reservation is being targeted. Do not apply it to new-booking flows or to generic tool-error handling.

Improve

Agent Improvement: Targeted Prompt Changes and Cross-Eval Results

EvalFactory also generates targeted actor prompt improvements from the same diagnosis. Each new rule addresses a specific instability pattern — ambiguous requests, silent substitutions, or fabricated confirmations.

Both actor versions were tested on the improved evaluation with 4 runs and 4 grading passes each. Consistency improved substantially while overall quality held steady.

Actor consistency+39%ICC improved from 0.32 (Fair) to 0.45 (Moderate) — the agent now produces more consistent behavior across repeated runs.

Unstable cases-35%Cases where the agent's behavior varies meaningfully dropped from 48 to 31. Targeted prompt rules addressed the most fragile case families.

Quality maintained97.5% pass rateOverall quality held steady while consistency improved — the prompt changes didn't trade accuracy for stability.

Targeted Prompt Additions

Six new behavior rules were added to the actor prompt, each addressing a specific instability pattern surfaced by the diagnosis.

Never silently change the requested date, time, party size, or seating area to a nearby valid option. Offer alternatives, but wait for the guest to choose one explicitly.

Fixes: Actor sometimes auto-substituted alternatives without asking, causing rubric disagreement on intent handling.

Never convert a same-day request on a closed day into the next open day unless the guest explicitly asks for that fallback.

Fixes: Closed-day requests triggered inconsistent behavior — some runs silently moved to the next day.

When the request is contradictory or materially ambiguous, stop and clarify before searching, creating, modifying, or canceling.

Fixes: Ambiguous requests (like 'book us for 12 tonight') produced different interpretations across runs.

If the exact requested slot is unavailable, do not create or modify a different slot until the guest explicitly approves the alternative.

Fixes: Race-condition and unavailability scenarios showed the actor sometimes proceeding without consent.

If a modification request exceeds supported policy limits, explain that the change cannot be completed and state the existing reservation remains unchanged.

Fixes: Large party modifications (e.g. 6→14) caused the actor to invent manual coordination steps.

In final summaries, copy factual details only from the successful tool result. Do not invent or alter names, phone numbers, reservation IDs, dates, or times.

Fixes: Occasional hallucination of reservation details in final summaries — especially reservation IDs.

Cross-Evaluation Results: v1 vs v2 Actor

Both actor versions were tested on the improved v2 evaluation with 4 runs and 4 grading passes each. Consistency improved substantially while overall quality held steady.

Metric	v1 Actor	v2 Actor	Change
Actor consistency (ICC)	0.32	0.45	+39%
Actor score std dev	0.064	0.054	-15%
Actor-unstable cases	48	31	-35%
Mean score	0.933	0.936	+0.3%
Pass rate	97.6%	97.5%	-0.1%

Calibrate

Continuous calibration through human feedback and online signals

Offline improvement gets you most of the way. Calibration closes the remaining gap by teaching the evaluator where its boundary disagrees with reality.

Human-anchor calibration uses a small set of labeled examples to correct false passes and false fails — tuning the verdict boundary without rewriting the rubric catalog. Online signal integration feeds business metrics like user satisfaction, retention, or revenue back into the evaluator, so the eval stays aligned with what actually matters in production. Raw signals are cleaned and case-linked before they adjust anything — curation before calibration.