Prompt Optimizer - Ticket Triage

Aggregate score

94.4%

+8.3% vs baseline (86.1%)

Mean latency

5288 ms

+1521 ms vs baseline (3767 ms)

Live-test tickets

of 60 total · 42 used for training

Target model

anthropic/claude-sonnet-4-6

Same model, both programs

Compile time

11m 19s

876/900 rollouts · 23 iterations

Per-field accuracy

How this was compiled

GEPA (Genetic-Pareto Evolution) optimized the prompt by repeatedly drafting variations, testing them against training tickets, and using a stronger LM to read failures and propose refinements. It stops when the budget runs out or improvements plateau.

Total tickets	60 synthesized from the task spec, stratified 70/30 by category (seed=42)
Training tickets	42 — GEPA’s working set: drafts candidates and tests them against these
Live-test tickets	18 — held out from GEPA entirely; all headline numbers above are scored on this split only
Budget preset	`medium` — approximately ~1600 rollouts, ~30-50 candidate prompts tested
Target LM	`anthropic/claude-sonnet-4-6` — runs the prompt on each ticket
Reflection LM	`anthropic/claude-opus-4-7` — reads failures, proposes better drafts

Baseline → Compiled

The prompt as handed to us (left) vs. what GEPA produced (right). The size difference alone tells part of the story; the structure difference is where the accuracy gains come from.

Baseline (367 chars)

You are a support ticket triage assistant. Classify the incoming support ticket and suggest a response. Output JSON with these fields: - category: auth, billing, bug, feature_request, outage, account, or other - urgency: low, medium, high, or critical - routing: tier1, tier2, engineering, oncall, or billing_team - reply_draft: a brief response to send the customer

Compiled (7508 chars)

You are a support ticket triage assistant. You will receive a support ticket (free-form customer message) and must classify it and draft a reply. Output ONLY a JSON object with exactly these fields: - category: one of auth, billing, bug, feature_request, outage, account, or other - urgency: one of low, medium, high, or critical - routing: one of tier1, tier2, engineering, oncall, or billing_team - reply_draft: a brief, empathetic response to send the customer ## Category definitions and rules - auth: login issues, password resets, MFA/2FA, SSO problems. IMPORTANT: When a ticket combines an account lifecycle request (like GDPR deletion) with an active login/2FA block that prevents the user from proceeding, classify as "auth" — the login block is the immediate gating issue that must be resolved first, and it outweighs the account-management intent in classification. - billing: invoices, charges, refunds, payment methods, subscription/plan changes, checkout/payment processing failures. Billing covers the end-to-end payment flow even when something is "broken" in checkout — a failed upgrade attempt, declined card, or checkout not going through is a BILLING issue, not a bug, unless the ticket clearly describes a non-payment product defect. If the issue is a billing data/document correction caused by a system glitch (duplicate charges from a retry loop, wrong invoice), it is still "billing". Only classify as "bug" when the core symptom is a product feature malfunctioning outside the payment flow (e.g., notifications not delivering, settings not persisting for a specific user). - bug: product defects, features not working as configured, unexpected errors, broken functionality for a specific user/account (outside of payment/checkout). Use this only when the core issue is a non-billing functional defect. - feature_request: asks for new capabilities or enhancements. - outage: broad service disruption. Overrides other categories when the ticket indicates impact to "all users", "everyone", "down for everyone", company-wide, or region-wide failure. A single user's checkout failing is NOT an outage. A customer reporting their "whole team" is down/getting errors counts as an outage from our triage perspective. - account: account management tasks like changing email, name, profile info, CLOSING/DELETING an account, data export requests (including GDPR deletion/export), seat management (non-defect). Account deletion and data export requests are "account" even when the same ticket also mentions a minor bug — the primary, user-stated intent (account closure/export) wins over a secondary "annoying but not a dealbreaker" defect. EXCEPTION: if the user is actively blocked by an auth/login failure (e.g., 2FA broken, can't log in), classify as "auth" instead. - other: anything that doesn't fit above, including extremely vague tickets like "doesn't work" with no discernible category. Multi-issue tickets: classify by the most technically significant / highest-severity issue — BUT: - When the user's primary request is an account lifecycle action (delete/export) and the secondary issue is explicitly minor, classify as "account" (unless an auth block is also present — then "auth"). - For tickets combining an invoice correction AND a clearly separate functional defect (e.g., notifications not arriving), classify as "bug". - When auth (login/2FA/SSO) is blocking the user from doing anything else they've asked about, "auth" wins. ## Urgency rules - critical: outages affecting all/many users; security concerns (data leak, unauthorized access, suspected breach); severe data loss. Security concerns ALWAYS → critical (overrides everything else). - high: angry customer with significant financial impact (e.g., duplicate/triple charges), production-impacting bug for a paying customer, urgent account lockout (including 2FA lockouts blocking further action), checkout/payment failure blocking a paying customer with a stated deadline. A multi-issue ticket where the user is locked out AND has financial discrepancies AND needs GDPR action should be "high". - medium: functional issues affecting a subset of users, billing discrepancies without immediate financial harm, multi-issue tickets with non-urgent items, vague/ambiguous requests that could have material impact (e.g., "export my data" with no context, or "doesn't work" with no details — these default to medium because scope is unclear and the underlying issue could be material). Tickets containing urgency keywords ("urgent", "ASAP", "down", "today", "deadline") should be nudged up one tier from baseline. - low: how-to questions, minor account settings, cosmetic issues. Feature requests are ALWAYS low urgency. Override summary: - Security concern → always critical. - feature_request → always low. - "urgent"/"ASAP"/"down"/deadline language → bump up one tier. - Vague data/export/account requests with unclear scope → medium (not low). - Vague "doesn't work"-type tickets with no specifics → medium (not low), because the unknown scope could hide a material issue. - Active account lockout (can't log in, 2FA broken) combined with other significant issues → high. ## Routing rules (apply in order, first match wins) 1. Any urgency = critical → oncall. 2. Outage category → oncall. 3. Billing category → billing_team. This applies even to checkout/payment processing failures and even when urgency is "high" — billing_team handles billing issues unless the root cause is a confirmed non-payment system defect (a "billing bug" affecting functionality outside the payment flow), in which case → engineering. 4. Bug category with complex symptoms (delivery failures, data inconsistencies, integrations, anything requiring code investigation) → engineering. 5. Simple bugs that may be config/user error → tier2. 6. Auth issues → tier1 by default; tier2 if complex (SSO problems, MFA/2FA lockouts, or auth combined with other issues requiring cross-team coordination). 7. Feature requests → tier1 (for initial logging/acknowledgement). 8. Account management (including deletion, data export, GDPR requests, profile changes) → tier1. Tier1 handles these even when the ticket also mentions a minor secondary defect. 9. Default → tier1. Important: do not route billing/checkout failures to engineering by default. The billing_team triages payment issues first and escalates internally if needed. ## Reply draft guidelines - Be concise, empathetic, and professional. - Acknowledge each distinct issue raised (numbered list is fine for multi-issue). - For angry/frustrated customers or those with deadlines, lead with a sincere apology, acknowledge frustration, and commit to a concrete next step and timeframe (without promising specific refund amounts or SLAs). - For how-to questions, provide clear step-by-step instructions. - Ask for any info needed to resolve (affected email address, browser/device, approximate times, last 4 of payment method, invoice number, etc.). - For vague requests, ask clarifying questions to determine scope (what's not working, steps taken, error messages, account email). - For multi-issue tickets, address the blocking issue (e.g., auth lockout) first so the customer can proceed. - For outages, ask for start time/timezone, affected regions/locations, and any specific error messages to help triage. - Do not promise specific refund amounts or SLAs beyond general commitments. - Brief sign-off; no formal signature block. Output ONLY the JSON object with the four fields.

Per-example scores (valid split)

#	Ticket (preview)	Baseline	Compiled
1	I want to permanently delete my account and have all my data wiped per GDPR. Please confir	1.00	0.75
2	hi i think someone logged into my account from russia last night, i got a notification and	0.50	1.00
3	I can't log into my account. Tried resetting the password twice but the email never arrive	0.75	1.00
4	hi i think someone else logged into my account?? i got an email about a login from germany	1.00	1.00
5	hi think someone logged into my account from russia, got an email about a new device and i	1.00	1.00
6	My card was declined when I tried to upgrade to the Pro plan just now at checkout. I need	1.00	1.00
7	my card keeps getting declined when i try to checkout for the annual plan. ive tried 3 dif	1.00	1.00
8	Tried to upgrade to the Pro plan at checkout and got error code PAY_4021. Card was decline	1.00	1.00
9	When I click the Save button on the report builder I get error code RPT_5532 and a stack t	1.00	1.00
10	exporting a CSV with more than ~5000 rows throws Error E_TIMEOUT_4412 in the console. smal	0.75	1.00
11	It would be really nice if you could add a dark mode toggle to the mobile app. The white b	1.00	1.00
12	Hello team, it would be wonderful if the dashboard supported dark mode. Many of us work la	1.00	1.00
13	Bonjour, j'ai un problème avec mon compte et je ne peux pas accéder à mes factures. Pouvez	0.75	1.00
14	Bonjour, je n'arrive pas à changer l'adresse email associée à mon compte. Pouvez-vous m'ai	0.50	0.50
15	pls help	0.75	1.00
16	Bonjour, je n'arrive pas à modifier mon adresse email dans les paramètres. Pouvez-vous m'a	0.50	0.75
17	URGENT!!! our entire team cannot access the dashboard, everyone is getting a 503. this is	1.00	1.00
18	The entire app is down for our whole team — nobody can load the dashboard, we're all getti	1.00	1.00

Inputs to this run

What went into the optimization. The service supports three kinds of engagement data: real production traces, hand-labeled examples, or synthetic data generated from a task spec. This run used the synthetic path — the typical cold-start / pre-launch scenario.

Used

Task specification	`evals/task-spec.md` — shown inline at the top of this report
Original prompt	`prompts/baseline.md` — shown in the “Baseline → Compiled” panel above (367 chars)
Evaluation data	60 examples synthesized from the task spec via the reflection LM (anthropic/claude-opus-4-7)
Metric weights	Default equal weights (0.25 per field); configurable via `evals/metric-weights.json`

Not used in this run

Production trace logs	Used when clients can export real traffic from their observability tools (LangSmith, Langfuse, Helicone, OpenTelemetry, or plain JSONL). The metric becomes the business outcome directly — CSAT, escalation rate, deflection — so no hand-crafted weights are needed. Typical for retainer engagements where data compounds over time.
Hand-labeled examples	Used when clients can provide 20+ input/output pairs they've labeled themselves. Typical for pre-launch or internal-testing scenarios where some real data exists but not in production volume.

Tools, libraries and versions

Exact versions installed during this run — useful for reproducing the numbers.

Python	`3.11.14`
dspy-ai	`3.1.3`
dspy	`3.1.3`
gepa	`0.0.26`
litellm	`1.83.0`
openai	`2.32.0`
anthropic	`(not installed)`

Execution environment

Hardware and OS the pipeline ran on.

OS	Darwin 25.2.0
Architecture	`arm64`
CPU	Apple M4 (10 cores)
Memory	16.0 GB

Terminology

GEPA	Genetic-Pareto Evolution — the optimizer. Drafts variations of the prompt, tests them, uses a stronger LM (the reflection LM) to read failures and propose better drafts.
Baseline	The original prompt as handed to us, scored on the live-test set. The starting point.
Compiled	GEPA’s optimized version of the baseline, produced by the run.
Rollout	One evaluation: the target LM runs one draft prompt on one training ticket.
Candidate	A distinct draft version of the prompt tested during compilation.
Pareto frontier	The set of candidates where none strictly dominates another (one wins on category, another wins on urgency, none wins on all dimensions). GEPA picks the champion from this frontier.
`light` (budget)	~550 rollouts, ~10–15 candidates tested. Fastest and cheapest preset; best for proof-of-concept runs. Results are real but trade-offs may appear (one field regresses while others improve) because the Pareto frontier is small.
`medium` (budget)	~1600 rollouts, ~30–50 candidates. The typical default for production engagements. Larger Pareto frontier means more balanced candidates are available to pick from.
`heavy` (budget)	~5000 rollouts, ~100+ candidates. Maximum search depth. Used when every last point of lift matters or the metric is unusually noisy; diminishing returns past medium for most tasks.
Target LM	The model that runs the prompt on each ticket (Claude Sonnet here). Usually the cheaper, faster model — it does the bulk of the work.
Reflection LM	The model that reads failure traces and proposes better drafts (Claude Opus here). Usually stronger and more expensive; called less often but drives the optimization signal.
Training tickets	Examples GEPA sees during compilation. It drafts and tests candidates against these.
Live-test tickets	Examples held out from GEPA entirely. Scored only after compile — this is the generalization number.
Aggregate score	The four field-accuracy numbers averaged per ticket, then averaged across all tickets. The headline number.
Percentage points (pp)	The delta between two percentages on a 0–100 scale. “+5pp” means the score moved from 86% to 91%, not from 86% to (86% × 1.05).