Prompt Optimizer — Ticket Triage — Medium Run

Before/after comparison on the held-out valid split. Generated 2026-04-21 16:46 UTC.

Task: Classify incoming support tickets into category / urgency / routing and draft a starter reply.

Show full task specification
# Support Ticket Triage — Task Specification

This is the input document for the ticket-triage optimization. It describes what the system must do, the output schema, decision rules, and edge cases.

## What the system does

Classifies incoming support tickets and produces a structured triage decision so front-line support has routing, urgency, and a starter reply prepared before a human reads the ticket.

## Input

Raw ticket text as written by the customer. Variable length, style, and grammar. Sometimes includes error codes, log snippets, or screenshot descriptions. Occasionally non-English.

## Output schema

```json
{
  "category": "auth | billing | bug | feature_request | outage | account | other",
  "urgency":  "low | medium | high | critical",
  "routing":  "tier1 | tier2 | engineering | oncall | billing_team",
  "reply_draft": "2-3 sentence initial response the agent will adapt"
}
```

## Category definitions

| Category | Includes | Excludes |
|---|---|---|
| `auth` | Login, password reset, 2FA, SSO, session expiration | Billing-gated access (→ `billing`) |
| `billing` | Payment failures, invoices, refunds, plan/seat changes, upgrades, downgrades | Login issues after payment (→ `auth`) |
| `bug` | Single-user product defect, broken behavior, error messages | Mass impact (→ `outage`) |
| `feature_request` | User wants functionality that doesn't exist | Bugs described as "missing feature" |
| `outage` | Explicit mass impact: "all users", "down for everyone", "our whole team" | Single-user bugs |
| `account` | Profile, settings, email changes, data export/deletion, account suspension | Auth problems (→ `auth`) |
| `other` | Spam, unrelated, non-English, too vague to classify | Anything that fits above |

## Urgency definitions

- **critical** — Outage, active data loss, security concern, regulatory/legal risk, "production down".
- **high** — Blocking for this user, core feature unusable, explicit deadline, angry paying customer.
- **medium** — Friction but workaround exists. User can still achieve the goal.
- **low** — Question, feature request, minor cosmetic issue.

## Routing rules

- **oncall** — All `critical` urgency. All `outage` category.
- **billing_team** — `billing` category, unless the root cause is a bug in billing code (then `engineering`).
- **engineering** — `bug` with complex symptoms (error codes, stack traces) or multi-system reach.
- **tier2** — Escalated from tier1 in feel (complex but non-engineering), or high-urgency non-critical that isn't billing.
- **tier1** — Default front line. Handles auth resets, simple billing questions, account admin, feature requests.

## Override rules and edge cases

| Signal | Override |
|---|---|
| Security concern (leaked password, unauthorized access, data breach) | urgency=`critical`, routing=`oncall` |
| "urgent", "ASAP", "right now", "down" keywords | nudge urgency up one tier |
| Payment failure during checkout | urgency=`high`, routing=`billing_team` |
| Feature request | urgency=`low` always |
| Multi-issue ticket | pick highest-urgency issue for category; mention others in `reply_draft` |
| Non-English input | category=`other`, routing=`tier1`, urgency=`medium`, `reply_draft` in English asking for clarification |
| Vague input (fewer than 10 meaningful words) | category=`other`, routing=`tier1`, `reply_draft` asks a clarifying question |
| Emotional or profane tone | urgency unchanged; `reply_draft` acknowledges frustration |

## Reply draft guidance

- 2–3 sentences.
- Tone: professional, empathetic, concrete.
- Include next steps (for example, "I'll escalate this to our engineering team," or "I'm triggering a password reset — check your email in the next few minutes").
- Do NOT promise SLA windows or specific timelines.
- Do NOT include URLs, personalization tokens, or the user's full name (the human agent edits before sending).

## Target distribution (for eval set synthesis)

Rough production mix we want the eval set to reflect:

| Category | Share | | Urgency | Share |
|---|---|---|---|---|
| auth | 25% | | low | 30% |
| billing | 20% | | medium | 45% |
| bug | 15% | | high | 20% |
| account | 15% | | critical | 5% |
| feature_request | 10% | | | |
| outage | 10% | | | |
| other | 5% | | | |
Aggregate score
94.4%
+8.3% vs baseline (86.1%)
Mean latency
5288 ms
+1521 ms vs baseline (3767 ms)
Live-test tickets
18
of 60 total · 42 used for training
Target model
anthropic/claude-sonnet-4-6
Same model, both programs
Compile time
11m 19s
876/900 rollouts · 23 iterations

Per-field accuracy

How this was compiled

GEPA (Genetic-Pareto Evolution) optimized the prompt by repeatedly drafting variations, testing them against training tickets, and using a stronger LM to read failures and propose refinements. It stops when the budget runs out or improvements plateau.

Total tickets60 synthesized from the task spec, stratified 70/30 by category (seed=42)
Training tickets42 — GEPA’s working set: drafts candidates and tests them against these
Live-test tickets18 — held out from GEPA entirely; all headline numbers above are scored on this split only
Budget presetmedium — approximately ~1600 rollouts, ~30-50 candidate prompts tested
Target LManthropic/claude-sonnet-4-6 — runs the prompt on each ticket
Reflection LManthropic/claude-opus-4-7 — reads failures, proposes better drafts

Baseline → Compiled

The prompt as handed to us (left) vs. what GEPA produced (right). The size difference alone tells part of the story; the structure difference is where the accuracy gains come from.

Baseline (367 chars)

You are a support ticket triage assistant. Classify the incoming support ticket and suggest a response. Output JSON with these fields: - category: auth, billing, bug, feature_request, outage, account, or other - urgency: low, medium, high, or critical - routing: tier1, tier2, engineering, oncall, or billing_team - reply_draft: a brief response to send the customer

Compiled (7508 chars)

You are a support ticket triage assistant. You will receive a support ticket (free-form customer message) and must classify it and draft a reply. Output ONLY a JSON object with exactly these fields: - category: one of auth, billing, bug, feature_request, outage, account, or other - urgency: one of low, medium, high, or critical - routing: one of tier1, tier2, engineering, oncall, or billing_team - reply_draft: a brief, empathetic response to send the customer ## Category definitions and rules - auth: login issues, password resets, MFA/2FA, SSO problems. IMPORTANT: When a ticket combines an account lifecycle request (like GDPR deletion) with an active login/2FA block that prevents the user from proceeding, classify as "auth" — the login block is the immediate gating issue that must be resolved first, and it outweighs the account-management intent in classification. - billing: invoices, charges, refunds, payment methods, subscription/plan changes, checkout/payment processing failures. Billing covers the end-to-end payment flow even when something is "broken" in checkout — a failed upgrade attempt, declined card, or checkout not going through is a BILLING issue, not a bug, unless the ticket clearly describes a non-payment product defect. If the issue is a billing data/document correction caused by a system glitch (duplicate charges from a retry loop, wrong invoice), it is still "billing". Only classify as "bug" when the core symptom is a product feature malfunctioning outside the payment flow (e.g., notifications not delivering, settings not persisting for a specific user). - bug: product defects, features not working as configured, unexpected errors, broken functionality for a specific user/account (outside of payment/checkout). Use this only when the core issue is a non-billing functional defect. - feature_request: asks for new capabilities or enhancements. - outage: broad service disruption. Overrides other categories when the ticket indicates impact to "all users", "everyone", "down for everyone", company-wide, or region-wide failure. A single user's checkout failing is NOT an outage. A customer reporting their "whole team" is down/getting errors counts as an outage from our triage perspective. - account: account management tasks like changing email, name, profile info, CLOSING/DELETING an account, data export requests (including GDPR deletion/export), seat management (non-defect). Account deletion and data export requests are "account" even when the same ticket also mentions a minor bug — the primary, user-stated intent (account closure/export) wins over a secondary "annoying but not a dealbreaker" defect. EXCEPTION: if the user is actively blocked by an auth/login failure (e.g., 2FA broken, can't log in), classify as "auth" instead. - other: anything that doesn't fit above, including extremely vague tickets like "doesn't work" with no discernible category. Multi-issue tickets: classify by the most technically significant / highest-severity issue — BUT: - When the user's primary request is an account lifecycle action (delete/export) and the secondary issue is explicitly minor, classify as "account" (unless an auth block is also present — then "auth"). - For tickets combining an invoice correction AND a clearly separate functional defect (e.g., notifications not arriving), classify as "bug". - When auth (login/2FA/SSO) is blocking the user from doing anything else they've asked about, "auth" wins. ## Urgency rules - critical: outages affecting all/many users; security concerns (data leak, unauthorized access, suspected breach); severe data loss. Security concerns ALWAYS → critical (overrides everything else). - high: angry customer with significant financial impact (e.g., duplicate/triple charges), production-impacting bug for a paying customer, urgent account lockout (including 2FA lockouts blocking further action), checkout/payment failure blocking a paying customer with a stated deadline. A multi-issue ticket where the user is locked out AND has financial discrepancies AND needs GDPR action should be "high". - medium: functional issues affecting a subset of users, billing discrepancies without immediate financial harm, multi-issue tickets with non-urgent items, vague/ambiguous requests that could have material impact (e.g., "export my data" with no context, or "doesn't work" with no details — these default to medium because scope is unclear and the underlying issue could be material). Tickets containing urgency keywords ("urgent", "ASAP", "down", "today", "deadline") should be nudged up one tier from baseline. - low: how-to questions, minor account settings, cosmetic issues. Feature requests are ALWAYS low urgency. Override summary: - Security concern → always critical. - feature_request → always low. - "urgent"/"ASAP"/"down"/deadline language → bump up one tier. - Vague data/export/account requests with unclear scope → medium (not low). - Vague "doesn't work"-type tickets with no specifics → medium (not low), because the unknown scope could hide a material issue. - Active account lockout (can't log in, 2FA broken) combined with other significant issues → high. ## Routing rules (apply in order, first match wins) 1. Any urgency = critical → oncall. 2. Outage category → oncall. 3. Billing category → billing_team. This applies even to checkout/payment processing failures and even when urgency is "high" — billing_team handles billing issues unless the root cause is a confirmed non-payment system defect (a "billing bug" affecting functionality outside the payment flow), in which case → engineering. 4. Bug category with complex symptoms (delivery failures, data inconsistencies, integrations, anything requiring code investigation) → engineering. 5. Simple bugs that may be config/user error → tier2. 6. Auth issues → tier1 by default; tier2 if complex (SSO problems, MFA/2FA lockouts, or auth combined with other issues requiring cross-team coordination). 7. Feature requests → tier1 (for initial logging/acknowledgement). 8. Account management (including deletion, data export, GDPR requests, profile changes) → tier1. Tier1 handles these even when the ticket also mentions a minor secondary defect. 9. Default → tier1. Important: do not route billing/checkout failures to engineering by default. The billing_team triages payment issues first and escalates internally if needed. ## Reply draft guidelines - Be concise, empathetic, and professional. - Acknowledge each distinct issue raised (numbered list is fine for multi-issue). - For angry/frustrated customers or those with deadlines, lead with a sincere apology, acknowledge frustration, and commit to a concrete next step and timeframe (without promising specific refund amounts or SLAs). - For how-to questions, provide clear step-by-step instructions. - Ask for any info needed to resolve (affected email address, browser/device, approximate times, last 4 of payment method, invoice number, etc.). - For vague requests, ask clarifying questions to determine scope (what's not working, steps taken, error messages, account email). - For multi-issue tickets, address the blocking issue (e.g., auth lockout) first so the customer can proceed. - For outages, ask for start time/timezone, affected regions/locations, and any specific error messages to help triage. - Do not promise specific refund amounts or SLAs beyond general commitments. - Brief sign-off; no formal signature block. Output ONLY the JSON object with the four fields.

Per-example scores (valid split)

#Ticket (preview)BaselineCompiled
1I want to permanently delete my account and have all my data wiped per GDPR. Please confir1.000.75
2hi i think someone logged into my account from russia last night, i got a notification and0.501.00
3I can't log into my account. Tried resetting the password twice but the email never arrive0.751.00
4hi i think someone else logged into my account?? i got an email about a login from germany1.001.00
5hi think someone logged into my account from russia, got an email about a new device and i1.001.00
6My card was declined when I tried to upgrade to the Pro plan just now at checkout. I need 1.001.00
7my card keeps getting declined when i try to checkout for the annual plan. ive tried 3 dif1.001.00
8Tried to upgrade to the Pro plan at checkout and got error code PAY_4021. Card was decline1.001.00
9When I click the Save button on the report builder I get error code RPT_5532 and a stack t1.001.00
10exporting a CSV with more than ~5000 rows throws Error E_TIMEOUT_4412 in the console. smal0.751.00
11It would be really nice if you could add a dark mode toggle to the mobile app. The white b1.001.00
12Hello team, it would be wonderful if the dashboard supported dark mode. Many of us work la1.001.00
13Bonjour, j'ai un problème avec mon compte et je ne peux pas accéder à mes factures. Pouvez0.751.00
14Bonjour, je n'arrive pas à changer l'adresse email associée à mon compte. Pouvez-vous m'ai0.500.50
15pls help0.751.00
16Bonjour, je n'arrive pas à modifier mon adresse email dans les paramètres. Pouvez-vous m'a0.500.75
17URGENT!!! our entire team cannot access the dashboard, everyone is getting a 503. this is 1.001.00
18The entire app is down for our whole team — nobody can load the dashboard, we're all getti1.001.00

Inputs to this run

What went into the optimization. The service supports three kinds of engagement data: real production traces, hand-labeled examples, or synthetic data generated from a task spec. This run used the synthetic path — the typical cold-start / pre-launch scenario.

Used

Task specificationevals/task-spec.md — shown inline at the top of this report
Original promptprompts/baseline.md — shown in the “Baseline → Compiled” panel above (367 chars)
Evaluation data60 examples synthesized from the task spec via the reflection LM (anthropic/claude-opus-4-7)
Metric weightsDefault equal weights (0.25 per field); configurable via evals/metric-weights.json

Not used in this run

Production trace logsUsed when clients can export real traffic from their observability tools (LangSmith, Langfuse, Helicone, OpenTelemetry, or plain JSONL). The metric becomes the business outcome directly — CSAT, escalation rate, deflection — so no hand-crafted weights are needed. Typical for retainer engagements where data compounds over time.
Hand-labeled examplesUsed when clients can provide 20+ input/output pairs they've labeled themselves. Typical for pre-launch or internal-testing scenarios where some real data exists but not in production volume.

Tools, libraries and versions

Exact versions installed during this run — useful for reproducing the numbers.

Python3.11.14
dspy-ai3.1.3
dspy3.1.3
gepa0.0.26
litellm1.83.0
openai2.32.0
anthropic(not installed)

Execution environment

Hardware and OS the pipeline ran on.

OSDarwin 25.2.0
Architecturearm64
CPUApple M4 (10 cores)
Memory16.0 GB

Terminology

GEPAGenetic-Pareto Evolution — the optimizer. Drafts variations of the prompt, tests them, uses a stronger LM (the reflection LM) to read failures and propose better drafts.
BaselineThe original prompt as handed to us, scored on the live-test set. The starting point.
CompiledGEPA’s optimized version of the baseline, produced by the run.
RolloutOne evaluation: the target LM runs one draft prompt on one training ticket.
CandidateA distinct draft version of the prompt tested during compilation.
Pareto frontierThe set of candidates where none strictly dominates another (one wins on category, another wins on urgency, none wins on all dimensions). GEPA picks the champion from this frontier.
light (budget)~550 rollouts, ~10–15 candidates tested. Fastest and cheapest preset; best for proof-of-concept runs. Results are real but trade-offs may appear (one field regresses while others improve) because the Pareto frontier is small.
medium (budget)~1600 rollouts, ~30–50 candidates. The typical default for production engagements. Larger Pareto frontier means more balanced candidates are available to pick from.
heavy (budget)~5000 rollouts, ~100+ candidates. Maximum search depth. Used when every last point of lift matters or the metric is unusually noisy; diminishing returns past medium for most tasks.
Target LMThe model that runs the prompt on each ticket (Claude Sonnet here). Usually the cheaper, faster model — it does the bulk of the work.
Reflection LMThe model that reads failure traces and proposes better drafts (Claude Opus here). Usually stronger and more expensive; called less often but drives the optimization signal.
Training ticketsExamples GEPA sees during compilation. It drafts and tests candidates against these.
Live-test ticketsExamples held out from GEPA entirely. Scored only after compile — this is the generalization number.
Aggregate scoreThe four field-accuracy numbers averaged per ticket, then averaged across all tickets. The headline number.
Percentage points (pp)The delta between two percentages on a 0–100 scale. “+5pp” means the score moved from 86% to 91%, not from 86% to (86% × 1.05).