A fixed 2-week engagement. You hand us your production LLM prompt plus eval data (production traces, labeled examples, a code or content corpus, or a written task spec we synthesize from). We hand back a drop-in replacement and a quantified before/after report, compiled with GEPA (genetic-Pareto reflective prompt evolution). No fine-tuning, no framework lock-in, no open-ended consulting.
Every team running an LLM feature has a prompt somewhere, usually a string literal in a codebase, written quickly by whoever needed to ship and barely touched since. It mostly works. Sometimes it misclassifies. Sometimes it drifts after a model update. Sometimes it's quietly costing you more tokens than it should.
Improving it systematically is expensive: you would need to build an evaluation harness, curate test data, design a scoring function, and iterate through dozens of prompt drafts to find one that actually moves the metric. Most teams never get past "good enough."
We do the systematic version, against your real data.
Week 1 is intake and baseline. Week 2 is the GEPA compile, a review iteration, and handoff. Fixed scope from day one.
Your prompt can come in any form you have: a raw string, a DSPy program, a LangChain pipeline, or all three. We wrap whichever form into a DSPy module for optimization; your stack stays untouched.
Before the baseline can run, we also settle two things: the output schema (which fields the prompt produces) and the scoring function (how we compare output to ground truth). Both are agreed async during intake, prior to the compile.
We then run your current prompt against a held-out split the optimizer never sees, so the before/after number is honest. You pick one of four ways to supply the data:
Imported from LangSmith, Langfuse, Helicone, OpenTelemetry, or plain JSONL. 100 to 500 records is ideal. Highest-signal source because it matches your real distribution.
20+ records as JSONL or CSV, one per line. Each pairs an input with the expected output fields (classification, extraction, structured generation, etc.).
A code repository, document collection, or other file-level dataset the prompt runs over. We sample a held-out set and draft expected outputs with your team at intake.
You fill out a 10-section intake template (30 to 60 min). We synthesize 30 to 50 examples from it, draft expected outputs, and hand them back for your review before the compile.
GEPA drafts hundreds of prompt variations, scores each on your training split, and uses a stronger reflection model (Claude Opus by default) to read failure traces and propose improvements. Same iterative feedback loop as RLHF, but the only thing that changes is the prompt text. No model weights touched.
We review the before/after with you. If a priority field regressed, we rebalance how much each output field counts toward the score and run one more compile. One iteration is included in fixed scope.
Swap the compiled prompt string into your codebase. You also get the evaluation set, the scoring function, the before/after HTML report, and a recompile script, so you can re-run the compile yourself from scratch whenever a new model version ships.
New to GEPA, field weights, or held-out splits? Quick definitions in the terminology section at the bottom.
We ran the full pipeline on a canonical ticket-triage task, the kind of LLM feature a typical SMB has in production. Here is what a single engagement produced.
| Field | Baseline | Compiled | Delta |
|---|---|---|---|
| category | 83.3% | 88.9% | +5.6pp |
| urgency | 72.2% | 94.4% | +22.2pp |
| routing | 88.9% | 94.4% | +5.6pp |
| reply_draft | 100% | 100% | held |
| aggregate | 86.1% | 94.4% | +8.3pp |
Held-out valid split of 18 examples. GEPA never saw them during optimization.
We ran this on medium budget. Light is faster and cheaper for smaller eval sets; heavy runs longer and explores more candidates when the eval set is large or the score is a production outcome signal (CSAT, escalation, deflection). We pick the tier at intake.
Everything ships as code, data, and documentation your team can own, re-run, and audit.
A plain-text replacement for your current prompt literal. Works with any API-accessible model (Claude, GPT, etc.) and any framework, or no framework at all.
The optimized version as a DSPy module, for teams adopting DSPy or building a DSPy-native pipeline. Optional; most clients use the string.
The data we scored against and the scoring logic, as your IP. You can re-run the evaluation yourself whenever you want.
Self-contained, committable, renderable as PDF. Includes methodology, per-field breakdowns, trade-off analysis, and the full compiled prompt.
If the target model (Claude, GPT) ships a new version within 30 days and behavior shifts, we re-run the compile on your eval set for free.
One shape only. Fixed scope, fixed fee, two weeks kickoff to handoff. Out-of-scope items are explicit so there is no scope creep to negotiate.
Six questions we get on almost every intro call.
No. We wrap your raw prompt in DSPy internally so the optimizer can evolve it, then extract the result back as a plain string at handoff. You can integrate that string into whatever you already use: plain API calls, LangChain, your own wrapper.
We agree on a metric at intake, then score both your original prompt and the compiled version on a held-out evaluation set (examples the optimizer never saw during compilation). All artifacts, numbers, and the scoring function ship in the delivered repo. You can reproduce the numbers yourself on your own hardware.
We assign each output field a weight that answers one question: what share of the total score does this field count for? For the v1 compile we start with equal weights (say, 25% each across four fields). We walk the per-field breakdown with you, and if a field that matters regressed, we rebalance for a v2 compile. For example, shifting from 25/25/25/25 to 10/10/70/10 tells GEPA to prefer candidates that preserve routing accuracy, even at some cost to other fields. No ML expertise required on your end; we draft the weights with you at intake.
Two options: bump up that field's weight so the optimizer prefers candidates that get it right, or increase the compile budget so GEPA evaluates more prompt variations and has more balanced options to choose from. One iteration of this kind is included in fixed scope.
Budget controls how many prompt variations GEPA drafts and evaluates during compilation. Light is faster and cheaper, good for smaller eval sets or pre-launch iteration. Medium is our default and is what the reference case uses. Heavy runs longer and explores more candidates, useful when the eval set is large or the metric is a production-scale outcome (CSAT, escalation, deflection). We pick the tier at intake based on your data size and goals.
Yes. DSPy supports multi-module programs and GEPA can optimize all components jointly. Multi-step engagements are larger than the standard fixed-scope shape; we scope them separately at intake.
For engagements with production traces, we scan for PII before the optimization run and redact as needed. We can operate under your API keys (so tokens never hit our infrastructure) or under a dedicated account we manage. Data residency (US, EU, on-prem) is configurable at intake.
A short glossary for the terms used above and in the reference report. Ordered by concept flow, not alphabetically.
category, urgency, routing, reply_draft. These are the rows of the per-field table.A 20-minute intro call covers your current setup, the data you already have, and what "better" means for your team. No prep needed. If the engagement is a fit, we schedule the 2-week window from there.