We optimize production prompts against your real evaluation data using GEPA (genetic-Pareto reflective prompt evolution). Fixed-scope engagements. Drop-in replacement. No fine-tuning, no framework adoption.
Every team running an LLM feature has a prompt somewhere, usually a string literal in a codebase, written quickly by whoever needed to ship and barely touched since. It mostly works. Sometimes it misclassifies. Sometimes it drifts after a model update. Sometimes it's quietly costing you more tokens than it should.
Improving it systematically is expensive: you would need to build an evaluation harness, curate test data, design a scoring function, and iterate through dozens of prompt drafts to find one that actually moves the metric. Most teams never get past "good enough."
We do the systematic version, against your real data.
Simple intake, a GEPA compile against your data, and a drop-in replacement you can ship.
Your prompt can come in any form you have: a raw string, a DSPy program, a LangChain pipeline, or all three if that is how your stack is wired.
We use your data for two things: GEPA scores prompt variants against it during compilation, and we hold out a split (this is your testing data) the optimizer never sees, so the before/after score is honest. Options are flexible: production traces from your observability tool (LangSmith, Langfuse, Helicone, OpenTelemetry, or plain JSONL), 20+ hand-labeled examples, or a written task specification we synthesize examples from.
GEPA drafts hundreds of prompt variations, tests each on your data, and uses a stronger LM to read failure traces and propose improvements. It is RLHF for prompts: same iterative feedback loop, but nothing changes except the prompt text. No model weights touched.
Swap the compiled prompt string into your codebase. You also get the evaluation set, metric function, before/after HTML report, and a recompile script so you can re-run yourself whenever a new model version ships.
We ran the full pipeline on a canonical ticket-triage task, the kind of LLM feature a typical SMB has in production. Here is what a single engagement produced.
| Field | Baseline | Compiled | Delta |
|---|---|---|---|
| category | 83.3% | 88.9% | +5.6pp |
| urgency | 72.2% | 94.4% | +22.2pp |
| routing | 88.9% | 94.4% | +5.6pp |
| reply_draft | 100% | 100% | held |
| aggregate | 86.1% | 94.4% | +8.3pp |
Held-out valid split of 18 examples. GEPA never saw them during optimization.
We ran this on medium budget. Light is faster and cheaper for smaller eval sets; heavy runs longer and explores more candidates for production-scale metrics. We pick the tier at intake.
Everything ships as code, data, and documentation your team can own, re-run, and audit.
A plain-text replacement for your current prompt literal. Works with any API-accessible model (Claude, GPT, etc.) and any framework, or no framework at all.
The optimized version as a DSPy module, for teams adopting DSPy or building a DSPy-native pipeline. Optional; most clients use the string.
The data we scored against and the scoring logic, as your IP. You can re-run the evaluation yourself whenever you want.
Self-contained, committable, renderable as PDF. Includes methodology, per-field breakdowns, trade-off analysis, and the full compiled prompt.
If the target model (Claude, GPT) ships a new version within 30 days and behavior shifts, we re-run the compile on your eval set for free.
This page pre-qualifies you. If the "not a fit" column describes your situation, we will say so on the intro call and, where we can, point you somewhere better.
Fixed-fee either way. Pick one-time for a pre-launch or post-model-change sprint. Pick retainer when production traffic makes continuous compounding worth it.
Both tiers are fixed-fee. Email hello@agenticstudiolabs.com for current pricing.
No telemetry in place yet? We can instrument your LLM feature with trace collection (LangSmith, Langfuse, Helicone, or OpenTelemetry) as a short precursor engagement, so the retainer has real data to work with.
Six questions we get on almost every intro call.
No. We wrap your raw prompt in DSPy internally so the optimizer can evolve it, then extract the result back as a plain string at handoff. You can integrate that string into whatever you already use: plain API calls, LangChain, your own wrapper.
We agree on a metric at intake, then score both your original prompt and the compiled version on a held-out evaluation set (examples the optimizer never saw during compilation). All artifacts, numbers, and the scoring function ship in the delivered repo. You can reproduce the numbers yourself on your own hardware.
We express priorities as metric weights. For engagements with real production outcome signals (CSAT, escalation, deflection), the metric is the business outcome directly. No hand-weighting needed. For synthetic or pre-launch engagements, we start equal-weighted across output fields and rebalance after you review the first compile.
Two options: rebalance the metric weights to emphasize the field that matters and re-run, or increase the optimization budget so GEPA has a larger Pareto frontier and more balanced candidates. We include one such iteration in fixed scope.
Budget controls how many prompt variations GEPA drafts and evaluates during compilation. Light is faster and cheaper, good for smaller eval sets or pre-launch iteration. Medium is our default and is what the reference case uses. Heavy runs longer and explores a wider Pareto frontier, useful when the eval set is large or the metric is a production-scale outcome (CSAT, escalation, deflection). We pick the tier at intake based on your data size and goals.
Yes. DSPy supports multi-module programs and GEPA can optimize all components jointly. Multi-step engagements are scoped differently: typically 3 to 4 weeks with a larger iteration allowance.
For engagements with production traces, we scan for PII before the optimization run and redact as needed. We can operate under your API keys (so tokens never hit our infrastructure) or under a dedicated account we manage. Data residency (US, EU, on-prem) is configurable at intake.
A 20-minute intro call covers your current setup, the data you have, what "better" means for your team, and whether a one-time or retainer engagement fits. No prep needed.