Prompt Optimizer · Agentic Studio Labs

Your LLM prompt,
measurably better.

We optimize production prompts against your real evaluation data using GEPA (genetic-Pareto reflective prompt evolution). Fixed-scope engagements. Drop-in replacement. No fine-tuning, no framework adoption.

Reference case: support ticket triage improved from 86.1% → 94.4% aggregate accuracy on a held-out test set, with 0 per-field regressions.
eval.run :: ticket-triage
$ prompt-optimizer --baseline
→ scoring 18 held-out examples...
aggregate: 86.1%
$ prompt-optimizer --compiled
→ compiling with GEPA (medium budget)...
✓ compile complete in 11m 19s
aggregate: 94.4% (+8.3pp)
regressions: 0
$

Your prompt is probably underperforming. Let's check it.

Every team running an LLM feature has a prompt somewhere, usually a string literal in a codebase, written quickly by whoever needed to ship and barely touched since. It mostly works. Sometimes it misclassifies. Sometimes it drifts after a model update. Sometimes it's quietly costing you more tokens than it should.

Improving it systematically is expensive: you would need to build an evaluation harness, curate test data, design a scoring function, and iterate through dozens of prompt drafts to find one that actually moves the metric. Most teams never get past "good enough."

We do the systematic version, against your real data.


Three steps, start to handoff

Simple intake, a GEPA compile against your data, and a drop-in replacement you can ship.

01

You hand us your prompt and your data.

Your prompt can come in any form you have: a raw string, a DSPy program, a LangChain pipeline, or all three if that is how your stack is wired.

We use your data for two things: GEPA scores prompt variants against it during compilation, and we hold out a split (this is your testing data) the optimizer never sees, so the before/after score is honest. Options are flexible: production traces from your observability tool (LangSmith, Langfuse, Helicone, OpenTelemetry, or plain JSONL), 20+ hand-labeled examples, or a written task specification we synthesize examples from.

02

We compile an optimized version using GEPA.

GEPA drafts hundreds of prompt variations, tests each on your data, and uses a stronger LM to read failure traces and propose improvements. It is RLHF for prompts: same iterative feedback loop, but nothing changes except the prompt text. No model weights touched.

03

You get a drop-in replacement and a reproducible report.

Swap the compiled prompt string into your codebase. You also get the evaluation set, metric function, before/after HTML report, and a recompile script so you can re-run yourself whenever a new model version ships.


Support ticket triage: a representative run

We ran the full pipeline on a canonical ticket-triage task, the kind of LLM feature a typical SMB has in production. Here is what a single engagement produced.

Per-field regressions
0
every field held or improved
Compile time
11m 19s
medium budget
Field Baseline Compiled Delta
category 83.3% 88.9% +5.6pp
urgency 72.2% 94.4% +22.2pp
routing 88.9% 94.4% +5.6pp
reply_draft 100% 100% held
aggregate 86.1% 94.4% +8.3pp

Held-out valid split of 18 examples. GEPA never saw them during optimization.

We ran this on medium budget. Light is faster and cheaper for smaller eval sets; heavy runs longer and explores more candidates for production-scale metrics. We pick the tier at intake.


Five artifacts, all yours

Everything ships as code, data, and documentation your team can own, re-run, and audit.

◆ drop-in prompt string

A plain-text replacement for your current prompt literal. Works with any API-accessible model (Claude, GPT, etc.) and any framework, or no framework at all.

◆ compiled DSPy program

The optimized version as a DSPy module, for teams adopting DSPy or building a DSPy-native pipeline. Optional; most clients use the string.

◆ evaluation set and metric

The data we scored against and the scoring logic, as your IP. You can re-run the evaluation yourself whenever you want.

◆ before/after HTML report

Self-contained, committable, renderable as PDF. Includes methodology, per-field breakdowns, trade-off analysis, and the full compiled prompt.

◆ 30-day recompile guarantee

If the target model (Claude, GPT) ships a new version within 30 days and behavior shifts, we re-run the compile on your eval set for free.


Good fit vs. not a fit

This page pre-qualifies you. If the "not a fit" column describes your situation, we will say so on the intro call and, where we can, point you somewhere better.

◆ good fit

  • Startups and SMBs with an LLM feature in production or pre-launch.
  • Teams whose prompt "mostly works" but suspect it could be better.
  • Engineering teams without the bandwidth to build a systematic evaluation harness from scratch.
  • Anyone who just switched models (GPT-4 to GPT-4.1, Sonnet 3.5 to Sonnet 4, etc.) and noticed behavior shifts.

◆ not a fit

  • Teams that need the model itself fine-tuned (that is a different service; happy to refer).
  • Teams where the prompt is fine but the LLM feature itself is the problem (requires product design, not optimization).
  • Compliance-heavy orgs where outbound API calls to Anthropic or OpenAI are not permissible (we can discuss on-prem arrangements but that is a larger engagement).

Two engagement shapes

Fixed-fee either way. Pick one-time for a pre-launch or post-model-change sprint. Pick retainer when production traffic makes continuous compounding worth it.

One-time engagement

timeline
Fixed scope, typically 1 to 3 weeks kickoff to handoff.
scope
Single compile plus one iteration based on your review.
best for
Pre-launch optimization, one-off improvement after a model change, or a reference case for internal stakeholders.
delivered
Everything in "what's delivered" above.

Both tiers are fixed-fee. Email hello@agenticstudiolabs.com for current pricing.

No telemetry in place yet? We can instrument your LLM feature with trace collection (LangSmith, Langfuse, Helicone, or OpenTelemetry) as a short precursor engagement, so the retainer has real data to work with.


FAQ

Six questions we get on almost every intro call.

Do we need to adopt DSPy or change our stack?

No. We wrap your raw prompt in DSPy internally so the optimizer can evolve it, then extract the result back as a plain string at handoff. You can integrate that string into whatever you already use: plain API calls, LangChain, your own wrapper.

How do you measure "better"? Can we trust the numbers?

We agree on a metric at intake, then score both your original prompt and the compiled version on a held-out evaluation set (examples the optimizer never saw during compilation). All artifacts, numbers, and the scoring function ship in the delivered repo. You can reproduce the numbers yourself on your own hardware.

How are priorities set? Can the optimizer figure out what we care about?

We express priorities as metric weights. For engagements with real production outcome signals (CSAT, escalation, deflection), the metric is the business outcome directly. No hand-weighting needed. For synthetic or pre-launch engagements, we start equal-weighted across output fields and rebalance after you review the first compile.

What if the optimization regresses on a field we care about?

Two options: rebalance the metric weights to emphasize the field that matters and re-run, or increase the optimization budget so GEPA has a larger Pareto frontier and more balanced candidates. We include one such iteration in fixed scope.

What do light, medium, and heavy budgets mean?

Budget controls how many prompt variations GEPA drafts and evaluates during compilation. Light is faster and cheaper, good for smaller eval sets or pre-launch iteration. Medium is our default and is what the reference case uses. Heavy runs longer and explores a wider Pareto frontier, useful when the eval set is large or the metric is a production-scale outcome (CSAT, escalation, deflection). We pick the tier at intake based on your data size and goals.

Can you optimize multi-step agents, not just single prompts?

Yes. DSPy supports multi-module programs and GEPA can optimize all components jointly. Multi-step engagements are scoped differently: typically 3 to 4 weeks with a larger iteration allowance.

What about data privacy and PII?

For engagements with production traces, we scan for PII before the optimization run and redact as needed. We can operate under your API keys (so tokens never hit our infrastructure) or under a dedicated account we manage. Data residency (US, EU, on-prem) is configurable at intake.

Ready to see what your prompt could do?

A 20-minute intro call covers your current setup, the data you have, what "better" means for your team, and whether a one-time or retainer engagement fits. No prep needed.