Prompt Optimizer · Agentic Studio Labs

Your LLM prompt,
measurably better.

A fixed 2-week engagement. You hand us your production LLM prompt plus eval data (production traces, labeled examples, a code or content corpus, or a written task spec we synthesize from). We hand back a drop-in replacement and a quantified before/after report, compiled with GEPA (genetic-Pareto reflective prompt evolution). No fine-tuning, no framework lock-in, no open-ended consulting.

Reference case: support ticket triage improved from 86.1% → 94.4% aggregate accuracy on a held-out test set, with 0 per-field regressions. See the full report →

Book a 20-min intro call see the reference case ↓

eval.run :: ticket-triage

$ prompt-optimizer --baseline

→ scoring 18 held-out examples...

aggregate: 86.1%

$ prompt-optimizer --compiled

→ compiling with GEPA (medium budget)...

✓ compile complete in 11m 19s

aggregate: 94.4% (+8.3pp)

regressions: 0

the problem

Your prompt is probably underperforming. Let's check it.

Every team running an LLM feature has a prompt somewhere, usually a string literal in a codebase, written quickly by whoever needed to ship and barely touched since. It mostly works. Sometimes it misclassifies. Sometimes it drifts after a model update. Sometimes it's quietly costing you more tokens than it should.

Improving it systematically is expensive: you would need to build an evaluation harness, curate test data, design a scoring function, and iterate through dozens of prompt drafts to find one that actually moves the metric. Most teams never get past "good enough."

We do the systematic version, against your real data.

how it works

Two weeks, kickoff to handoff

Week 1 is intake and baseline. Week 2 is the GEPA compile, a review iteration, and handoff. Fixed scope from day one.

Week 1 · Intake + baseline

You hand us your prompt plus evaluation data.

Your prompt can come in any form you have: a raw string, a DSPy program, a LangChain pipeline, or all three. We wrap whichever form into a DSPy module for optimization; your stack stays untouched.

Before the baseline can run, we also settle two things: the output schema (which fields the prompt produces) and the scoring function (how we compare output to ground truth). Both are agreed async during intake, prior to the compile.

We then run your current prompt against a held-out split the optimizer never sees, so the before/after number is honest. You pick one of four ways to supply the data:

preferred

Production traces

Imported from LangSmith, Langfuse, Helicone, OpenTelemetry, or plain JSONL. 100 to 500 records is ideal. Highest-signal source because it matches your real distribution.

common

Labeled examples

20+ records as JSONL or CSV, one per line. Each pairs an input with the expected output fields (classification, extraction, structured generation, etc.).

corpus

Code or content corpus

A code repository, document collection, or other file-level dataset the prompt runs over. We sample a held-out set and draft expected outputs with your team at intake.

pre-launch

Written spec

You fill out a 10-section intake template (30 to 60 min). We synthesize 30 to 50 examples from it, draft expected outputs, and hand them back for your review before the compile.

Week 2 · Compile + review

We compile with GEPA and walk you through the results.

GEPA drafts hundreds of prompt variations, scores each on your training split, and uses a stronger reflection model (Claude Opus by default) to read failure traces and propose improvements. Same iterative feedback loop as RLHF, but the only thing that changes is the prompt text. No model weights touched.

We review the before/after with you. If a priority field regressed, we rebalance how much each output field counts toward the score and run one more compile. One iteration is included in fixed scope.

Handoff

You get a drop-in replacement and a reproducible report.

Swap the compiled prompt string into your codebase. You also get the evaluation set, the scoring function, the before/after HTML report, and a recompile script, so you can re-run the compile yourself from scratch whenever a new model version ships.

New to GEPA, field weights, or held-out splits? Quick definitions in the terminology section at the bottom.

reference case

Support ticket triage: a representative run

We ran the full pipeline on a canonical ticket-triage task, the kind of LLM feature a typical SMB has in production. Here is what a single engagement produced.

Aggregate accuracy

86.1% → 94.4%

+8.3 percentage points

Per-field regressions

every field held or improved

Compile time

11m 19s

medium budget

Field	Baseline	Compiled	Delta
category	83.3%	88.9%	+5.6pp
urgency	72.2%	94.4%	+22.2pp
routing	88.9%	94.4%	+5.6pp
reply_draft	100%	100%	held
aggregate	86.1%	94.4%	+8.3pp

Held-out valid split of 18 examples. GEPA never saw them during optimization.

We ran this on medium budget. Light is faster and cheaper for smaller eval sets; heavy runs longer and explores more candidates when the eval set is large or the score is a production outcome signal (CSAT, escalation, deflection). We pick the tier at intake.

see the full before/after report → case study repo on GitHub →

what's delivered

Five artifacts, all yours

Everything ships as code, data, and documentation your team can own, re-run, and audit.

◆ drop-in prompt string

A plain-text replacement for your current prompt literal. Works with any API-accessible model (Claude, GPT, etc.) and any framework, or no framework at all.

◆ compiled DSPy program

The optimized version as a DSPy module, for teams adopting DSPy or building a DSPy-native pipeline. Optional; most clients use the string.

◆ evaluation set and metric

The data we scored against and the scoring logic, as your IP. You can re-run the evaluation yourself whenever you want.

◆ before/after HTML report

Self-contained, committable, renderable as PDF. Includes methodology, per-field breakdowns, trade-off analysis, and the full compiled prompt.

◆ 30-day recompile guarantee

If the target model (Claude, GPT) ships a new version within 30 days and behavior shifts, we re-run the compile on your eval set for free.

engagement

Engagement at a glance

One shape only. Fixed scope, fixed fee, two weeks kickoff to handoff. Out-of-scope items are explicit so there is no scope creep to negotiate.

timeline: Two weeks, kickoff to handoff. Week 1 is intake and baseline, week 2 is the GEPA compile, review, and delivery.
scope: Single compile plus one review iteration if a priority field regressed. Further compiles are optional scope adjustments.
pricing: Fixed fee. Email hello@agenticstudiolabs.com for current pricing. We prefer to run the compile under your infrastructure and API keys so tokens never touch our account; if that's not convenient, we can operate under ours.
out of scope: Production deployment and infrastructure. Ongoing monitoring or continuous optimization. Model fine-tuning or weight training. Changes to upstream or downstream application code.
guarantee: If the target model (Claude, GPT) ships a new version within 30 days of handoff and behavior shifts, we re-run the compile on your eval set for free.

common questions

FAQ

Six questions we get on almost every intro call.

Do we need to adopt DSPy or change our stack?

No. We wrap your raw prompt in DSPy internally so the optimizer can evolve it, then extract the result back as a plain string at handoff. You can integrate that string into whatever you already use: plain API calls, LangChain, your own wrapper.

How do you measure "better"? Can we trust the numbers?

We agree on a metric at intake, then score both your original prompt and the compiled version on a held-out evaluation set (examples the optimizer never saw during compilation). All artifacts, numbers, and the scoring function ship in the delivered repo. You can reproduce the numbers yourself on your own hardware.

How are priorities set? Can the optimizer figure out what we care about?

We assign each output field a weight that answers one question: what share of the total score does this field count for? For the v1 compile we start with equal weights (say, 25% each across four fields). We walk the per-field breakdown with you, and if a field that matters regressed, we rebalance for a v2 compile. For example, shifting from 25/25/25/25 to 10/10/70/10 tells GEPA to prefer candidates that preserve routing accuracy, even at some cost to other fields. No ML expertise required on your end; we draft the weights with you at intake.

What if the optimization regresses on a field we care about?

Two options: bump up that field's weight so the optimizer prefers candidates that get it right, or increase the compile budget so GEPA evaluates more prompt variations and has more balanced options to choose from. One iteration of this kind is included in fixed scope.

What do light, medium, and heavy budgets mean?

Budget controls how many prompt variations GEPA drafts and evaluates during compilation. Light is faster and cheaper, good for smaller eval sets or pre-launch iteration. Medium is our default and is what the reference case uses. Heavy runs longer and explores more candidates, useful when the eval set is large or the metric is a production-scale outcome (CSAT, escalation, deflection). We pick the tier at intake based on your data size and goals.

Can you optimize multi-step agents, not just single prompts?

Yes. DSPy supports multi-module programs and GEPA can optimize all components jointly. Multi-step engagements are larger than the standard fixed-scope shape; we scope them separately at intake.

What about data privacy and PII?

For engagements with production traces, we scan for PII before the optimization run and redact as needed. We can operate under your API keys (so tokens never hit our infrastructure) or under a dedicated account we manage. Data residency (US, EU, on-prem) is configurable at intake.

terminology

Terminology

A short glossary for the terms used above and in the reference report. Ordered by concept flow, not alphabetically.

GEPA: Genetic-Pareto reflective prompt evolution. The compile algorithm. Drafts hundreds of prompt variations, scores each against your eval set, and uses a stronger reflection model to read the failures of poor candidates and propose improvements.
Compile: A single GEPA run. Takes the current prompt plus the eval set and produces an optimized prompt artifact. A typical compile runs 10 to 20 minutes of wall time.
Eval set: The records we score prompts against. Your evaluation data once converted to a common structured format, whether you supplied production traces, labeled examples, a code or content corpus, or a written spec.
Held-out split: The slice of the eval set (typically 30%, selected at intake with a fixed random seed) that GEPA never sees during compilation. The before/after headline number is always scored on this split so nothing is memorized. Also called the "valid split."
Train split: The remaining 70% of the eval set. GEPA uses this to score candidate prompts during the compile. Scores on the train split are reported as diagnostics only, never as the headline result.
Fields: The structured output components the LLM produces for each record. In the reference case: category, urgency, routing, reply_draft. These are the rows of the per-field table.
Score (aka the metric): A single number between 0 and 1 that says how well the prompt performed on a record, or averaged across the eval set. The scoring function combines per-field correctness into that single number. Defined jointly at intake.
Field weights: How much each output field counts toward the overall score. For example 25/25/25/25 for equal weighting across four fields, or 10/10/70/10 to prioritize routing accuracy. Changing weights is a one-line config edit.
Baseline: Your current prompt's score on the held-out split. Measured on day one of the engagement, before any optimization.
Compiled: The GEPA-produced prompt's score on the same held-out split. This is the "after" number in the before/after report.
Target LM: The model the compiled prompt will actually run against in production. Default: Claude Sonnet 4.6. Client's choice at intake.
Reflection LM: The stronger model GEPA uses during the compile to read failure traces and propose prompt improvements. Default: Claude Opus 4.7. Most of GEPA's effectiveness comes from a strong reflection model.
Budget (light / medium / heavy): How many prompt variations GEPA drafts and evaluates during one compile. Light is fastest and cheapest (good for smaller eval sets or pre-launch iteration); medium is our default; heavy runs longer (good for large eval sets or when every point of lift matters).
Recompile: Running the GEPA compile again from the delivered artifacts, typically after a new target-model version ships. The recompile script is part of handoff; one free re-run within 30 days is guaranteed.

Ready to see what your prompt could do?

A 20-minute intro call covers your current setup, the data you already have, and what "better" means for your team. No prep needed. If the engagement is a fit, we schedule the 2-week window from there.

Book a 20-min intro call or email hello@agenticstudiolabs.com

Your LLM prompt, measurably better.