Real-API evals as the deploy gate
Most agent CI pipelines look identical. Lint, type-check, run a few unit tests where the LLM is mocked. Pipeline goes green. Deploy proceeds.
That covers the wiring. It does not cover whether the agent gave the right answer.
The pipeline I'm about to describe has a step the others don't: a small eval suite that runs every PR through real model calls, grades the answers, and only lets the deploy proceed if every case passes.
Why mocked tests aren't enough
The mocked unit tests are good for what they're for. They prove the wiring works. Edges connect. Conditionals route. State flows where it should.
Honest tests, all of them, each running in tens of milliseconds because there's no network involved. They run on every push, on every machine, without API keys.
What they don't prove: that the agent gives a correct answer.
The mocked LLM returns whatever you tell it to. If you decide the classifier should route "What is LangGraph?" down the retrieval branch, you can prove your code does that. You can't prove that the real model, given the same question, would also pick that branch. The mock is loyal to your test, not to the model.
Two failure modes mocked tests can't catch:
- Model drift. The provider ships a new model checkpoint. Your prompt now produces slightly different output. Your unit tests still pass. They mocked the model.
- Prompt drift. Somebody tweaks the prompt to be tighter. Tighter to the point that citations get dropped. Unit tests still pass. They never read the model's output.
Pipeline goes green. Production goes red.
What the 10-case eval suite actually checks
The eval suite is a runner script plus a JSON file of test cases (e.g. evals/run.py and evals/test_set.json). Ten questions, each paired with a list of expected keywords. The rule: the final answer must contain every expected keyword, case-insensitive substring match. All-or-nothing.
{
"id": "q05",
"question": "What does the interrupt() function do in a LangGraph node?",
"expected_keywords": ["interrupt", "pause"]
}
Three things to notice about this:
- The grading is dumb on purpose. No second LLM call to score the first. No cosine similarity, no rubric, no LLM-as-judge. Just substring match.
- The questions are written for the specific corpus the agent indexes. Off-corpus questions belong in a different test set.
- When a case fails, the failure data gets printed: which keywords were missing, plus the first 200 characters of the actual answer. Enough to read the run log and understand why it failed.
What it catches:
- Agent doesn't retrieve. The answer doesn't mention the topic at all.
- Agent hallucinates the wrong concept. Mentions adjacent terms but misses the ones the question is about.
- Prompt drift below a quality threshold. Answer used to contain the right terms; new prompt produces output that doesn't.
- Model regression. A new model checkpoint paraphrases too aggressively and the keyword stops appearing.
What it doesn't catch:
- Subtle correctness. The answer can mention every required keyword and still explain the concept wrong. Keyword-match is a floor, not a ceiling.
- Tone, formatting, citation style. None of those are graded.
The grade is binary at the case level and binary at the suite level. Either every case passes, or the suite fails. There's no "9 out of 10 is fine for this PR." If a case fails, the deploy doesn't happen. That's the only behavior that survives contact with a busy week.
Cost and latency of running real evals on every PR
Each case runs through the agent's full graph: classify the question, fetch the relevant chunks from a local vector store, compose the answer with the context, score whether the answer is grounded. Three LLM calls per case in the happy path, mixing a small model for the cheap classification and grading work and a larger one for the writing. If the grading step rejects, the writer runs again with the rejection reason in hand. The loop is capped, so the worst case is roughly twice the call count of the happy path.
Ten cases at three-to-six calls each. Thirty to sixty LLM calls per PR. Sequential, not parallel. Wall-clock: roughly two minutes added to the CI run, dominated by the writer model's response time.
Cost works out to a few cents per PR. The math doesn't matter much. The alternative is shipping a broken agent on a Friday.
Why sequential? Because the agent keeps state per request. Parallelizing would mean multiple state stores running side by side, plus careful logging to keep results legible. Two minutes is fine. Premature optimization here would buy nothing.
The gating step in GitHub Actions
Here's the relevant slice of .github/workflows/ci.yml:
- name: Unit tests (mocked, no API keys needed)
run: <your unit test suite>
- name: Eval gate (real API, all-or-nothing)
env:
PROVIDER_API_KEY: ${{ secrets.PROVIDER_API_KEY }}
run: <your eval suite; exit non-zero on any failure>
# everything below runs only if every step above passed
- name: Deploy
run: <push image, update service, whatever your deploy does>
- The eval step has secrets the mocked tests don't get to see. That's by design. If a unit test could pull the provider's API key, it could call the real model and stop being a unit test.
- The eval script (e.g.
python evals/run.py) exits zero only if every case passes. Non-zero halts the workflow. GitHub Actions doesn't even try the deploy step. - The deploy steps all run sequentially after the gate. They run because every preceding step succeeded, not because they were chosen.
The structure is a literal gate. There is no "deploy anyway" branch. There is no override to skip the evals. If a model regression breaks the suite at 11pm on a Tuesday, the deploy doesn't happen. The previous version keeps serving traffic.
The workflow does have one escape hatch: a paths-ignore clause for documentation and asset-only changes. README typo? Skip the gate. Anything that touches code, build config (e.g. the Dockerfile), or the workflow itself? Run it.
What this gets you, and what it doesn't
What it gets:
- Model regressions don't reach production. If a provider ships a checkpoint that breaks the answer shape, the next deploy fails at the gate. Nobody finds out at 3am.
- The eval suite stays honest. If you can't write a question with reliable keywords, you don't understand what the agent should do. Writing the suite is a forcing function.
- The CI log is enough to debug a regression. Failed case, missing keywords, answer preview. All in plaintext.
What it doesn't get:
- Slow drift below the keyword floor. Answers that subtly degrade in tone, structure, or correctness while still hitting every required keyword.
- Coverage of edge cases not in the suite. Ten questions is a small sample. New questions get added when a real user query exposes a gap; old ones get retired when the corpus shifts and they no longer probe what they were written to probe.
- A stamp of "the agent works." The gate doesn't say the agent is good. It says the agent didn't get worse than the floor we wrote.
Mocked tests cover the wiring. Real-API evals cover the answer. Both are needed because they fail differently, and a deploy that costs real money should pass both.