Retry, escalate, resume

January 22, 2026

When somebody says their agent has retry logic, they almost always mean the SDK is configured to retry on 429s and 5xx. A try / except somewhere down the call stack, maybe with exponential backoff. They didn't write it; the client library wrote it. And it works fine when the failure mode is the network.

It does not work when the failure mode is "the agent gave a wrong answer."

A wrong answer is not a transient error. Sleeping doesn't help. Re-running the same prompt with the same retrieved chunks gets you the same answer with different filler words. If you want the next attempt to be better, the next attempt has to know why the last one was bad. And the only way to know that is to score it.

Here's how retry actually works in the agent. It isn't a wrapper around a function. It's a back-edge in the graph.

The wrong way to retry

The first instinct is to wrap the whole agent call in the same retry pattern, one level up:

for attempt in range(MAX_RETRIES):
    try:
        answer = agent.invoke(query)
        return answer
    except Exception:
        time.sleep(2 ** attempt)

This handles the same transient failures the SDK already retries: hiccups, rate limits, ChromaDB briefly off. Just at a different layer. It does not handle anything new.

But the failure mode I care about isn't a transient error. It's the agent confidently producing an answer that doesn't follow from the retrieved docs. Hallucination. Drift. The HTTP layer is happy. The agent is wrong.

If you wrap that in try / except, the exception never fires. Your retry budget gets spent on a problem you don't have.

What's missing is the part where someone, or something, looks at the answer and says no, that's not it, try again. And on the retry, the agent should know what was wrong with the first attempt.

Retry as graph state

In LangGraph the agent is a graph of nodes, and the graph has its own state. The state isn't a function-local variable; it's a typed dict that flows through every node. That gives us somewhere to put the retry counter.

The state for my agent looks like this:

class AgentState(TypedDict):
    query: str
    category: str
    retrieved_docs: list[dict]
    answer: str
    iteration_count: int
    verifier_passes: bool
    verifier_feedback: str

Two fields make the retry possible:

iteration_count: the budget. Bumped by the verifier on every cycle.
verifier_feedback: the why from the previous attempt. The planner reads this on retry, so it isn't trying the same thing twice.

The verifier is a separate node. Its only job is one structured-output call: was the answer grounded in the retrieved docs? Yes or no, with a one-sentence reason. I run it on Claude Haiku because the question is small and the answer is a boolean. Sonnet does the planning; Haiku does the scoring.

The retry happens at the graph edge:

graph.add_edge("planner", "verifier")
graph.add_conditional_edges(
    "verifier",
    _route_after_verifier,
    {"end": END, "retry": "planner"},
)

The dotted edge is the back-edge. It points at the planner, not the retriever, because we don't want a fresh fetch; we want a fresh attempt against the same evidence with the verifier's feedback in hand.

Three things to notice about this:

The edge is conditional. The graph doesn't know in advance whether it will loop. It looks at state, picks an edge.
The retry branch points back at planner, not retrieval. The retrieved docs are still in state. Same evidence, fresh attempt. We're not re-fetching from ChromaDB.
There's a hard ceiling. MAX_ITERATIONS = 3. If the verifier still says no after the third planner attempt, we exit with whatever the planner produced. Wrong is better than spinning forever.

The planner reads the previous feedback when it runs the second time, so each cycle is informed. The planner isn't generating from scratch; it's responding to a critique. That's the part that distinguishes "retry" from "re-run": the Reflexion shape, applied to RAG grounding instead of code generation.

The escalation point

Three iterations isn't always enough. Sometimes the planner is stuck in a basin: the retrieved chunks happen to be misleading, the verifier is being pedantic, or the query itself was ambiguous. The graph can detect that something is wrong; what it can't do is decide on its own to redirect.

So the verifier escalates. The graph stops mid-execution and writes its entire state to a checkpointer: every field, every retrieved doc, the failed answer, the verifier's reason for failing it. And then nothing happens. No background timer ticks. No retry queue. The execution sits there, with all of its context intact, waiting for somebody to come tell it what to do.

The mechanism is one call. The verifier fires it at iteration 2, one cycle before the budget runs out:

if not result.passes and new_iter == HITL_TRIGGER and hitl_enabled:
    decision = interrupt(
        {
            "type": "verifier_failed",
            "iteration": new_iter,
            "max_iterations": MAX_ITERATIONS,
            "verifier_reason": result.reason,
            "answer": state["answer"],
        }
    )

interrupt() is LangGraph's pause primitive: the call that does the stopping and the writing-to-disk. The checkpointer is a SqliteSaver in my case, which writes to a local SQLite file. When a human comes in, they see what the verifier saw: the question, the docs that were retrieved, the answer that was rejected, and the verifier's stated reason. They have three actions available:

approve: override the verifier. The answer was actually fine. Mark verifier_passes=True and exit.
rewrite: replace the answer with a better one the human typed themselves. Exit with the new answer.
Anything else (including reject): agree with the verifier and give up. Exit with verifier_passes=False.

The resume call looks like this:

graph.invoke(
    Command(resume={"action": "approve"}),
    config={"configurable": {"thread_id": session_id}},
)

The thread_id tells SqliteSaver which paused execution to resume. State is loaded from disk, the verifier node re-runs from the top, sees the human's decision, and routes accordingly. Execution continues from exactly where it stopped. No restart, no replay-from-scratch.

One quirk: because the verifier re-runs from the top on resume, its LLM call fires twice. Haiku at temperature zero, so the duplicate is deterministic and basically free. If the verifier did anything expensive, I'd split the LLM into its own node so resume wouldn't repeat it. Not worth it here.

What this buys you

That's the picture: the agent retries itself when it's wrong, escalates when retry isn't enough, and resumes exactly where it stopped once a human has decided. None of that needed a wrapper, a queue, or a callback. It needed a graph state, a conditional edge, and a checkpointer.

Three things fall out of this approach that I didn't plan for, but I keep coming back to.

It makes the retry budget visible. When retry lives in calling code, the budget gets wrapped around invocations and tends to drift: different call sites, different counts, no shared view. When it's in graph state, every node sees the iteration counter and can react to it. The verifier knows when it's at iter=2 and pulls the human in. The router knows when iter=3 and exits cleanly. One number, one place.

It makes the agent pausable. The HITL escalation isn't a special path; it's interrupt() plus the same checkpointer that already persisted state between nodes. Anywhere a node could call interrupt(), the agent can pause. Want a budget approval before an expensive tool call? Same primitive. Want to send the user a clarifying question and wait for the response? Same primitive.

It makes the run replayable. Every step writes to the checkpointer. If something later in the graph blows up, SqliteSaver still has the state from before the explosion. You can inspect, fix, resume. That isn't unique to RAG agents (it's how durable workflows work in general), but it falls out for free once your retry already lives in state.

The wrapper-around-a-function model gives you none of that. You retry, it works or it doesn't, you log the result, and the rest of the system never finds out what happened in the loop. The first time something breaks in production, the difference between "I have a stack trace" and "I have the entire run on disk" is the difference between a debugging session you can finish and one you can't.

The right place for retry is inside the agent's state, not around it. Once it lives there, escalation and resume stop being features you build. They're already there, waiting.