When Agents Are Actually Worth Shipping
Systems Notes | 2026-03-07
Take: Ship agents only when uncertainty is the core problem.
I like agent systems, but I do not romanticize them. Most teams adopt agents because the demo looks impressive, not because the problem truly needs agentic behavior. That is backward. Agents are expensive in latency, cost, and operational complexity. I ship them only when they solve a real uncertainty problem that deterministic pipelines cannot handle cleanly. If the workflow is mostly predictable, I default to boring automation every time.
My first question: what uncertainty are we buying down?
Before I design an agent architecture, I write one sentence:
- What decision cannot be encoded reliably with rules today?
If I cannot answer that, I do not build an agent.
Good uncertainty candidates:
- Unstructured, noisy inputs where intent extraction matters.
- Dynamic tool selection based on context quality.
- Multi-step planning where path choices vary per request.
Bad candidates:
- Fixed ETL flows.
- Static API orchestration.
- Strictly deterministic validation tasks.
I have seen too many “agentic” systems that were just brittle wrappers around simple scripts.
A decision framework I use before committing
I score candidate workflows on five axes:
1. Input ambiguity: Is source data messy enough to require semantic reasoning?
2. Decision branching: Does the path genuinely vary request to request?
3. Cost tolerance: Can we afford higher latency and token burn?
4. Failure impact: What happens if the agent is wrong once?
5. Operator control: Can humans review, override, and replay safely?
If a workflow scores low on ambiguity and branching, I reject agent design and build deterministic automation.
The architecture pattern that keeps agents sane
When I do ship agents, I do not let them run wild. I use a constrained architecture:
- Planner produces candidate actions.
- Policy layer approves or rejects actions.
- Tool executor runs approved actions only.
- State store tracks plan, action, and outcome transitions.
- Human review gate for risky actions.
The key rule: planner output is a proposal, never an automatic permission slip.
This protects the system from model hallucination and from subtle prompt-injection paths.
Tooling boundaries matter more than model cleverness
I enforce strict tool contracts:
- Every tool has a typed schema.
- Every tool call has explicit allow/deny rules.
- Side-effecting tools require extra checks.
- Tool responses are normalized before feeding back to the model.
Without boundaries, agents become unpredictable quickly. With boundaries, they become debuggable components in a broader system.
What I optimize for in production agent systems
1) Bounded autonomy
I optimize for constrained autonomy, not full autonomy. The agent should handle repetitive ambiguity, then escalate edge cases predictably.
2) Replayability
I need to replay a run with the same context and inspect every decision.
- Store prompt/tool trace per step.
- Persist intermediate state.
- Attach run IDs across services.
3) Safe degradation
If one capability fails, the system should degrade gracefully:
- Fall back to simpler deterministic path.
- Queue for human review.
- Return partial result with clear status.
4) Cost discipline
Agents can quietly become expensive. I cap max steps, token budgets, and retry counts per run.
What I avoid
1) Recursive free-form tool loops
Unlimited “think-act-think” cycles are a reliability hazard. I cap steps and force termination states.
2) Hidden prompts and implicit behavior
If behavior depends on undocumented prompt magic, no one can maintain it. I version prompts and policies like code.
3) Side effects in planning stage
Planner should not mutate external systems. Planning and execution are separate for a reason.
4) Blind trust in “agent confidence” language
Natural-language confidence claims are not metrics. I trust measurable outcomes and validation gates.
A practical launch path
I use this rollout sequence for agent features:
1. Dry-run mode: agent proposes actions, no execution.
2. Shadow compare: compare proposals against deterministic baseline/human decisions.
3. Scoped execution: enable only low-risk tool actions.
4. Progressive unlock: increase capability by policy tier.
5. Continuous audit: monitor policy rejects, retries, and correction rate.
This lets me quantify value before exposing high-impact operations.
How I measure whether an agent deserves to stay
I keep the success criteria concrete:
- Reduced manual triage time.
- Lower error rate versus existing workflow.
- Stable latency within SLA range.
- Acceptable cost per completed task.
- Low rate of policy-violating proposals.
If the agent fails these metrics over time, I simplify or remove it. Deleting complexity is a valid optimization.
Agent vs workflow automation: my default table
I keep this mental model:
- Mostly deterministic + high side-effect risk -> workflow automation.
- Ambiguous input + low side-effect risk -> agent assist.
- Ambiguous input + high side-effect risk -> hybrid with strict review gates.
Most enterprise tasks fall into the hybrid category, not fully autonomous systems.
Related projects
The clearest production example for this approach is OpenClaw Local Operator System, where local-first execution, policy middleware, and tool constraints were required from day one.
For contrast, Semantic Career Workflow Automation shows where deterministic pipelines plus semantic scoring can solve most of the value without full agent complexity.
Final take
Agents are worth shipping when uncertainty is the core bottleneck and you can bound risk with policy, observability, and human control. They are not worth shipping when a deterministic workflow can solve the same outcome faster and safer. My rule is simple: ship the least complex system that produces reliable outcomes, then add agentic behavior only where it proves measurable leverage.
If a team cannot explain the exact handoff between model judgment and deterministic enforcement, the architecture is not ready. Clear boundaries are what turn agent ideas into production systems you can trust.