The Best Agent Stack Is the One On-Call Can Run
Daily Brief | 2026-02-22
Take: My take: The winning stack is the one your team can operate under pressure.
I am not optimizing for impressive screenshots. I am optimizing for fewer incidents, faster recovery, and cleaner handoffs between teams. This edition is built around that mindset.
Today's theme: The winning stack is the one your team can operate under pressure.
Top Stories
Permission boundaries are core architecture
- Agents should not run with broad credentials when scoped tokens can satisfy the same task.
- I separate read, write, and privileged tool permissions by workflow intent.
- Runtime policy checks catch unsafe tool requests before execution reaches sensitive systems.
Why it matters: Permission sprawl turns minor prompt mistakes into high-impact incidents.
My take:
- Least privilege is not optional when agents can execute tools against production systems.
- I would rather approve one more permission request than debug one preventable security incident.
Reality check: Security reviews after launch rarely remove risk as effectively as scoped design upfront.
Builder move: Issue short-lived scoped credentials per workflow and enforce tool allowlists at runtime policy layer.
Dependency supply chain risk is underrated
- AI workflows often pull fast-moving dependencies with weak provenance checks.
- I pin versions, scan lockfiles, and audit transitive packages tied to tool execution.
- Build reproducibility matters because incident rollback depends on known artifact state.
Why it matters: Supply chain drift can introduce security or reliability regressions without any app code changes.
My take:
- I treat dependency governance as production safety work, not as compliance paperwork.
- Unpinned transitive dependencies are silent risk multipliers in automation stacks.
Reality check: A passing build today does not guarantee the same dependency behavior tomorrow.
Builder move: Pin critical dependencies, verify checksums in CI, and schedule weekly lockfile audit reviews.
Evaluation gates belong in CI/CD
- Prompt edits, model routing changes, and tool updates should trigger automated eval checks.
- I keep a regression suite that reflects real user intents, not idealized sandbox prompts.
- Passing unit tests is not enough when semantic behavior is part of the product.
Why it matters: Without eval gates, quality drifts silently until customer trust is already damaged.
My take:
- Prompt engineering is useful, but without eval gates it is still guesswork with better wording.
- I push back on any release plan that skips semantic regression checks for speed.
Reality check: A green pipeline with no eval coverage can still ship broken behavior.
Builder move: Add a mandatory semantic eval stage in CI and block deployment when key task scores regress.
Tooling / Shipping Notes
Regression datasets need refresh cadence
- Static eval sets decay as product behavior and user expectations evolve.
- I refresh regression datasets on a schedule tied to major feature changes.
- Each refresh keeps legacy high-impact cases so quality history is preserved.
Why it matters: Stale eval data gives false confidence and misses emerging failure modes.
My take:
- I would rather maintain eval data aggressively than debug avoidable regressions in production.
- Dataset ownership is a core engineering responsibility in AI products.
Reality check: Old benchmarks flatter new models when user behavior has already shifted.
Builder move: Schedule monthly eval dataset reviews and add new failure examples from support incidents.
Runbooks and incident drills for AI workflows
- Incidents move faster when on-call engineers have task-specific runbooks ready.
- Drills reveal missing ownership paths and weak monitoring assumptions early.
- I keep rollback, communication, and validation steps in one shared incident template.
Why it matters: Prepared response paths reduce downtime and decision paralysis during production failures.
My take:
- If the team has never practiced an incident, response quality will be inconsistent.
- Runbooks are living assets that should evolve with architecture changes.
Reality check: The worst time to define process is during a live outage.
Builder move: Schedule quarterly AI incident drills and update runbooks with concrete lessons after each exercise.
CLI-first workflows keep AI delivery reproducible
- I keep generation, evaluation, and release actions in scripts so anyone can run the same steps.
- Task runners reduce tribal knowledge and remove manual sequencing errors.
- CLI interfaces are easier to validate in CI than ad-hoc notebook workflows.
Why it matters: Repeatable command paths reduce operational drift between individual developers and CI systems.
My take:
- If a workflow cannot be run from the terminal, I do not consider it production ready.
- Convenience clicks are fine for exploration but fragile for delivery.
Reality check: Manual runbooks fail fastest during incidents.
Builder move: Wrap core AI workflows in scripted commands and gate releases through those commands in CI.
Action items
- Ship one production-hardening improvement from "Permission boundaries are core architecture" in the next sprint and measure its reliability impact.
- Add a CI quality gate inspired by "Dependency supply chain risk is underrated" so regressions fail before deployment.
- Operationalize "Regression datasets need refresh cadence" with a written runbook and ownership assigned to one engineer this week.
I build pragmatic, Python-driven automation systems. If your team is serious about shipping AI reliably, let's talk.