My Approach to LLM Infra, Evals, and CI

Systems Notes | 2026-03-07

Take: Prompt tweaks are not strategy; eval gates are.

Most teams treat LLM infrastructure like a demo pipeline wearing production clothes. A prompt works in staging once, then everyone assumes it is “good enough.” Two weeks later latency doubles, output quality drifts, or cost spikes, and now the team is firefighting without a baseline. I run my LLM systems with the same standards I apply to backend services: versioned inputs, explicit evaluation suites, gated CI, and reversible releases. If a model stack cannot survive controlled change, it is not infrastructure yet.

The architecture principle: separate generation from decision

I split LLM usage into two layers:

Generation layer: model prompt, tool calls, and candidate output.
Decision layer: deterministic validation, policy checks, and final action.

The model can propose. Deterministic code decides.

This pattern prevents a model regression from triggering unsafe downstream behavior. It also gives me clean points to test:

Was generation coherent?
Did it meet schema constraints?
Did policy accept or reject correctly?

When teams collapse these layers, they end up debating “model quality” while ignoring system design flaws.

Version everything that affects output

I version more than model names.

Prompt template ID and revision.
Tool schema revision.
Preprocessing pipeline hash.
Post-processing policy revision.
Model/provider identifier.

Each evaluation run stores the full tuple, not just “used model X.” Without this, diffing behavior across releases is mostly storytelling.

In practice, I maintain a structured run record like:

prompt_version
policy_version
eval_set_version
model_route
latency_ms
cost_tokens
pass_fail

This makes regression root-cause obvious. If quality drops, I can localize whether prompt, route, or policy changed.

Evals before deployment, not after incident

My eval strategy has three layers:

1) Deterministic schema evals

These are non-negotiable. If output must be JSON, it is validated every run. If required fields are missing, hard fail.

No “best effort” parsing.
No auto-filled defaults for critical fields.
Immediate CI failure for schema pass rate below threshold.

2) Task-level benchmark evals

I keep a curated dataset of representative tasks and expected behaviors.

Classification tasks: precision/recall thresholds.
Extraction tasks: field-level accuracy expectations.
Routing tasks: correct action path vs expected.

I focus on the minimum eval set that predicts production risk, not benchmark vanity.

3) Policy and safety evals

This catches injection and unsafe tool decisions.

Prompt injection probes.
Tool misuse attempts.
Data leakage checks for restricted fields.

If a change increases policy bypass probability, it does not ship regardless of raw answer quality.

CI gates I actually use

I do not believe in one giant quality score. I gate with a few specific thresholds:

Schema pass rate >= 99%.
Critical task accuracy >= target baseline.
Safety eval pass rate = 100% for blocking scenarios.
p95 latency within release budget.
Token cost within defined variance from baseline.

Any gate failure blocks merge.

That sounds strict, but it prevents the worst failure mode: shipping regressions because “the model feels better overall.” Feelings are not deployment criteria.

Prompt changes are code changes

I run prompt updates like code:

Pull request with explicit changelog.
Side-by-side eval diff in CI comments.
Reviewer approval required from someone who understands failure modes.
Rollback strategy attached before merge.

This creates accountability and history. If a prompt change causes degradation, I can revert one artifact quickly instead of patching blindly.

Model routing: reliability over novelty

I route requests by task constraints, not by hype cycle.

Fast/cheap model for low-risk classification.
Stronger model for complex synthesis with strict guardrails.
Fallback route for provider outage or latency spikes.

I maintain circuit breakers:

If provider error rate crosses threshold, route to fallback.
If latency exceeds SLA window, degrade gracefully (partial response or queued processing).

The goal is continuity. “Best model always” is usually just “least resilient architecture.”

What I optimize for

1) Change safety

I want to update prompts, tools, or routes without fear. That only happens when eval and rollout discipline are already in place.

2) Debug speed

I optimize for rapid diagnosis: what changed, where it changed, and what metric moved first.

3) Cost predictability

I track token and latency drift per release. If cost doubles, that should be visible in CI before it reaches production billing.

What I avoid

1) One-number quality dashboards

They hide meaningful regressions. I prefer a small set of explicit gates.

2) Free-form output for critical workflows

If downstream code relies on model text shape, eventually it breaks. Structured output contracts win.

3) “Ship now, eval later” culture

Postmortems are expensive. Pre-merge evals are cheap.

4) Over-indexing on benchmark leaderboard talk

Leaderboard performance rarely maps directly to your workflow constraints. Your own eval set is your source of truth.

Deployment strategy that keeps risk bounded

I deploy in phases:

1. Offline eval pass against fixed dataset.

2. Shadow mode where new path runs but does not control actions.

3. Canary traffic slice with strict monitoring.

4. Gradual ramp once metrics hold.

5. Rollback readiness kept active for first full cycle.

This strategy protects you from both obvious and subtle regressions.

Observability I expect from every LLM service

At minimum, every request should emit:

Route and model used.
Prompt version and policy version.
Latency, token usage, retry count.
Validation outcome and action decision.
Trace ID that links to upstream/downstream systems.

Without this, your infra is opaque. Opaque infra does not scale.

Related projects

I implemented this discipline directly in DarkClaw Personal Assistant, where policy middleware and schema checks gate risky automation actions.

I also stress-tested similar CI-and-checkpoint thinking in YT Content Factory, where multi-provider orchestration fails quickly without versioned stages and clear rollback points.

Final take

LLM infra quality is not about picking a smarter model every quarter. It is about building a system where change is measurable, regressions are caught before deploy, and operator control is always available. If your process cannot tell you whether a release improved quality, reduced safety, or increased cost, you are not running an LLM platform, you are gambling with production behavior.

Related project

Python Resume Tailoring CLI