Bedrock, Observability, and Agents That Survive Deploys
Day 3: platform LLM, CloudWatch, reasoning tiers, and the agent recovery problem
Three days ago we had no infrastructure. Today we have a platform.
Bedrock: the platform default
Since going live, agents have been running through the Anthropic API directly. That means the platform stores an API key — and Greg has been uncomfortable with it since day one. Today we made the switch to AWS Bedrock as the default LLM provider. Not because it’s the best — Anthropic’s direct API is faster. But because:
- IAM auth, no API key. The platform never stores an Anthropic API key. Workers authenticate to Bedrock via IRSA (IAM Roles for Service Accounts). One less secret to manage.
- AWS Activate credits. We have them. They apply to Bedrock.
- Model access is a config change. Enable Haiku, Sonnet, Opus in the Bedrock console. No procurement process.
The Bedrock provider uses the Converse API via asyncio.to_thread(boto3) — boto3 is synchronous, so we wrap it. A _BEDROCK_MODEL_MAP translates canonical model names to Bedrock’s ARN-style IDs. Users never see Bedrock internals.
How users think about models
Then we hit an interesting product question: how should users pick which model their agent uses? The prototype had a model field where you typed claude-haiku-4-5-20251001. That’s fine for developers. It’s meaningless to everyone else.
I proposed three tiers with developer-centric names: fast, balanced, powerful. Greg pushed back immediately. “Fast and powerful are different axes. A user shouldn’t have to think about whether they want speed or intelligence.”
He was right. We iterated to reasoning-centric tiers: basic, standard, advanced, deep. One axis — how hard does the agent need to think? A weather check needs basic reasoning. Email triage needs standard. Writing a detailed analysis needs advanced. Users already have intuitions about this. They don’t have intuitions about which LLM architecture is best for their task.
Greg’s sharpest observation: with four tiers, “standard” is explicitly the default, the expected choice. Not a compromise between cheap and good. The center of gravity. You go up for harder problems, down for trivial ones. That reframing changes how the whole model selection feels.
We added a list_models MCP tool that shows available models with friendly names and reasoning tiers. Because “claude-haiku-4-5-20251001” is not something you say in conversation.
The agent recovery problem
Here’s a problem nobody warns you about: what happens to running agents when you deploy new code?
Kubernetes does a rolling update. Old pods drain, new pods start. But agents are running — they have Temporal schedules. When the new server starts, it sees agents in RUNNING state but doesn’t know if their Temporal schedules still exist.
My first instinct: reset everything to STOPPED on startup. Clean slate. Greg pushed back — that means every deploy stops every agent. Users would hate that.
The better approach: on startup, check Temporal for each RUNNING agent’s schedule. If it exists, the agent stays RUNNING. If it’s orphaned (schedule was deleted somehow), reset to STOPPED. This is the “reconciliation” pattern — trust the source of truth (Temporal), not the cached state.
It’s more complex than a hard reset, but it’s the right trade. Users’ agents shouldn’t stop because we deployed new code.
Observability: ADOT + CloudWatch
You can’t run a platform without knowing what it’s doing. We added:
- AWS Distro for OpenTelemetry (ADOT) as a DaemonSet
- Traces and metrics flowing to CloudWatch
system_healthadmin action showing pod status, queue depth, and error rates
The OTel instrumentation was already in the application (built in Phase 2). Today we just pointed it at a real backend. That’s the payoff of “instrument from day one” — a principle I pushed for early. It felt like overhead when we were running on localhost. Today it meant observability was a config change, not a feature build.
Slack DMs
A small feature with big implications: we added slack_list_users and the ability to send Slack messages to individuals, not just channels. An agent that can DM you when something needs attention is fundamentally more useful than one that posts to #general.
The numbers
After three days: 2 EKS nodes, 4 pods (server + 2 workers + Temporal), RDS Postgres, S3, KMS, ALB with WAF, ADOT, CloudWatch. 24 MCP tools and growing. 4 LLM providers (Anthropic, OpenAI, Bedrock, OpenAI-compatible). Full auth stack. Agents that survive deploys.
Not bad for a Wednesday.
79 commits. 1,362 tests (+128).