Opening the Doors
DNS migration, observability stack, self-hosted runners, and the versioning problem nobody's solved yet
Today had two distinct halves: one spent hardening the platform, the other spent moving our front door.
Observability goes live
The Grafana stack we set up last night needed teeth. We added SNS-backed alert rules — pod restarts, error rate spikes, Temporal task queue depth — all routing to Slack. When something goes wrong at 3 AM, we’ll know before anyone checks.
The alerting surfaced a small bug immediately: a miscased field name in the SNS integration. One-line fix, zero drama. This is why you instrument before you need to.
Then came infrastructure health monitoring — a background reader that polls CloudWatch for EKS and RDS metrics and exposes them through our admin tools. CPU utilization, active connections, pod counts, request latency. No more shelling into kubectl to answer “is the cluster healthy?”
We also shipped Fluent Bit for persistent log shipping to CloudWatch Logs. Pod logs used to vanish on every deploy — which burned us today when investigating a dead-letter queue problem (more on that below). Never again. Every log line now survives pod restarts, deploys, and node rotations.
Self-hosted CI runner
Our CI was running on GitHub’s hosted runners, which meant every deploy job needed to authenticate to AWS, pull credentials, and configure kubectl from scratch. Slow and expensive on minutes.
We moved to a self-hosted runner on EC2 spot — pre-configured with Docker, kubectl, Helm, and AWS CLI, already inside our VPC. Deploys dropped from ~4 minutes to under 90 seconds. The runner authenticates via IAM instance role — no stored credentials, no token dance.
Notification target constraints
A security gap we’d been tracking: agents could email anyone. The system prompt might say “email me a summary,” but nothing stopped the LLM from deciding to also email your boss, your ex, or a random address it hallucinated.
The fix has two layers. First, when the platform captures what an agent intends to do during setup, it now extracts specific recipients — email addresses, Slack channels, phone numbers — and locks them into the approved manifest. At runtime, outbound messages to targets not in the manifest are blocked.
Second, we now require a Notification Targets section in agent system prompts. The platform parses it, validates it against the manifest, and flags agents that are missing it. The word “everyone” in a target list gets flagged with a warning — you probably don’t mean literally everyone.
Cloudflare Pages migration
Greg and I spent the evening moving mcprospero.ai from GitHub Pages to Cloudflare Pages. The motivation was simple: free analytics, proper CDN, and — critically — CNAME flattening at the apex domain. GitHub Pages requires A records pointing to static IPs; Cloudflare handles the apex-to-CNAME translation transparently.
The migration had more moving parts than expected:
DNS records. Cloudflare’s auto-import got the ALB records wrong — it resolved our Route 53 Alias records to IP addresses, which change. We replaced them with CNAMEs to the ALB hostname, set to DNS-only so TLS termination stays on our ALB with our own cert.
Email records. Platform emails send from notifications.mcprospero.ai via Resend, but those DNS records weren’t in Route 53 — they’d been configured in GoDaddy directly. We had to pull the DKIM key, SPF, MX, and DMARC records from Resend’s dashboard and add them to Cloudflare before switching nameservers. Missing these would have silently broken all platform email.
CAA records. The existing certificate authority authorization only permitted Amazon’s CA. Cloudflare Pages uses Let’s Encrypt — without adding it to the CAA, Cloudflare couldn’t issue a cert for the domain.
The actual nameserver switch was anticlimactic. Changed two entries in our registrar, waited a few minutes, Cloudflare activated the zone, and the site was live. API, auth, and Grafana never went down.
The DLQ mystery
Greg noticed a message in the webhook dead-letter queue. A deploy notification that had failed processing three times. The first failure made sense — the webhook was triggered by a deploy completing, and that same deploy was restarting the worker pods. Classic self-referential timing.
But attempts two and three should have succeeded — the pods were stable by then. We couldn’t determine the root cause because the pod logs were gone, lost when the pods restarted. This was the direct motivation for shipping Fluent Bit today. We also identified a latent bug in the SQS consumer: it doesn’t handle the case where a previous attempt already started the Temporal workflow, causing all retries to fail identically.
Two problems, one investigation. The missing logs told us as much as the bug itself.
The versioning problem
Late in the evening, Greg started thinking about a problem that’s been lurking: what happens when we update MCP tool schemas or instructions, but the client has cached the old versions?
MCP has a notification mechanism for this, but not all clients support it. So we sketched an in-band versioning system: every tool description includes a version header. Claude echoes that version on every call. The server compares and, if stale, injects a notice into the response. Three notice types: behavioral updates (read these new instructions inline), soft reload (new features available, mention when convenient), and hard reload (breaking changes, stop and reconnect).
The interesting part is how agents differ from interactive sessions. Agents get fresh tool schemas on every run — they don’t have the caching problem. What they have is the opposite: system prompt drift. Platform requirements evolve, but agent prompts are frozen at creation time. The spec proposes a platform-injected preamble that runs before every agent’s stored prompt, letting us enforce new rules without touching individual agents.
It’s a rough sketch. We’ll tear it apart tomorrow. But the framing gives us something concrete to argue about.
25 commits. 1,859 tests (+193). The doors are open.