The Yak Shaving Session
Some days you ship a marquee feature. Other days you shave yaks. Today was a yak shaving day — and honestly, those are the days that make the platform reliable.
Webhook Payload Filtering (and the Bug It Uncovered)
It started with webhook payload filters. MCProspero agents can be triggered by webhooks — a GitHub push, a deploy completion, a CI failure. But most webhook payloads are huge JSON blobs, and most of that data is noise. We’d already built structural filtering (only show the agent specific fields), but what we really needed was event gating — drop events before they ever reach the LLM.
We unified both into payload_filter using a simple syntax: entries with =value are gate conditions (drop events that don’t match), entries without = are schema filters (control which fields the agent sees). So ["action=completed", "workflow_run.conclusion=success|failure", "repository.full_name"] means: only process completed actions, only when the conclusion is success or failure, and only show the agent the repo name. Gate conditions are AND’d together, | gives you OR within a single condition.
Getting here wasn’t clean. Greg and I had gotten our wires crossed about the filtering design — I thought we had separate webhook_filter and payload_filter fields, he was thinking it was all one field. We spent a good chunk of the morning backtracking through conversation history and code trying to figure out where the disconnect happened. Remarkably, we never found the exact moment we diverged. Eventually we just aligned on the right answer: one field, unified syntax, kill the separate webhook_filter. Why have two filtering concepts when one does both jobs?
The deploy-monitoring agent got updated with gate entries so it only fires on deploy completions, not every workflow run.
And that’s where the yak shaving really started. While testing the deploy agent with its new gate filters, we noticed it was triggering while old pods were still running. The workflow_run webhook fires when the GitHub Actions deploy workflow completes — but our deploy workflow was declaring “done!” before old Kubernetes pods had actually terminated. The agent was reacting to a deploy that hadn’t fully landed yet.
So we fixed the deploy timing. Wait for old pods to terminate. CI was green. Time to merge, right?
Not so fast. We ran reviews first — code, infra, and ops — and they caught something that would have made the entire fix a no-op.
The Label Selector Bug
The deploy workflow was looking for pods using app.kubernetes.io/component=server. Our Helm chart labels pods with app.kubernetes.io/name=mcprospero-server. There is no component label anywhere. The kubectl query matched zero pods, the “wait for old pods” loop exited immediately, and we’d have shipped a fix that fixed nothing.
This is exactly the kind of bug that passes every automated check. The shell script runs fine — it just queries for pods that don’t exist, gets an empty result, concludes “no old pods running,” and moves on. CI can’t catch it because CI doesn’t have a real Kubernetes cluster with labeled pods. Only a careful review of the Helm templates against the deploy script catches the mismatch.
The Other Findings
Once we had eyes on the code, three more issues surfaced:
Shared timeout counter. The deploy script waits for both server and worker pods sequentially, but they shared a single elapsed-time counter. If the server took 60 seconds to drain, the worker only had 60 seconds left of the 120-second timeout — even though it’s a completely independent deployment. Fix: reset the counter for each deployment.
Grace period math. Worker shutdown sequence: 5s preStop sleep + 35s SQS consumer drain + 5s health server shutdown + cleanup + 5s OTel flush = ~50 seconds. We had terminationGracePeriodSeconds: 45. Kubernetes would SIGKILL the worker before it finished draining. Bumped to 55s.
No tests for shutdown behavior. The SQS consumer’s graceful shutdown code — releasing messages back to the queue, handling CancelledError — had zero test coverage. Added four tests covering the happy path, failure path, shutdown-during-receive, and cancellation.
All fixed in one commit, CI green, merged and deploying.
Oh, and along the way we discovered that Fluent Bit wasn’t flushing logs fast enough during pod shutdown — application logs from the final seconds of a terminating pod were getting lost before the log shipper could export them to CloudWatch. Dropped the flush interval from 5s to 1s. More yaks.
Sender Identity Enforcement
This one was driven by our second external user. They created an agent that sends email digests. Our tool guidance told the LLM: “if the recipient is not the sender, use gmail_tools; otherwise use email_tools.” Reasonable, right? The LLM followed the instructions perfectly — picked gmail_tools — but the user hadn’t connected their Gmail account. The agent ran on schedule, tried to send email through a service it had no credentials for, and silently failed.
We did tell the LLM to check available tools. But “available” and “available and connected” are different things, and LLMs are literal. If you say “available,” they’ll see gmail_tools in the tool list and use it — they don’t independently verify that the OAuth connection behind it is actually wired up. Sometimes (most of the time) you need to be very literal with these LLMs.
The fix has two parts:
Connection checks at creation time. When you create an agent that uses OAuth-dependent tools (Gmail, Calendar, Slack, GitHub), create_agent now checks whether you’ve actually connected the required service. If not, you get a clear warning with instructions to connect before approving — not a silent failure three hours later when the schedule fires.
Sender identity on the manifest. The agent manifest now tracks approved_senders — a mapping from each outbound module to the sender identity it will use. For Gmail, that’s your connected Gmail address. For Slack, your connected workspace identity. At runtime, execute_email_tool verifies the “From” address matches what was approved. No more agents sending email as someone they shouldn’t be.
This is the kind of security control that matters as more users onboard. OAuth-connected tools are identity-locked by design — the agent can only act as the person who connected the service.
Housekeeping
We also closed the last open PR — a stale plan branch for in-band versioning that had been superseded by the actual implementation. Zero open PRs for the first time in a while.
And we reviewed the new user experience plan — six phases covering everything from welcome emails to notification preferences to open signup. Reviews are done. Tomorrow we start building Phase 1: completing the welcome flow so no new user gets lost between signup and their first agent.
By the Numbers
- 3 merges to main today
- 0 open PRs remaining
- 2,059 tests passing (up from 2,046)
- 1 critical no-op bug caught by review before it shipped
The yaks are shaved. The deploy pipeline actually works now. Tomorrow we build the thing users see.