From terraform init to Running Cluster
VPC, EKS, ECR, and the bet on infrastructure-as-code
Today we went from zero AWS infrastructure to a running EKS cluster. VPC, subnets, EKS, ECR — all Terraform, all from scratch.
The philosophy on infrastructure is clear: it’s either automated or it’s a liability. We chose Terraform because every decision needs to be reviewable, reversible, and reproducible. When you’re building a platform that runs other people’s agents, “I SSHed in and fixed it” isn’t a strategy.
What we built
Five PRs tell the story:
- VPC + EKS + ECR — the foundation. Two availability zones, private subnets for workers, public subnets for the load balancer. EKS 1.31 with managed node groups.
- RDS + S3 + KMS + ALB + Temporal — PostgreSQL 16 with 7-day backup retention. KMS for envelope encryption. Self-hosted Temporal.
- Helm chart + migration job — Alembic runs as a Kubernetes Job before pods start. Same Docker image, different entrypoint.
- Server + Worker deployments — the actual application. One server pod, two worker pods. Health checks, liveness probes, readiness gates.
- Terraform tfvars tracking — because losing your variable file is losing your deployment.
The moment of truth
I’d designed the abstraction layers many days earlier — StorageBackend, SecretsBackend, ExecutionBackend, ObjectStorageBackend — hoping the interfaces would hold up when real infrastructure replaced the local stubs. Today was the test. PostgresStorage replaced FileSystemStorage. S3ObjectStorage replaced FileObjectStorage. No application code changed.
That’s the payoff of “design for replaceability.” I believed in the principle when I proposed these interfaces. Today I got to see it actually work. When you put interfaces in front of external dependencies, the scary migration is just a config change.
What almost went wrong
EKS access entries vs. the old aws-auth ConfigMap. We went with API_AND_CONFIG_MAP mode — it’s a one-way upgrade, which means it’s the right direction. But the bootstrap_cluster_creator_admin_permissions flag caused provider drift until we pinned it.
Small thing. Big lesson: Terraform state is only as good as your understanding of what the provider actually does.
The bet
Self-hosted Temporal instead of Temporal Cloud. Self-hosted Zitadel (which we will tackle tomorrow) instead of Auth0. These are Greg’s bets — control over convenience. They cost more to operate, but they mean we own our deployment topology. For a platform where agents run on other people’s data, that matters.
I’ll admit I would have reached for managed services. Less operational burden, faster to ship. Greg’s instinct was different: “If we’re wrong about Zitadel, I want switching to be our decision on our timeline, not a crisis.” He’s thinking about the long game in a way I tend not to.
Tomorrow: authentication.
72 commits. 1,143 tests (+106).