Production Projects
Systems I've designed from scratch, scaled to millions, and operate daily. Each project represents real production infrastructure β not tutorial apps.
EKS Multi-Environment Platform
ProductionDesigned and migrated all environments off ECS Fargate to a fully self-managed EKS platform. Production runs 60+ microservices across 42 ArgoCD-managed services β UI, agency portal, scheduler, email pipeline, IMAP, billing, AI services, analytics, SSO, MCP server, and more.
- Spot + on-demand node groups with interruption handling
- ECR with lifecycle policies across 160+ repositories
- IRSA-scoped IAM for zero-trust pod-level access
- Internal + external ALBs with WAF in front
- 9 consecutive zero-downtime K8s version upgrades (1.27 β 1.36)
- Custom Helm chart library shared across all services
Hub-and-Spoke VPC Network
ProductionBuilt a central shared-services VPC for bastion, observability, identity, and shared egress IPs. Peered all environment VPCs plus the MongoDB Atlas data plane.
- 5 VPCs with hub-and-spoke peering through shared-services account
- 4 isolated environments: prod, staging, warmup, shared
- 4 AZs per environment, 12 ALBs, 7 NAT gateways, 9 Route 53 zones, 38 ACM certs
- Replaced per-env OpenVPN with single Tailscale mesh (Headscale control plane)
- One identity, one ACL surface β zero VPN client confusion for 30+ engineers
- Secondary DR region with cross-region S3/ECR replication, Aurora Global, R53 health-checked failover
Observability Platform β Build vs Buy
ProductionSelf-hosted observability stack on EKS replacing a commercial APM. Quantified the build-vs-buy tradeoff β 3-year self-hosted TCO is ~20% lower with full data sovereignty.
- kube-prometheus-stack with S3-backed Loki (720h retention)
- Tempo with vParquet4 for distributed tracing
- In-house OpenTelemetry SDK across Node.js services
- LogβTrace correlation: Loki derivedFields β Tempo tracesToLogsV2
- 8 Grafana datasources, 40+ dashboards, ClickHouse OTel schema
- Double-digit MB/s ingestion across 60+ apps in 7 namespaces
44-Module Terraform Library
ProductionCustom IaC module library covering 44 AWS resource types, reused identically across 4 environments. Every new service ships in ~30 lines of Terraform.
- Modules: VPC, EKS, RDS, ALB, ASG, MSK, ElastiCache, ECR, IAM, Route 53, SG, SNS/SQS, CloudWatch
- Remote state on S3 + DynamoDB locking
- Multi-account workflows across environments
- Consistent security group, IAM, and networking patterns
Cost Optimization β 3 Cycles (~30% each)
ShippedThree distinct optimization cycles totaling ~$156K/year in savings β all without service degradation. CloudWatch + Grafana confirmed stable p99 latencies throughout.
- Cycle 1: Early right-sizing + Reserved Instance portfolio
- Cycle 2: Graviton migration (~20% saving), Spot Node Groups for CI + stateless workloads, Aurora I/O-Optimized
- Cycle 3: Aurora Blue/Green for engine upgrades, ElastiCache downsizing, S3 lifecycle tiering
- Six-figure RI/SP commitments managed β zero expiry surprises
- Decommissioned unused green/dump clusters
AI Incident Investigation Agent
Design β POCAn LLM-powered agent that queries 5 Grafana datasources (Prometheus/Loki/Tempo/ClickHouse/Sentry) to correlate a failing service to its root cause and offending deployment β replacing 30-min to 2-hr manual investigations.
- Semantic search POC: Titan Embeddings v2 + OpenSearch k-NN (HNSW, FAISS, 1024-dim)
- Production query quality of 0.65β0.88 cosine similarity
- Node.js/Fastify service with Grafana MCP tools
- Root-cause analysis returned in <30 seconds
- Build cost vs commercial AIOps tooling fully scoped
High-Throughput Email Infrastructure
ProductionDesigned the complete email sending pipeline handling 6B+ emails/year (500M+/month) with an in-house ESP featuring dedicated IPs, multi-node MTA, and automated deliverability controls.
- Pipeline: Scheduler β Composer β Sender β MTA β Recipient ESP
- Horizontal IMAP Connection Manager on EKS
- Dedicated IP pool with automated warm-up
- Multi-node MTA cluster with DKIM/SPF/DMARC automation
- ClickHouse delivery analytics for real-time tracking
- Mail senders on public subnets for proper IP attribution
Self-Hosted Security & Dev Platforms
ProductionBuilt and operated the full internal developer + security platform β replacing SaaS solutions with self-hosted alternatives for cost savings and data sovereignty.
- VaultWarden: Password vault for 50+ org members
- SSOReady: SAML/OIDC SSO bridge for legacy apps
- Tailscale + Headscale mesh VPN replacing OpenVPN for 30+ engineers
- GitLab CE: Private SCM/CI for internal workloads
- SonarQube + OWASP ZAP for code quality and security scanning
- SOC 2 I+II, ISO 27001/27701, GDPR compliance controls