Loading...
πŸ“ Architecture Portfolio

Production Projects

Systems I've designed from scratch, scaled to millions, and operate daily. Each project represents real production infrastructure β€” not tutorial apps.

⎈

EKS Multi-Environment Platform

Production

Designed and migrated all environments off ECS Fargate to a fully self-managed EKS platform. Production runs 60+ microservices across 42 ArgoCD-managed services β€” UI, agency portal, scheduler, email pipeline, IMAP, billing, AI services, analytics, SSO, MCP server, and more.

  • Spot + on-demand node groups with interruption handling
  • ECR with lifecycle policies across 160+ repositories
  • IRSA-scoped IAM for zero-trust pod-level access
  • Internal + external ALBs with WAF in front
  • 9 consecutive zero-downtime K8s version upgrades (1.27 β†’ 1.36)
  • Custom Helm chart library shared across all services
EKS 1.36ArgoCD 3Helm 4KarpenterIRSAMSKAuroraOpenSearchRedis
🌐

Hub-and-Spoke VPC Network

Production

Built a central shared-services VPC for bastion, observability, identity, and shared egress IPs. Peered all environment VPCs plus the MongoDB Atlas data plane.

  • 5 VPCs with hub-and-spoke peering through shared-services account
  • 4 isolated environments: prod, staging, warmup, shared
  • 4 AZs per environment, 12 ALBs, 7 NAT gateways, 9 Route 53 zones, 38 ACM certs
  • Replaced per-env OpenVPN with single Tailscale mesh (Headscale control plane)
  • One identity, one ACL surface β€” zero VPN client confusion for 30+ engineers
  • Secondary DR region with cross-region S3/ECR replication, Aurora Global, R53 health-checked failover
VPC PeeringNAT GWRoute 53 PHZCloudflareTailscaleHeadscale
πŸ“Š

Observability Platform β€” Build vs Buy

Production

Self-hosted observability stack on EKS replacing a commercial APM. Quantified the build-vs-buy tradeoff β€” 3-year self-hosted TCO is ~20% lower with full data sovereignty.

  • kube-prometheus-stack with S3-backed Loki (720h retention)
  • Tempo with vParquet4 for distributed tracing
  • In-house OpenTelemetry SDK across Node.js services
  • Log↔Trace correlation: Loki derivedFields ↔ Tempo tracesToLogsV2
  • 8 Grafana datasources, 40+ dashboards, ClickHouse OTel schema
  • Double-digit MB/s ingestion across 60+ apps in 7 namespaces
PrometheusLokiTempoGrafanaClickHouseSentryOpenTelemetry
πŸ”§

44-Module Terraform Library

Production

Custom IaC module library covering 44 AWS resource types, reused identically across 4 environments. Every new service ships in ~30 lines of Terraform.

  • Modules: VPC, EKS, RDS, ALB, ASG, MSK, ElastiCache, ECR, IAM, Route 53, SG, SNS/SQS, CloudWatch
  • Remote state on S3 + DynamoDB locking
  • Multi-account workflows across environments
  • Consistent security group, IAM, and networking patterns
TerraformS3DynamoDBMulti-AccountHCL
πŸ’°

Cost Optimization β€” 3 Cycles (~30% each)

Shipped

Three distinct optimization cycles totaling ~$156K/year in savings β€” all without service degradation. CloudWatch + Grafana confirmed stable p99 latencies throughout.

  • Cycle 1: Early right-sizing + Reserved Instance portfolio
  • Cycle 2: Graviton migration (~20% saving), Spot Node Groups for CI + stateless workloads, Aurora I/O-Optimized
  • Cycle 3: Aurora Blue/Green for engine upgrades, ElastiCache downsizing, S3 lifecycle tiering
  • Six-figure RI/SP commitments managed β€” zero expiry surprises
  • Decommissioned unused green/dump clusters
GravitonSpotAurora I/O-OptRI/SPS3 Lifecycle
πŸ€–

AI Incident Investigation Agent

Design β†’ POC

An LLM-powered agent that queries 5 Grafana datasources (Prometheus/Loki/Tempo/ClickHouse/Sentry) to correlate a failing service to its root cause and offending deployment β€” replacing 30-min to 2-hr manual investigations.

  • Semantic search POC: Titan Embeddings v2 + OpenSearch k-NN (HNSW, FAISS, 1024-dim)
  • Production query quality of 0.65–0.88 cosine similarity
  • Node.js/Fastify service with Grafana MCP tools
  • Root-cause analysis returned in <30 seconds
  • Build cost vs commercial AIOps tooling fully scoped
BedrockTitan Embed v2OpenSearch k-NNGrafana MCPFastify
πŸ“§

High-Throughput Email Infrastructure

Production

Designed the complete email sending pipeline handling 6B+ emails/year (500M+/month) with an in-house ESP featuring dedicated IPs, multi-node MTA, and automated deliverability controls.

  • Pipeline: Scheduler β†’ Composer β†’ Sender β†’ MTA β†’ Recipient ESP
  • Horizontal IMAP Connection Manager on EKS
  • Dedicated IP pool with automated warm-up
  • Multi-node MTA cluster with DKIM/SPF/DMARC automation
  • ClickHouse delivery analytics for real-time tracking
  • Mail senders on public subnets for proper IP attribution
EKSClickHouseDKIM/SPF/DMARCMSKS3
πŸ›‘οΈ

Self-Hosted Security & Dev Platforms

Production

Built and operated the full internal developer + security platform β€” replacing SaaS solutions with self-hosted alternatives for cost savings and data sovereignty.

  • VaultWarden: Password vault for 50+ org members
  • SSOReady: SAML/OIDC SSO bridge for legacy apps
  • Tailscale + Headscale mesh VPN replacing OpenVPN for 30+ engineers
  • GitLab CE: Private SCM/CI for internal workloads
  • SonarQube + OWASP ZAP for code quality and security scanning
  • SOC 2 I+II, ISO 27001/27701, GDPR compliance controls
VaultWardenSSOReadyHeadscaleGitLab CESonarQube
Accent Color