Everything engineering needs to understand Gravity Labs platform architecture - each layer, each decision, and how services, agents, workflow, and infrastructure connect in production.
Click any layer to expand it. Each layer links to the detailed section below.
Single-page application served from AKS. Communicates exclusively through the Nginx API gateway. Handles auth token storage, silent refresh, and role-based UI rendering.
↓ see full detailSingle ingress point. Routes /api/v1/auth/* → auth-service, /api/v1/labs/* → lab-service, etc. Terminates TLS. No business logic here.
Each business domain = one FastAPI microservice, independently deployable in AKS. All services use async SQLAlchemy + asyncpg to talk to their own database within the shared PostgreSQL instance.
Each AI agent is its own microservice/container in AKS. This enables independent horizontal scaling during events — Scoring agent scales to hundreds of pods while Lab Architect stays at 2.
Temporal orchestrates all long-running, multi-step lab lifecycle workflows. If a server crashes mid-provisioning, Temporal replays from the last checkpoint — no orphaned VMs, no missed license assignments.
↓ see full detailRabbitMQ carries async events between services (event.started, lab.provisioned, score.ready). Valkey handles all stateful ephemeral data: JWT tokens, rate-limit counters, SSO state, session data.
↓ see full detailOne PostgreSQL 16 instance running in AKS. Each microservice owns its own database inside that instance (gravitylabs_auth, gravitylabs_tenant, gravitylabs_labs, etc.) — hard logical isolation with no cross-service queries allowed. Self-managed means this is portable to AWS/GCP as-is.
↓ see full detailTerraform provisions the AKS cluster, ACR, Key Vault, networking. Helm packages each service. GitHub Actions builds images and pushes to ACR. Argo CD watches the Helm chart repo and syncs AKS — GitOps, cluster always matches Git.
↓ see full detailPrometheus scrapes metrics from every pod. Grafana dashboards show lab provisioning queue depth, token burn rate, VM health, active sessions. Loki aggregates logs from all containers. All running inside AKS.
↓ see full detailServices outside the core app layer. OSS-first to stay cloud-agnostic. Sendgrid for transactional email (SMTP-compatible, swap to SES/Mailgun on any cloud). MinIO for object storage (S3-compatible API — runs in AKS, swap to Azure Blob/S3/GCS with config only). Cloudflare for CDN and DDoS protection.
↓ see full detailReact 18 + TypeScript, Tailwind CSS, Zustand (state), Axios (HTTP), React Hook Form + Zod (validation), Framer Motion (UI animation). Built with Vite. Served as static files from an Nginx pod inside AKS.
When a participant launches a lab, the React UI polls the lab-service every 3 seconds via Axios. The Zustand useLabStore updates the status chip (Analyzing → Designing → Provisioning → Ready) without re-rendering the whole page — React 18's concurrent mode batches these updates efficiently.
Nginx 1.25 running as a Kubernetes deployment. Acts as the single entry point for all API traffic from the frontend. All service-to-service internal calls bypass it and go direct on the internal Docker/K8s network.
Azure API Management / AWS API Gateway are costly at scale and create cloud lock-in. Nginx is free, runs in AKS, and the config is version-controlled in the same repo. Migrating clouds = copy one nginx.conf.
A participant clicks "Launch Lab." The React app calls POST /api/v1/labs/provision. Nginx routes it to lab-service. Lab-service validates the JWT by calling auth-service internally — directly, not through Nginx.
Every business domain is its own FastAPI microservice. The list below is not exhaustive — new services will be added each sprint. The pattern stays consistent: one service, one database, one Helm chart, one GitHub Actions pipeline.
/internal/verify on auth-service to validate tokens| Service | Domain | Phase |
|---|---|---|
| auth-service | Identity, JWT, SSO | ✅ Phase 1 |
| tenant-service | Orgs, seats, SAML | Phase 2 |
| event-service | Hackathons, cohorts | Phase 2 |
| lab-service | Lab lifecycle + Temporal | Phase 3 |
| catalog-service | Lab templates, paths | Phase 3 |
| scoring-service | 5-dim grading | Phase 3 |
| billing-service | Tokens, subscriptions | Phase 4 |
| notification-service | Email, in-app alerts | Phase 4 |
| analytics-service | Reports, dashboards | Phase 4 |
| … more each sprint |
At 25,000 concurrent users, the scoring engine will be under maximum load while the billing service is idle. A monolith scales everything together and wastes compute. Microservices let us scale each component to exactly what the load demands — scoring-service at 200 pods, auth-service at 10, billing at 2.
Single AKS cluster, namespace-separated by environment:
gl-dev — developer sandboxgl-staging — pre-prod, full stackgl-prod — productionFor world-record events: second prod cluster in a second Azure region, fronted by Azure Traffic Manager.
Terraform modules are written provider-agnostic where possible. Switching to AWS = swap the Azure provider for AWS provider. App layer (Helm) doesn't change at all.
Each microservice and agent has its own Helm chart. A monorepo charts/ directory contains:
charts/auth-service/values.yamlGitHub Actions does build — it shouldn't also hold cluster credentials and kubectl commands. Argo CD owns the deploy half. This separates concerns: if the cluster drifts (someone ran kubectl apply directly), Argo CD detects it and re-converges. With Actions alone, drift is invisible.
Each AI agent is an independently deployed microservice in AKS. They communicate with lab-service and each other via RabbitMQ (async) or direct REST calls (sync, internal network only). Agents are orchestrated by Temporal workflows — Temporal calls the agent services as Activities.
| Agent | Scale Trigger | Max Replicas |
|---|---|---|
| lab-architect | Lab request queue | 10 |
| dataset-forge | Dataset queue depth | 50 |
| provisioning | Provision queue | 100 |
| qa-validator | Validation queue | 50 |
| live-observer | Active lab count | 200 |
| hint-dispenser | Hint request queue | 100 |
| scoring | Score queue depth | 300 |
| cross-validator | Submission queue | 100 |
At T-30min, 20,000 participants submit simultaneously. The scoring queue in RabbitMQ fills with 20,000 messages. KEDA sees the queue depth and scales scoring-agent from 5 → 280 pods in under 90 seconds. Each pod picks jobs from the queue, calls Claude API for commentary, writes results to PostgreSQL scoring DB. Queue drains in ~8 minutes. KEDA scales back down to 5 pods once it's clear.
Temporal solves the hardest problem in lab provisioning: multi-step, long-running processes that must complete correctly even when infrastructure fails. It gives us durable execution — write the workflow as normal Python code, and Temporal guarantees it runs to completion.
A workflow orchestration engine. You write a Python function (@workflow.defn) with steps (@activity.defn). Temporal persists every step's result. If the server crashes between step 3 and 4, Temporal replays from step 3's saved result — transparently, automatically.
Without: You write a background job calling Terraform → Graph API → DB update → notification. Server crashes after Terraform but before Graph API. You have a live VM and no license — leaked cost, broken state. You need manual retry logic, dead-letter queues, compensating transactions.
With Temporal: The workflow function replays from the last checkpoint. Terraform step already completed? Skip it. Resume at Graph API. Zero data loss, zero manual recovery.
Carries events between services and agents. Services don't call each other directly for async operations — they publish to a queue and move on.
| Event | Publisher → Consumer |
|---|---|
| event.started | event-service → lab-service |
| lab.provision.requested | lab-service → provisioning-agent |
| lab.ready | lab-service → notification-service |
| submission.received | lab-service → scoring-agent |
| score.ready | scoring-agent → analytics-service |
Redis-compatible, BSD-licensed. All ephemeral stateful data lives here — not in PostgreSQL.
Azure Service Bus and Azure Cache for Redis are excellent managed services — but they tie you to Azure. RabbitMQ and Valkey run identically on AWS EKS, GCP GKE, or bare metal.
Moving Gravity Labs to AWS = update Terraform provider. The entire messaging and cache layer runs in the new AKS-equivalent cluster with zero code changes.
Production note: In Phase 4, RabbitMQ gets clustered (3 nodes) for HA. Valkey gets a primary + replica pair. Both managed via Helm charts in AKS.
Single PostgreSQL 16 instance in AKS, with one database per microservice. Not five separate Postgres pods — one pod, multiple logical databases inside it.
Managed Postgres services are fine but cloud-specific. Running Postgres in AKS means:
pg_dump → MinIOlab-service connects to gravitylabs_labs only, using its own DB user. It physically cannot query gravitylabs_auth — wrong credentials. This is enforced at the Postgres level, not application code.
Deployed inside AKS. Scrapes metrics from every pod via ServiceMonitor CRDs. All FastAPI services expose /metrics (prometheus-fastapi-instrumentator). Key metrics per service:
Also in AKS. Pre-built dashboards for:
The event ops team watches the Event Operations dashboard on a big screen. Queue depth rises → KEDA auto-scales → queue drains. If queue doesn't drain, Grafana alert fires → PagerDuty page.
All containers write structured JSON logs via structlog. Loki (in AKS) aggregates logs from all pods. Grafana has a Loki datasource — query logs alongside metrics in the same dashboard.
Example: a participant reports their lab is stuck. Ops queries Loki for lab_id=xyz across all services to see the exact failure point in under 10 seconds.
Services outside the core application layer. OSS-first and cloud-agnostic — every choice here can be swapped for an equivalent on a different cloud with only a config change, not a code change.
| Concern | Self-managed / OSS (chosen) | Azure equivalent | AWS equivalent | Rationale |
|---|---|---|---|---|
| Transactional email | Sendgrid (SMTP API) | Azure Communication Services Email | Amazon SES | SMTP-compatible — swap by changing SMTP host config only |
| Object storage | MinIO in AKS (S3-compatible) | Azure Blob Storage | Amazon S3 | S3 API is the industry standard. MinIO in AKS = zero cloud dependency. Move to S3/Blob = change endpoint + keys only |
| CDN / DDoS | Cloudflare (free tier → Pro) | Azure Front Door + WAF | CloudFront + Shield | Cloudflare is provider-agnostic — it sits in front of any cloud origin |
| Secret management | HashiCorp Vault in AKS (on-prem) / Azure Key Vault (cloud) | Azure Key Vault | AWS Secrets Manager | Vault for on-prem deployments. Key Vault for Azure production — AKS has native integration |
| Lab streaming (RDP/SSH) | Apache Guacamole in AKS | Azure Virtual Desktop | Amazon AppStream | Guacamole is open-source, runs anywhere, browser-based — no client install |
| GPU labs / AI workloads | Neko WebRTC + AKS GPU node pool | Azure NC-series nodes | AWS P-series | Neko provides browser-based desktop access to GPU workloads |
| SMS / Push (future) | TBD — Twilio or self-hosted | Azure Communication Services | Amazon SNS / Pinpoint | Not in Phase 1–3 scope. Evaluate in Phase 4. |
Enables independent scaling during events. KEDA scales scoring-agent independently from lab-architect-agent.
Self-managed in AKS. Cloud-agnostic. Logical isolation enforced at DB user level. Split individual DBs later if needed.
Actions = build + push to ACR. Argo CD = GitOps sync to AKS. Drift detection included.
gl-dev / gl-staging / gl-prod namespaces. Second cluster added for world-record multi-region events.
BSD-licensed Redis fork. Production: consider Azure Cache for Redis for managed HA. Code unchanged either way.
S3 API compatibility means swapping to actual S3 / Azure Blob is a config-only change.
Current Phase 1 uses HS256 (symmetric). Phase 2 should migrate to RS256 so services only hold public keys, not the signing secret.
Twilio vs self-hosted vs Azure Communication Services. Evaluate in Phase 4 when notification-service is scoped.
🔴 CRITICAL — CSP procurement is blocked until enrollment completes. 2–4 week timeline. Blocks Phase 3 lab provisioning.
"Report" service referenced in early diagrams. Formally tracked as analytics-service in Phase 4. Scope TBD.