GL
Gravity Labs — Architecture Reference
v1.0 · May 2026
Architecture Reference Document

Real cloud labs.
Clear system design.

Everything engineering needs to understand Gravity Labs platform architecture - each layer, each decision, and how services, agents, workflow, and infrastructure connect in production.

  • AI agent-based lab lifecycle orchestration.
  • Cloud-agnostic architecture patterns.
  • Scale-ready design from 500 to 25,000 participants.
Contents
01

Full Architecture Stack

Click any layer to expand it. Each layer links to the detailed section below.

Client Layer
React 18TypeScriptTailwind CSSZustand

Single-page application served from AKS. Communicates exclusively through the Nginx API gateway. Handles auth token storage, silent refresh, and role-based UI rendering.

↓ see full detail
API Gateway
NginxRoutingTLS termination

Single ingress point. Routes /api/v1/auth/* → auth-service, /api/v1/labs/* → lab-service, etc. Terminates TLS. No business logic here.

↓ see full detail
Application Services
FastAPIasyncpgPython 3.11many microservices

Each business domain = one FastAPI microservice, independently deployable in AKS. All services use async SQLAlchemy + asyncpg to talk to their own database within the shared PostgreSQL instance.

auth-service ✅ tenant-service event-service lab-service scoring-service billing-service notification-service catalog-service + more per sprint
↓ see full detail
AI Agents Layer
8 agentsLangGraphClaude APIindependent pods

Each AI agent is its own microservice/container in AKS. This enables independent horizontal scaling during events — Scoring agent scales to hundreds of pods while Lab Architect stays at 2.

lab-architect-agent dataset-forge-agent provisioning-agent qa-validator-agent live-observer-agent hint-dispenser-agent scoring-agent cross-validator-agent
↓ see full detail
Workflow Orchestration
Temporaldurable executionTerraform/Pulumi workers

Temporal orchestrates all long-running, multi-step lab lifecycle workflows. If a server crashes mid-provisioning, Temporal replays from the last checkpoint — no orphaned VMs, no missed license assignments.

↓ see full detail
Messaging & Cache
RabbitMQValkey (Redis-compatible)

RabbitMQ carries async events between services (event.started, lab.provisioned, score.ready). Valkey handles all stateful ephemeral data: JWT tokens, rate-limit counters, SSO state, session data.

↓ see full detail
Data Layer
PostgreSQL 16single instancemultiple databasesself-managed in AKS

One PostgreSQL 16 instance running in AKS. Each microservice owns its own database inside that instance (gravitylabs_auth, gravitylabs_tenant, gravitylabs_labs, etc.) — hard logical isolation with no cross-service queries allowed. Self-managed means this is portable to AWS/GCP as-is.

↓ see full detail
Infrastructure & CI/CD
TerraformHelm ChartsACRGitHub ActionsArgo CD

Terraform provisions the AKS cluster, ACR, Key Vault, networking. Helm packages each service. GitHub Actions builds images and pushes to ACR. Argo CD watches the Helm chart repo and syncs AKS — GitOps, cluster always matches Git.

↓ see full detail
Monitoring & Observability
PrometheusGrafanaLokistructlog

Prometheus scrapes metrics from every pod. Grafana dashboards show lab provisioning queue depth, token burn rate, VM health, active sessions. Loki aggregates logs from all containers. All running inside AKS.

↓ see full detail
Peripheral Services
EmailObject StorageCDNself-managed OSS

Services outside the core app layer. OSS-first to stay cloud-agnostic. Sendgrid for transactional email (SMTP-compatible, swap to SES/Mailgun on any cloud). MinIO for object storage (S3-compatible API — runs in AKS, swap to Azure Blob/S3/GCS with config only). Cloudflare for CDN and DDoS protection.

↓ see full detail

02

Client Layer

Technology

React 18 + TypeScript, Tailwind CSS, Zustand (state), Axios (HTTP), React Hook Form + Zod (validation), Framer Motion (UI animation). Built with Vite. Served as static files from an Nginx pod inside AKS.

Why This Stack
  • React 18 concurrent rendering handles real-time lab status updates without UI freezing
  • Zustand is lightweight vs Redux — no boilerplate for auth + lab state
  • Tailwind keeps styles co-located with components — critical for a large team
  • Vite gives sub-second hot reload, keeping dev velocity high
Gravity Labs example
Live lab status flow

When a participant launches a lab, the React UI polls the lab-service every 3 seconds via Axios. The Zustand useLabStore updates the status chip (Analyzing → Designing → Provisioning → Ready) without re-rendering the whole page — React 18's concurrent mode batches these updates efficiently.


03

API Gateway — Nginx

Technology

Nginx 1.25 running as a Kubernetes deployment. Acts as the single entry point for all API traffic from the frontend. All service-to-service internal calls bypass it and go direct on the internal Docker/K8s network.

Routing Rules
/api/v1/auth/* → auth-service:8001 /api/v1/tenant/* → tenant-service:8002 /api/v1/events/* → event-service:8003 /api/v1/labs/* → lab-service:8004 /api/v1/agents/* → agent-gateway:8010 /api/v1/billing/* → billing-service:8006 /api/v1/catalog/* → catalog-service:8007
Why Not a Cloud API Gateway

Azure API Management / AWS API Gateway are costly at scale and create cloud lock-in. Nginx is free, runs in AKS, and the config is version-controlled in the same repo. Migrating clouds = copy one nginx.conf.

Gravity Labs example

A participant clicks "Launch Lab." The React app calls POST /api/v1/labs/provision. Nginx routes it to lab-service. Lab-service validates the JWT by calling auth-service internally — directly, not through Nginx.


04

Application Services Layer

Every business domain is its own FastAPI microservice. The list below is not exhaustive — new services will be added each sprint. The pattern stays consistent: one service, one database, one Helm chart, one GitHub Actions pipeline.

Technology & Pattern
  • FastAPI — async, OpenAPI auto-docs, Pydantic validation
  • SQLAlchemy 2.x async + asyncpg — non-blocking DB calls
  • Alembic — schema migrations per service
  • structlog — structured JSON logs (Loki-ready)
  • python-jose — JWT validation via auth-service
  • Each service calls /internal/verify on auth-service to validate tokens
Service Inventory (current + planned)
ServiceDomainPhase
auth-serviceIdentity, JWT, SSO✅ Phase 1
tenant-serviceOrgs, seats, SAMLPhase 2
event-serviceHackathons, cohortsPhase 2
lab-serviceLab lifecycle + TemporalPhase 3
catalog-serviceLab templates, pathsPhase 3
scoring-service5-dim gradingPhase 3
billing-serviceTokens, subscriptionsPhase 4
notification-serviceEmail, in-app alertsPhase 4
analytics-serviceReports, dashboardsPhase 4
… more each sprint
Why microservices, not a monolith

At 25,000 concurrent users, the scoring engine will be under maximum load while the billing service is idle. A monolith scales everything together and wastes compute. Microservices let us scale each component to exactly what the load demands — scoring-service at 200 pods, auth-service at 10, billing at 2.


05

Infrastructure & CI/CD

AKS Cluster Topology

Single AKS cluster, namespace-separated by environment:

  • gl-dev — developer sandbox
  • gl-staging — pre-prod, full stack
  • gl-prod — production

For world-record events: second prod cluster in a second Azure region, fronted by Azure Traffic Manager.

Terraform — What It Provisions
  • AKS cluster (node pools, autoscaler config)
  • ACR (Azure Container Registry)
  • Azure Key Vault (all secrets)
  • Virtual Network + subnets
  • Load Balancer + public IP
  • Storage accounts (PostgreSQL PVC backing)
Multi-cloud principle

Terraform modules are written provider-agnostic where possible. Switching to AWS = swap the Azure provider for AWS provider. App layer (Helm) doesn't change at all.

Helm — What It Packages

Each microservice and agent has its own Helm chart. A monorepo charts/ directory contains:

  • Deployment (image, replicas, resource limits)
  • Service (ClusterIP / LoadBalancer)
  • HPA (Horizontal Pod Autoscaler)
  • ConfigMap + Secret references to Key Vault
  • ServiceMonitor (Prometheus scrape config)
CI/CD Pipeline — GitHub Actions + Argo CD
Dev pushes PR
Feature branch → pull request opened on GitHub
GitHub Actions — CI
Run tests → lint → build Docker image → push to ACR with git SHA tag
🏷
Update Helm values
GitHub Actions commits new image tag to charts/auth-service/values.yaml
👁
Argo CD — detects change
Watches the Helm chart repo. Sees new image tag. Marks app as OutOfSync.
🚀
Argo CD — syncs AKS
Rolling deploy to gl-staging (auto) or gl-prod (manual approval gate). Cluster always matches Git.
Why GitHub Actions + Argo CD, not Actions alone

GitHub Actions does build — it shouldn't also hold cluster credentials and kubectl commands. Argo CD owns the deploy half. This separates concerns: if the cluster drifts (someone ran kubectl apply directly), Argo CD detects it and re-converges. With Actions alone, drift is invisible.


06

AI Agents Layer

Each AI agent is an independently deployed microservice in AKS. They communicate with lab-service and each other via RabbitMQ (async) or direct REST calls (sync, internal network only). Agents are orchestrated by Temporal workflows — Temporal calls the agent services as Activities.

Agent Runtime Stack
  • LangGraph — agent state machine and tool-call orchestration
  • Claude API (Anthropic) — primary LLM for lab design, analysis, commentary
  • Azure OpenAI GPT-4o — fallback / cost routing for high-volume scoring
  • FastAPI — same framework as app services, consistent tooling
  • Temporal Activity Workers — each agent registers as a Temporal activity worker, so Temporal workflows can call them reliably with retries
Why Each Agent = Its Own Container
  • During a 25K-user event, scoring-agent needs 300+ replicas. Lab-architect needs 3. Bundled = you scale both to 300.
  • An LLM timeout in hint-dispenser-agent doesn't crash provisioning-agent
  • Independent versioning — you can hotfix scoring logic without redeploying the whole agent fleet
  • KEDA autoscaling per agent based on RabbitMQ queue depth
Agent Communication Pattern
Temporal ProvisionLabWorkflow
lab-architect-agent
dataset-forge-agent
↓ via RabbitMQ
provisioning-agent
qa-validator-agent
Agent Scaling (KEDA)
AgentScale TriggerMax Replicas
lab-architectLab request queue10
dataset-forgeDataset queue depth50
provisioningProvision queue100
qa-validatorValidation queue50
live-observerActive lab count200
hint-dispenserHint request queue100
scoringScore queue depth300
cross-validatorSubmission queue100
Gravity Labs example — scoring at event scale
25,000-user event — final submission wave

At T-30min, 20,000 participants submit simultaneously. The scoring queue in RabbitMQ fills with 20,000 messages. KEDA sees the queue depth and scales scoring-agent from 5 → 280 pods in under 90 seconds. Each pod picks jobs from the queue, calls Claude API for commentary, writes results to PostgreSQL scoring DB. Queue drains in ~8 minutes. KEDA scales back down to 5 pods once it's clear.


07

Workflow Orchestration — Temporal

Temporal solves the hardest problem in lab provisioning: multi-step, long-running processes that must complete correctly even when infrastructure fails. It gives us durable execution — write the workflow as normal Python code, and Temporal guarantees it runs to completion.

What is Temporal

A workflow orchestration engine. You write a Python function (@workflow.defn) with steps (@activity.defn). Temporal persists every step's result. If the server crashes between step 3 and 4, Temporal replays from step 3's saved result — transparently, automatically.

Without vs With Temporal

Without: You write a background job calling Terraform → Graph API → DB update → notification. Server crashes after Terraform but before Graph API. You have a live VM and no license — leaked cost, broken state. You need manual retry logic, dead-letter queues, compensating transactions.

With Temporal: The workflow function replays from the last checkpoint. Terraform step already completed? Skip it. Resume at Graph API. Zero data loss, zero manual recovery.

Gravity Labs — 3 Core Workflows
  • ProvisionLabWorkflow — AI design → dataset → VM/container spin-up → license assign → dataset inject → URL return. 2–8 min.
  • GradeLabWorkflow — run scoring scripts against live cloud → AI commentary → 5-dimension score → store results.
  • DestroyLabWorkflow — revoke licenses → delete Entra users → Terraform destroy → log cost. Atomic — half-teardown impossible.
ProvisionLabWorkflow — step by step (Gravity Labs example)
1
lab-architect-agent
Receives NL description → produces IaC template + scoring rubric (LangGraph + Claude API)
2
HITL Gate 1
Admin reviews blueprint — approve/reject. Temporal waits indefinitely (signal).
3
dataset-forge-agent
Generates N unique synthetic datasets (Faker + Pandas). One per participant.
4
HITL Gate 2
Admin reviews procurement list + cost. Temporal waits for approval signal.
5
provisioning-agent
Runs Terraform/Bicep. Creates VM/AKS/container. Assigns Microsoft licenses via Graph API.
6
qa-validator-agent
Dry-run: checks connectivity, subscriptions, dataset, network isolation. Pass/fail report.
7
HITL Gate 3
Admin reviews QA report + final cost. One click → lab goes live.
Lab Ready
Guacamole URL + credentials returned to participant. live-observer-agent starts watching.

08

Messaging & Cache

RabbitMQ — Async Messaging

Carries events between services and agents. Services don't call each other directly for async operations — they publish to a queue and move on.

EventPublisher → Consumer
event.startedevent-service → lab-service
lab.provision.requestedlab-service → provisioning-agent
lab.readylab-service → notification-service
submission.receivedlab-service → scoring-agent
score.readyscoring-agent → analytics-service
Valkey — Cache & Token Store

Redis-compatible, BSD-licensed. All ephemeral stateful data lives here — not in PostgreSQL.

refresh:{user_id}:{jti} → "1" TTL 7d refresh_tokens:{user_id} → Set{jti} login_attempts:{ip} → count TTL 15m blocklist:{jti} → "1" TTL=token TTL sso_state:{state} → data TTL 10m sso_code:{code} → user_id TTL 5m lab_status:{lab_id} → JSON TTL 1h rate_limit:{service}:{ip} → count TTL 1m
Why Self-Managed (not managed MQ / cache)

Azure Service Bus and Azure Cache for Redis are excellent managed services — but they tie you to Azure. RabbitMQ and Valkey run identically on AWS EKS, GCP GKE, or bare metal.

Multi-cloud portability

Moving Gravity Labs to AWS = update Terraform provider. The entire messaging and cache layer runs in the new AKS-equivalent cluster with zero code changes.

Production note: In Phase 4, RabbitMQ gets clustered (3 nodes) for HA. Valkey gets a primary + replica pair. Both managed via Helm charts in AKS.


09

Data Layer — PostgreSQL

Architecture Decision

Single PostgreSQL 16 instance in AKS, with one database per microservice. Not five separate Postgres pods — one pod, multiple logical databases inside it.

gravitylabs_auth ← auth-service only gravitylabs_tenant ← tenant-service only gravitylabs_events ← event-service only gravitylabs_labs ← lab-service only gravitylabs_scoring ← scoring-service only gravitylabs_catalog ← catalog-service only gravitylabs_billing ← billing-service only gravitylabs_temporal ← Temporal internal
Why Single Instance
  • Simpler ops — one backup, one upgrade, one monitoring target
  • Cheaper — one pod with 16 logical DBs vs 16 separate pods
  • Logical isolation is still strict — no service can query another's DB (enforced by separate DB users + passwords per service)
  • Future split is easy if one service's DB needs to scale independently — just migrate that one DB to its own instance
Self-Managed — Why Not RDS / Azure Flexible

Managed Postgres services are fine but cloud-specific. Running Postgres in AKS means:

  • Identical setup on Azure, AWS, GCP, on-prem
  • No extra monthly cost per managed instance
  • Full control over Postgres config (connection limits, extensions, WAL)
  • Backup is handled by a scheduled Kubernetes CronJob running pg_dump → MinIO
Gravity Labs example

lab-service connects to gravitylabs_labs only, using its own DB user. It physically cannot query gravitylabs_auth — wrong credentials. This is enforced at the Postgres level, not application code.


10

Monitoring & Observability

Prometheus — Metrics

Deployed inside AKS. Scrapes metrics from every pod via ServiceMonitor CRDs. All FastAPI services expose /metrics (prometheus-fastapi-instrumentator). Key metrics per service:

  • Request rate, latency (p50/p95/p99), error rate
  • Active lab count, provisioning queue depth
  • RabbitMQ queue depth per queue
  • Valkey hit/miss rate, memory usage
  • Agent LLM token usage + latency
  • PostgreSQL connections, query time
Grafana — Dashboards

Also in AKS. Pre-built dashboards for:

  • Event Operations — active participants, labs in each status, provisioning queue, time-to-ready
  • Agent Fleet — pod count per agent, LLM latency, queue drain rate
  • Infrastructure — node CPU/memory, pod restarts, PVC usage
  • Business — token burn rate, revenue snapshot, tenant activity
During a live event

The event ops team watches the Event Operations dashboard on a big screen. Queue depth rises → KEDA auto-scales → queue drains. If queue doesn't drain, Grafana alert fires → PagerDuty page.

Loki — Log Aggregation

All containers write structured JSON logs via structlog. Loki (in AKS) aggregates logs from all pods. Grafana has a Loki datasource — query logs alongside metrics in the same dashboard.

Example: a participant reports their lab is stuck. Ops queries Loki for lab_id=xyz across all services to see the exact failure point in under 10 seconds.


11

Peripheral Services

Services outside the core application layer. OSS-first and cloud-agnostic — every choice here can be swapped for an equivalent on a different cloud with only a config change, not a code change.

Decision Table — Peripheral Services
ConcernSelf-managed / OSS (chosen)Azure equivalentAWS equivalentRationale
Transactional emailSendgrid (SMTP API)Azure Communication Services EmailAmazon SESSMTP-compatible — swap by changing SMTP host config only
Object storageMinIO in AKS (S3-compatible)Azure Blob StorageAmazon S3S3 API is the industry standard. MinIO in AKS = zero cloud dependency. Move to S3/Blob = change endpoint + keys only
CDN / DDoSCloudflare (free tier → Pro)Azure Front Door + WAFCloudFront + ShieldCloudflare is provider-agnostic — it sits in front of any cloud origin
Secret managementHashiCorp Vault in AKS (on-prem) / Azure Key Vault (cloud)Azure Key VaultAWS Secrets ManagerVault for on-prem deployments. Key Vault for Azure production — AKS has native integration
Lab streaming (RDP/SSH)Apache Guacamole in AKSAzure Virtual DesktopAmazon AppStreamGuacamole is open-source, runs anywhere, browser-based — no client install
GPU labs / AI workloadsNeko WebRTC + AKS GPU node poolAzure NC-series nodesAWS P-seriesNeko provides browser-based desktop access to GPU workloads
SMS / Push (future)TBD — Twilio or self-hostedAzure Communication ServicesAmazon SNS / PinpointNot in Phase 1–3 scope. Evaluate in Phase 4.

12

Scale Strategy

500
Standard event
Single region · default node pool
5,000
Large event
Node autoscale + quota pre-request
15,000
Multi-region
East US 2 + West US 2
Traffic Manager load balancing
25,000
World record attempt
4 regions · subscription spread
Microsoft-sponsored licenses
Scale mechanisms in AKS
  • HPA — scales pods based on CPU/memory
  • KEDA — scales agent pods based on RabbitMQ queue depth
  • Cluster Autoscaler — adds/removes VM nodes in AKS node pool
  • Temporal worker pools — provisioning workers scale with workflow load
  • PostgreSQL connection pooling — PgBouncer in front of Postgres for 10K+ concurrent connections
  • Azure quota tickets — submitted 6–8 weeks before large events for vCPU headroom

13

Open Decisions (ADRs)

DECIDED

Agent deployment: each agent = own container

Enables independent scaling during events. KEDA scales scoring-agent independently from lab-architect-agent.

DECIDED

Single PostgreSQL instance, multiple databases

Self-managed in AKS. Cloud-agnostic. Logical isolation enforced at DB user level. Split individual DBs later if needed.

DECIDED

CI/CD: GitHub Actions + Argo CD

Actions = build + push to ACR. Argo CD = GitOps sync to AKS. Drift detection included.

DECIDED

AKS topology: single cluster, namespace-separated

gl-dev / gl-staging / gl-prod namespaces. Second cluster added for world-record multi-region events.

DECIDED

Cache/token store: Valkey (Redis-compatible)

BSD-licensed Redis fork. Production: consider Azure Cache for Redis for managed HA. Code unchanged either way.

DECIDED

Object storage: MinIO in AKS (S3-compatible)

S3 API compatibility means swapping to actual S3 / Azure Blob is a config-only change.

PENDING

JWT algorithm: HS256 → RS256 upgrade

Current Phase 1 uses HS256 (symmetric). Phase 2 should migrate to RS256 so services only hold public keys, not the signing secret.

PENDING

SMS / push notifications (Phase 4)

Twilio vs self-hosted vs Azure Communication Services. Evaluate in Phase 4 when notification-service is scoped.

PENDING

Microsoft Partner enrollment

🔴 CRITICAL — CSP procurement is blocked until enrollment completes. 2–4 week timeline. Blocks Phase 3 lab provisioning.

PENDING

Analytics service scope (Phase 4)

"Report" service referenced in early diagrams. Formally tracked as analytics-service in Phase 4. Scope TBD.

GRAVITY LABS · ARCHITECTURE REFERENCE · v1.0 · MAY 2026 · CONSTEL GLOBAL