Gravity Labs — Architecture Reference

01

Full Architecture Stack

Click any layer to expand it. Each layer links to the detailed section below.

Client Layer

React 18TypeScriptTailwind CSSZustand

⌄

Single-page application served from AKS. Communicates exclusively through the Nginx API gateway. Handles auth token storage, silent refresh, and role-based UI rendering.

↓ see full detail

API Gateway

NginxRoutingTLS termination

⌄

Single ingress point. Routes /api/v1/auth/* → auth-service, /api/v1/labs/* → lab-service, etc. Terminates TLS. No business logic here.

↓ see full detail

Application Services

FastAPIasyncpgPython 3.11many microservices

⌄

Each business domain = one FastAPI microservice, independently deployable in AKS. All services use async SQLAlchemy + asyncpg to talk to their own database within the shared PostgreSQL instance.

auth-service ✅ tenant-service event-service lab-service scoring-service billing-service notification-service catalog-service + more per sprint

↓ see full detail

AI Agents Layer

8 agentsLangGraphClaude APIindependent pods

⌄

Each AI agent is its own microservice/container in AKS. This enables independent horizontal scaling during events — Scoring agent scales to hundreds of pods while Lab Architect stays at 2.

lab-architect-agent dataset-forge-agent provisioning-agent qa-validator-agent live-observer-agent hint-dispenser-agent scoring-agent cross-validator-agent

↓ see full detail

Workflow Orchestration

Temporaldurable executionTerraform/Pulumi workers

⌄

Temporal orchestrates all long-running, multi-step lab lifecycle workflows. If a server crashes mid-provisioning, Temporal replays from the last checkpoint — no orphaned VMs, no missed license assignments.

↓ see full detail

Messaging & Cache

RabbitMQValkey (Redis-compatible)

⌄

RabbitMQ carries async events between services (event.started, lab.provisioned, score.ready). Valkey handles all stateful ephemeral data: JWT tokens, rate-limit counters, SSO state, session data.

↓ see full detail

Data Layer

PostgreSQL 16single instancemultiple databasesself-managed in AKS

⌄

One PostgreSQL 16 instance running in AKS. Each microservice owns its own database inside that instance (gravitylabs_auth, gravitylabs_tenant, gravitylabs_labs, etc.) — hard logical isolation with no cross-service queries allowed. Self-managed means this is portable to AWS/GCP as-is.

↓ see full detail

Infrastructure & CI/CD

TerraformHelm ChartsACRGitHub ActionsArgo CD

⌄

Terraform provisions the AKS cluster, ACR, Key Vault, networking. Helm packages each service. GitHub Actions builds images and pushes to ACR. Argo CD watches the Helm chart repo and syncs AKS — GitOps, cluster always matches Git.

↓ see full detail

Monitoring & Observability

PrometheusGrafanaLokistructlog

⌄

Prometheus scrapes metrics from every pod. Grafana dashboards show lab provisioning queue depth, token burn rate, VM health, active sessions. Loki aggregates logs from all containers. All running inside AKS.

↓ see full detail

Peripheral Services

EmailObject StorageCDNself-managed OSS

⌄

Services outside the core app layer. OSS-first to stay cloud-agnostic. Sendgrid for transactional email (SMTP-compatible, swap to SES/Mailgun on any cloud). MinIO for object storage (S3-compatible API — runs in AKS, swap to Azure Blob/S3/GCS with config only). Cloudflare for CDN and DDoS protection.

↓ see full detail

02

Client Layer

Technology

React 18 + TypeScript, Tailwind CSS, Zustand (state), Axios (HTTP), React Hook Form + Zod (validation), Framer Motion (UI animation). Built with Vite. Served as static files from an Nginx pod inside AKS.

Why This Stack

React 18 concurrent rendering handles real-time lab status updates without UI freezing
Zustand is lightweight vs Redux — no boilerplate for auth + lab state
Tailwind keeps styles co-located with components — critical for a large team
Vite gives sub-second hot reload, keeping dev velocity high

Gravity Labs example

Live lab status flow

When a participant launches a lab, the React UI polls the lab-service every 3 seconds via Axios. The Zustand useLabStore updates the status chip (Analyzing → Designing → Provisioning → Ready) without re-rendering the whole page — React 18's concurrent mode batches these updates efficiently.

03

API Gateway — Nginx

Technology

Nginx 1.25 running as a Kubernetes deployment. Acts as the single entry point for all API traffic from the frontend. All service-to-service internal calls bypass it and go direct on the internal Docker/K8s network.

Routing Rules

/api/v1/auth/* → auth-service:8001 /api/v1/tenant/* → tenant-service:8002 /api/v1/events/* → event-service:8003 /api/v1/labs/* → lab-service:8004 /api/v1/agents/* → agent-gateway:8010 /api/v1/billing/* → billing-service:8006 /api/v1/catalog/* → catalog-service:8007

Why Not a Cloud API Gateway

Azure API Management / AWS API Gateway are costly at scale and create cloud lock-in. Nginx is free, runs in AKS, and the config is version-controlled in the same repo. Migrating clouds = copy one nginx.conf.

Gravity Labs example

A participant clicks "Launch Lab." The React app calls POST /api/v1/labs/provision. Nginx routes it to lab-service. Lab-service validates the JWT by calling auth-service internally — directly, not through Nginx.

04

Application Services Layer

Every business domain is its own FastAPI microservice. The list below is not exhaustive — new services will be added each sprint. The pattern stays consistent: one service, one database, one Helm chart, one GitHub Actions pipeline.

Technology & Pattern

FastAPI — async, OpenAPI auto-docs, Pydantic validation
SQLAlchemy 2.x async + asyncpg — non-blocking DB calls
Alembic — schema migrations per service
structlog — structured JSON logs (Loki-ready)
python-jose — JWT validation via auth-service
Each service calls /internal/verify on auth-service to validate tokens

Service Inventory (current + planned)

Service	Domain	Phase
auth-service	Identity, JWT, SSO	✅ Phase 1
tenant-service	Orgs, seats, SAML	Phase 2
event-service	Hackathons, cohorts	Phase 2
lab-service	Lab lifecycle + Temporal	Phase 3
catalog-service	Lab templates, paths	Phase 3
scoring-service	5-dim grading	Phase 3
billing-service	Tokens, subscriptions	Phase 4
notification-service	Email, in-app alerts	Phase 4
analytics-service	Reports, dashboards	Phase 4
… more each sprint

Why microservices, not a monolith

At 25,000 concurrent users, the scoring engine will be under maximum load while the billing service is idle. A monolith scales everything together and wastes compute. Microservices let us scale each component to exactly what the load demands — scoring-service at 200 pods, auth-service at 10, billing at 2.

05

Infrastructure & CI/CD

AKS Cluster Topology

Single AKS cluster, namespace-separated by environment:

gl-dev — developer sandbox
gl-staging — pre-prod, full stack
gl-prod — production

For world-record events: second prod cluster in a second Azure region, fronted by Azure Traffic Manager.

Terraform — What It Provisions

AKS cluster (node pools, autoscaler config)
ACR (Azure Container Registry)
Azure Key Vault (all secrets)
Virtual Network + subnets
Load Balancer + public IP
Storage accounts (PostgreSQL PVC backing)

Multi-cloud principle

Terraform modules are written provider-agnostic where possible. Switching to AWS = swap the Azure provider for AWS provider. App layer (Helm) doesn't change at all.

Helm — What It Packages

Each microservice and agent has its own Helm chart. A monorepo charts/ directory contains:

Deployment (image, replicas, resource limits)
Service (ClusterIP / LoadBalancer)
HPA (Horizontal Pod Autoscaler)
ConfigMap + Secret references to Key Vault
ServiceMonitor (Prometheus scrape config)

CI/CD Pipeline — GitHub Actions + Argo CD

✍

Dev pushes PR

Feature branch → pull request opened on GitHub

⚙

GitHub Actions — CI

Run tests → lint → build Docker image → push to ACR with git SHA tag

🏷

Update Helm values

GitHub Actions commits new image tag to charts/auth-service/values.yaml

👁

Argo CD — detects change

Watches the Helm chart repo. Sees new image tag. Marks app as OutOfSync.

🚀

Argo CD — syncs AKS

Rolling deploy to gl-staging (auto) or gl-prod (manual approval gate). Cluster always matches Git.

Why GitHub Actions + Argo CD, not Actions alone

GitHub Actions does build — it shouldn't also hold cluster credentials and kubectl commands. Argo CD owns the deploy half. This separates concerns: if the cluster drifts (someone ran kubectl apply directly), Argo CD detects it and re-converges. With Actions alone, drift is invisible.

06

AI Agents Layer

Each AI agent is an independently deployed microservice in AKS. They communicate with lab-service and each other via RabbitMQ (async) or direct REST calls (sync, internal network only). Agents are orchestrated by Temporal workflows — Temporal calls the agent services as Activities.

Agent Runtime Stack

LangGraph — agent state machine and tool-call orchestration
Claude API (Anthropic) — primary LLM for lab design, analysis, commentary
Azure OpenAI GPT-4o — fallback / cost routing for high-volume scoring
FastAPI — same framework as app services, consistent tooling
Temporal Activity Workers — each agent registers as a Temporal activity worker, so Temporal workflows can call them reliably with retries

Why Each Agent = Its Own Container

During a 25K-user event, scoring-agent needs 300+ replicas. Lab-architect needs 3. Bundled = you scale both to 300.
An LLM timeout in hint-dispenser-agent doesn't crash provisioning-agent
Independent versioning — you can hotfix scoring logic without redeploying the whole agent fleet
KEDA autoscaling per agent based on RabbitMQ queue depth

Agent Communication Pattern

Temporal ProvisionLabWorkflow

↓

lab-architect-agent

→

dataset-forge-agent

↓ via RabbitMQ

provisioning-agent

→

qa-validator-agent

Agent Scaling (KEDA)

Agent	Scale Trigger	Max Replicas
lab-architect	Lab request queue	10
dataset-forge	Dataset queue depth	50
provisioning	Provision queue	100
qa-validator	Validation queue	50
live-observer	Active lab count	200
hint-dispenser	Hint request queue	100
scoring	Score queue depth	300
cross-validator	Submission queue	100

Gravity Labs example — scoring at event scale

25,000-user event — final submission wave

At T-30min, 20,000 participants submit simultaneously. The scoring queue in RabbitMQ fills with 20,000 messages. KEDA sees the queue depth and scales scoring-agent from 5 → 280 pods in under 90 seconds. Each pod picks jobs from the queue, calls Claude API for commentary, writes results to PostgreSQL scoring DB. Queue drains in ~8 minutes. KEDA scales back down to 5 pods once it's clear.

07

Workflow Orchestration — Temporal

Temporal solves the hardest problem in lab provisioning: multi-step, long-running processes that must complete correctly even when infrastructure fails. It gives us durable execution — write the workflow as normal Python code, and Temporal guarantees it runs to completion.

What is Temporal

A workflow orchestration engine. You write a Python function (@workflow.defn) with steps (@activity.defn). Temporal persists every step's result. If the server crashes between step 3 and 4, Temporal replays from step 3's saved result — transparently, automatically.

Without vs With Temporal

Without: You write a background job calling Terraform → Graph API → DB update → notification. Server crashes after Terraform but before Graph API. You have a live VM and no license — leaked cost, broken state. You need manual retry logic, dead-letter queues, compensating transactions.

With Temporal: The workflow function replays from the last checkpoint. Terraform step already completed? Skip it. Resume at Graph API. Zero data loss, zero manual recovery.

Gravity Labs — 3 Core Workflows

ProvisionLabWorkflow — AI design → dataset → VM/container spin-up → license assign → dataset inject → URL return. 2–8 min.
GradeLabWorkflow — run scoring scripts against live cloud → AI commentary → 5-dimension score → store results.
DestroyLabWorkflow — revoke licenses → delete Entra users → Terraform destroy → log cost. Atomic — half-teardown impossible.

ProvisionLabWorkflow — step by step (Gravity Labs example)

1

lab-architect-agent

Receives NL description → produces IaC template + scoring rubric (LangGraph + Claude API)

2

HITL Gate 1

Admin reviews blueprint — approve/reject. Temporal waits indefinitely (signal).

3

dataset-forge-agent

Generates N unique synthetic datasets (Faker + Pandas). One per participant.

4

HITL Gate 2

Admin reviews procurement list + cost. Temporal waits for approval signal.

5

provisioning-agent

Runs Terraform/Bicep. Creates VM/AKS/container. Assigns Microsoft licenses via Graph API.

6

qa-validator-agent

Dry-run: checks connectivity, subscriptions, dataset, network isolation. Pass/fail report.

7

HITL Gate 3

Admin reviews QA report + final cost. One click → lab goes live.

✓

Lab Ready

Guacamole URL + credentials returned to participant. live-observer-agent starts watching.

08

Messaging & Cache

RabbitMQ — Async Messaging

Carries events between services and agents. Services don't call each other directly for async operations — they publish to a queue and move on.

Event	Publisher → Consumer
event.started	event-service → lab-service
lab.provision.requested	lab-service → provisioning-agent
lab.ready	lab-service → notification-service
submission.received	lab-service → scoring-agent
score.ready	scoring-agent → analytics-service

Valkey — Cache & Token Store

Redis-compatible, BSD-licensed. All ephemeral stateful data lives here — not in PostgreSQL.

refresh:{user_id}:{jti} → "1" TTL 7d refresh_tokens:{user_id} → Set{jti} login_attempts:{ip} → count TTL 15m blocklist:{jti} → "1" TTL=token TTL sso_state:{state} → data TTL 10m sso_code:{code} → user_id TTL 5m lab_status:{lab_id} → JSON TTL 1h rate_limit:{service}:{ip} → count TTL 1m

Why Self-Managed (not managed MQ / cache)

Azure Service Bus and Azure Cache for Redis are excellent managed services — but they tie you to Azure. RabbitMQ and Valkey run identically on AWS EKS, GCP GKE, or bare metal.

Multi-cloud portability

Moving Gravity Labs to AWS = update Terraform provider. The entire messaging and cache layer runs in the new AKS-equivalent cluster with zero code changes.

Production note: In Phase 4, RabbitMQ gets clustered (3 nodes) for HA. Valkey gets a primary + replica pair. Both managed via Helm charts in AKS.

09

Data Layer — PostgreSQL

Architecture Decision

Single PostgreSQL 16 instance in AKS, with one database per microservice. Not five separate Postgres pods — one pod, multiple logical databases inside it.

gravitylabs_auth ← auth-service only gravitylabs_tenant ← tenant-service only gravitylabs_events ← event-service only gravitylabs_labs ← lab-service only gravitylabs_scoring ← scoring-service only gravitylabs_catalog ← catalog-service only gravitylabs_billing ← billing-service only gravitylabs_temporal ← Temporal internal

Why Single Instance

Simpler ops — one backup, one upgrade, one monitoring target
Cheaper — one pod with 16 logical DBs vs 16 separate pods
Logical isolation is still strict — no service can query another's DB (enforced by separate DB users + passwords per service)
Future split is easy if one service's DB needs to scale independently — just migrate that one DB to its own instance

Self-Managed — Why Not RDS / Azure Flexible

Managed Postgres services are fine but cloud-specific. Running Postgres in AKS means:

Identical setup on Azure, AWS, GCP, on-prem
No extra monthly cost per managed instance
Full control over Postgres config (connection limits, extensions, WAL)
Backup is handled by a scheduled Kubernetes CronJob running pg_dump → MinIO

Gravity Labs example

lab-service connects to gravitylabs_labs only, using its own DB user. It physically cannot query gravitylabs_auth — wrong credentials. This is enforced at the Postgres level, not application code.

10

Monitoring & Observability

Prometheus — Metrics

Deployed inside AKS. Scrapes metrics from every pod via ServiceMonitor CRDs. All FastAPI services expose /metrics (prometheus-fastapi-instrumentator). Key metrics per service:

Request rate, latency (p50/p95/p99), error rate
Active lab count, provisioning queue depth
RabbitMQ queue depth per queue
Valkey hit/miss rate, memory usage
Agent LLM token usage + latency
PostgreSQL connections, query time

Grafana — Dashboards

Also in AKS. Pre-built dashboards for:

Event Operations — active participants, labs in each status, provisioning queue, time-to-ready
Agent Fleet — pod count per agent, LLM latency, queue drain rate
Infrastructure — node CPU/memory, pod restarts, PVC usage
Business — token burn rate, revenue snapshot, tenant activity

During a live event

The event ops team watches the Event Operations dashboard on a big screen. Queue depth rises → KEDA auto-scales → queue drains. If queue doesn't drain, Grafana alert fires → PagerDuty page.

Loki — Log Aggregation

All containers write structured JSON logs via structlog. Loki (in AKS) aggregates logs from all pods. Grafana has a Loki datasource — query logs alongside metrics in the same dashboard.

Example: a participant reports their lab is stuck. Ops queries Loki for lab_id=xyz across all services to see the exact failure point in under 10 seconds.

11

Peripheral Services

Services outside the core application layer. OSS-first and cloud-agnostic — every choice here can be swapped for an equivalent on a different cloud with only a config change, not a code change.

Decision Table — Peripheral Services

Concern	Self-managed / OSS (chosen)	Azure equivalent	AWS equivalent	Rationale
Transactional email	Sendgrid (SMTP API)	Azure Communication Services Email	Amazon SES	SMTP-compatible — swap by changing SMTP host config only
Object storage	MinIO in AKS (S3-compatible)	Azure Blob Storage	Amazon S3	S3 API is the industry standard. MinIO in AKS = zero cloud dependency. Move to S3/Blob = change endpoint + keys only
CDN / DDoS	Cloudflare (free tier → Pro)	Azure Front Door + WAF	CloudFront + Shield	Cloudflare is provider-agnostic — it sits in front of any cloud origin
Secret management	HashiCorp Vault in AKS (on-prem) / Azure Key Vault (cloud)	Azure Key Vault	AWS Secrets Manager	Vault for on-prem deployments. Key Vault for Azure production — AKS has native integration
Lab streaming (RDP/SSH)	Apache Guacamole in AKS	Azure Virtual Desktop	Amazon AppStream	Guacamole is open-source, runs anywhere, browser-based — no client install
GPU labs / AI workloads	Neko WebRTC + AKS GPU node pool	Azure NC-series nodes	AWS P-series	Neko provides browser-based desktop access to GPU workloads
SMS / Push (future)	TBD — Twilio or self-hosted	Azure Communication Services	Amazon SNS / Pinpoint	Not in Phase 1–3 scope. Evaluate in Phase 4.

12

Scale Strategy

500

Standard event
Single region · default node pool

5,000

Large event
Node autoscale + quota pre-request

15,000

Multi-region

East US 2 + West US 2
Traffic Manager load balancing

25,000

World record attempt
4 regions · subscription spread
Microsoft-sponsored licenses

Scale mechanisms in AKS

HPA — scales pods based on CPU/memory
KEDA — scales agent pods based on RabbitMQ queue depth
Cluster Autoscaler — adds/removes VM nodes in AKS node pool
Temporal worker pools — provisioning workers scale with workflow load
PostgreSQL connection pooling — PgBouncer in front of Postgres for 10K+ concurrent connections
Azure quota tickets — submitted 6–8 weeks before large events for vCPU headroom

13

"Report" service referenced in early diagrams. Formally tracked as analytics-service in Phase 4. Scope TBD.

Real cloud labs.Clear system design.

Full Architecture Stack

Client Layer

API Gateway — Nginx

Application Services Layer

Infrastructure & CI/CD

AI Agents Layer

Workflow Orchestration — Temporal

Messaging & Cache

Data Layer — PostgreSQL

Monitoring & Observability

Peripheral Services

Scale Strategy

Open Decisions (ADRs)

Agent deployment: each agent = own container

Single PostgreSQL instance, multiple databases

CI/CD: GitHub Actions + Argo CD

AKS topology: single cluster, namespace-separated

Cache/token store: Valkey (Redis-compatible)

Object storage: MinIO in AKS (S3-compatible)

JWT algorithm: HS256 → RS256 upgrade

SMS / push notifications (Phase 4)

Microsoft Partner enrollment

Analytics service scope (Phase 4)

Real cloud labs.
Clear system design.