GitHub

This document describes the system architecture, data flow, components, and security model of Metrx.

System Overview

Metrx is a distributed system for real-time cost tracking and outcome management of AI agents:

┌─────────────────────────────────────────────────────────────────┐
│ Client Applications (Your Code)                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │ Metrx SDK / HTTP Client                           │   │
│  │ - Routes requests through Gateway                        │   │
│  │ - Adds authentication & tracking headers                 │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                   │
└──────────────────────────────┬──────────────────────────────────┘
                               │
                        HTTP/JSON
                               │
┌──────────────────────────────▼──────────────────────────────────┐
│ Metrx Gateway (Cloudflare Worker)                         │
│ Runs on edge, sub-millisecond latency                            │
├──────────────────────────────────────────────────────────────────┤
│ 1. Authenticate API key (KV cache → Supabase)                    │
│ 2. Check usage limits against monthly quota                      │
│ 3. Extract custom headers (X-Agent-ID, X-Session-ID, etc.)       │
│ 4. Route to correct LLM provider (OpenAI, Anthropic, etc.)       │
│ 5. Proxy request with provider auth                              │
│ 6. Stream response back to client                                │
│ 7. Calculate cost from token counts                              │
│ 8. Queue event to Redis (fire-and-forget)                        │
└──────────────────────────────┬──────────────────────────────────┘
                               │
         ┌─────────────────────┴──────────────┬──────────────┐
         │                                    │              │
    HTTP/JSON                          HTTPS                │
         │                                    │              │
    ┌────▼────────────────────────┐  ┌───────▼──────┐  ┌────▼──────┐
    │ LLM Providers               │  │ Cloudflare KV│  │ Upstash   │
    ├─────────────────────────────┤  │ (API Key     │  │ Redis     │
    │ - OpenAI                    │  │  Cache)      │  │ (Event    │
    │ - Anthropic                 │  └──────────────┘  │  Queue)   │
    │ - Google (Gemini)           │                    └───────┬────┘
    │ - xAI (Grok)                │                           │
    │ - Others                    │                   Event Stream
    └─────────────────────────────┘                           │
                                                              │
         ┌────────────────────────────────────────────────────▼─────┐
         │ Worker Processes (BullMQ)                                 │
         ├──────────────────────────────────────────────────────────┤
         │ 1. Consume events from Redis queue                       │
         │ 2. Enrich event data (org lookup, pricing)               │
         │ 3. Write to Supabase (events, usage counters)            │
         │ 4. Trigger webhooks                                      │
         │ 5. Update real-time metrics                              │
         │ 6. Process inferred outcomes                             │
         └──────────────────┬───────────────────────────────────────┘
                            │
                   PostgreSQL/HTTPS
                            │
         ┌──────────────────▼──────────────────┐
         │ Supabase (PostgreSQL + Auth)         │
         ├───────────────────────────────────────┤
         │ Tables:                               │
         │ - organizations                       │
         │ - api_keys                            │
         │ - events (LLM calls)                  │
         │ - sessions                            │
         │ - outcomes                            │
         │ - monthly_usage_counters              │
         │ - webhooks                            │
         └──────────────────┬──────────────────┘
                            │
┌───────────────────────────▼──────────────────────────┐
│ Web Dashboard (Next.js)                              │
├───────────────────────────────────────────────────────┤
│ - Real-time cost dashboards                          │
│ - Agent & session metrics                            │
│ - Team management                                    │
│ - Webhook configuration                              │
│ - Billing integration (Stripe)                       │
└───────────────────────────────────────────────────────┘

Component Details

1. Gateway (Cloudflare Worker)

Purpose: Transparent proxy for LLM API calls with sub-millisecond overhead.

Location: /apps/gateway

Key Features:

Runs globally on Cloudflare edge (low latency)
API key authentication with KV caching
Transparent request forwarding (OpenAI API compatible)
Token usage extraction & cost calculation
Real-time event queueing to Redis

Latency Budget: <100ms p95 added latency

Critical Dependencies:

Cloudflare KV (API key cache)
Upstash Redis (event queue)
Supabase (org lookup on cache miss)
LLM provider APIs (OpenAI, Anthropic, Google, xAI)

Failure Modes:

Auth cache miss + Redis down: Returns 503 (won’t process requests)
Provider timeout: Returns 504 with retry information
Request size > 10MB: Returns 413
Rate limit exceeded: Returns 429

2. Web Dashboard (Next.js)

Purpose: User interface for cost tracking, team management, and outcome tracking.

Location: /apps/web

Key Features:

Real-time cost dashboards (powered by Supabase subscriptions)
Agent and session drill-down
Team/user management (via Clerk)
Outcome tracking and business ROI calculation
Billing (Stripe integration)
Webhook management

Authentication: Clerk (OAuth + JWT)

Database Access: Supabase client (RLS enforced)

3. Worker Processes (BullMQ)

Purpose: Asynchronous processing of events from the event queue.

Location: /workers (or separate service)

Key Features:

Event dequeuing from Redis
Database writes (Supabase)
Webhook dispatching
Monthly usage counter updates
Session aggregation

Job Queue: BullMQ (Redis-backed)

Scaling: Horizontally scalable (multiple worker instances)

Processing Latency: P99 <5 seconds from event creation to database

4. Database (Supabase / PostgreSQL)

Purpose: Durable storage for organizations, events, sessions, outcomes, and configuration.

Key Tables: organizations, api_keys, events, sessions, outcomes, monthly_usage_counters, webhooks

Row-Level Security (RLS): All tables use RLS policies to ensure users only access their organization’s data.

See the API Reference for field-level details on request/response formats.

Data Flow

Request Path (Real-Time, Synchronous)

Client Request
    ↓
Gateway receives request
    ↓
Authenticate API key
  ├─ Check KV cache (fast path, 90% hit rate)
  └─ Miss? Query Supabase + cache result
    ↓
Check usage limits
  ├─ Query monthly_usage_counters for org
  └─ Reject if over limit (return 429)
    ↓
Extract custom headers (X-Agent-ID, X-Session-ID, etc.)
    ↓
Parse request & resolve provider
    ↓
Forward to LLM provider API
  ├─ Stream response back to client
  └─ Capture token usage from final chunk
    ↓
Calculate cost (from model pricing + tokens)
    ↓
Queue event to Redis (fire-and-forget, `<1ms`)
    ↓
Return response to client

Total Latency to Client: Provider latency + <100ms gateway overhead

Event Processing Path (Async, Background)

Event in Redis queue
    ↓
BullMQ worker dequeues event
    ↓
Enrich event metadata
  └─ Org lookup, pricing verification
    ↓
Write to Supabase
  ├─ Insert into events table
  └─ Increment monthly_usage_counters
    ↓
Update session aggregates (if session_id present)
    ↓
Trigger webhooks (HTTP POST to registered URLs)
    ↓
Process inferred outcomes (ML-based)
    ↓
Update real-time metrics for dashboard

Total Latency: P99 <5 seconds

Outcome Tracking Path

Business outcome occurs (e.g., customer satisfied)
    ↓
Dashboard user confirms outcome (or via API)
    ↓
Write to outcomes table
    ↓
Trigger outcome.confirmed webhook
    ↓
Calculate session ROI
  ├─ Total cost for session
  └─ Assign value to outcome
    ↓
Update dashboards & reports

Security Model

API Key Security

Generation: Client generates random 32-byte key
Storage: Only SHA-256 hash stored in database
Transmission: Over TLS 1.3 only
Caching: Hashed key cached in Cloudflare KV for 1 hour
Rotation: Users can rotate keys anytime; old key invalidated immediately

Network Security

TLS 1.3: All external communication encrypted
HTTPS Only: Gateway rejects HTTP requests
CORS: Restricted to configured origins (default: open in dev, restricted in prod)
Rate Limiting: Per-org limits + per-IP burst limits
Request Size: Max 10MB per request

Data Isolation

Row-Level Security: Supabase RLS policies enforce org isolation
API Keys: Scoped to single org
Webhooks: Only receive events from their org
Dashboard: Users only see their org’s data

Authentication & Authorization

Gateway: API key + org_id lookup

Web Dashboard: Clerk OAuth + JWT session

Webhooks: HMAC-SHA256 signature verification

Performance Characteristics

The Gateway adds minimal overhead to LLM provider latency. See the API Reference for rate limits by tier.

Scalability

Gateway: Horizontally scalable across Cloudflare edge locations
Workers: Horizontally scalable (add more worker instances)
Database: Supabase handles auto-scaling; events table can store years of data
Redis: Upstash Redis auto-scales; queue typically empty (sub-second processing)

Integration Points

LLM Providers

The Gateway proxies to:

OpenAI: GPT-4, GPT-4 Turbo, GPT-3.5 Turbo, etc.
Anthropic: Claude 3 (Opus, Sonnet, Haiku)
Google: Gemini Pro
xAI: Grok

Each provider has its own authentication & API format, abstracted by the gateway.

Billing Integration

Stripe: Monthly subscription billing + usage-based overage charges
Webhook: Triggered when usage exceeds plan limit

Observability

Sentry: Error tracking for gateway and workers
OpenTelemetry: Traces exported to your observability backend
Webhooks: Real-time event stream for custom logging

Deployment Architecture

Production

Cloudflare (Global CDN)
  ↓
  ├─ Gateway (Cloudflare Workers)
  ├─ KV (API key cache)
  └─ Pages (Static assets for docs)

AWS / Vercel
  ├─ Web Dashboard (Next.js)
  └─ Worker processes (EC2 / Vercel Functions)

Supabase (Managed PostgreSQL)
  └─ All persistent data

Upstash (Managed Redis)
  └─ Event queue

Stripe
  └─ Billing & payments

Clerk
  └─ Authentication & user management

Self-Hosting

See Self-Hosting Guide for details on running on your infrastructure.

Monitoring & Observability

Key Metrics

Real-Time (Dashboard):

Cost per agent (last 24h)
Call volume by model
Avg latency
Error rate

Historical (Reports):

Cost trends (daily, weekly, monthly)
Model usage distribution
Customer chargeback calculation
Outcome success rate & ROI

Alerts

Cost spike detected (> threshold)
Rate limit approached (> 80% of quota)
Provider error rate spike
Webhook delivery failures

Next Steps: See Self-Hosting Guide for deployment details or API Reference for integration.

Integration Guide Self-Hosting