HyperRoute

Observability

HyperRoute ships with comprehensive observability built in — not as a paid plugin. Health checks, Prometheus metrics, OpenTelemetry tracing, and slow query detection are all available out of the box.


Health Checks

Three-tier health checks for Kubernetes and other orchestration platforms:

EndpointPurposeUse For
/health/liveProcess is runningKubernetes livenessProbe
/health/readyReady to serve trafficKubernetes readinessProbe
/health/startupInitial startup completeKubernetes startupProbe

Response Format

{
  "status": "healthy",
  "checks": {
    "schema_loaded": true,
    "upstreams_reachable": true,
    "cache_connected": true
  },
  "uptime_seconds": 3600,
  "version": "0.1.0"
}

Configuration

observability:
  health:
    enabled: true
    live_path: /health/live
    ready_path: /health/ready
    startup_path: /health/startup
    check_interval: 30s

Prometheus Metrics

Available at http://localhost:9091/metrics (separate port, configurable):

Request Metrics

MetricTypeDescription
hyperroute_requests_totalCounterTotal requests by operation and status
hyperroute_request_duration_secondsHistogramRequest latency distribution
hyperroute_active_connectionsGaugeCurrent active connections
hyperroute_errors_totalCounterErrors by type and subgraph

Subgraph Metrics

MetricTypeDescription
hyperroute_subgraph_requests_totalCounterPer-subgraph request counts
hyperroute_subgraph_duration_secondsHistogramPer-subgraph latency
hyperroute_upstream_errors_totalCounterUpstream service errors

Cache Metrics

MetricTypeDescription
hyperroute_cache_hits_totalCounterPlan/response cache hits
hyperroute_cache_misses_totalCounterPlan/response cache misses
hyperroute_inflight_dedup_hits_totalCounterIn-flight deduplication hits

Security Metrics

MetricTypeDescription
hyperroute_graphql_errors_totalCounterGraphQL-layer errors
hyperroute_requests_blocked_totalCounterSecurity-blocked requests
hyperroute_rate_limited_totalCounterRate-limited requests

Configuration

observability:
  metrics:
    enabled: true
    path: /metrics
    include_runtime_metrics: true
    histogram_buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]

Distributed Tracing (OpenTelemetry)

HyperRoute exports traces via OTLP (gRPC) to any compatible backend: Jaeger, Tempo, Zipkin, Datadog APM, and more.

Span Hierarchy

graphql.request (total)
├── graphql.parse
├── graphql.validate
├── graphql.plan
│   └── plan.cache.lookup [cache: hit/miss]
└── graphql.execute
    ├── subgraph.fetch [users] (25ms)
    │   └── http.request
    ├── subgraph.fetch [products] (30ms)
    │   └── http.request
    └── subgraph.fetch [inventory] (20ms)
        └── http.request

Span Attributes

AttributeDescription
graphql.operation.nameQuery/mutation name
graphql.operation.typequery, mutation, subscription
subgraph.nameTarget subgraph
cache.statushit, miss, skip

Configuration

Via environment variables:

export OTEL_EXPORTER_OTLP_ENDPOINT="http://tempo:4317"
export OTEL_SERVICE_NAME="hyperroute-production"
routerd serve --config router.yaml

Or in router.yaml:

observability:
  tracing:
    enabled: true
    otlp_endpoint: "http://tempo:4317"
    service_name: "hyperroute-production"
    sampling_rate: 0.1       # 10% of requests
    propagation: "w3c"       # or b3, jaeger

Slow Query Detection

HyperRoute automatically identifies slow queries and stores them for inspection:

curl http://localhost:4000/__hyperroute/slow-queries | jq

Response:

[
  {
    "query": "query GetDashboard { ... }",
    "operation_name": "GetDashboard",
    "duration_ms": 1250,
    "timestamp": "2026-02-20T10:15:30Z",
    "plan": { "steps": ["..."] }
  }
]

Use this alongside the query plan explain endpoint for debugging:

# Explain a query plan
curl -X POST http://localhost:4000/.well-known/hyperroute/plan \
  -H "Content-Type: application/json" \
  -d '{"query": "{ user(id: \"1\") { name orders { total } } }"}'

Alerting Rules (Prometheus)

Recommended alerting rules for production:

groups:
  - name: hyperroute
    rules:
      - alert: HighErrorRate
        expr: rate(hyperroute_errors_total[5m]) / rate(hyperroute_requests_total[5m]) > 0.05
        for: 5m
        annotations:
          summary: "HyperRoute error rate above 5%"

      - alert: SlowSubgraph
        expr: histogram_quantile(0.95, rate(hyperroute_subgraph_duration_seconds_bucket[5m])) > 2
        for: 10m
        annotations:
          summary: "Subgraph p95 latency exceeds 2s"

      - alert: HighInflightDedup
        expr: hyperroute_inflight_dedup_hits_total > 1000
        for: 1m
        annotations:
          summary: "Thundering herd detected — dedup saving upstream load"

Complete Observability Config

observability:
  health:
    enabled: true
    live_path: /health/live
    ready_path: /health/ready
    startup_path: /health/startup
    check_interval: 30s

  metrics:
    enabled: true
    path: /metrics
    include_runtime_metrics: true
    histogram_buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]

  tracing:
    enabled: true
    service_name: hyperroute
    endpoint: ${OTEL_EXPORTER_OTLP_ENDPOINT}
    sample_rate: 0.1
    propagation: w3c

Next Steps

  • Caching — Distributed caching for query plans and responses
  • Deployment — Production deployment with Prometheus and Grafana
  • Configuration — Full observability config reference