Observability
HyperRoute ships with comprehensive observability built in — not as a paid plugin. Health checks, Prometheus metrics, OpenTelemetry tracing, and slow query detection are all available out of the box.
Health Checks
Three-tier health checks for Kubernetes and other orchestration platforms:
| Endpoint | Purpose | Use For |
|---|---|---|
/health/live | Process is running | Kubernetes livenessProbe |
/health/ready | Ready to serve traffic | Kubernetes readinessProbe |
/health/startup | Initial startup complete | Kubernetes startupProbe |
Response Format
{
"status": "healthy",
"checks": {
"schema_loaded": true,
"upstreams_reachable": true,
"cache_connected": true
},
"uptime_seconds": 3600,
"version": "0.1.0"
}
Configuration
observability:
health:
enabled: true
live_path: /health/live
ready_path: /health/ready
startup_path: /health/startup
check_interval: 30s
Prometheus Metrics
Available at http://localhost:9091/metrics (separate port, configurable):
Request Metrics
| Metric | Type | Description |
|---|---|---|
hyperroute_requests_total | Counter | Total requests by operation and status |
hyperroute_request_duration_seconds | Histogram | Request latency distribution |
hyperroute_active_connections | Gauge | Current active connections |
hyperroute_errors_total | Counter | Errors by type and subgraph |
Subgraph Metrics
| Metric | Type | Description |
|---|---|---|
hyperroute_subgraph_requests_total | Counter | Per-subgraph request counts |
hyperroute_subgraph_duration_seconds | Histogram | Per-subgraph latency |
hyperroute_upstream_errors_total | Counter | Upstream service errors |
Cache Metrics
| Metric | Type | Description |
|---|---|---|
hyperroute_cache_hits_total | Counter | Plan/response cache hits |
hyperroute_cache_misses_total | Counter | Plan/response cache misses |
hyperroute_inflight_dedup_hits_total | Counter | In-flight deduplication hits |
Security Metrics
| Metric | Type | Description |
|---|---|---|
hyperroute_graphql_errors_total | Counter | GraphQL-layer errors |
hyperroute_requests_blocked_total | Counter | Security-blocked requests |
hyperroute_rate_limited_total | Counter | Rate-limited requests |
Configuration
observability:
metrics:
enabled: true
path: /metrics
include_runtime_metrics: true
histogram_buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
Distributed Tracing (OpenTelemetry)
HyperRoute exports traces via OTLP (gRPC) to any compatible backend: Jaeger, Tempo, Zipkin, Datadog APM, and more.
Span Hierarchy
graphql.request (total)
├── graphql.parse
├── graphql.validate
├── graphql.plan
│ └── plan.cache.lookup [cache: hit/miss]
└── graphql.execute
├── subgraph.fetch [users] (25ms)
│ └── http.request
├── subgraph.fetch [products] (30ms)
│ └── http.request
└── subgraph.fetch [inventory] (20ms)
└── http.request
Span Attributes
| Attribute | Description |
|---|---|
graphql.operation.name | Query/mutation name |
graphql.operation.type | query, mutation, subscription |
subgraph.name | Target subgraph |
cache.status | hit, miss, skip |
Configuration
Via environment variables:
export OTEL_EXPORTER_OTLP_ENDPOINT="http://tempo:4317"
export OTEL_SERVICE_NAME="hyperroute-production"
routerd serve --config router.yaml
Or in router.yaml:
observability:
tracing:
enabled: true
otlp_endpoint: "http://tempo:4317"
service_name: "hyperroute-production"
sampling_rate: 0.1 # 10% of requests
propagation: "w3c" # or b3, jaeger
Slow Query Detection
HyperRoute automatically identifies slow queries and stores them for inspection:
curl http://localhost:4000/__hyperroute/slow-queries | jq
Response:
[
{
"query": "query GetDashboard { ... }",
"operation_name": "GetDashboard",
"duration_ms": 1250,
"timestamp": "2026-02-20T10:15:30Z",
"plan": { "steps": ["..."] }
}
]
Use this alongside the query plan explain endpoint for debugging:
# Explain a query plan
curl -X POST http://localhost:4000/.well-known/hyperroute/plan \
-H "Content-Type: application/json" \
-d '{"query": "{ user(id: \"1\") { name orders { total } } }"}'
Alerting Rules (Prometheus)
Recommended alerting rules for production:
groups:
- name: hyperroute
rules:
- alert: HighErrorRate
expr: rate(hyperroute_errors_total[5m]) / rate(hyperroute_requests_total[5m]) > 0.05
for: 5m
annotations:
summary: "HyperRoute error rate above 5%"
- alert: SlowSubgraph
expr: histogram_quantile(0.95, rate(hyperroute_subgraph_duration_seconds_bucket[5m])) > 2
for: 10m
annotations:
summary: "Subgraph p95 latency exceeds 2s"
- alert: HighInflightDedup
expr: hyperroute_inflight_dedup_hits_total > 1000
for: 1m
annotations:
summary: "Thundering herd detected — dedup saving upstream load"
Complete Observability Config
observability:
health:
enabled: true
live_path: /health/live
ready_path: /health/ready
startup_path: /health/startup
check_interval: 30s
metrics:
enabled: true
path: /metrics
include_runtime_metrics: true
histogram_buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
tracing:
enabled: true
service_name: hyperroute
endpoint: ${OTEL_EXPORTER_OTLP_ENDPOINT}
sample_rate: 0.1
propagation: w3c
Next Steps
- Caching — Distributed caching for query plans and responses
- Deployment — Production deployment with Prometheus and Grafana
- Configuration — Full observability config reference