Monitoring Stack (Prometheus & Grafana)
What this covers
How to deploy, configure, and extend the optional Prometheus + Grafana monitoring stack that provides real-time observability into every Tessallite service. This page explains the architecture, what data is collected, how the dashboard is organised, and how to add your own metrics and panels.
Overview
Tessallite ships with an optional monitoring stack that runs as a completely separate Docker Compose project. It is not required to run the platform and can be started or stopped independently without affecting any Tessallite service.
The stack consists of three containers:
| Container | Image | Purpose |
|---|---|---|
| prometheus | prom/prometheus:v2.51.0 | Scrapes /metrics from every service every 15 seconds, stores time-series data for up to 15 days |
| grafana | grafana/grafana:10.4.0 | Visualises metrics through a pre-built dashboard with 21 panels across three sections |
| nginx-exporter | nginx/nginx-prometheus-exporter:1.1 | Translates the frontend's nginx stub_status into Prometheus-format metrics |
All three containers live in the monitoring/ directory at the workspace root, separate from the main tessallite/infra/ Docker Compose stack.
How it connects to Tessallite
Both stacks share a Docker network called tessallite_net. This allows Prometheus (running in the monitoring stack) to reach every Tessallite service by container name, even though they are managed by different Docker Compose projects.
The main stack creates the network automatically on docker compose up. The monitoring stack's deploy script also creates it if it does not exist, so either stack can be started first.
Deploying the monitoring stack
Prerequisites
- Docker and Docker Compose v2 installed
- The main Tessallite stack running (or at least one
docker compose upto create the shared network)
Steps
- Navigate to the monitoring directory:
cd monitoring/ - Create the environment file:
cp .env.example .env - Set
GRAFANA_ADMIN_PASSWORDin.env - Deploy:
bash deploy.sh(Linux/macOS/Git Bash) ordeploy.bat(Windows) - Open the dashboards:
- Prometheus:
http://127.0.0.1:9090 - Grafana:
http://127.0.0.1:3001(username:admin, password: your.envvalue)
- Prometheus:
Teardown
bash teardown.sh # stop and remove data
bash teardown.sh --keep-data # stop but preserve volumes
What data is collected
Scraped services
Prometheus scrapes metrics from all seven Tessallite services:
| Service | Port | Metrics source |
|---|---|---|
| model-service | 8001 | prometheus-client via FastAPI middleware |
| query-router | 8000 | prometheus-client via FastAPI middleware |
| optimizer | 8000 | prometheus-client via FastAPI middleware |
| scheduler | 8000 | prometheus-client via FastAPI middleware |
| agent-service | 8000 | prometheus-client via FastAPI middleware |
| gateway | 8080 | prometheus-client via FastAPI middleware |
| frontend | nginx-exporter (9113) | nginx stub_status translated by exporter sidecar |
Platform metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
tessallite_http_requests_total | Counter | service, method, path, status | Total HTTP requests handled |
tessallite_http_request_duration_seconds | Histogram | service, method, path | Request latency in seconds |
Model-level usage metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
tessallite_model_queries_total | Counter | tenant, project, model_name, protocol, route_type | Query volume per model |
tessallite_model_query_errors_total | Counter | tenant, project, model_name, error_type | Failed queries per model |
tessallite_model_query_duration_seconds | Histogram | tenant, project, model_name | Query execution time per model |
tessallite_model_bytes_processed_total | Counter | tenant, project, model_name | Bytes scanned per model |
tessallite_model_rows_returned_total | Counter | tenant, project, model_name | Rows returned per model |
Aggregate refresh metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
tessallite_refresh_runs_total | Counter | status | Completed vs failed refreshes |
tessallite_refresh_run_duration_seconds | Histogram | mode | Refresh duration (full or incremental) |
Dashboard sections
The Grafana dashboard is organised into three collapsible sections with four filter variables at the top: Service, Tenant, Project, and Model.
Service Health (7 panels)
Live service status tiles, uptime over time, scrape duration, per-service request rate, error rate (5xx), and latency percentiles (p95 and p50).
Query Routing and Aggregates (4 panels)
Query routing distribution (source/aggregate/pocket), HTTP error rate by service, refresh run duration, and refresh completion rate.
Model Health and Usage (7 panels)
Per-model query throughput, latency p95, protocol distribution (SQL vs DAX), route distribution, query errors by type, bytes processed, and rows returned.
Adding custom metrics
All Tessallite metrics are defined in tessallite/shared/metrics.py using the Python prometheus-client library.
- Define the metric in
shared/metrics.pyusingCounter,Histogram, orGauge. - Import and instrument in the relevant service code (e.g.,
from shared.metrics import MY_COUNTER). - Rebuild the service:
docker compose build <service> && docker compose up -d <service>
The new metric appears automatically on /metrics. Prometheus begins scraping it on the next 15-second cycle. No Prometheus configuration changes are needed.
Adding a Grafana panel
- Open Grafana and navigate to the Tessallite Platform Overview dashboard.
- Click Edit, then Add panel.
- Write a PromQL query referencing your metric.
- Save the dashboard.
To make the panel permanent, export the dashboard JSON and save it to monitoring/grafana/tessallite-dashboard.json.
Adding a new scrape target
- Add
PrometheusMiddlewareand a/metricsendpoint to the new service. - Add a scrape job to
monitoring/prometheus.yml. - Restart Prometheus:
cd monitoring/ && docker compose restart prometheus
Data retention
Prometheus retains time-series data for 15 days by default. To change this, edit the --storage.tsdb.retention.time argument in monitoring/docker-compose.yml and restart.
Frequently asked questions
Do I need the monitoring stack to run Tessallite?
No. It is entirely optional. Tessallite operates normally without it.
Will stopping the monitoring stack affect Tessallite?
No. The monitoring containers are independent. Stopping them has zero impact on the platform.
Where is monitoring data stored?
In Docker volumes: monitoring_prometheus_data and monitoring_grafana_data. Use teardown.sh --keep-data to preserve them.
Can I use this in production?
Yes, for small to medium deployments. For high availability, consider a managed Prometheus service and point it at the same /metrics endpoints.