system-adminUpdated 2026-05-23

Monitoring Stack (Prometheus & Grafana)

What this covers

How to deploy, configure, and extend the optional Prometheus + Grafana monitoring stack that provides real-time observability into every Tessallite service. This page explains the architecture, what data is collected, how the dashboard is organised, and how to add your own metrics and panels.

Overview

Tessallite ships with an optional monitoring stack that runs as a completely separate Docker Compose project. It is not required to run the platform and can be started or stopped independently without affecting any Tessallite service.

The stack consists of three containers:

Container	Image	Purpose
prometheus	`prom/prometheus:v2.51.0`	Scrapes `/metrics` from every service every 15 seconds, stores time-series data for up to 15 days
grafana	`grafana/grafana:10.4.0`	Visualises metrics through a pre-built dashboard with 21 panels across three sections
nginx-exporter	`nginx/nginx-prometheus-exporter:1.1`	Translates the frontend's nginx `stub_status` into Prometheus-format metrics

All three containers live in the monitoring/ directory at the workspace root, separate from the main tessallite/infra/ Docker Compose stack.

How it connects to Tessallite

Both stacks share a Docker network called tessallite_net. This allows Prometheus (running in the monitoring stack) to reach every Tessallite service by container name, even though they are managed by different Docker Compose projects.

The main stack creates the network automatically on docker compose up. The monitoring stack's deploy script also creates it if it does not exist, so either stack can be started first.

Deploying the monitoring stack

Prerequisites

Docker and Docker Compose v2 installed
The main Tessallite stack running (or at least one docker compose up to create the shared network)

Steps

Navigate to the monitoring directory: cd monitoring/
Create the environment file: cp .env.example .env
Set GRAFANA_ADMIN_PASSWORD in .env
Deploy: bash deploy.sh (Linux/macOS/Git Bash) or deploy.bat (Windows)
Open the dashboards:
- Prometheus: http://127.0.0.1:9090
- Grafana: http://127.0.0.1:3001 (username: admin, password: your .env value)

Teardown

bash teardown.sh              # stop and remove data
bash teardown.sh --keep-data  # stop but preserve volumes

What data is collected

Scraped services

Prometheus scrapes metrics from all seven Tessallite services:

Service	Port	Metrics source
model-service	8001	`prometheus-client` via FastAPI middleware
query-router	8000	`prometheus-client` via FastAPI middleware
optimizer	8000	`prometheus-client` via FastAPI middleware
scheduler	8000	`prometheus-client` via FastAPI middleware
agent-service	8000	`prometheus-client` via FastAPI middleware
gateway	8080	`prometheus-client` via FastAPI middleware
frontend	nginx-exporter (9113)	nginx `stub_status` translated by exporter sidecar

Platform metrics

Metric	Type	Labels	Description
`tessallite_http_requests_total`	Counter	service, method, path, status	Total HTTP requests handled
`tessallite_http_request_duration_seconds`	Histogram	service, method, path	Request latency in seconds

Model-level usage metrics

Metric	Type	Labels	Description
`tessallite_model_queries_total`	Counter	tenant, project, model_name, protocol, route_type	Query volume per model
`tessallite_model_query_errors_total`	Counter	tenant, project, model_name, error_type	Failed queries per model
`tessallite_model_query_duration_seconds`	Histogram	tenant, project, model_name	Query execution time per model
`tessallite_model_bytes_processed_total`	Counter	tenant, project, model_name	Bytes scanned per model
`tessallite_model_rows_returned_total`	Counter	tenant, project, model_name	Rows returned per model

Aggregate refresh metrics

Metric	Type	Labels	Description
`tessallite_refresh_runs_total`	Counter	status	Completed vs failed refreshes
`tessallite_refresh_run_duration_seconds`	Histogram	mode	Refresh duration (full or incremental)

Dashboard sections

The Grafana dashboard is organised into three collapsible sections with four filter variables at the top: Service, Tenant, Project, and Model.

Service Health (7 panels)

Live service status tiles, uptime over time, scrape duration, per-service request rate, error rate (5xx), and latency percentiles (p95 and p50).

Query Routing and Aggregates (4 panels)

Query routing distribution (source/aggregate/pocket), HTTP error rate by service, refresh run duration, and refresh completion rate.

Model Health and Usage (7 panels)

Per-model query throughput, latency p95, protocol distribution (SQL vs DAX), route distribution, query errors by type, bytes processed, and rows returned.

Adding custom metrics

All Tessallite metrics are defined in tessallite/shared/metrics.py using the Python prometheus-client library.

Define the metric in shared/metrics.py using Counter, Histogram, or Gauge.
Import and instrument in the relevant service code (e.g., from shared.metrics import MY_COUNTER).
Rebuild the service: docker compose build <service> && docker compose up -d <service>

The new metric appears automatically on /metrics. Prometheus begins scraping it on the next 15-second cycle. No Prometheus configuration changes are needed.

Adding a Grafana panel

Open Grafana and navigate to the Tessallite Platform Overview dashboard.
Click Edit, then Add panel.
Write a PromQL query referencing your metric.
Save the dashboard.

To make the panel permanent, export the dashboard JSON and save it to monitoring/grafana/tessallite-dashboard.json.

Adding a new scrape target

Add PrometheusMiddleware and a /metrics endpoint to the new service.
Add a scrape job to monitoring/prometheus.yml.
Restart Prometheus: cd monitoring/ && docker compose restart prometheus

Data retention

Prometheus retains time-series data for 15 days by default. To change this, edit the --storage.tsdb.retention.time argument in monitoring/docker-compose.yml and restart.

Frequently asked questions

Do I need the monitoring stack to run Tessallite?
No. It is entirely optional. Tessallite operates normally without it.

Will stopping the monitoring stack affect Tessallite?
No. The monitoring containers are independent. Stopping them has zero impact on the platform.

Where is monitoring data stored?
In Docker volumes: monitoring_prometheus_data and monitoring_grafana_data. Use teardown.sh --keep-data to preserve them.

Can I use this in production?
Yes, for small to medium deployments. For high availability, consider a managed Prometheus service and point it at the same /metrics endpoints.