system-adminUpdated 2026-05-23

Monitoring Stack (Prometheus & Grafana)

What this covers

How to deploy, configure, and extend the optional Prometheus + Grafana monitoring stack that provides real-time observability into every Tessallite service. This page explains the architecture, what data is collected, how the dashboard is organised, and how to add your own metrics and panels.

Overview

Tessallite ships with an optional monitoring stack that runs as a completely separate Docker Compose project. It is not required to run the platform and can be started or stopped independently without affecting any Tessallite service.

The stack consists of three containers:

ContainerImagePurpose
prometheusprom/prometheus:v2.51.0Scrapes /metrics from every service every 15 seconds, stores time-series data for up to 15 days
grafanagrafana/grafana:10.4.0Visualises metrics through a pre-built dashboard with 21 panels across three sections
nginx-exporternginx/nginx-prometheus-exporter:1.1Translates the frontend's nginx stub_status into Prometheus-format metrics

All three containers live in the monitoring/ directory at the workspace root, separate from the main tessallite/infra/ Docker Compose stack.

How it connects to Tessallite

Both stacks share a Docker network called tessallite_net. This allows Prometheus (running in the monitoring stack) to reach every Tessallite service by container name, even though they are managed by different Docker Compose projects.

The main stack creates the network automatically on docker compose up. The monitoring stack's deploy script also creates it if it does not exist, so either stack can be started first.

Deploying the monitoring stack

Prerequisites

Steps

  1. Navigate to the monitoring directory: cd monitoring/
  2. Create the environment file: cp .env.example .env
  3. Set GRAFANA_ADMIN_PASSWORD in .env
  4. Deploy: bash deploy.sh (Linux/macOS/Git Bash) or deploy.bat (Windows)
  5. Open the dashboards:
    • Prometheus: http://127.0.0.1:9090
    • Grafana: http://127.0.0.1:3001 (username: admin, password: your .env value)

Teardown

bash teardown.sh              # stop and remove data
bash teardown.sh --keep-data  # stop but preserve volumes

What data is collected

Scraped services

Prometheus scrapes metrics from all seven Tessallite services:

ServicePortMetrics source
model-service8001prometheus-client via FastAPI middleware
query-router8000prometheus-client via FastAPI middleware
optimizer8000prometheus-client via FastAPI middleware
scheduler8000prometheus-client via FastAPI middleware
agent-service8000prometheus-client via FastAPI middleware
gateway8080prometheus-client via FastAPI middleware
frontendnginx-exporter (9113)nginx stub_status translated by exporter sidecar

Platform metrics

MetricTypeLabelsDescription
tessallite_http_requests_totalCounterservice, method, path, statusTotal HTTP requests handled
tessallite_http_request_duration_secondsHistogramservice, method, pathRequest latency in seconds

Model-level usage metrics

MetricTypeLabelsDescription
tessallite_model_queries_totalCountertenant, project, model_name, protocol, route_typeQuery volume per model
tessallite_model_query_errors_totalCountertenant, project, model_name, error_typeFailed queries per model
tessallite_model_query_duration_secondsHistogramtenant, project, model_nameQuery execution time per model
tessallite_model_bytes_processed_totalCountertenant, project, model_nameBytes scanned per model
tessallite_model_rows_returned_totalCountertenant, project, model_nameRows returned per model

Aggregate refresh metrics

MetricTypeLabelsDescription
tessallite_refresh_runs_totalCounterstatusCompleted vs failed refreshes
tessallite_refresh_run_duration_secondsHistogrammodeRefresh duration (full or incremental)

Dashboard sections

The Grafana dashboard is organised into three collapsible sections with four filter variables at the top: Service, Tenant, Project, and Model.

Service Health (7 panels)

Live service status tiles, uptime over time, scrape duration, per-service request rate, error rate (5xx), and latency percentiles (p95 and p50).

Query Routing and Aggregates (4 panels)

Query routing distribution (source/aggregate/pocket), HTTP error rate by service, refresh run duration, and refresh completion rate.

Model Health and Usage (7 panels)

Per-model query throughput, latency p95, protocol distribution (SQL vs DAX), route distribution, query errors by type, bytes processed, and rows returned.

Adding custom metrics

All Tessallite metrics are defined in tessallite/shared/metrics.py using the Python prometheus-client library.

  1. Define the metric in shared/metrics.py using Counter, Histogram, or Gauge.
  2. Import and instrument in the relevant service code (e.g., from shared.metrics import MY_COUNTER).
  3. Rebuild the service: docker compose build <service> && docker compose up -d <service>

The new metric appears automatically on /metrics. Prometheus begins scraping it on the next 15-second cycle. No Prometheus configuration changes are needed.

Adding a Grafana panel

  1. Open Grafana and navigate to the Tessallite Platform Overview dashboard.
  2. Click Edit, then Add panel.
  3. Write a PromQL query referencing your metric.
  4. Save the dashboard.

To make the panel permanent, export the dashboard JSON and save it to monitoring/grafana/tessallite-dashboard.json.

Adding a new scrape target

  1. Add PrometheusMiddleware and a /metrics endpoint to the new service.
  2. Add a scrape job to monitoring/prometheus.yml.
  3. Restart Prometheus: cd monitoring/ && docker compose restart prometheus

Data retention

Prometheus retains time-series data for 15 days by default. To change this, edit the --storage.tsdb.retention.time argument in monitoring/docker-compose.yml and restart.

Frequently asked questions

Do I need the monitoring stack to run Tessallite?
No. It is entirely optional. Tessallite operates normally without it.

Will stopping the monitoring stack affect Tessallite?
No. The monitoring containers are independent. Stopping them has zero impact on the platform.

Where is monitoring data stored?
In Docker volumes: monitoring_prometheus_data and monitoring_grafana_data. Use teardown.sh --keep-data to preserve them.

Can I use this in production?
Yes, for small to medium deployments. For high availability, consider a managed Prometheus service and point it at the same /metrics endpoints.

Related