tenant_adminUpdated 2026-06-12

Refresh SLA Declarations

What this covers

A refresh SLA (Service Level Agreement) is a promise you write down for a model: "the data in this model must be refreshed and ready by this time every day." The SLA monitor checks that promise for you. If the data is not ready on time, it can retry the refresh automatically and send you an alert. This page explains the idea, how to set one up, what the monitor actually checks, and how to read a breach alert.

Why SLA declarations matter

Refresh jobs run on a schedule and are mostly unattended. Without a declared SLA, a refresh that fails quietly at 3 a.m. surfaces only when someone opens a dashboard and notices yesterday's numbers. By then, decisions may already have been made on stale data.

An SLA turns "the data is usually ready in the morning" into a checked rule: "a successful refresh must have finished by 07:00 UTC; if not, retry it and tell the on-call channel." Think of it like a school bus schedule — the question is not "did the bus drive around?" but "did it arrive at the stop by 8 o'clock?" Only an on-time arrival counts.

Defining an SLA

SLAs are managed per model in the Refresh SLA section (available in the Scheduler panel and in the model's Settings panel).

To create one:

  1. Open the model and locate the Refresh SLA section.
  2. Click Configure SLA.
  3. Fill in:
    • Target completion time (HH:MM UTC) — the time of day by which a successful refresh must have finished. Note this is UTC, not your local time.
    • Grace period (minutes) — extra slack added to the target before a breach is declared. Covers normal day-to-day jitter in refresh duration.
    • Max retries per day — how many automatic retry attempts the monitor may make per day when a breach is detected.
    • Send webhook alert on breach — whether to emit a refresh.sla_breach webhook event (and email/Slack notification, if notification routes are configured).
  4. Save.

Or via API:

POST /api/v1/projects/{project_id}/models/{model_id}/sla
{
  "target_completion_time": "07:00",
  "grace_period_minutes": 15,
  "max_retries": 1,
  "alert_on_breach": true
}

One SLA exists per model. Creating a second returns a conflict — use PATCH on the same path to change an existing SLA, and DELETE to remove it.

How the monitor works

The SLA monitor runs as an hourly sweep job. For each model with an SLA it:

  1. Works out today's deadline: target completion time plus the grace period (all in UTC).
  2. Does nothing until the deadline has passed — no early alarms.
  3. After the deadline, checks every active aggregate on the model one by one: did a refresh of that aggregate finish successfully between midnight UTC and the deadline? Only aggregates the scheduler actually keeps refreshed are checked — retired and invalid aggregates are not monitored. A retired aggregate is no longer kept fresh and nothing reads from it, so it can never breach the SLA. If every aggregate on a model has been retired, the model has nothing to monitor and is always reported as meeting its SLA.
  4. The SLA is met only when all active aggregates pass that check. If even one active aggregate has no on-time success, the model is in breach — one slow table cannot hide behind six fast ones. A later refresh the same day does not undo an on-time success.
  5. A refresh that failed before the deadline does not count as meeting the SLA — only a successful one does.

Each aggregate ends up in one of three buckets, and the alert reports all three:

What happens on a breach

A breach opens a breach episode for that model and that day. Within one episode:

  1. Automatic retry — stale aggregates only. If the day's retry budget is not used up, the monitor immediately runs a real full refresh of every stale aggregate — the same refresh machinery the scheduler uses. Aggregates whose data already landed (on time or late) are never refreshed again: the monitor does not redo work that is already done. Each retry run appears in the refresh history (triggered by sla_monitor) and always finishes in a final state: completed or failed. If some other refresh of an aggregate is still running at that moment, the monitor does not start a second one on top of it — it skips that aggregate and reports the skip honestly. One retry attempt uses one unit of the daily budget.
  2. One alert. If "Send webhook alert on breach" is on, a single refresh.sla_breach webhook event is emitted — once per episode, on the first sweep that detects the breach. Later sweeps of the same day stay quiet, even if the breach persists: they may still retry stale data (budget permitting), but they will not re-alert. An email/Slack notification goes out with the alert if you have notification routes configured for the sla_breach event type.
  3. Recovery. When every aggregate finally has a successful refresh for the day, the episode is marked recovered and a single refresh.sla_recovered webhook event is emitted. After that the monitor is silent for the rest of the day. The next day is a fresh start — a new breach is a new episode and alerts again.

Breach webhook payload

{
  "model_id": "<uuid>",
  "target_completion_time": "07:00",
  "grace_period_minutes": 15,
  "last_completed_at": "2026-06-12T08:24:59Z",
  "retried": true,
  "retries_today": 0,
  "retry_run_statuses": ["completed", "completed"],
  "sla_scope": "all_aggregates",
  "breach_kind": "partial",
  "aggregates_total": 7,
  "aggregates_on_time": 5,
  "aggregates_late": 0,
  "aggregates_stale": 2
}

How to read it:

Recovery webhook payload

{
  "model_id": "<uuid>",
  "target_completion_time": "07:00",
  "grace_period_minutes": 15,
  "last_completed_at": "2026-06-12T08:03:11Z",
  "resolved_at": "2026-06-12T09:00:02Z",
  "aggregates_total": 7
}

Emitted once per episode (event type refresh.sla_recovered) when every aggregate has a successful refresh for the breached day. It is your "all clear": the data is whole again, even though the deadline was missed.

Worked example

Model "Sales" refreshes nightly at 03:00 UTC and takes about 20 minutes. You declare: target 07:00, grace 15 minutes, max retries 1, alerts on.

Best practices and pitfalls

Updating and deleting SLAs

SLAs can be updated via PATCH and deleted via DELETE (or with Edit / Remove in the Refresh SLA section). Deleting an SLA does not affect in-flight refreshes; breach checking simply stops from the next sweep cycle.

Limits

Related