Recommended Alerts¶
Recommended production alerts for Quine Enterprise, expressed against the metrics it emits — so they work with any dashboard or alerting backend. Each gives you what to watch, warning and critical levels, and why.
Read the values from the Metrics: GET /api/v2/system/metrics
endpoint, via JMX, or from your configured metrics reporter; metric names come from
Collected Metrics. The sections below follow the data path: ingest,
standing queries, persistor, graph, and host, then cluster.
Three rules for every alert below
- Alert on sustained conditions, not spikes. Use a few-minute "for"/pending window so a GC pause or latency blip doesn't page.
- Tune the numbers to your baseline. These are starting points — adjust after watching steady state for a few days.
- Poll volatile gauges directly. Fast gauges like
shared.valve.ingestcan flap between scrapes; read them from the endpoint or JMX, not a slow-poll reporter.
When an alert fires and you need to find the underlying cause, see Diagnosing Bottlenecks, which explains how to read each metric and trace a symptom to its root cause.
Metric names
Names below are written as they appear at the metrics endpoint. Metrics that are scoped to
a graph are prefixed with its name (shown here as
{graph-name}). Your metrics reporter may rewrite the . and - separators
(for example to _). Some signals are emitted only as log messages rather than
metrics; those are called out in the relevant section and marked as logs in the summary.
Ingest Streams¶
Ingest rate per stream¶
What to watch: {graph-name}.ingest.{name}.count (one-minute rate).
Why it matters: This is the live records-per-second rate for each stream. Two independent failure modes are worth alerting on:
- Critical — stalled stream: rate
== 0for ≥ 5 minutes on a stream that should be active. This catches a dead or stuck ingest immediately and needs no baseline. - Warning — degraded throughput: rate sustained below a chosen fraction (for example, 50%) of the stream's normal steady-state. Set this per stream from your observed historical rate — a stream that normally runs ~350/min dropping to ~150/min is worth a look even though it isn't zero.
The rate is an exponentially weighted moving average, so it is volatile at the start and end of a stream; allow ~10 minutes for it to settle before drawing conclusions.
Rate alone does not catch every data problem
A mis-specified idFrom or an ingest race can silently produce missing or malformed data
with no error and no drop in ingest rate. Pair these rate alerts with a node-count or
output-volume baseline if data completeness matters.
A stalled stream can be expected during recovery
After a disaster recovery event, ingest streams resume from their last committed offsets and may sit at zero or run below baseline while they catch up. Account for recovery windows when tuning the "stalled stream" alert.
Standing Queries¶
Standing query backpressure¶
What to watch: shared.valve.ingest.{name} (gauge, one per ingest stream).
Why it matters: This gauge reports how many standing queries are currently pausing an
ingest because the standing query result queue is filling up faster than results can be
processed. 0 means the valve is open (healthy). A non-zero value means Quine Enterprise is
applying backpressure to protect itself — the ingest is waiting for standing query work to
catch up. That is the system working as designed, but a valve that stays closed means
something downstream of the match (the output query or its destination) can't keep up. Note
this is not data loss on its own — data is only lost once dropped results appear (see below).
- Warning: sustained non-zero on a stream that should be flowing freely.
Dropped standing query results¶
What to watch: {graph-name}.standing-queries.dropped.{name} (counter).
As a leading indicator, also watch {graph-name}.standing-queries.queue-time.{name}
(timer) — a rising queue time means an output is getting slow before it starts dropping.
Why it matters: This counter records standing query results that were irrecoverably dropped. Each drop is also accompanied by a WARN log explaining why. The backpressure valve normally prevents the result queue from overflowing, so any sustained increase here means real data loss.
- Critical: any sustained increase (the counter should stay flat).
Persistor¶
Persistor latency¶
What to watch: the persistor timers, weighting the write path most heavily because that is where back-pressure first shows up and propagates back into ingest:
persistor.persist-event,persistor.persist-snapshot— write pathpersistor.get-journal,persistor.get-latest-snapshot— read path
Why it matters: These measure how long persistence operations take. Watch both the average and the 95th percentile — single-digit-millisecond p95 is healthy. As latency climbs into the tens of milliseconds and beyond, the persistor becomes the bottleneck and back-pressure propagates back into the ingest streams.
- Warning: p95 sustained > 50 ms (well above a < 10 ms healthy baseline).
- Critical: p95 sustained > 100 ms.
The shape of the latency tells you why the persistor is slow: a high average points to a
general persistor bottleneck, while a high p95 with a low average points to occasional slow
operations, often a supernode. See
Diagnosing Bottlenecks for
more on reading these, and — if your persistor is Cassandra — for the driver-side latency
metric (s{n}.cql-requests).
Log signal — Query timed out after PT2S / DriverTimeoutException: if your persistor is
Cassandra, this means a request hit the server-side timeout (default 2 s). If these correlate
with Cassandra GC events in its gc.log, the Cassandra JVM is pausing — consider ScyllaDB,
which has no GC pauses. Severity: critical on recurring timeouts (a lone timeout under heavy load can be transient).
If persistor latency is consistently high rather than an occasional supernode blip, the Cassandra tier may be under-provisioned — see Cluster Sizing.
Graph Health¶
Supernode edge counts¶
What to watch: the upper buckets of the edge-count histogram —
{graph-name}.node.edge-counts.2048-16383 and
{graph-name}.node.edge-counts.16384-infinity (counters).
Why it matters: This histogram counts how many in-memory nodes fall into each edge-count bucket — it is your supernode detector. Supernodes (nodes with very high edge counts) are expensive: they are slow to wake, sleep, and snapshot, they increase persistor load, and they serialize traversals through a single hot node.
- Warning: the
2048-16383bucket becomes non-zero and stays populated — nodes with thousands of edges are accumulating. - Critical: the
16384-infinitybucket becomes non-zero — a live supernode with tens of thousands of edges.
This metric only sees awake nodes
The edge-count histogram counts only nodes that are currently in memory — and some failures never increment a counter at all — so pair it with the log signal below for durable detection.
See Cluster Performance for supernode mitigation strategies.
Log signal — Node <id> has: <N> edges: emitted every 10,000 edges on a single node.
Because it fires regardless of whether the node is awake at scrape time, it catches a supernode
the histogram can miss. Severity: warning, escalating as N grows.
Critical-node mailboxes¶
What to watch: the upper buckets of node.mailbox-sizes (counters).
Why it matters: This histogram counts how many in-memory node mailboxes hold each number of queued messages. When the higher buckets populate, some nodes have become critical nodes — they are receiving more work than they can keep up with. This is often driven by supernodes, but can also point to a data-modeling or topology issue.
- Warning: the upper buckets become non-zero and stay populated.
Oversized properties¶
What to watch: the top bucket of {graph-name}.node.property-sizes (histogram).
Why it matters: This tracks the serialized size of node properties. If the largest bucket populates, some properties are extremely large and may approach the 1 MB single-value size guideline imposed by Cassandra — large values are slow to persist and can fail outright.
- Warning: the top bucket becomes non-zero.
Host Metrics¶
Heap memory usage¶
What to watch: the ratio of memory.heap.used to memory.heap.max, evaluated over a
moving average rather than instantaneously.
Why it matters: The JVM heap normally saw-tooths as garbage collection runs, so an instantaneous reading near the top is often just pre-collection. Alert on sustained pressure using a moving average to avoid false pages.
- Warning: heap used / max sustained > 80%.
- Critical: sustained > 90%.
Heap sizing context
A static heap of 12 GB is recommended (16 GB maximum). If the JVM logs frequent long GC pauses,
the heap is likely configured too large. Out-of-memory or OOM-killed errors indicate the
opposite — too little memory for the instance, or an in-memory-soft-node-limit set too
high. See Operational Considerations
for resource planning. For per-host memory and heap sizing across a
cluster, see Cluster Sizing.
Awake nodes versus capacity¶
What to watch: Quine Enterprise keeps "hot" nodes in memory and sleeps the rest. The number of nodes currently awake is not a single metric — it is derived from the per-shard sleep counters using a conservation identity:
awake ≈ Σ over shards ( woken − slept-success − slept-failure − removed )
using {graph-name}.shard.{shard}.sleep-counters.woken, .slept-success,
.slept-failure, and .removed. (All four exit paths must be subtracted — subtracting only
some of them makes the figure grow without bound.)
Why it matters: This tracks how full Quine Enterprise's in-memory capacity is. As awake nodes approach capacity, node-sleeping can no longer keep pace and memory pressure builds. Capacity is a function of your topology and configured node limits:
capacity_soft = members × shard-count × in-memory-soft-node-limit
capacity_hard = members × shard-count × in-memory-hard-node-limit
where:
membersis the number of cluster members — your configuredquine.cluster.target-size.shard-countis the number of shards each member runs (defaults to 4).in-memory-soft-node-limit/in-memory-hard-node-limitare the per-shard cache limits (defaulting to 10,000 and 75,000).
All of these are described in the Configuration Reference.
- Warning: awake nodes sustained above
capacity_soft— node-sleeping is no longer keeping up and memory pressure is building. - Critical: awake nodes approaching
capacity_hard— near the point where Quine Enterprise applies back-pressure on waking new nodes.
A related early warning is rapid growth in
{graph-name}.shard.{shard}.unlikely.incomplete-shutdown: it means nodes are being
contacted just as they decide to sleep, wasting time serializing and persisting extra
snapshots (cache thrash).
Sustained pressure here means the workload's in-memory working set is outgrowing the cluster's capacity — the signal to raise node limits or add members. See Cluster Sizing for how to measure and estimate a target size.
Cluster Health¶
These conditions surface only as log messages from the cluster's cross-host messaging layer.
Log signal — Ask relayed by graph timed out: a cross-host request timed out — usually
node thread saturation, a supernode, or an unstable/slow network. Severity: warning on repeated occurrences.
Log signal — PhiAccrualFailureDetector ... heartbeat interval is growing too large: CPU
starvation, commonly from Kubernetes CPU limits set too close to requests. Setting limits 2–3×
the request often resolves it. Severity: warning if sustained.
Summary¶
| Signal | Type | What to watch | Warning | Critical |
|---|---|---|---|---|
| Ingest rate per stream | metric | {graph-name}.ingest.{name}.count |
< ~50% of baseline | == 0 for ≥ 5 min (active stream) |
| Standing query backpressure | metric | shared.valve.ingest.{name} |
sustained non-zero | — |
| Dropped SQ results | metric | {graph-name}.standing-queries.dropped.{name} |
— | any sustained increase |
| Persistor latency (p95) | metric | persistor.persist-event / persist-snapshot |
> 50 ms sustained | > 100 ms sustained |
| Cassandra timeout | log | Query timed out after PT2S |
— | recurring |
| Supernode edge counts | metric | {graph-name}.node.edge-counts.* |
2048-16383 populated |
16384-infinity non-zero |
| Supernode (durable) | log | Node <id> has: <N> edges |
populated | as N grows |
| Critical-node mailboxes | metric | node.mailbox-sizes.* |
upper buckets populated | — |
| Oversized properties | metric | {graph-name}.node.property-sizes |
top bucket populated | — |
| Heap memory usage | metric | memory.heap.used / memory.heap.max |
> 80% sustained | > 90% sustained |
| Awake nodes vs capacity | metric | {graph-name}.shard.{shard}.sleep-counters.* |
above capacity_soft |
approaching capacity_hard |
| Cross-host request timeout | log | Ask relayed by graph timed out |
recurring | — |
| Cluster heartbeat / CPU starvation | log | PhiAccrualFailureDetector ... |
sustained | — |
Know your baseline
The numbers above are starting points, not universal truths. Calibrate them to your own deployment — its steady-state ingest rate, persistor latency, heap size, and node counts. For broader resource-planning guidance (heap sizing, persistor-to-host ratios, scaling ceilings), see Operational Considerations.