Recommended Alerts¶
Recommended production alerts for Novelty, expressed against the metrics it emits — so they work with any dashboard or alerting backend. Each gives you what to watch, warning and critical levels, and why.
Read the values from the Metrics: GET /api/v2/system/metrics
endpoint, via JMX, or from your configured metrics reporter; metric names come from
Collected Metrics. The sections below follow the data path: ingest,
persistor, graph, and host.
Three rules for every alert below
- Alert on sustained conditions, not spikes. Use a few-minute "for"/pending window so a GC pause or latency blip doesn't page.
- Tune the numbers to your baseline. These are starting points — adjust after watching steady state for a few days.
- Poll volatile gauges directly. Fast gauges like
shared.valve.ingestcan flap between scrapes; read them from the endpoint or JMX, not a slow-poll reporter.
Metric names
Names below are written as they appear at the metrics endpoint. Metrics that are scoped to
a model are prefixed with its name (shown here as
{model-name}). Your metrics reporter may rewrite the . and - separators
(for example to _). Some signals are emitted only as log messages rather than
metrics; those are called out in the relevant section and marked as logs in the summary.
Ingest Streams¶
Ingest rate per stream¶
What to watch: {model-name}.ingest.{name}.count (one-minute rate).
Why it matters: This is the live records-per-second rate for each stream. Two independent failure modes are worth alerting on:
- Critical — stalled stream: rate
== 0for ≥ 5 minutes on a stream that should be active. This catches a dead or stuck ingest immediately and needs no baseline. - Warning — degraded throughput: rate sustained below a chosen fraction (for example, 50%) of the stream's normal steady-state. Set this per stream from your observed historical rate — a stream that normally runs ~350/min dropping to ~150/min is worth a look even though it isn't zero.
The rate is an exponentially weighted moving average, so it is volatile at the start and end of a stream; allow ~10 minutes for it to settle before drawing conclusions.
Rate alone does not catch every data problem
A mis-specified idFrom or an ingest race can silently produce missing or malformed data
with no error and no drop in ingest rate. Pair these rate alerts with a node-count or
output-volume baseline if data completeness matters.
Persistor¶
Persistor latency¶
What to watch: the persistor timers, weighting the write path most heavily because that is where back-pressure first shows up and propagates back into ingest:
persistor.persist-event,persistor.persist-snapshot— write pathpersistor.get-journal,persistor.get-latest-snapshot— read path
Why it matters: These measure how long persistence operations take. Watch both the average and the 95th percentile — single-digit-millisecond p95 is healthy. As latency climbs into the tens of milliseconds and beyond, the persistor becomes the bottleneck and back-pressure propagates back into the ingest streams.
- Warning: p95 sustained > 50 ms (well above a < 10 ms healthy baseline).
- Critical: p95 sustained > 100 ms.
The shape of the latency tells you why the persistor is slow: a high average points to a general persistor bottleneck, while a high p95 with a low average points to occasional slow operations, often a supernode.
Log signal — Query timed out after PT2S / DriverTimeoutException: if your persistor is
Cassandra, this means a request hit the server-side timeout (default 2 s). If these correlate
with Cassandra GC events in its gc.log, the Cassandra JVM is pausing — consider ScyllaDB,
which has no GC pauses. Severity: critical on recurring timeouts (a lone timeout under heavy load can be transient).
Graph Health¶
Supernode edge counts¶
What to watch: the upper buckets of the edge-count histogram —
{model-name}.node.edge-counts.2048-16383 and
{model-name}.node.edge-counts.16384-infinity (counters).
Why it matters: This histogram counts how many in-memory nodes fall into each edge-count bucket — it is your supernode detector. Supernodes (nodes with very high edge counts) are expensive: they are slow to wake, sleep, and snapshot, they increase persistor load, and they serialize traversals through a single hot node.
- Warning: the
2048-16383bucket becomes non-zero and stays populated — nodes with thousands of edges are accumulating. - Critical: the
16384-infinitybucket becomes non-zero — a live supernode with tens of thousands of edges.
This metric only sees awake nodes
The edge-count histogram counts only nodes that are currently in memory — and some failures never increment a counter at all — so pair it with the log signal below for durable detection.
Log signal — Node <id> has: <N> edges: emitted every 10,000 edges on a single node.
Because it fires regardless of whether the node is awake at scrape time, it catches a supernode
the histogram can miss. Severity: warning, escalating as N grows.
Critical-node mailboxes¶
What to watch: the upper buckets of node.mailbox-sizes (counters).
Why it matters: This histogram counts how many in-memory node mailboxes hold each number of queued messages. When the higher buckets populate, some nodes have become critical nodes — they are receiving more work than they can keep up with. This is often driven by supernodes, but can also point to a data-modeling or topology issue.
- Warning: the upper buckets become non-zero and stay populated.
Oversized properties¶
What to watch: the top bucket of {model-name}.node.property-sizes (histogram).
Why it matters: This tracks the serialized size of node properties. If the largest bucket populates, some properties are extremely large and may approach the 1 MB single-value size guideline imposed by Cassandra — large values are slow to persist and can fail outright.
- Warning: the top bucket becomes non-zero.
Host Metrics¶
Heap memory usage¶
What to watch: the ratio of memory.heap.used to memory.heap.max, evaluated over a
moving average rather than instantaneously.
Why it matters: The JVM heap normally saw-tooths as garbage collection runs, so an instantaneous reading near the top is often just pre-collection. Alert on sustained pressure using a moving average to avoid false pages.
- Warning: heap used / max sustained > 80%.
- Critical: sustained > 90%.
Heap sizing context
A static heap of 12 GB is recommended (16 GB maximum). If the JVM logs frequent long GC pauses,
the heap is likely configured too large. Out-of-memory or OOM-killed errors indicate the
opposite — too little memory for the instance, or an in-memory-soft-node-limit set too
high. See Operational Considerations
for resource planning.
Awake nodes versus capacity¶
What to watch: Novelty keeps "hot" nodes in memory and sleeps the rest. The number of nodes currently awake is not a single metric — it is derived from the per-shard sleep counters using a conservation identity:
awake ≈ Σ over shards ( woken − slept-success − slept-failure − removed )
using {model-name}.shard.{shard}.sleep-counters.woken, .slept-success,
.slept-failure, and .removed. (All four exit paths must be subtracted — subtracting only
some of them makes the figure grow without bound.)
Why it matters: This tracks how full Novelty's in-memory capacity is. As awake nodes approach capacity, node-sleeping can no longer keep pace and memory pressure builds. Capacity is a function of your topology and configured node limits:
capacity_soft = shard-count × in-memory-soft-node-limit
capacity_hard = shard-count × in-memory-hard-node-limit
where shard-count is the number of shards (defaults to 4) and in-memory-soft-node-limit /
in-memory-hard-node-limit are the per-shard cache limits (defaulting to 10,000 and 75,000).
All of these are described in the Configuration Reference.
- Warning: awake nodes sustained above
capacity_soft— node-sleeping is no longer keeping up and memory pressure is building. - Critical: awake nodes approaching
capacity_hard— near the point where Novelty applies back-pressure on waking new nodes.
A related early warning is rapid growth in
{model-name}.shard.{shard}.unlikely.incomplete-shutdown: it means nodes are being
contacted just as they decide to sleep, wasting time serializing and persisting extra
snapshots (cache thrash).
Summary¶
| Signal | Type | What to watch | Warning | Critical |
|---|---|---|---|---|
| Ingest rate per stream | metric | {model-name}.ingest.{name}.count |
< ~50% of baseline | == 0 for ≥ 5 min (active stream) |
| Persistor latency (p95) | metric | persistor.persist-event / persist-snapshot |
> 50 ms sustained | > 100 ms sustained |
| Cassandra timeout | log | Query timed out after PT2S |
— | recurring |
| Supernode edge counts | metric | {model-name}.node.edge-counts.* |
2048-16383 populated |
16384-infinity non-zero |
| Supernode (durable) | log | Node <id> has: <N> edges |
populated | as N grows |
| Critical-node mailboxes | metric | node.mailbox-sizes.* |
upper buckets populated | — |
| Oversized properties | metric | {model-name}.node.property-sizes |
top bucket populated | — |
| Heap memory usage | metric | memory.heap.used / memory.heap.max |
> 80% sustained | > 90% sustained |
| Awake nodes vs capacity | metric | {model-name}.shard.{shard}.sleep-counters.* |
above capacity_soft |
approaching capacity_hard |
Know your baseline
The numbers above are starting points, not universal truths. Calibrate them to your own deployment — its steady-state ingest rate, persistor latency, heap size, and node counts. For broader resource-planning guidance (heap sizing, persistor-to-host ratios, scaling ceilings), see Operational Considerations.