Recommended Alerts¶

Recommended production alerts for Novelty, expressed against the metrics it emits — so they work with any dashboard or alerting backend. Each gives you what to watch, warning and critical levels, and why.

Read the values from the Metrics: GET /api/v2/system/metrics endpoint, via JMX, or from your configured metrics reporter; metric names come from Collected Metrics. The sections below follow the data path: ingest, persistor, graph, and host.

Three rules for every alert below

Alert on sustained conditions, not spikes. Use a few-minute "for"/pending window so a GC pause or latency blip doesn't page.
Tune the numbers to your baseline. These are starting points — adjust after watching steady state for a few days.
Poll volatile gauges directly. Fast gauges like shared.valve.ingest can flap between scrapes; read them from the endpoint or JMX, not a slow-poll reporter.

Metric names

Names below are written as they appear at the metrics endpoint. Metrics that are scoped to a model are prefixed with its name (shown here as {model-name}). Your metrics reporter may rewrite the . and - separators (for example to _). Some signals are emitted only as log messages rather than metrics; those are called out in the relevant section and marked as logs in the summary.

Ingest Streams¶

Ingest rate per stream¶

What to watch: {model-name}.ingest.{name}.count (one-minute rate).

Why it matters: This is the live records-per-second rate for each stream. Two independent failure modes are worth alerting on:

Critical — stalled stream: rate == 0 for ≥ 5 minutes on a stream that should be active. This catches a dead or stuck ingest immediately and needs no baseline.
Warning — degraded throughput: rate sustained below a chosen fraction (for example, 50%) of the stream's normal steady-state. Set this per stream from your observed historical rate — a stream that normally runs ~350/min dropping to ~150/min is worth a look even though it isn't zero.

The rate is an exponentially weighted moving average, so it is volatile at the start and end of a stream; allow ~10 minutes for it to settle before drawing conclusions.

Rate alone does not catch every data problem

A mis-specified idFrom or an ingest race can silently produce missing or malformed data with no error and no drop in ingest rate. Pair these rate alerts with a node-count or output-volume baseline if data completeness matters.

Persistor¶

Persistor latency¶

What to watch: the persistor timers, weighting the write path most heavily because that is where back-pressure first shows up and propagates back into ingest:

persistor.persist-event, persistor.persist-snapshot — write path
persistor.get-journal, persistor.get-latest-snapshot — read path

Why it matters: These measure how long persistence operations take. Watch both the average and the 95th percentile — single-digit-millisecond p95 is healthy. As latency climbs into the tens of milliseconds and beyond, the persistor becomes the bottleneck and back-pressure propagates back into the ingest streams.

Warning: p95 sustained > 50 ms (well above a < 10 ms healthy baseline).
Critical: p95 sustained > 100 ms.

The shape of the latency tells you why the persistor is slow: a high average points to a general persistor bottleneck, while a high p95 with a low average points to occasional slow operations, often a supernode.

Log signal — Query timed out after PT2S / DriverTimeoutException: if your persistor is Cassandra, this means a request hit the server-side timeout (default 2 s). If these correlate with Cassandra GC events in its gc.log, the Cassandra JVM is pausing — consider ScyllaDB, which has no GC pauses. Severity: critical on recurring timeouts (a lone timeout under heavy load can be transient).

Graph Health¶

Supernode edge counts¶

What to watch: the upper buckets of the edge-count histogram — {model-name}.node.edge-counts.2048-16383 and {model-name}.node.edge-counts.16384-infinity (counters).

Why it matters: This histogram counts how many in-memory nodes fall into each edge-count bucket — it is your supernode detector. Supernodes (nodes with very high edge counts) are expensive: they are slow to wake, sleep, and snapshot, they increase persistor load, and they serialize traversals through a single hot node.

Warning: the 2048-16383 bucket becomes non-zero and stays populated — nodes with thousands of edges are accumulating.
Critical: the 16384-infinity bucket becomes non-zero — a live supernode with tens of thousands of edges.

This metric only sees awake nodes

The edge-count histogram counts only nodes that are currently in memory — and some failures never increment a counter at all — so pair it with the log signal below for durable detection.

Log signal — Node <id> has: <N> edges: emitted every 10,000 edges on a single node. Because it fires regardless of whether the node is awake at scrape time, it catches a supernode the histogram can miss. Severity: warning, escalating as N grows.

Critical-node mailboxes¶

What to watch: the upper buckets of node.mailbox-sizes (counters).

Why it matters: This histogram counts how many in-memory node mailboxes hold each number of queued messages. When the higher buckets populate, some nodes have become critical nodes — they are receiving more work than they can keep up with. This is often driven by supernodes, but can also point to a data-modeling or topology issue.

Warning: the upper buckets become non-zero and stay populated.

Oversized properties¶

What to watch: the top bucket of {model-name}.node.property-sizes (histogram).

Why it matters: This tracks the serialized size of node properties. If the largest bucket populates, some properties are extremely large and may approach the 1 MB single-value size guideline imposed by Cassandra — large values are slow to persist and can fail outright.

Warning: the top bucket becomes non-zero.

Host Metrics¶

Heap memory usage¶

What to watch: the ratio of memory.heap.used to memory.heap.max, evaluated over a moving average rather than instantaneously.

Why it matters: The JVM heap normally saw-tooths as garbage collection runs, so an instantaneous reading near the top is often just pre-collection. Alert on sustained pressure using a moving average to avoid false pages.

Warning: heap used / max sustained > 80%.
Critical: sustained > 90%.

Heap sizing context

A static heap of 12 GB is recommended (16 GB maximum). If the JVM logs frequent long GC pauses, the heap is likely configured too large. Out-of-memory or OOM-killed errors indicate the opposite — too little memory for the instance, or an in-memory-soft-node-limit set too high. See Operational Considerations for resource planning.

Awake nodes versus capacity¶

What to watch: Novelty keeps "hot" nodes in memory and sleeps the rest. The number of nodes currently awake is not a single metric — it is derived from the per-shard sleep counters using a conservation identity:

awake ≈ Σ over shards ( woken − slept-success − slept-failure − removed )

using {model-name}.shard.{shard}.sleep-counters.woken, .slept-success, .slept-failure, and .removed. (All four exit paths must be subtracted — subtracting only some of them makes the figure grow without bound.)

Why it matters: This tracks how full Novelty's in-memory capacity is. As awake nodes approach capacity, node-sleeping can no longer keep pace and memory pressure builds. Capacity is a function of your topology and configured node limits:

capacity_soft = shard-count × in-memory-soft-node-limit
capacity_hard = shard-count × in-memory-hard-node-limit

where shard-count is the number of shards (defaults to 4) and in-memory-soft-node-limit / in-memory-hard-node-limit are the per-shard cache limits (defaulting to 10,000 and 75,000).

All of these are described in the Configuration Reference.

Warning: awake nodes sustained above capacity_soft — node-sleeping is no longer keeping up and memory pressure is building.
Critical: awake nodes approaching capacity_hard — near the point where Novelty applies back-pressure on waking new nodes.

A related early warning is rapid growth in {model-name}.shard.{shard}.unlikely.incomplete-shutdown: it means nodes are being contacted just as they decide to sleep, wasting time serializing and persisting extra snapshots (cache thrash).

Summary¶

Signal	Type	What to watch	Warning	Critical
Ingest rate per stream	metric	`{model-name}.ingest.{name}.count`	< ~50% of baseline	`== 0` for ≥ 5 min (active stream)
Persistor latency (p95)	metric	`persistor.persist-event` / `persist-snapshot`	> 50 ms sustained	> 100 ms sustained
Cassandra timeout	log	`Query timed out after PT2S`	—	recurring
Supernode edge counts	metric	`{model-name}.node.edge-counts.*`	`2048-16383` populated	`16384-infinity` non-zero
Supernode (durable)	log	`Node <id> has: <N> edges`	populated	as `N` grows
Critical-node mailboxes	metric	`node.mailbox-sizes.*`	upper buckets populated	—
Oversized properties	metric	`{model-name}.node.property-sizes`	top bucket populated	—
Heap memory usage	metric	`memory.heap.used` / `memory.heap.max`	> 80% sustained	> 90% sustained
Awake nodes vs capacity	metric	`{model-name}.shard.{shard}.sleep-counters.*`	above `capacity_soft`	approaching `capacity_hard`

Know your baseline

The numbers above are starting points, not universal truths. Calibrate them to your own deployment — its steady-state ingest rate, persistor latency, heap size, and node counts. For broader resource-planning guidance (heap sizing, persistor-to-host ratios, scaling ceilings), see Operational Considerations.