Metrics
We expose a large number of JVM and application metrics via the DropWizard Metrics library.
They can be exported by periodically writing as CSV files, logging, to InfluxDb, and/or via JMX. By default only the JMX reporter is enabled. See the comments on the metrics-reporter
setting in the Config Ref Manual for how to enable / configure the others - i.e. the part on one of [jmx, csv, influxdb, slf4j]
. Some metrics are also exposed in JSON on the HTTP endpoint /api/v1/admin/metrics
.
The metrics that we explicitly measure in our code are as follows. Shard, node and standing-query metrics are prefixed with a namespace.
- shard
- shard-{n}
- sleep-counters: Counters that track the sleep cycle (in aggregate) of nodes on the shard
- removed
- slept-failure
- slept-success
- woken
- unlikely: Counters that track occurrences of supposedly unlikely (and generally bad) code paths
- wake-up-failed: Despite repeated attempts, we cannot wakeup the requested node.
- wake-up-error: An unexpected error was encountered when attempting to wake up a node; will retry.
- Hard-limit-reached: A node was blocked from being woken up because the hard limit for number of active nodes has been hit; will retry.
- actor-name-reserved
- node: Bucketed counters
- edge-counts: A counter for the numbers of edges on nodes, split into buckets
- 1-7
- 128-2047
- 2048-16383
- 16384-infinity
- property-counts: A counter for the numbers of properties on nodes, split into buckets
- 1-7
- 128-2047
- 2048-16383
- 16384-infinity
- mailbox-sizes: A counter for the sizes of message mailboxes on nodes, split into buckets
- 1-7
- 128-2047
- 2048-16383
- 16384-infinity
- persistor: All are timers, except snapshot-sizes, which is a histogram.
- get-journal: Measures how long it takes to query a node’s journal from the persistor
- get-latest-snapshot: Measures how long it takes to retrieve a node’s snapshot from the persistor
- persist-event: Measures how long it takes to persist a change to a node’s state.
- persist-snapshot: Measures how long it takes to persist a node’s snapshot.
- set-standing-query-state: Measures how long it takes to persist standing query state.
- get-standing-query-states: Measures how long it takes to retrieve standing query states.
- snapshot-sizes: A histogram that measures the serialized size (in bytes) of a node’s persisted snapshot.
- ingest
- {ingest-name}: Both meters (count and rate)
- count: Number of records ingested
- bytes: Number of bytes ingested (aggregate data payload size)
- standing-queries
- results: Meter of results that were produced for a named standing query on this member
- {standing-query-name}
- dropped: Counter of results that were dropped for a named standing query on this member due to an excess of messages already in-flight when the standing query backpressures. This should be zero.
- {standing-query-name}
- states: Histogram of the size (in bytes) of persistent standing query states.
- {standing-query-id}
- shared
- valve.ingest: A gauge representing how many operations are currently pausing that ingest due to backpressuring.
- {ingest-name}
Other libraries we use also export metrics via this mechanism - e.g. the Cassandra client reports metrics relating to to usage of the Cassandra server, which can optionally be enabled in your config file: https://docs.datastax.com/en/developer/java-driver/4.17/manual/core/metrics/#enabling-specific-driver-metrics.