Collected Metrics

We expose a large number of JVM and application metrics via the DropWizard Metrics library.

They can be exported by periodically writing as CSV files, logging, to InfluxDb, and/or via JMX. By default only the JMX reporter is enabled. See the comments on the metrics-reporter setting in the Config Ref Manual for how to enable / configure the others - i.e. the part on one of [jmx, csv, influxdb, slf4j]. Some metrics are also exposed in JSON on the HTTP endpoint /api/v1/admin/metrics.

The metrics that we explicitly measure in our code are as follows. Shard, node and standing-query metrics are prefixed with a namespace.

  • shard
  • shard-{n}
    • sleep-counters: Counters that track the sleep cycle (in aggregate) of nodes on the shard
    • removed
    • slept-failure
    • slept-success
    • woken
    • unlikely: Counters that track occurrences of supposedly unlikely (and generally bad) code paths
    • wake-up-failed: Despite repeated attempts, we cannot wakeup the requested node.
    • wake-up-error: An unexpected error was encountered when attempting to wake up a node; will retry.
    • Hard-limit-reached: A node was blocked from being woken up because the hard limit for number of active nodes has been hit; will retry.
    • actor-name-reserved
  • node: Bucketed counters
  • edge-counts: A counter for the numbers of edges on nodes, split into buckets
    • 1-7
    • 128-2047
    • 2048-16383
    • 16384-infinity
  • property-counts: A counter for the numbers of properties on nodes, split into buckets
    • 1-7
    • 128-2047
    • 2048-16383
    • 16384-infinity
  • mailbox-sizes: A counter for the sizes of message mailboxes on nodes, split into buckets
    • 1-7
    • 128-2047
    • 2048-16383
    • 16384-infinity
  • persistor: All are timers, except snapshot-sizes, which is a histogram.
  • get-journal: Measures how long it takes to query a node’s journal from the persistor
  • get-latest-snapshot: Measures how long it takes to retrieve a node’s snapshot from the persistor
  • persist-event: Measures how long it takes to persist a change to a node’s state.
  • persist-snapshot: Measures how long it takes to persist a node’s snapshot.
  • set-standing-query-state: Measures how long it takes to persist standing query state.
  • get-standing-query-states: Measures how long it takes to retrieve standing query states.
  • snapshot-sizes: A histogram that measures the serialized size (in bytes) of a node’s persisted snapshot.
  • ingest
  • {ingest-name}: Both meters (count and rate)
    • count: Number of records ingested
    • bytes: Number of bytes ingested (aggregate data payload size)
  • standing-queries
  • results: Meter of results that were produced for a named standing query on this member
    • {standing-query-name}
  • dropped: Counter of results that were dropped for a named standing query on this member due to an excess of messages already in-flight when the standing query backpressures. This should be zero.
    • {standing-query-name}
  • states: Histogram of the size (in bytes) of persistent standing query states.
    • {standing-query-id}
  • shared
  • valve.ingest: A gauge representing how many operations are currently pausing that ingest due to backpressuring.
    • {ingest-name}

Other libraries we use also export metrics via this mechanism - e.g. the Cassandra client reports metrics relating to to usage of the Cassandra server, which can optionally be enabled in your config file: https://docs.datastax.com/en/developer/java-driver/4.17/manual/core/metrics/#enabling-specific-driver-metrics.