Collected Metrics¶
We expose a large number of JVM and application metrics via the DropWizard Metrics library.
They can be exported by periodically writing as CSV files, logging, to InfluxDb, and/or via JMX. By default only the JMX reporter is enabled. See the comments on the metrics-reporters setting in the Config Ref Manual for how to enable / configure the others - i.e. the part on one of [jmx, csv, influxdb, slf4j]. Some metrics are also exposed in JSON on the HTTP endpoint Metrics: GET /api/v2/admin/metrics.
Available Metrics¶
The metrics that we explicitly measure in our code are as follows. Shard, node and standing-query metrics are prefixed with a namespace.
- shard
- shard-{n}
- sleep-counters: Counters that track the sleep cycle (in aggregate) of nodes on the shard
- removed
- slept-failure
- slept-success
- woken
- unlikely: Counters that track occurrences of supposedly unlikely (and generally bad) code paths
- wake-up-failed: Despite repeated attempts, we cannot wakeup the requested node.
- wake-up-error: An unexpected error was encountered when attempting to wake up a node; will retry.
- Hard-limit-reached: A node was blocked from being woken up because the hard limit for number of active nodes has been hit; will retry.
- actor-name-reserved
- node: Bucketed counters
- edge-counts: A counter for the numbers of edges on nodes, split into buckets
- 1-7
- 8-127
- 128-2047
- 2048-16383
- 16384-infinity
- property-counts: A counter for the numbers of properties on nodes, split into buckets
- 1-7
- 8-127
- 128-2047
- 2048-16383
- 16384-infinity
- mailbox-sizes: A counter for the sizes of message mailboxes on nodes, split into buckets
- 1-7
- 8-127
- 128-2047
- 2048-16383
- 16384-infinity
- persistor: All are timers, except snapshot-sizes, which is a histogram.
- get-journal: Measures how long it takes to query a node's journal from the persistor
- get-latest-snapshot: Measures how long it takes to retrieve a node's snapshot from the persistor
- persist-event: Measures how long it takes to persist a change to a node's state.
-
persist-snapshot: Measures how long it takes to persist a node's snapshot.
-
set-standing-query-state: Measures how long it takes to persist standing query state.
-
get-standing-query-states: Measures how long it takes to retrieve standing query states.
-
snapshot-sizes: A histogram that measures the serialized size (in bytes) of a node's persisted snapshot.
- ingest
-
{ingest-name}: Both meters (count and rate)
- count: Number of records ingested
- bytes: Number of bytes ingested (aggregate data payload size)
-
standing-queries
- results: Meter of results that were produced for a named standing query on this member
- {standing-query-name}
- dropped: Counter of results that were dropped for a named standing query on this member due to an excess of messages already in-flight when the standing query backpressures. This should be zero.
- {standing-query-name}
-
states: Histogram of the size (in bytes) of persistent standing query states.
- {standing-query-id}
-
shared
- valve.ingest: A gauge representing how many operations are currently pausing that ingest due to backpressuring.
- {ingest-name}
Other libraries we use also export metrics via this mechanism - e.g. the Cassandra client reports metrics relating to to usage of the Cassandra server, which can optionally be enabled in your config file: https://docs.datastax.com/en/developer/java-driver/4.17/manual/core/metrics/#enabling-specific-driver-metrics.
Namespaced Metrics¶
In Quine Enterprise, metrics that are scoped to a namespace include the namespace name as a prefix. For example, node.property-counts becomes default.node.property-counts when using the default namespace, or tenant1.node.property-counts for a namespace called tenant1.
The following metrics include the namespace prefix:
| Metric Category | Example Metric Name |
|---|---|
| Node properties | <namespace>.node.property-counts |
| Node edges | <namespace>.node.edge-counts |
| Property sizes | <namespace>.node.property-sizes |
| Standing query results | <namespace>.standing-queries.results.<name> |
| Standing query dropped | <namespace>.standing-queries.dropped.<name> |
| Standing query states | <namespace>.standing-queries.states.<id> |
| Ingest count | <namespace>.ingest.<name>.count |
| Ingest bytes | <namespace>.ingest.<name>.bytes |
| Shard sleep counters | <namespace>.shard.<n>.sleep-counters.* |
| Shard unlikely events | <namespace>.shard.<n>.unlikely.* |
The following metrics are global and do not include namespace prefixes:
persistor.persist-eventpersistor.persist-snapshotpersistor.get-journalpersistor.get-latest-snapshotpersistor.set-standing-query-statepersistor.get-standing-query-statespersistor.snapshot-sizesmessaging.relayTell.*messaging.relayAsk.*dgn-reg.countshared.valve.<name>
Migrating from Quine¶
When migrating from Quine to Quine Enterprise, metric names change to support namespaces. This section helps you update your monitoring infrastructure (Grafana dashboards, InfluxDB queries, alerting rules) to work with Quine Enterprise.
Quine only supports the default namespace, so namespace prefixes are omitted from metric names. Quine Enterprise supports multiple namespaces, so the namespace is always included in metric names for consistency. For example, node.property-counts in Quine becomes default.node.property-counts in Quine Enterprise.
When migrating, add the default. prefix (or your target namespace) to namespaced metrics. Global metrics remain unchanged.
Updating Grafana Queries¶
When migrating Grafana dashboards, update your queries to include the namespace prefix. The exact syntax depends on your metrics reporter.
InfluxDB example:
-- Quine query
SELECT mean("value") FROM "node_property_counts" WHERE $timeFilter
-- Quine Enterprise query
SELECT mean("value") FROM "default_node_property_counts" WHERE $timeFilter
Prometheus example:
# Quine query
rate(node_property_counts[5m])
# Quine Enterprise query
rate(default_node_property_counts[5m])
InfluxDB member_id Tag¶
When using InfluxDB as a metrics reporter, the member_id tag value differs between products:
| Product | member_id Value |
|---|---|
| Quine | quine |
| Quine Enterprise (single node) | Cluster member address (e.g., 127.0.0.1:25520) |
| Quine Enterprise (cluster) | Each member's address (e.g., 10.0.0.1:25520) |
Update any queries that filter by member_id to use the appropriate value for your deployment.