Cluster Performance

Cluster Scaling

Vertical Scaling

The recommended size of a machine for an individual thatDot Streaming Graph host is between 8 and 32 CPU cores. Each machine should have around 16GB memory when the java heap is set to 12GB, or 20GB when the java heap is set to 16GB. Streaming Graph experiences diminishing returns for memory usage beyond this limit. Heap sizes larger than this can also trigger long garbage collection pauses depending on the GC options chosen. When choosing machine size, consider cost vs performance. Smaller hosts tend to be more performant per-core, but may result in less-stable ingest rates. 16-core hosts are recommended for most applications.

Note

When using hot spares, it is strongly recommended that the Streaming Graph hosts are homogeneous. Any Streaming Graph machine can host a hot spare, and a hot spare must be capable of assuming the full workload of any other cluster member.

Horizontal Scaling

Adding additional thatDot Streaming Graph hosts to a cluster tends to improve processing speeds linearly with the amount of hosts added. However, this can vary with data that is highly connected. In general, the less connected the data, the better performance that can be achieved by scaling horizontally.

If the workload is well parallelizable, expect to see only a small drop in performance when horizontally scaling. Testing has shown that, using a parallelizable workload, a cluster will only lose at least 2% efficiency when doubling in size. For example, a 100x cluster scale factor would only incur a 13.5% loss of efficiency.

The size of the persistor layer must be considered as well when determining the size of a Streaming Graph cluster. A good rule of thumb is to use a ratio of 4:1 Streaming Graph hosts to persistor hosts of roughly the same machine size. This provides an ideal balance of stability and cost. If cost efficiency is valued over stability, then a ratio of 8:1 Streaming Graph to persistor hosts might be desirable. It is recommended to use high memory machines for your persistor especially when fewer persistor hosts are used to ensure the entire persistor cluster has enough memory.

Throughput & Backpressure

thatDot Streaming Graph is a backpressured system. This means the system is deliberately designed to slow its data processing instead of crashing when one of its components is overwhelmed. While not all components of Streaming Graph can be accounted for when deciding when to slow down, several components are integrated into this process. For example, queries are broken down into individually-runnable parts, with each part pulling partially-processed results from each earlier part of the same query as fast as possible. Most ingest sources (for example, Kafka, Kinesis, and local file) natively backpressure, pulling in additional records for processing any time Streaming Graph has capacity to do so. Standing query processing includes an internal results queue that all ingest streams monitor to ensure ingestion slows down if matches are getting found faster than they can be processed and reported.

The backpressured design ensures system stability and responsiveness, but can make performance tuning more challenging. The top-level metric for measuring performance of a Streaming Graph cluster is the overall throughput of data. This is measured by summing together the ingest rate for all ingest streams.

There are five categories of operational considerations that can affect throughput:

CPU utilization
Network latency
Disk IO
Available Memory
Data modeling choices

Tuning cluster performance is a matter of identifying which component(s) of the system is the current limiting factor, and then improving that component so that it is no longer the bottleneck to increase the throughput. Repeat as necessary to achieve the desired throughput. The result of a successful adjustment will be improved performance and a new bottleneck surfacing elsewhere in the system; there will always be a bottleneck in the system, but each can be improved to support any desired cluster throughput. The ideal configuration varies according to the data source and use case. But the tuning process is complete when the desired configuration runs at the throughput needed for the given ingest sources.

The important performance factors to consider for each component of a Streaming Graph cluster are:

The overall CPU utilization of the Streaming Graph cluster members.
The overall CPU utilization of the persistor cluster.
The network throughput and latency into Streaming Graph, between Streaming Graph cluster members, and between the Streaming Graph and persistor clusters.
Disk IO on the persistor cluster.
Ensuring enough graph nodes are live in the cache to minimize unnecessary persistor reads.

Contributes to CPU usage:

Parsing of the incoming data formats. (e.g. JSON vs. Protobuf)
Number of ingest sources, and the parallelism setting for each.
Ingest query structure. One incoming event can have many effects, depending on the use case.
The number of standing queries set on the cluster.
The number of standing query matches being produced from all standing queries.
The standing query output action configured for each standing query result.

Contributes to Network latency:

Cluster configuration that does not prioritize data locality, causing data and queries to be relayed to additional cluster members needlessly. Poor data locality also causes additional CPU load.
Large data payloads.

Contributes to disk IO:

Slow disk access (read and write) on persistor members.
Total writes being performed.
persistor compaction.
Competing disk access from other applications or system processes.

Contributes to memory usage:

The number of nodes kept active in Streaming Graph’s cache (in Java heap space).
The existence of supernodes in the graph.
Network traffic can also use off-heap memory. This doesn’t affect performance but is worth noting for stability

An ideal Streaming Graph configuration will have a generally balanced configuration of components so that the limiting factor on the system is close to the capability of the other factors, and that no type of resource (CPU, network, disk IO, RAM) is significantly underutilized because of the limits of another resource. Contact thatDot to get assistance in throughput strategies, like supernode mitigation.

Data Modeling

All data systems require a user-defined choice of representation in how to model the data in the system. Data modeling choices are generally not a consideration for the operating team, but they do affect the operation. The typical approach is for the application architect to design the data model used by the ingest queries and standing queries, then scale the other factors (see previous section) to support the chosen data model. The operating team can assist the application architect by reporting any surprising behavior or unexpected performance limitations which are not addressed by scaling systems for CPU, network, and disk IO.

Contributes to data modeling-induced throughput limitations:

Write amplification: for each element consumed from an ingest stream, how many operations are being performed with that element.
Highly focused access patterns: if some sections of the graph are highly trafficked, the workload on those sections can cause a delay in processing other parts of the graph, limiting throughput. This often shows up as lower CPU usage.
Not taking advantage of data locality when structuring the system configuration.

Data Locality

Note

For the best possible performance, it’s important to review your application’s needs and to determine a data locality scheme.

thatDot Streaming Graph will distribute nodes based on their IDs. Node IDs are customizable and this feature enables custom performance tuning options when setting up a Streaming Graph cluster. If you know in advance that some types of nodes will share many connections in the graph model and common queries across those connections, it makes sense to utilize the locIdFrom feature (a custom Cypher function) to ensure that the nodes are located on the same member. This will reduce lookup and communication times as queries can limit network hops through the cluster.

locIdFrom is a position-aware analogue to the idFrom ID resolution function. Similar to idFrom, locIdFrom is used to resolve a location ID from an arbitrary sequence of query values. The difference is that locIdFrom takes an additional first argument (the position argument) which specifies which position in the cluster should be responsible for the returned ID. For example, locIdFrom(2, "hello", 7.5, "world") will always return the same ID. Namely, an ID owned by cluster position 2 whose value is derived from the query values “hello”, 7.5, and “world”. locIdFrom is particularly useful when combined with the kafkaHash function, which can be used to choose a cluster position based on the same algorithm Apache Kafka uses by default for its partitioning. For example, kafkaHash("partition-key") returns the same integer as Kafka would use for a partition number when writing a record with the key “partition-key”.

Together, the locIdFrom / kafkaHash combination allows data pipeline engineers to guarantee that so long as the number of Streaming Graph members is an even divisor or multiple of the number of kafka partitions, all data from a given Kafka partition will be owned by the same Streaming Graph member. This is useful for ensuring that data is received in the same order as the Kafka topic, and for keeping related data together for efficient processing (i.e., minimal communication between cluster members).

A common implementation is to utilize Kafka as an ingestion source, aligning Kafka topics/partitions to Streaming Graph cluster members according to a plan for how data is likely to be distributed in this particular use case, while maximizing utilization of all Kafka brokers and Streaming Graph members across the partitions. Choosing a design that maximizes data locality can be complicated, especially when you are first getting started with Streaming Graph. Therefore thatDot engineers are available to provide guidance and assistance.

1.8.0