Troubleshooting Ingest Queries

Missing Data

If the graph doesn’t appear to include known ingested data, there can be several potential causes.

1. Unstable or mismatched id selection

A common pattern in Quine is to create complex graph structures from multiple disparate sources of data. To ensure that desired data relationships are created, validate that the ids from the different data sources are being handled consistently. Here’s an example.

Example order ingest query

WITH $that as orderData  
MATCH (order)  
WHERE id(order) = idFrom(“order”, orderData.orderId)  
SET order:Order,  
    order.qty = orderData.quantity,  
    order.partNumber = orderData.part

Example customer data ingest query

WITH $that as customerData  
MATCH (customer), (order)  
WHERE id(customer) = idFrom(“customer”, customerData.customerNumber)  
      AND id(order) = idFrom(customerData.myOrder)  
SET customer:Customer,  
    customer.name = customerData.name  
CREATE (customer)-[:ordered]->(order)

In the above example, the customer ingest query doesn’t match the order ingest query. In the order ingest query, the id for orders is being prefixed with the string order, to help ensure that there aren’t any collisions with ids for other types of nodes. However, in the customer ingest query, the idFrom expression being used to identify the order node doesn’t utilize any such prefix.

The outcome of the above misconfiguration would be that all orders wouldn’t have any associated customer nodes, and the customer nodes would all be connected to empty orders.

2. Ingest race conditions

Due to the distributed nature of quine, ingest queries have to be carefully crafted to avoid race conditions. Here’s an example.

Example customer data ingest query

WITH $that as customerData  
MATCH (customer), (order)  
WHERE id(customer) = idFrom(“customer”, customerData.customerNumber)  
SET customer:Customer,  
    customer.name = customerData.name

Example order ingest query

WITH $that as orderData  
MATCH (order), (customer)  
WHERE id(order) = idFrom(“order”, orderData.orderId)  
      AND customer.name = orderData.forName  
SET order:Order,  
    order.qty = orderData.quantity,  
    order.partNumber = orderData.part  
CREATE (customer)-[:ordered]->(order)

In the above example, the order ingest query is attempting to create relationships between orders and customers. However, this query assumes that a certain customer node exists when this order data gets ingested. Since ingests are run in parallel, this might not be the case.

The outcome of the above misconfiguration would be that some expected order data might not show up in the graph at all. If, at the time of ingested a specific order, the intended related customer node isn’t yet created, the ingest query will silently fail, resulting in no new data for that ingested record being present in the graph.

Slow Ingests

1. Parallelism configuration

Quine uses a default parallelism value of 16 for ingest sources. This value can be tweaked to potentially increase throughput. The correct value depends on many factors that are outside the scope of this environment (your cluster configuration, data complexity, query structure). However, experimenting with this value may result in improved ingest rates.

2. Poorly optimized ingest queries

Query optimization is a broad topic outside the scope of this document. However, here are some guidelines to consider when writing performant ingest queries.

Anchor by IDs when possible

Consider the following two ingest queries.

Query 1

WITH $that as orderData  
MATCH (order), (customer)  
WHERE id(order) = idFrom(orderData.orderId)  
      AND customer.name = orderData.forName  
…

Query 2

WITH $that as orderData  
MATCH (order), (customer)  
WHERE id(order) = idFrom(orderData.orderId)  
      AND id(customer) = idFrom(orderData.customerId)  
…

In the first query, Quine will have to perform an all node scan (think full table scan in a relational database) to find matching customer data. This is due to the fact that we are matching on an arbitrary property of the customer node (in this case, name). In the second query, Quine is able to anchor on specific node IDs (think index lookup in a relational database).

Consider supernodes

A supernode is any node that has a significant number of half-edges. These nodes require special consideration to ensure that Quine meets performance expectations.

Data locality issues

Ingest performance can be impacted by the amount of messages that require coordination across more than one host in the cluster. thatDot recommends structuring ingest queries in such a way that minimizes coordination between cluster nodes.

3. Standing query bottlenecks

Quine is a fully backpressured system. This means that if standing queries are doing a lot of work ingest rates may slow (or even halt entirely). Refer to the standing query optimization guide to improve the performance of standing queries.

Other Ingest Errors

Look at the log output from Quine and follow the trail.

1.8.1