Querying Infinite Data

How do you query an infinite amount of data? thatDot Connect is built for real-time operations on infinite datasets. That calls for a slightly different approach to data analysis.

Unless you are running in a debug environment that does not have much data, avoid queries that involve scanning the whole data store. The typical examples of queries that require this sort of scan are MATCH (n) RETURN n LIMIT 20 or MATCH (n) RETURN count(*). These queries are tempting to use for exploration on traditional databases, but the design of thatDot Connect calls for a different approach.

In a streaming system like thatDot Connect we have the advantage of having new incoming data in hand at the perfect moment. New data streaming in gives us the starting point needed to find other data efficiently. In a sense, the new data behaves like an index into the old data. However, when issuing queries to explore an existing dataset, you may not have that incoming new data to use as a starting point. So the first goal of an exploration is to efficiently find starting points for the data you want to explore. thatDot Connect can fetch data extremely efficiently when provided an ID for a node in question.

Using IDs in a Query

The simplest way to avoid a scan is to make sure at least some part of a MATCH (n) … pattern has a constraint of the form id(n) = ... in the WHERE clause. This allows the query compiler to optimize the execution plan to begin with a hop to a node with a known ID instead of needing to consider every node ever seen as a potential starting point.

In the example query below, constraining the id(person) is enough to avoid any sort of scan since the rest of the pattern is all somehow connected to person.

// Get a particular person's name, and the names of their paternal grandparents
MATCH
  (person: Person)-[:has_father]->(dad: Person),
  (grandpa: Person)<-[:has_father]-(dad)-[:has_mother]->(grandma: Person)
WHERE id(person) = 7
RETURN
  person.name AS person,
  grandpa.name AS paternal_grandfather,
  grandma.name AS maternal_grandmother

Since the ID of a node is not usually known statically, we recommend using the idFrom function to have thatDot Connect compute the IDs of nodes based on data values. This is similar to a consistent-hashing approach where a collection of values are hashed together to produce a unique result that can be used for an ID. Since thatDot Connect supports many different kinds of IDs (numbers, UUIDs, strings, tuples of values, and more…), the idFrom will handle producing consistent results appropriate for the dataset regardless of the which ID type is used.

Consequently, the above example can be rewritten to find the ID using fields from the data:

// Get a particular person's name, and the names of their paternal grandparents
MATCH
  (person: Person)-[:has_father]->(dad: Person),
  (grandpa: Person)<-[:has_father]-(dad)-[:has_mother]->(grandma: Person)
WHERE id(person) = idFrom('person', person.name)
RETURN
  person.name AS person,
  grandpa.name AS paternal_grandfather,
  grandma.name AS maternal_grandmother

Finding Recently Accessed Nodes

Another common pattern for queries is to want to pull out some small number of examples of a structure (for instance to verify that data is being written in the desired graph structure). For those cases, the recentNodes procedure is a fast and efficient way to get back a sample of IDs of nodes that were recently modified or queried. The number passed to recentNodes indicates the desired number of elements to sample

// Get 20 sample nodes
CALL recentNodes(20) YIELD node AS nodeId
MATCH (n) WHERE id(n) = nodeId
RETURN n

This approach can be used to anchor larger queries too, like the paternal grandparents query from above. This time, we don’t need to know the ID of any of the nodes in the pattern—we just need to constrain at least one of them to be from recentNodes.

// Sample the recent part of the graph looking for people and paternal grandparents
CALL recentNodes(1000) YIELD node AS personId
MATCH
  (person: Person)-[:has_father]->(dad: Person),
  (grandpa: Person)<-[:has_father]-(dad)-[:has_mother]->(grandma: Person)
WHERE id(person) = personId
RETURN
  person.name AS person,
  grandpa.name AS paternal_grandfather,
  grandma.name AS maternal_grandmother
LIMIT 20

This query will explore the set of 1000 nodes that have most recently streamed in, and it will run the query on those IDs and return results if the larger structure in the MATCH clause matches the data found at the IDs returned from recentNodes.

The same approach of sampling recently touched nodes can be used to quickly compute some aggregate statistics over recent data. For instance, here is a query for quickly sampling the distributions of labels in recently created or accessed data:

// Count the number of each type of node label for 1000 recently accessed nodes
CALL recentNodes(1000) YIELD node AS nodeId
MATCH (n) WHERE id(n) = nodeId
RETURN labels(n), count(*)