Not all data is equally meaningful. Buried in a sea of mundane data often lies a record which would be very valuable to know about. These records differ from the rest of the data in meaningful ways, and as a result, they are very valuable to identify as quickly as possible.
Finding these “diamonds-in-the-rough” is challenging even when the data is static, because doing so requires using context to understand what is anomalous and what isn’t. After all, how novel is an observation that you’ve never seen before? That will depend on what you’ve seen previously and what the data means. Finding and scoring these novel data observations in a streaming context is even more challenging due to the incremental nature of streaming systems and the high-throughput requirements often demanded of stream processing systems.
This is the problem we created thatDot Novelty Scoring to solve. Novelty Scoring from thatDot operates in an entirely streaming fashion and scores each item that streams through so that other systems can understand how novel each piece of data is that streams through. We do this by building stateful, compressed, graphical models of all incoming data, using thatDot Connect under the hood.
There are two ways to pass data into the Novelty Scoring subsystem and receive results: 1.) as a procedure call in a query, and 2.) through the REST API. The REST API is ideal when interacting from outside of a thatDot Connect system using typical REST conventions. Procedure calls are useful when interacting with the novelty subsystem when combining with other thatDot Connect capabilities like Standing Queries.
The Novelty Scoring subsystem is integrated with the Cypher language available with other aspects of thatDot Connect. To call the Novelty Scoring system in Cypher, use the procedures
novelty.read. For example,
CALL novelty.read("my-context", ["aces", "diamonds", "spades"]) uses the Cypher procedure syntax to read the observation
["aces", "diamonds", "spades"] from the context
observe procedure makes a single data observation. It updates the underlying model and computes novelty scores.
observe takes two arguments: 1.) the name of the novelty context, 2.) an observation consisting of a list of values which together represent a single observation.
read procedure takes the same arguments as
observe, but reads the scores without updating the underlying model.
The novelty context (1) is a user-chosen name of the model used to separate one model from another. thatDot Connect can support any number of models in the same system as long as their context names are distinct.
The observation (2) is passed in to the
read procedures as the second argument. The observation is a list of any type and of any length. The values in this list are currently all treated as strings—so there is no distinction between the following two observations:
['foo', '5', 'bar'] == ['foo', 5, 'bar'] However, the equivalence of numbers and strings is likely to change in later releases.
read procedures return the following values:
observation- This is the same value passed in to produce the output. It is returned here only for reference.
score- The score is the total calculation of how novel the particular observation is. The value is always between
1, where zero is entirely normal and not-anomalous, and one is highly novel and clearly anomalous. The score is the result of a complex analysis of the observation and other contextual data. In contrast to the next field, this score is weighted primarily by the novelty of individual components of the observation. Depending on the dataset and corresponding observation structure (see Step 2), real-world datasets will often see this score weighted with exponentially fewer results at higher scores. Practically, this often means that
0.99is a reasonable threshold for finding only the most anomalous results; and
0.999is likely to return half as many results. But to reiterate, the actual values and results will depend on the data and observation structure.
totalObsScore- While the
scorefield is biased toward novel components the
totalObsScorefield is a similar computation applied to all components of the entire observation. One of the practical uses of this field is when using thatDot Anomaly Detector for finding “anti-anomalies”: data which is very typical.
sequence- Each observation passed into thatDot Anomaly Detector is given a unique sequence number. This value represents a total order for all observations and can be used to explore the data visualization as it was at the time when this observation was observed.
probability- This field represents the probability of seeing this entire observation (exactly) given all previous data when the observation was made.
uniqueness- A value between
1which indicates how surprising this entire observation is, given all previously observed data. A value of
1means that this observation has never been seen before (in its entirety). Values approaching
0indicate that this observation is incredibly common.
infoContent- The “Information Content”, “Shannon Information”, or “self-information” contained in this entire observation, given all prior observations. This value is measured in bits, and is an answer to the question: On average, how many “yes/no” questions would I need to ask to identify this observation, given this and all previous observations made to the system.
mostNovelComponent- An object describing which component of the observation was the most novel.
mostNovelComponent.index- Which component in the list from the
observationfield was the most novel. This value is the index into that list, and is zero-indexed.
mostNovelComponent.value- The string from the
observationfield which is the most novel component. This is the value you would find by extracting the component at position
mostNovelComponent.novelty- An abstract measure of how novel this one particular (most novel) component is. The maximum theoretical value of this field is equivalent to the value in the
infoContentfield. This field is not directly a measure of information content, however. Instead it is weighted by many additional factors. The ratio of
infoContentwill always be between
1and will explain how much of the total
infoContentis attributable to this particular component.
Putting this all together, a few examples of a query issued to the Novelty Scoring system via a procedure call:
CALL novelty.observe("foo", ["a", "b", "c"])
CALL novelty.observe("any name you choose", ["one", 2, "3"]) YIELD mostNovelComponent RETURN mostNovelComponent.value
MATCH (this)-[:connected_to]->(that) CALL novelty.observe(this.contextName, that.dataList) YIELD score as sc, uniqueness as un RETURN sc / su
The REST API is conceptually the same as the procedure call, but the interface is all that is changed. Full documentation for using the novelty system via the REST API can be found alongside the rest of the REST API documentation shipped with each instance of thatDot Connect.