Novelty Scoring (beta)

Introduction

Not all data is equally meaningful. Buried in a sea of mundane data often lies a record which would be very valuable to know about. These records differ from the rest of the data in meaningful ways, and as a result, they are very valuable to identify as quickly as possible.

Finding these “diamonds-in-the-rough” is challenging even when the data is static, because doing so requires using context to understand what is anomalous and what isn’t. After all, how novel is an observation that you’ve never seen before? That will depend on what you’ve seen previously and what the data means. Finding and scoring these novel data observations in a streaming context is even more challenging due to the incremental nature of streaming systems and the high-throughput requirements often demanded of stream processing systems.

This is the problem we created thatDot Novelty Scoring to solve. Novelty Scoring from thatDot operates in an entirely streaming fashion and scores each item that streams through so that other systems can understand how novel each piece of data is that streams through. We do this by building stateful, compressed, graphical models of all incoming data, using thatDot Connect under the hood.

Usage

There are two ways to pass data into the Novelty Scoring subsystem and receive results: 1.) as a procedure call in a query, and 2.) through the REST API. The REST API is ideal when interacting from outside of a thatDot Connect system using typical REST conventions. Procedure calls are useful when interacting with the novelty subsystem when combining with other thatDot Connect capabilities like Standing Queries.

Procedure Calls

The Novelty Scoring subsystem is integrated with the Cypher language available with other aspects of thatDot Connect. To call the Novelty Scoring system in Cypher, use the procedures beta.novelty.observe and beta.novelty.read. For example, CALL beta.novelty.read("my-context", ["aces", "diamonds", "spades"]) uses the Cypher procedure syntax to read the observation ["aces", "diamonds", "spades"] from the context my-context.

The observe procedure makes a single data observation. It updates the underlying model and computes novelty scores. observe takes two arguments: 1.) the name of the novelty context, 2.) an observation consisting of a list of values which together represent a single observation.

The read procedure takes the same arguments as observe, but reads the scores without updating the underlying model.

The novelty context (1) is a user-chosen name of the model used to separate one model from another. thatDot Connect can support any number of models in the same system as long as their context names are distinct.

The observation (2) is passed in to the observe or read procedures as the second argument. The observation is a list of any type and of any length. The values in this list are currently all treated as strings—so there is no distinction between the following two observations: ['foo', '5', 'bar'] == ['foo', 5, 'bar'] However, the equivalence of numbers and strings is likely to change in later releases.

The observe and read procedures return the following values:

  • observation - This is the same value passed in to produce the output. It is returned here only for reference.
  • score - The score is the total calculation of how novel the particular observation is. The value is always between 0 and 1, where zero is entirely normal and not-anomalous, and one is highly novel and clearly anomalous. The score is the result of a complex analysis of the observation and other contextual data. In contrast to the next field, this score is weighted primarily by the novelty of individual components of the observation. Depending on the dataset and corresponding observation structure (see Step 2), real-world datasets will often see this score weighted with exponentially fewer results at higher scores. Practically, this often means that 0.99 is a reasonable threshold for finding only the most anomalous results; and 0.999 is likely to return half as many results. But to reiterate, the actual values and results will depend on the data and observation structure.
  • totalObsScore - While the score field is biased toward novel components the totalObsScore field is a similar computation applied to all components of the entire observation. One of the practical uses of this field is when using thatDot Anomaly Detector for finding “anti-anomalies”: data which is very typical.
  • sequence - Each observation passed into thatDot Anomaly Detector is given a unique sequence number. This value represents a total order for all observations and can be used to explore the data visualization as it was at the time when this observation was observed.
  • probability - This field represents the probability of seeing this entire observation (exactly) given all previous data when the observation was made.
  • uniqueness - A value between 0 and 1 which indicates how surprising this entire observation is, given all previously observed data. A value of 1 means that this observation has never been seen before (in its entirety). Values approaching 0 indicate that this observation is incredibly common.
  • infoContent - The “Information Content”, “Shannon Information”, or “self-information” contained in this entire observation, given all prior observations. This value is measured in bits, and is an answer to the question: On average, how many “yes/no” questions would I need to ask to identify this observation, given this and all previous observations made to the system.
  • mostNovelComponent - An object describing which component of the observation was the most novel.
    • mostNovelComponent.index - Which component in the list from the observation field was the most novel. This value is the index into that list, and is zero-indexed.
    • mostNovelComponent.value - The string from the observation field which is the most novel component. This is the value you would find by extracting the component at position index from the observation array.
    • mostNovelComponent.novelty - An abstract measure of how novel this one particular (most novel) component is. The maximum theoretical value of this field is equivalent to the value in the infoContent field. This field is not directly a measure of information content, however. Instead it is weighted by many additional factors. The ratio of novelty over infoContent will always be between 0 and 1 and will explain how much of the total infoContent is attributable to this particular component.

Examples

Putting this all together, a few examples of a query issued to the Novelty Scoring system via a procedure call:

CALL beta.novelty.observe("foo", ["a", "b", "c"])
CALL beta.novelty.observe("any name you choose", ["one", 2, "3"]) 
YIELD mostNovelComponent 
RETURN mostNovelComponent.value
MATCH (this)-[:connected_to]->(that)
CALL beta.novelty.observe(this.contextName, that.dataList) 
YIELD score as sc, uniqueness as un
RETURN sc / su

Calling the REST API

The REST API is conceptually the same as the procedure call, but the interface is all that is changed. Full documentation for using the novelty system via the REST API can be found alongside the rest of the REST API documentation shipped with each instance of thatDot Connect.

Novelty Scoring In Beta

Use of the Novelty Scoring system during its Beta period is subject to some caveats:

  • Many other detailed computations and aspects of the system are available, but not necessarily documented or easily discovered. Contact us to learn more and discuss specifics.
  • Throughput per novelty context is limited and not entirely scalable—though some mitigations exist. The current implementation linearizes all observations fed in to a single novelty context so that they have a total ordering. This requires that all observations to the same context go through a single bottleneck during processing. After this first step of novelty calculation, the rest of the computation occurs in parallel. In practice, this limits the throughput of a single novelty context to approximately 2,000 observations per second, depending on the underlying hardware. Short term mitigations to support higher throughput exist, and future versions of the Novelty Scoring system are planned to avoid this limitation.
  • Rapid successive observations (including via the “bulk” enpoint) may produce rare errors in the computed scores.
  • To make good use of the novelty system during beta, we are exposing more of the internal state and calculations so that expert users can make their own interpretations of the intermediate stages of novelty calculation.
  • Visualization of and custom processing with the underlying model is available through thatDot Connect.