Main Concepts

Introduction

thatDot Novelty allows users to stream in categorical data and immediately receive information about how unusual that data is, with an explanation of why it is so unusual. Built in visualization tools let you understand how a single observation relates to the entirety of your dataset. All this is done on your own system, without any data ever leaving the machines you control.

Unlike other anomaly detection systems, thatDot Novelty Detector works on categorical data—which means non-numeric data like: names, identifiers, email addresses, IP Addresses, status codes, natural language text, and other strings.

No training data or data labeling is required. Simply start it up, feed it data, and get results back in real-time. The system adapts to the data you feed it. The anomaly scores returned are based on a complex model of the data seen so far, so once a representative amount of data has been fed in, your scores will helpfully identify every unusual observation.

What’s wrong with existing anomaly detection methods?

Traditional unsupervised anomaly detection methods—like Clustering (e.g. K-Means), Random Forests, Isolation Forest, and others require converting all data to a numeric representation. This works well when the data is naturally numeric and there is a small set of features, but it becomes impossible when there is categorical data with more than a handful of possible values.

Terminology

  • observation – An observation is a list of strings fed into novelty. e.g.: [“my”, “sample”, “observation”] the list can be any length, but all observations made into the same context should have the same length.
  • component – One observation is made up of many components. Each string in the observation is one component. e.g.: “sample”
  • context – A context is a name of a group of observations. Each observation in the same context should have the same structure. One instance of a running novelty can have any number of context. Each context is entirely separate from the others.
  • novelty – A measure of how anomalous an observation or component is. Unlike the terms “anomaly” or “anomalous”, which each tend to be binary (“it is or isn’t an anomaly”), novelty comes in shades of gray. Data can be more or less novel. The most novel is the most anomalous.