Choose Your Data Structure

We’ve worked hard to make thatDot Novelty Detector very simple to use. Pairing the power and simplicity of this system with the many varieties of data in the world requires one last decision: you the data owner should decide how to structure the data you feed into the system. In general, this means making an array of strings. But your choices about the content and order of the data will have an effect on how valuable the results are. Choosing the content and order is how you define which question you want to analyze and monitor with thatDot Novelty Detector.

Contextual Learning

Under the hood, thatDot Novelty Detector builds a rich graphical model based on the data you stream in. That model is contextually tailored for each component of the observation. Practically, that means the system will learn what represents normal vs. novel behavior for each component in an observation, given the prior components that come before it in that observation. So the order in which you feed in data is relevant to answering different kinds of questions.

Choosing Your First Observation Structure

If you’re just getting started, a good “rule-of-thumb” approach is to choose the values from your data which you believe relate to your question. Then arrange those values in ascending order of their expected cardinality. For instance, if geographic location is relevant to your topic, then you would want to include country before city. Since there are only about 200 countries in the world, and tens of thousands of cities, the cardinality of country is much lower and should come first. This would learn a “fingerprint” for what is normal in Athens, Greece which is entirely separate from the fingerprint learned from Athens, Georgia in the United States.

If you choose more values than you need, it won’t harm the results, but it will probably require more data to get useful explanations—though this will depend on the actual data itself.

Example observation structures for common use cases

  • Operational Security – [user_id, service_name, access_location, access_type_read_or_write, path_accessed, response_code]
  • Network Optimization – [country, region, status_code, cache_status, server_ip_address, client_subnet]
  • E-Commerce Intelligence – [web_property, region, demographic_profile, previously_viewed_product_id, product_id_purchased]
  • Log Analysis and Reduction – [application, hostname, function_call, status_code]

Observation Order in Depth

The order of observations is used to generate conditional probabilities under the hood. The order determines which values are given, for each particular observation.

For example: in the “Log Analysis and Reduction” example above, the observation structure given has four components: [application, hostname, function_call, status_code]. This is like asking the following four questions:

  1. Given all data observed so far, what is the chance of seeing this application name?
  2. Given all data observed so far, and given this particular application name, what is the chance of seeing this hostname value?
  3. Given all data observed so far, and given these particular application and hostname values, what is the change of seeing this particular function_call value?
  4. Given all data observed so far, and given these particular application, hostname and function_call value, what is the chance of seeing this particular status_code value?

These conditional probabilities are the first step in the underlying algorithm. Once you choose the order, the rest of the process requires no additional choices or interaction. If you’d like to experiment with different ordering to see how it affects the results, you can do so in the same novelty detector instance by feeding the reordered observations into a different novelty context as described in Step 2.