Finding and analysing anomalies in data purely using data observation without previously characterised knowledge (unsupervised learning), provides the Scientific Community with results that are both fascinating and challenging. The unsupervised learning school is based in the assumption that abnormal observations are fundamentally different to the majority of the population to study, and those differences, or abnormalities, can be discovered and categorised by studying its statistical properties.
However, the reality of the world is rarely pure and never simple. What truly constitutes abnormality is usually governed by many circumstances, which provides those in search of creating knowledge acquisition devices with captivating challenges. In the IoT ecosystem, the volatility of the environment pushes those challenges to the next level.
So, can systems be built where the idea of "normal" is flexible enough to understand change as part of normality?
Finally, we submitted our paper “Classification of Device Behaviours in Industrial IoT Networks, towards distinguishing the abnormal from security threats.” Over the last four weeks, my good friend Dr. Paul Stacey and I have been working against the clock to consistently summarise our discussions, hypotheses, experiment results and conclusions. Thank you, Paul, for your time, talent and effort.
Our proposed diagnostic system attempts to identify and describe changes in individual features and the dependencies of those variables. Our work proposes a spatial-temporal method to characterise network behaviour, find anomalies, and calculate their similarity with previously identified behavior. Our thesis is that by calculating the entropy and dispersion coefficient, we can generate 2D shapes, which contain vital information to describe the individual feature behavior, and the dependency between features. We propose that any possible sensor, smart thing, community of things or network flow behavior can be represented in a 2D shape. The generated shape area and its form and position in a 2D Euclidean plane, are defined by the dispersion values and the dependency between those values. The problem is then restricted to finding the windows-time evolving towards describing shapes with an abnormal figure compared to the rest, or to the expected.
The central idea is the characterisation of any behavior as a 2D shape. This idea is supported by the following conjectures. Firstly, many important kinds of traffic anomalies cause changes to the distribution of key features observed in traffic, and consequently in the graphical shape generated. Secondly, different types of attack cause different types of anomalies and those anomalies can be studied and categorised based on the distances of the shapes generated. Each of the anomalies found in our experiment affects the distribution of certain traffic features. In some cases, feature distributions become more dispersed, as when source addresses distribution change dramatically in denial-of-service (DOS) attacks, or when ports are scanned for vulnerabilities. In other cases, feature distributions become concentrated on a small set of values, such as when a single source sends a large number of packets to a single destination in an unusually high volume flow.
Unfortunately, identifying anomaly shapes for each specific situation and classifying them accordingly, is challenging. The distribution of traffic features is a high-dimensional object, and so can be difficult to work with directly. However, in most cases, it is possible to extract very useful information from the degree of dispersal or concentration of the distribution and the specific variables changing its distribution at the same time, compared with those who remain stable. In some cases, because a group of features were dispersed, while another group were concentrated, is a strong signal that is useful both for detecting the anomaly and categorising it when it has been detected.
Some things were apparent after project completion. Unsupervised approaches will play a key role in the task of discovering and describing emerging behaviours in the IoT space. Supervised based approaches risk applying obsolete knowledge to new realities, as the size of the IoT ecosystem increases. I personally always found that acquiring new knowledge, rather than following the path of the existing hypotheses, which were only true in the past, is fascinating.
See how it works here
Access the full paper here