Anomaly Detection is a data analysis task which detects anomalies from a given dataset is important in many contexts and domains such as medical and health, fraud detection in finance, and computer systems & networks [1]. The same problem has also been terms as:
- outlier detection
- novelty detection
- deviation detection
- exception mining
Definition of Anomaly
Anomalies denote patterns that do not conform expected behavior, and hav e been refered to as other terms such as outliers, discordant observations, exceptions, aberrations, surprises, peculiarities or contaminants in different application domains [2]. Among many definitions in the literature, one of the widely accepted definition of anomaly by Hawkins [3] is defined as follows.
An anomaly is an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism.
Relationships to noise & novelty
The key distinction between a noise and anomaly is the interestingness of the pattern to the analyst. Noise is not of interest to the analyst while anomaly is.
The distinction between novel patterns and anomaly ones is that the novel patterns are typically incorporated into the normal model after being detected [2:1].
Types of Anomalies
Usually, anomalies can be categorized into three main types [1:1][4]:
-
Point Anomalies: A point anomaly is any point which deviates from the range of expected values in a given dataset. For example, a memory usage value which has 3 standard deviation from the mean can be considered as a point anomaly. The majority of previous studies has been focused on point anomalies.
-
Collective Anomalies: A homogeneous group/set of data points deviating from the normal regions of the rest of the data. For example, an unexpected streak of low throughput values may be considered as collective anomalies when compared with higher ones in the past observation window.
-
Contextual Anomalies: Anomalies that manifest only under certain environments/contexts. For example, high expenditure during a festive period can be considered as normal in the context, e.g., black friday. Contextual anomalies require that the data has a set of contextual attributes (to define a context), and a set of behavior attributes (to detect anomalies within a context).
-
Pattern Anomalies: The shapes of some performance metrics when plotted are know to exhibit specific pattern, and the violation of this can be considered as pattern anomalies. This can be seen as a special case of collective anomalies.
Input & Output
- Input: a set of data instances where each data instance can be described using a set of attributes. Each data instance can be univariate (one attribute) or multivariate (multiple attributes).
- Output: labels and/or scores. Scoring mechanisom allows analyst to use a domain specific threshold to filter relevant anomalies.
Types of Solutions with respect to Label Conditions
Based on the availability of the labels associated with data instances, anomaly detection techniques can be categorized into three types [2:2]:
- Supervised Anomaly Detection: It assumes labeled training set is available, which is not always the case. In addition, even it is availble, the numbers of anomaly and non-anomaly labels are imbalanced.
- Semisupervised Anomaly Detection: It assumes that we have a training set with the normal class. A typical approach in this stream is to build a model for the normal class, and use it to identify anomalies in the test data.
- Unsupervised Anomaly Detection: No need for training data, and thus is widely applicable. It assumes that normal instances are far more frequent than anomalies in the test data.
References
- Ahmed, Mohiuddin, Abdun Naser Mahmood, and Jiankun Hu. "A survey of network anomaly detection techniques." Journal of Network and Computer Applications 60 (2016): 19-31. ↩︎ ↩︎
- Chandola, Varun, Arindam Banerjee, and Vipin Kumar. "Anomaly detection: A survey." ACM computing surveys (CSUR) 41.3 (2009): 15. ↩︎ ↩︎ ↩︎
- Hawkins, Douglas M. Identification of outliers. Vol. 11. London: Chapman and Hall, 1980. ↩︎
- Ibidunmoye, Olumuyiwa, Francisco Hernández-Rodriguez, and Erik Elmroth. "Performance anomaly detection and bottleneck identification." ACM Computing Surveys (CSUR) 48.1 (2015): 4. ↩︎