Learning from Labeled Anomalies for Efficient Anomaly Detection using Python Machine Learning Client for SAP HANA

In a few separate blog posts, we have discussed the problem of anomaly detection in dataset with multiple features using techniques like one-class classification, clustering(DBSCAN) as well as statistical tests. However, all the aforementioned techniques become less applicable when the dataset of interest is of high dimensionality(i.e. contains many features), or the boundary between normal points and anomalous ones is complicated. In this case, a better approach is to manually label the point of anomalies in the dataset, and then train a supervised machine learning model for the classification of normal points and anomalies.

In this blog post, we will analyze a specific dataset with labeled anomalies, and use the decision tree algorithm in SAP HANA Prective Analysis Library(PAL) through Python Machine Learning Client for SAP HANA(hana_ml) to construct a classification model for anomaly detection. In the meantime, several resampling techniques are also involved for improving the performance of the trained model on different perspectives.

Introduction

Separating anomalies from normal ones using with labeled datasets seems as simple as a regular classification problem. However, the highly skewed distribution between the normal and anomalous data points can pose a big challenge for building up any efficient classification model, because in datasets anomalies are usually so rare to be observed while normality is overwhelming. For example, we consider a disease with prevalence rate 0.1%, if we use a naive model that predict all people as non-patient of this disease, then this model has a “high” accuracy rate of 99.9%, seemingly good. However, if we happily adopt this naive model for detecting this disease, then it would be a disaster for all real patients. The imbalance of distribution between normal and anomalous labels is one typical characteristic for anomaly detection problems, especially when normal points and anomalies are entangled in the feature space of the dataset.

In the meatime, anomalous cases are usually much more valuable than normal ones. For example, if we fail to detect a fraudulent transaction between bank acounts, then we could have a great loss of money; however, if we suspect one transaction is fraudulent and it turn out to be not true, we only pay for some manual verification procedures that are realtively cheap. Higher importance for anomalous cases compared to normal ones is another typical characteristic for anomaly detection.

In this blog post, we will do a case study on anomaly detection using labeled dataset, the following contents will be included in our discussion:

Introduction & background knowledge on the dataset for our case study, with brief problem analysis
Anomaly detection from classification models with the help of various resampling techniques

Case Study: Thyroid Hyperfunctionality Detection by Classification

Dataset Description and Problem Analysis

The problem of interest in this blog post is thyroid disease recognition. The original full dataset is available in the UCI machine learning repository[1]. The original dataset contains 21 attributes – 15 of them are categorical and 6 are numerical. The dataset is divided into 3 classes : normal, subnormal and hyperfunction. Hyperfunction is the minority class in this dataset, but it is also the case that we are mostly interested in because once gained, it may accelerates the body’s metabolism, bringing along symptoms like unintentional weight loss, rapid or irregular heartbeat, nervousness, anxiety and irritability, etc.

Our designated task in this blog post is to distinguish hyperfunctional cases from non-hyperfunctional(i.e. normal and subnormal) ones, using only the 6 numerical attributes. A reduced version of this dataset for this target is available in the ODDS library[2], where all attribute values are scaled into the range [0, 1] using Min-Max scalar. Besides, The label column in this dataset is valued with 0s and 1s, where 1 for hyperfunction and 0 for non-hyperfunction.

Let us examine the corresponding dataset for further analysis. We assume that the data has already been stored in a table with name ‘PAL_THYROID_DATA_TBL’ in a database of SAP HANA platform.

Then, thyroid_df a hana_ml.DataFrame that contains the information of the dataset, a brief description of this dataset could be obtained as follows:

As revealed by the mean value of the TYPE column, hyperfunctional cases covers less than 3% of all cases in the dataset, so the dataset is highly skewed w.r.t. thyroid functionality types.

Now we inspect the hyperfunctional cases and non-hyperfunctional cases separately.

A closer examination of the data tells us that the distribution of hyperfunctional cases and non-hyperfunctional ones are quite different(in quantiles) in attributes like V1, V2, V3, and V5. So these numerical attributes have the potential to tell the differences between thyroid hyperfunctionality and non-hyperfunctionality.

Visulization of the dataset is another way to assess the how difficult the classification problem should be. Dimensionality reduction is indispensible for the dataset since the its feature space is of dimension 6. Here PCA is applied for tranforming the dataset from 6D to 2D for drawing a scatter plot of the dataset(without involving the TYPE column).

One can see from the above figure that hyperfunctional cases are distributed differently from the non-hyperfunctional ones(in the reduced attribute space), yet the two classes are not that well separated.

Dataset Partition

Before bulding up any classfication model for anomaly detection, we firstly divide the whole dataset into training and testing part using the train_test_val_split() method in hana_ml, where training percent is set to 0.7 and testing percent is set to 0.3(no validation data).

Now we are ready to build a classification model. DecisionTree classifier is used in the following context for illustration, other classification algorithms are also applicable with similar workflow.

Direct Training

To begin with, we build a decision-tree classifier on the training dataset directly without any modification. The decision-tree classifier is called through the UnifiedClassification class in hana_ml, by doing so we can obtain many evaluation metric values of trained model on the test dataset by calling the corresponding score() function.

Values of evaluation metrics of the trained model on the test dataset is available in the 2nd element of returned result of the score() function, and it can be collected to the python client

A few key statistical values that worth mentioning:

The overall accuracy is ~99.38%, higher than the naive ‘always non-hyperfunctional’ classifier.
~85.7% hyperfunctional cases in the test are assigned the correct label by the decision-tree classifier
~88.9% predicted hyperfunctional cases of the decision-tree classifier are real hyperfunctional cases.

The performance of the trained model is already reasonably well. However, since classes are imbalanced in the training dataset, resampling techniques that balance the counts of classes have the potential for further improving performance of the classification model. We will verify this justification in the subsequent subsections. Our first try is to oversample the minority class for achieving class balance in the training data.

Model Training with Minority-class Oversampling

We firstly augment the number of hyperfunctional cases several times in the training dataset using the synthetic minority over-sampling technique(i.e. SMOTE), so that the numer of hyperfunctional cases and that of non-hyperfunctional cases become comparable.

So, by oversampling the hyperfunctional cases in the traininig data to be roughly the same size as that of non-hyperfunctional cases, and retraining the decision tree model, we see the some improvement revealed by the following key statistics:

The overall accuracy increases from ~99.38% to ~99.73%(versus training without resampling)
~96.4% hyperfunctional cases in the test are assigned the correct label by the decision-tree classifier(versus 85.7% without resampling)
~93.1% predicted hyperfunctional cases of the decision-tree classifier are real hyperfunctional cases(versus 88.9% without resampling)

The increment of model performance on test dataset is non-negligible, which shows the effectivity of oversampling the minority-class.

Another approach for over-samping the minority class is direct duplication, which equivalent to bootstrapping is some sense.

Model Training with Majority-class Undersampling

Oversampling the minority-class of training data can usually increase its related precision and recall metrics without much affecting the evaluation metrics of majority-class. However, it has some drawbacks: one is the additional memory/computational resource consumption for oversampled training data, another is that it usually lacks the potential in achieving a very high recall rate for the minority-class. When misclassification of a case with minority-class label becomes unacceptably high, we must figure out a smart way to increase its recall rate, while naively labeling all data with minority-class is obviously not a smart way and we should always avoid doing that.

So here comes majority-class undersampling, in which case the collection of data with the majority-class label is subsampled. As a consequence, the area covered by the majority-class data points becomes smaller with lower density, which gives room to machine learning methods for better modeling the minority-class points with less contraversy. This usually results in direct increment of the coverage(i.e. recall rate) minority-class points, yet some points of majority-class will also be misclassified as minority-class(precision rate drops down). However, as long as the price paid by the later is smaller than the values saved from the former, we are happy to make the change.

Undersampling the majority-class can be realized in many different ways. Our first try here is TomekLinks algorithm, which is provided in SAP HANA Predictive Analysis Library(PAL) and wrapped up in the Python machine learning client for SAP HANA(hana_ml).

Compared to direct classification result without resampling the training data, we have:

The overall accuracy increases from 99.38% to 99.64%(versus training without resampling)
96.4% percent of hyperfunctional cases are now correctly predicted(versus 85.7% without resampling)
90% percent of the predicted hyperfunctional cases are now correct(versus 88.9% without resampling)

For hyperfunctional cases, the recall rate has much improvement(same as the one achieved in minority-class oversampling), yet the increment of precision is much less, and the value is less the one achieved by minority-class oversampling. The result is consistent with our previous analysis for majority-class subsampling.

In the following we try random subsampling of the majority class, in which we subsample the the majority-class data greatly so that eventually the two classes will have similar size in the training data.

Now the trained model achieves a perfect recall score for the hyperfunctional cases in the test dataset, yet the corresponding precision score is also greatly reduced, indicating the production of many falsely predicted hyperfunctional cases in the test dataset.

Model Training with Hybrid Method

High undersampling rate of the majority class can potentially lead to high recall rate of the minority-class, yet it also has risk of underfitting the majority-class since many points of the class are thrown away in the training phase. We have already seen the production of many falsely predicted hyperfunctional cases caused by undersampling the non-hyperfunctional cases in the training data. In comparsion, combining the minority-class oversampling and majority-class undersampling, i.e. a hybrid method, could be a smarter and more robust way for treating class imbalanced problems.

SMOTETomek is an algorithms that combined the both sampling strategies, which is also provided in SAP HANA Predictive Analysis Library(PAL) and wrapped up in the Python machine learning client for SAP HANA(hana_ml). In the following context we use SMOTETomek to resample the training data.

Overall, the evaluation metrics on the test dataset are similar to the case when SMOTE is applied to the training data. This is reasonable because usually only a few points are associated with TomekLinks in a dataset, so the undersampling effect is insignificant.

In the following context we undersample(randomly without replacement) the non-hyperfunctional cases in the training data by half, and in the mean time oversample the hyperfunctional cases using SMOTE so that the two classes eventually have similar size.

The trained model also scores perfectly on the recall rate for hyperfunctional cases in the test data and, compared to the result by purely undersampling the non-hyperfunctional cases, the precision rate of hyperfunctional cases improves a lot(from ~53.85% to ~90.32%). Compared with previous results, the result of this hybrid resampling method is the most satisfying one.

Summary and Discussion

In this blog post, we have done a case study on anomaly detection by classification, where a training dataset with normal and anomalous labels is provided. The main difficulty for anomaly detection by classification is that the distribution of normal and anomalous labels is usually highly imbalanced, which can pose some challenge for building up an efficient classification model. We have shown that, by appropriately resampling the training data, we could potentially improve the classification model’s performance in the prediction phase, especially for the precision and recall rate w.r.t. the anomalous data points. In our case study, a hybrid resampling method combing SMOTE and random subsampling(without replacement) gives the most satisfying result.

However, it should be emphasized that, even when the training data is highy imbalanced in classes, resampling does not always lead to better machine learning model(e.g. see [3]) than the one without resampling the training data. One must be careful enough to validate the gain by resampling the training data, especially when the collected features can well tell the differences between classes.