WO2020185101A9

WO2020185101A9 - Hybrid machine learning system and method

Info

Publication number: WO2020185101A9
Application number: PCT/NZ2020/050025
Authority: WO
Inventors: Khoon Guan SEAH; Murugaraj ODIATHEVAR; Marcus Frean; Alvin VALERA
Original assignee: Victoria Link Limited
Priority date: 2019-03-11
Filing date: 2020-03-11
Publication date: 2021-01-14
Also published as: WO2020185101A1

Abstract

An aspect of the invention provides a classification system (100) configured to classify at least one data point within a data set (106). The system comprises a first machine learning model (110) in which is maintained a first plurality of data points associated to respective first labels; a second machine learning model (112) in which is maintained a second plurality of data points associated to respective second labels; and a (classifier 114). The classifier (114) is configured to: classify the at least one data point using the first machine learning model (110) to a first confidence level to determine at least one first classification label; responsive to the first confidence level having a value greater than or equal to a classification threshold value: classifying the at least one data point with the at least one first classification label to the first confidence level; and responsive to the first confidence level having a value of zero and/or the first confidence level having a value below the classification threshold value: classifying the at least one data point using the second machine learning model to a second confidence level to determine at least one second classification label, responsive to the first confidence level having a value less than the second confidence level: classifying the at least one data point with the at least one second classification label, and responsive to the first confidence level having a value greater than or equal to the second confidence level: classifying the at least one data point with the at least one first classification label to the second confidence level.

Description

HYBRID MACHINE LEARNING SYSTEM AND METHOD

FIELD OF THE INVENTION

The invention relates to a hybrid machine learning system and method, particularly for use in anomaly detection in time series data. Examples of time series data include network data, water & electricity consumption, industrial production data, chemical concentration readings and monthly sunspot numbers.

BACKGROUND OF THE INVENTION

Detecting network anomalies is not a well posed problem. One of the main challenges in detecting anomalies is the changing normal network traffic. With the advancement in technology, new applications are being created almost daily around the world. There are many types of anomalies in the network ranging from attacks to faulty devices. But there are also many types of normal traffic. Also with the recent advances in the Internet of Things (IoT), normal traffic profiles display large variations and heterogeneity. Many network anomaly detection techniques miss this aspect of the network and focus on algorithms.

One solution to this problem is to use a machine learning model trained from a dataset and then deployed in the real-world. However, whenever the model's efficiency withers, it needs to be retrained with new data. In a supervised learning context, obtaining new data and labelling them manually is painful. Network traffic changes faster than one can label and retrain a model. In an unsupervised learning context, it is desirable to spot outliers. Unfortunately, new normal traffic may also be outliers at a different point in time such as Flash-Crowd traffic.

To combat the respective weaknesses of supervised learning models and unsupervised learning models, hybrid models are used. On example of a hybrid model comprises an offline learning model and an online learning model. Offline learning models are the traditional machine learning models. They can go deep and pick out patterns that are not easily recognisable by humans. However, they suffer from lengthy training times. Online learning models are in the form of incremental learners where they learn in time windows as new data arrives. They are required to be fast and thus unable to go too deep to recognise intricate patterns in the data. One of the main challenges in such hybrid approaches is enabling an online and offline model to work together in a logical and effective manner to detect anomalies.

Furthermore, another aspect of machine learning is that supervised, unsupervised and hybrid models tend to have a bias variance trade off. Machine learning models need to be accurate to training data and also be able to generalise to unseen data. In the network anomaly context, both false positive and false negatives can be costly.

It is an object of at least preferred embodiments of the present invention to address at least some of the aforementioned disadvantages. An additional or alternative object is to at least provide the public with a useful choice.

SUMMARY OF THE INVENTION

In accordance with an aspect of the invention, a method of classifying at least one data point within a data set comprises maintaining, in a first machine learning model, a first plurality of data points associated to respective first labels; maintaining, in a second machine learning model, a second plurality of data points associated to respective second labels; classifying the at least one data point using the first machine learning model to a first confidence level to determine at least one first classification label;

responsive to the first confidence level having a value greater than or equal to a classification threshold value: classifying the at least one data point with the at least one first classification label to the first confidence level; and responsive to the first confidence level having a value of zero and/or the first confidence level having a value below the classification threshold value: classifying the at least one data point using the second machine learning model to a second confidence level to determine at least one second classification label, responsive to the first confidence level having a value less than the second confidence level : classifying the at least one data point with the at least one second classification label, and responsive to the first confidence level having a value greater than or equal to the second confidence level: classifying the at least one data point with the at least one first classification label to the second confidence level.

The term 'comprising' as used in this specification means 'consisting at least in part of. When interpreting each statement in this specification that includes the term

'comprising', features other than that or those prefaced by the term may also be present. Related terms such as 'comprise' and 'comprises' are to be interpreted in the same manner. In an embodiment the classification threshold value is user defined. In an embodiment the classification threshold value is 80% or greater. In an embodiment the classification threshold is 90% or greater. In an embodiment the classification threshold lies in the range of 80% to 90%.

In an embodiment the method further comprises maintaining the at least one data point in a high confidence database responsive to the at least one data point being classified to a classification confidence level having a value greater than an upper threshold confidence level.

In an embodiment the upper threshold confidence level is 92.5% or greater. In an embodiment the upper threshold confidence level lies in the range 92.5% to 97.5%. In an embodiment the upper threshold confidence level is 95%.

In an embodiment the method further comprises maintaining the at least one data point in a low confidence database responsive to the at least one data point being classified to a classification confidence level having a value below a lower threshold confidence level.

In an embodiment the lower threshold confidence level is lower than the upper threshold confidence level. In an embodiment the lower threshold confidence level is less than 95%. In an embodiment the lower threshold confidence level lies in the range 70% to 95%. In an embodiment the lower threshold confidence level is less than 92.5%. In an embodiment the lower threshold confidence level is less than 80%.

In an embodiment the first machine learning model comprises a Radius Nearest

Neighbour (Rad-NN) model.

In an embodiment classifying the at least one data point using the first machine learning model includes counting votes of all neighbouring points positioned within a specified radius r.

In an embodiment the method comprises determining the first confidence level as a ratio of A: nearest like neighbours to k nearest unlike neighbours, k having a value in the range 10 to 15.

In an embodiment the second machine learning model comprises a Support Vector Machine (SVM) model having parameters C and y. In an embodiment the values of C and/or y are user-defined. In an embodiment, the value of C is approximately 1,000. In an embodiment the value of y is approximately 0.1 In an embodiment the method further comprises training the second machine learning model from at least one data point retrieved from the high confidence database.

In accordance with a further aspect of the invention, a classification system configured to classify at least one data point within a data set comprises a first machine learning model in which is maintained a first plurality of data points associated to respective first labels; a second machine learning model in which is maintained a second plurality of data points associated to respective second labels; and a classifier configured to: classify the at least one data point using the first machine learning model to a first confidence level to determine at least one first classification label; responsive to the first confidence level having a value greater than or equal to a classification threshold value: classify the at least one data point with the at least one first classification label to the first confidence level; and responsive to the first confidence level having a value of zero and/or the first confidence level having a value below the classification threshold value: classify the at least one data point using the second machine learning model to a second confidence level to determine at least one second classification label, responsive to the first confidence level having a value less than the second confidence level : classify the at least one data point with the at least one second classification label, and responsive to the first confidence level having a value greater than or equal to the second confidence level: classify the at least one data point with the at least one first classification label to the second confidence level.

In an embodiment the classification threshold value is user defined. In an embodiment the classification threshold value is 80% or greater. In an embodiment the classification threshold is 90% or greater. In an embodiment the classification threshold lies in the range of 80% to 90%.

In an embodiment the system further comprises a high confidence database in which is maintained the at least one data point responsive to the at least one data point being classified to a classification confidence level having a value greater than an upper threshold confidence level.

In an embodiment the system further comprises a low confidence database in which is maintained the at least one data point responsive to the at least one data point being classified to a classification confidence level having a value below a lower threshold confidence level. In an embodiment the lower threshold confidence level is lower than the upper threshold confidence level. In an embodiment the lower threshold confidence level is less than 95%. In an embodiment the lower threshold confidence level lies in the range 70% to 95%. In an embodiment the lower threshold confidence level is less than 92.5%. In an embodiment the lower threshold confidence level is less than 80%.

In an embodiment the first machine learning model comprises a Radius Nearest

Neighbour (Rad-NN) model.

In an embodiment the classifier is configured to determine the first confidence level as a ratio of k nearest like neighbours to k nearest unlike neighbours, k having a value in the range 10 to 15.

In an embodiment the second machine learning model comprises a Support Vector Machine (SVM) model having parameters C and y, In an embodiment the values of C and/or y are user-defined. In an embodiment, the value of C is approximately 1,000. In an embodiment the value of y is approximately 0.1

In an embodiment the system further comprises a model trainer configured to train the second machine learning model from at least one data point retrieved from the high confidence database.

In accordance with a further aspect of the invention, a computer-readable medium has stored thereon computer-executable instructions that, when executed by a processor, cause the processor to perform a method of classifying at least one data point within a data set, the method comprising : maintaining, in a first machine learning model, a first plurality of data points associated to respective first labels; maintaining, in a second machine learning model, a second plurality of data points associated to respective second labels; classifying the at least one data point using the first machine learning model to a first confidence level to determine at least one first classification label;

responsive to the first confidence level having a value greater than or equal to a classification threshold value: classifying the at least one data point with the at least one first classification label to the first confidence level; and responsive to the first confidence level having a value of zero and/or the first confidence level having a value below the classification threshold value: classifying the at least one data point using the second machine learning model to a second confidence level to determine at least one second classification label, responsive to the first confidence level having a value less than the second confidence level: classifying the at least one data point with the at least one second classification label, and responsive to the first confidence level having a value greater than or equal to the second confidence level : classifying the at least one data point with the at least one first classification label to the second confidence level.

In an embodiment the first machine learning model comprises a Radius Nearest

Neighbour (Rad-NN) model.

In an embodiment the method comprises determining the first confidence level as a ratio of k nearest like neighbours to k nearest unlike neighbours, k having a value in the range 10 to 15. In an embodiment the second machine learning model comprises a Support Vector Machine (SVM) model having parameters C and y. In an embodiment the values of C and/or y are user-defined. In an embodiment, the value of C is approximately 1,000. In an embodiment the value of y is approximately 0.1

In an embodiment the method further comprises training the second machine learning model from at least one data point retrieved from the high confidence database.

In accordance with a further aspect of the invention, a method of detecting anomalies within a set of data points comprises receiving a data point from the set of data points; determining a similarity score associated to the data point using a first machine learning model; responsive to the data point having a similarity score less than a threshold value: determining a first value for an anomaly score associated to the data point, and adding the data point to a first database; and responsive to the data point having a similarity score greater than the threshold value: determining a second value for the anomaly score associated to the data point, the second value not equal to the first value.

In an embodiment the method further comprises adding the similarity score associated to the data point to a first batch of similarity scores maintained in a second database; and responsive to determining that the first batch of similarity scores is significantly different to a second batch of similarity scores: adding the data points associated to the respective similarity scores in the first batch to a low confidence database.

In an embodiment the method further comprises receiving the first batch of similarity scores; and responsive to determining that the first batch of similarity scores is not significantly different to the second batch of similarity scores: updating the threshold value.

In an embodiment the method further comprises removing outliers from the first batch of similarity scores.

In an embodiment, updating the threshold value comprises determining a median value of the first batch of similarity scores; and setting the threshold value to the median value.

In accordance with a further aspect of the invention, an anomaly detection system configured to detect anomalies within a set of data points comprises: a first machine learning model; a second machine learning model; and a processor. The processor is configured to: receive a data point from the set of data points; determine a similarity score associated to the data point using the first machine learning model; responsive to the data point having a similarity score less than a threshold value: determine a first value for an anomaly score associated to the data point, and add the data point to a first database; and responsive to the data point having a similarity score greater than the threshold value: determining a second value for the anomaly score associated to the data point, the second value not equal to the first value.

In accordance with a further aspect of the invention, a computer-readable medium has stored thereon computer-executable instructions that, when executed by a processor, cause the processor to perform a method of detecting anomalies within a set of data points. The method comprises receiving a data point from the set of data points;

determining a similarity score associated to the data point using a first machine learning model; responsive to the data point having a similarity score less than a threshold value: determining a first value for an anomaly score associated to the data point, and adding the data point to a first database; and responsive to the data point having a similarity score greater than the threshold value: determining a second value for the anomaly score associated to the data point, the second value not equal to the first value.

The invention in one aspect comprises several steps. The relation of one or more of such steps with respect to each of the others, the apparatus embodying features of construction, and combinations of elements and arrangement of parts that are adapted to affect such steps, are all exemplified in the following detailed disclosure.

This invention may also be said broadly to consist in the parts, elements and features referred to or indicated in the specification of the application, individually or collectively, and any or all combinations of any two or more said parts, elements or features, and where specific integers are mentioned herein which have known equivalents in the art to which this invention relates, such known equivalents are deemed to be incorporated herein as if individually set forth.

In addition, where features or aspects of the invention are described in terms of Markush groups, those persons skilled in the art will appreciate that the invention is also thereby described in terms of any individual member or subgroup of members of the Markush group.

As used herein, '(s)' following a noun means the plural and/or singular forms of the noun.

As used herein, the term 'and/or' means 'and' or 'or' or both.

As used herein, the term’computer-readable medium’ should be taken to include a single medium or multiple media. Examples of multiple media include a centralised or distributed database and/or associated caches. These multiple media store the one or more sets of computer executable instructions. The term 'computer readable medium' should also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any one or more of the methods described above. The computer-readable medium is also capable of storing, encoding or carrying data structures used by or associated with these sets of instructions. The term 'computer-readable medium' includes solid-state memories, optical media and magnetic media.

As used herein, the terms 'component', 'module', 'system', 'interface', and/or the like in relation to a processor are generally intended to refer to a computer-related entity, either hardware, a combination of hardware and software, software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a controller and the controller can be a component. One or more components may reside within a process and/or thread of execution and a component may be localized on one computer and/or distributed between two or more computers.

As used herein, the term 'connected to' in relation to data or signal transfer includes all direct or indirect types of communication, including wired and wireless, via a cellular network, via a data bus, or any other computer structure. It is envisaged that they may be intervening elements between the connected integers. Variants such as 'in

communication with', 'joined to', and 'attached to' are to be interpreted in a similar manner. Related terms such as’connecting' and 'in connection with' are to be

interpreted in the same manner.

It is intended that reference to a range of numbers disclosed herein (for example, 1 to 10) also incorporates reference to all rational numbers within that range (for example, 1, 1.1, 2, 3, 3.9, 4, 5, 6, 6.5, 7, 8, 9, and 10) and also any range of rational numbers within that range (for example, 2 to 8, 1.5 to 5.5, and 3.1 to 4.7) and, therefore, all sub-ranges of all ranges expressly disclosed herein are hereby expressly disclosed.

These are only examples of what is specifically intended and all possible combinations of numerical values between the lowest value and the highest value enumerated are to be considered to be expressly stated in this application in a similar manner.

In this specification where reference has been made to patent specifications, other external documents, or other sources of information, this is generally for the purpose of providing a context for discussing the features of the invention. Unless specifically stated otherwise, reference to such external documents or such sources of information is not to be construed as an admission that such documents or such sources of information, in any jurisdiction, are prior art or form part of the common general knowledge in the art.

Although the present invention is broadly as defined above, those persons skilled in the art will appreciate that the invention is not limited thereto and that the invention also includes embodiments of which the following description gives examples.

BRIEF DESCRIPTION OF THE DRAWINGS

Preferred forms of the hybrid machine learning system and method will now be described by way of example only with reference to the accompanying figures in which :

Figure 1 shows an example hybrid machine learning system;

Figure 2 shows an example method of classifying data points within a data set;

Figure 3 shows an example of learning performed by the online model of figure 1;

Figure 4 shows another example of a hybrid machine learning system;

Figures 5-7 show an example method of detecting anomalies using the system of figure 4; and

Figure 8 shows an example of a suitable computing environment to implement embodiments of one or more of the systems and methods of figures 1 to 7.

DETAILED DESCRIPTION

Figure 1 shows an example hybrid machine learning system 100. Shown in figure 1 is an environment 102 from which a data extractor 104 obtains a data set 106.

The data set 106 includes at least one data point. In an embodiment the data point(s) obtained by the data extractor 104 are passed to a pre-processor 108. Examples of the pre-processor operation are further described below.

System 100 includes a first machine learning model 110 and a second machine learning model 112. The first machine learning model 110 includes a first plurality of data points associated to respective first labels. The second machine learning model 112 includes a second plurality of data points associated to respective second labels. In an embodiment the system 100 includes additional machine learning models over and above the first machine learning model 110 and the second machine learning model 112.

In an embodiment the first machine learning model 110 comprises an offline model. The offline model 110 is typically trained on labelled and analysed data. The offline model 110 contains knowledge of general characteristics of data and is optimised offline. The training time and depth of the offline model can be high and it remains static during application.time.

One example of a suitable offline model 110 is a Radius Nearest Neighbour (Rad-NN) model. This model is one form of nearest neighbour model. Nearest Neighbour models are also known as lazy learners because they do not learn a discriminative function but "memorize" the dataset. A Nearest Neighbour model simply maintains a knowledge base of points in hyperspace and when tasked with classifying a new point, p, calculates the distance between p and its neighbours. It then counts the votes or classes of the neighbouring points and classifies p. This is an appropriate model to retain knowledge or useful points that encompass the general features of network traffic.

The Rad-NN model counts the votes of all neighbouring points within a specified radius. The voting can be done based on equal weighting for all of the points, or distance weighting where closer points would have a higher weight on their vote for p.

The parameter k represents the number of like and unlike neighbours to consider when the offline model 110 calculates its confidence. In an embodiment the value of k is in the range 10 to 15. In an embodiment a confidence of the classification is determined by a ratio of nearest like neighbours to nearest unlike neighbours.

In this context, if p is too far away there will be no neighbouring points within the specified radius. The point p will be an outlier and the model will be unable to classify it. The complexity for the search is of 0(d x log2(n)), where n is the number of points and d is the number of features, with an implementation of a Ball tree structure which is efficient compared to many algorithms.

In an embodiment the offline model 110 maintains a knowledge base with data, F_p*, where p=l,...,n and their corresponding labels L_p*.

In an embodiment the second machine learning model 112 comprises an online model. The online model 112 is required to have low training time. It also needs to be adaptive and continuously change as the ground truth changes. The online model 112 retrains itself in intervals and with new data based on confidence scores. One example of a suitable online model 112 is a Support Vector Machine (SVM) model.

A Support Vector Machines (SVM) is a supervised machine learning algorithm that learns an optimal hyperplane or decision boundary to separate two classes of data. This hyperplane learnt by the SVM is described by a decision function. SVM uses a kernal trick to obtain the hyperplane in a higher dimensional space. The kernal trick avoids the explicit mapping to higher dimension needed to learn the hyperplane and reduces computation operations.

In an embodiment the kernel is based on a Radial Basis Function (RBF). The RBF kernal is specified by regularisation parameters C and y. The C parameter trades off correct classification during training for a simpler hyperplane. A lower C value will yield a simpler hyperplane and a higher C value gives us a more complex hyperplane. The y parameter defines how far the influence of each training data point reaches with a low value meaning 'far' and vice versa.

Optimal values for these parameters depend on the data and/or the user. These values are derived based on classification performance or validation using new data. Support Vectors within the SVM are the data points close to the hyperplane and they influence its position and orientation. After training, the Support Vectors completely define the hyperplane.

In an embodiment the online model 112 can be trained with limited data and is quick to make predictions.

As normal data changes, the online model 112 can incrementally learn with the new data and shift its decision boundary. Furthermore, any bias and variance can be easily adjusted using the C and g parameters forming part of the model. Training time is typically 0(max(n,d) x min(n,d)²). For a small number of samples, this does not pose an issue. Furthermore, once the online model 112 is trained, the support vectors are enough to describe the current decision boundary.

In an embodiment the online model 112 contains the ground truth of recent data

where and their corresponding labels

In an embodiment the classifications of the offline model 110 and/or the online model 112 are associated to respective confidence levels. The labels assigned to new data based on the offline model 110 and the online model are only as accurate as the confidence level of the classifications of the respective models. The confidence level associated to classification of the offline model 110 is an indicator of accuracy of classification using the offline model 110.

Likewise, the confidence level associated to classification of the online model 112 is an indicator of accuracy of classification using the online model 112.

In an embodiment Ft denotes a data point to be classified at time t. A classifier 114 receives a data point Ft from the data set 106. The classifier 114 classifies the data point Ft by assigning to it a classification label. An example operation of the classifier 114 is described below. In an embodiment the classifier 114 assigns a classification label to the data point Ft from either the offline model 110 or the online model 112 depending on which of the models has a higher confidence.

In an embodiment the classifier 114 determines at least one classification label Lt for data point Ft to a confidence level Ct. If the value of Ct is above an upper threshold confidence level then the data point Ft is assumed to be classified to a high confidence. The data point Ft is stored in high confidence database 120.

In an embodiment a model trainer 122 takes as input at least one data point from the high confidence database 120 and trains the online model 112 using the data point(s). For example, the model trainer 122 may train the online model 112 using data points with their labels in addition to data points extracted

from the high confidence database 120 when retraining criteria are met.

In an embodiment the retraining criteria are met when the number of data points in the high confidence database 120 is greater than or equal to 1,000. In an embodiment the data point(s) used to train the online model 112 are removed from the high confidence database 120 following training.

In an embodiment the data point(s) in the online model 112 are grouped into time windows. The retraining criteria are met when a threshold number of time windows of data points has been added to the high confidence database 120. In an embodiment the system 100 includes a low confidence database 130. If for example the value of Ct is below a lower threshold confidence level then the data point Ft is assumed to be classified to a low confidence. The data point Ft is stored in low confidence database 130.

In an embodiment the data points in the low confidence database 130 and/or the data points not eligible for storing in the high confidence database 120 can be used to understand any changes in network traffic from the environment 102. For example, identification of a large number of low confidence points in the low confidence database 130 suggests that the first machine learning model 110 is going out of scope and requires an update.

Figure 2 shows an example method of classifying data points within a data set. The classifier 114 receives a data point Ft to be classified at a time t. As described above, the data point Ft is at least partly representative of an environment 102. The data point Ft may have undergone pre-processing or other data manipulation via the pre-processor 108.

The data point Ft is classified 202 using the offline model 110. The offline model 110 outputs for example a data tuple Ft, L*t, C*t which represents the data point, a label assigned to the data point and a first confidence level.

In an embodiment the confidence C* of classifying a data point F using offline model 110 is determined by the function :

D(a,b) represents the distance between points a and b. NLN_i(F) represents the I^th nearest like neighbour of F while NUN_i(F) represents its I^th nearest unlike neighbour. In an embodiment, offline model 110 uses distance weighting for classifying, therefore this measure will always be in the range (0,1). In an embodiment the confidence of classifying a data point F using online model 112 is calculated for example by using Platt scaling on the scores of the decision function.

The sign of the output of the decision function, + or -, provides a classification for an input data point. Instead of only a classification label, Platt scaling is an algorithm to give a probability estimate on the output of the decision function.

In the formula below, x is the data point, f(x) is the decision function and exp is the standard exponential function. A and B are parameters determined using the data by solving a regularized maximum likelihood problem. Other methods include isotonic regression.

The confidence value C*t is checked 204 to determine whether or not it is above a classification threshold value thres. In an embodiment the classification threshold value is user defined. In an embodiment the classification threshold value is 80% or greater. In an embodiment the classification threshold is 90% or greater.

In an embodiment the confidence value C*t is deemed to be low if the value is lower than thres, or if Ct* - 0. Where Ct* = O this would indicate that the offline model 110 is unable to classify the data point Ft.

If the confidence value C*t is low then the classifier 114 causes the data point F_t to be classified 206 using the online model 112. The online model 110 outputs for example a data tuple which represents the data point, a label assigned to the data

point based on the online model 112, and a second confidence level.

Alternatively if the confidence value C*t is high then the classifier 114 causes the first classification label from the offline model 110 to be assigned 208 to the data point.

Following step 206, the first confidence level C*t is checked 210 against the second confidence level If the first confidence level is higher than the second confidence

level then the data point is assigned 208 the first classification label. Alternatively if the second confidence level is higher than the first confidence level then the data point is assigned 212 the second classification label.

Figure 3 shows an example of learning performed by the online model 112. In an embodiment the online model retains its support vectors and extracts new data from the high confidence database 120 to retrain itself and incrementally learn. The decision boundary of the online model 112 will shift in line with normal traffic changes.

Figure 3 shows a graphical representation of a plurality of data points within an SVM model 300. The model 300 includes a boundary line 305. The data points 310 classified as normal are positioned on one side of the boundary line 305. The data points 315 classified as anomalies are positioned on another side of the boundary line 305.

Some of the normal data points and some of the anomaly data points represent support vectors. These data points are indicated at 320, 322, 324, 326 within the normal data points 310 and indicated at 330, 332 and 334 within the anomaly data points 315.

Shown at 340 is the model 300 following retraining. Data points 320, 322, 324 and 326 are retained within a set of normal data points 350. Data point 326 is no longer a support vector. There is a new data point 355 within the set of normal data points 350.

Data points 330, 332 and 334 are retained within a set of anomaly data points 360. There is a new data point 365 within the set of anomaly data points 360.

The boundary line 370 dividing the normal data points 350 from the anomaly data points 360 has shifted slightly to a different position within the model 340 than the boundary line 305 within the model 300.

The boundary lines 305 and 370 within each of models 300 and 340 are defined by the support vectors within each model. Changes to the support vectors result in changes to the respective boundary lines. Boundary line 370 has shifted to fit the new data in model 340.

Figure 4 shows an example online offline system 400. In an embodiment the system 400 is configured to obtain the data set 106 of figure 1 from the environment 102. In an embodiment the data extractor 104 and/or the pre-processor 108 create the data set 106 from the environment 102. The data set 106 includes at least one data point.

System 400 includes a first machine learning model 410 and a second machine learning model 412. In an embodiment the first machine learning model 410 comprises an offline model and the second machine learning model 412 comprises an online model. At initialisation the offline model 410 is trained with normal data indicated at

414, The offline model 410 outputs an anomaly score for each point The anomaly

score indicates how far a given data point is from a centre or peak of a distribution from which the data point is drawn.

In an embodiment the offline model 410 comprises an Autoencoder (AE), An AE comprises an encoder (neural) network, a latent layer and a decoder network. The encoder maps the input data into the latent layer of lower dimensionality, and the decoder reconstructs them again. AEs are able to extract abstract features from high dimensional data in the form of the latent layer and they also give a form of anomaly score such as Reconstruction Error (RE).

The offline model 410 in the form of an AE is trained with only normal data while minimising RE. Thus, the assumption is that normal traffic data can be reconstructed easily while anomalous data will have higher RE. This RE measure can be viewed as a vertical distance from the center/peak of the distribution which the AE has learnt. Points with a larger RE tend to mean that the data point is far from the centre.

Where the offline model 410 comprises an AE, the model 410 determines a similarity score, for example a Reconstruction Error for each point In an embodiment the

offline model 410 comprises a complex deep learning model that outputs a form of anomaly score such as RE. In an embodiment the offline model 410 provides

dimensionality reduction. Reducing the dimension of the data has the potential to help the online model 412 learn better.

In an embodiment the online model 412 is initialised with existing data obtained from the offline model 410. A threshold thres is initialised as a median value of the dataset 414 used to initialise the online model 412. In an embodiment the online model 412 includes an outlier detection process.

The parameters of the online model 412 can be updated with an incremental step with new data or completely retrained with new data. Where the online model 412 is retrained with new data it has a 0% knowledge retention rate.

In operation, the offline model 410 receives an incoming live stream of data to be scored. This live stream is referred to as Xi shown at 416. The stream of data 416 is fed into the offline model 410. In an embodiment the offline model outputs a latent layer representation of the stream of data 416. The latent layer is referred to as z_i . The offline model 410 also outputs similarity scores, for example Reconstruction Error values, associated to respective values from the set of x_i values. The Reconstruction Error is referred to as The data output from offline model 410 is shown at 418.

If the similarity score

is less than thres then the live stream of data values x_i is classified as normal traffic. Data points with small RE tend to be closer to the centre of the distribution of the training data. These can be safely classified as normal and the online model 412 can use them for training. The live stream is assigned an anomaly score of 0 indicating that the live stream is normal traffic. The live stream is stored in online database 420.

The online model 412 is retrained with this new data and its existing knowledge once a retraining criterion is met.

On the other hand, if the similarity score is greater than or equal to thres then the

online model 412 provides an anomaly score for the live stream x_i .

The live stream of data values x_i and corresponding similarity scores

are collated into batches and stored in Batch database 422 to maintain a set or batch of values.

An value less than a dth percentile of a batch is used as input to a one-sided Mann-

Whitney test 424 against a set of values. The Mann-Whitriey test includes a null

hypothesis test H₀. The null hypothesis checks to see whether or not the mean ranks of two populations are equal and/or whether or not the populations are the same.

If the null hypothesis H₀ is not rejected, the median value of the batch will be the new threshold thres. The median value is one example of a value that can be assigned to the threshold thres. It will be appreciated that other percentile values can be assigned to thres. The value of thres is thereby allowed to drift. On the other hand, if the null hypothesis H₀ is rejected, the thres value is not updated because more than q% of the data is significantly different from the normal training set 414 of

The threshold thres needs to be able to drift, to allow the online model 412 to incorporate concept-drift and variants of existing normal traffic. However, the threshold should also not drift too far as this would cause it to train with anomalous traffic.

In an embodiment a network administrator provides knowledge of a suitable value for thres. For example, if it is expected that a certain percentage of the traffic in each batch will be legitimate during day-to-day operations, that can be used as the percentile value in determining a threshold.

In an embodiment, an assumption is made that at least 50% of the traffic in the network is legitimate. The median value of each batch is taken to be the new threshold thres.

Most of the time legitimate traffic forms the great bulk of data and one can take a higher percentile value. Furthermore, the percentile values are robust to mean-shifts and outliers. There may be legitimate traffic above this threshold that is not considered for online training. They can still be given a low anomaly score by the online model 412, albeit not 0.

It is expected that there will be cases when intrusions in streaming data arrive together, such as Denial of Service (DoS) attacks or Flash Crowd traffic. In such cases, the median value will increase and subsequent training of the online model would result in incorrect classification. Hence, it is desirable to check the REs of each batch to determine whether most of the points in this batch are anomalous or legitimate. In an embodiment the Mann-Whitney test is used to determine when to update the threshold.

The above statistical test is not specific to the Mann-Whitney test. It will be appreciated that any suitable test can be used to compare a similarity between two populations or between two sets of data. One example is the Kolmogorov Smirnoff test.

Furthermore, it will be appreciated that the Mann-Whitney test may be replaced with another statistical test or tests depending on scenario, data and choice of offline and online model.

Figure 5 shows an example method 500 of operation of the online offline system 400. The method is a representation of the following algorithm :

Referring to figures 4 and 5, the offline model 410 receives a data point from the live stream of data points. The offline model 410 as a first machine learning model obtains/outputs 504 a similarity score, for example a reconstruction error value, associated to the data point.

If 506 the similarity score is less than a threshold value then the data point is determined 508 to be normal traffic. The data point is associated to an anomaly score of zero. The data point is added 510 to the a first database, for example the online database 420.

On the other hand if the similarity score is greater than or equal to the threshold value then a second machine learning model, for example the online model 412,

obtains/outputs 512 a non-zero anomaly score. In an embodiment the online model 412 determines an anomaly score that has a different value to the anomaly score determined in step 508 above.

It will be appreciated that an alternative comparison would involve determining the data point as normal traffic if the similarity score is less than or equal to the threshold value, and determining the data point as not normal if the similarity score is greater than the threshold value.

If 514 there are further data points in stream of data 416 then these further data points are received by the offline model 410.

In an embodiment the method 500 optionally includes a check to determine whether the stream of data 416 is suspiciously different from the training set 414. The check is shown at 516 in figure 5 and figure 6. It will be appreciated that the check can work in parallel with respect to the method shown in figure 5.

Referring to Figure 6, the data points and associated similarity scores are added 518 to a second database. One example of the second database is the batch database 422.

One example test is the Mann-Whitney test performed to check 520 whether or not the new data points indicate suspicious activity. In an embodiment the test checks for data that is not anomalous to and/or significantly different to previously observed data. If the data stream does not indicate suspicious activity then control is passed through 522 back to the method shown in figure 5. On the other hand, if the data stream does indicate suspicious activity then the data point is added 524 to a low confidence database. Control is then passed back to the method shown through 526 in figure 5. In an embodiment the method 500 optionally includes a technique for dynamically determining the threshold value thres as an alternative to the user having to specify a value. An example of a technique for dynamically determining a value for thres is shown at 530 in figure 5 and figure 7. It will be appreciated that the dynamic adjustment can work in parallel with respect to the method shown in figure 5.

Referring to figure 7, a batch of similarity values is received 532. In an embodiment the method includes the optional step of removing 534 outliers from the received batch.

A test is performed 536 to determine whether the batch of similarity values is normal. The test could include any statistical test or similarity checking method. If the batch is determined to be normal then in an embodiment the thres value is updated 538 to the median RE value of the batch. It will be appreciated that the median RE value is one example. For example, the thres value could be updated to a user defined percentile value of a similarity score of the batch.

Control is then passed back to the method shown in figure 5 through 540.

This technique allows the value of thres to drift with data sets that are evolving.

Examples of training offline and online models

In an embodiment the system 100 is trained by selecting the data set 106 from a training set for example the NSLKDD 2009 dataset. The data set 106 includes four different types of attacks namely, Denial of Service (DoS), Probe, Remote to Local (R2L) and User to Root (U2R). The original data set is pre-processed to replace the labels with a binary value of either normal or anomaly. The original data set includes 41 features, of which 3 are nominal, 6 are binary and the remaining features are numerical.

The pre-processor 108 performs one-hot encoding of the nominal features to make them into a total of 90 binary features. All binary features are decomposed into 8 components using Principal Component Analysis (PCA). These 8 components explain at least 80% of the variability of these binary features. The 'num outbound cmds' feature is removed from the data set as it only takes the value 0 in the entire dataset.

One example of a one-hot encoding process transforms nominal or categorical variables such as 'PROTOCOL' (which takes either 'tcp', 'udp' or 'icmp), into binary variables. For each value, it adds another variable. The variables 'TCP', 'UDP' and 'ICMP' take the value 1 according to the data point's protocol and 0 on the other two variables. Other methods include Label encoding that changes nominal variables into numerical values.

The pre-processor 108 normalises the values to the range [0,1] so that they can be compared. Min-Max scaling is used for the 8 PCA components. Min-Max scaling takes each value and subtracts the minimum value and divides over the feature range.

Min-Max scaling is not used for the numerical features due to the effect of outliers which can highly distort the data. In an embodiment the following method is used for normalisation :

Normalization of

In the above equation fi is the observed value of the feature ie. represents the Euler number (e ~ 2.7182818) and is a constant. Each numerical feature has its own value. To facilitate adaptation to normal traffic, the constant is determined such that the average of feature / of the normal traffic instances is mapped to 0.5. If this is not possible, is such that the average of feature / of all traffic instances in the training set is mapped to 0.5.

In an embodiment the offline model 110 is trained using a weighted Euclidean distance as follows:

One advantage of using this metric is that some features of the data set 106 have more predictive power in relation to detection of anomalies. The offline model 106 does not learn that some features are more discriminative than others. In an embodiment the feature weights are computed based on Chi Squared statistic and rescaled to the range [0, 1] using Min-Max scaling.

In one training example, 20% of the data set 106 is randomly selected to determine the value for the radius. For each data point in the selected set, the average of its distances from all other points is computed. The 50th percentile of these distances is selected as the radius because the distribution is fat-tailed.

The smaller the radius, the more responsibility is imposed on the online model for classification. A larger radius has a potential disadvantage of obtaining votes from points that are further away which is inappropriate. The optimal radius does to a certain extent depend on the number of points in the offline model 110 and how far each one is from each other. In an embodiment k is given a value of greater than 1. In an embodiment k is given a value of 10. In an embodiment k is given a value in the range 10 to 15.

In an embodiment the online model 112 is trained using 1000 data points randomly selected from, or the first 1000 data points in, the data set 106. The online model 112 must bring new knowledge and must be trained using different data points. The system 100 is then tested on the remaining data points in the testing set.

The online model 112 is intended to grasp the network traffic as is at that point in time. In an embodiment, high values for parameters C and y are used to obtain high variance and low bias.

Experimental results

The results are evaluated based on false positive rate (FPR), true positive rate (TPR) also known as detection rate, and accuracy (Acc).

The system 100 was trained with the following initial parameters:

• Online model 112 parameters are C = 1000, y = 0.1

• Confidence threshold, thres, of offline model 110 is 80%

• k, to calculate confidence is 10

• Upper threshold, u for high confidence database is 95%

• Lower threshold, / for low confidence database is 80% • The online model 112 is retrained when the number of previous support vectors and high confidence points reaches 1000.

The results show an average prediction time for each data point of 0.08s. The system achieves 94.14% accuracy, 91.68% detection rate and 2.59% false positive rate. The online model 112 was trained 13 times whilst predicting 21,544 data points in the testing set.

The online model 112 has been found to learn continuously. The offline model 110 has been found to perform knowledge accumulation in the knowledge base in the form of data and its neighbourhood. This knowledge has been successfully leveraged using a confidence score for the online model 112 to select new points to learn from. All features from the training data set were used which introduces noise.

If there is an awareness of the anomalies to detect, the system 100 can be trained on specific features. For example, to detect R2L attacks, preferred features include duration of connection, service requested and number of failed login attempts instead of all features. For detecting DoS attacks, aggregating flows by source and destination instead of a 5 tuple and using number of sources and average packet sizes may be better suited.

In an embodiment the high confidence threshold u lies in the range 0.925 to 0.975. The number of online models 112 trained decrease as u is increased because there are fewer points and the online model has to wait longer to obtain 1000 points for retraining.

In an embodiment, the prediction threshold thres of the offline model, has an initial parameter value of 80% or greater. In an embodiment the threshold thres has an initial parameter value of 90%. It is important for the system 100 to balance the contribution of the online model 112 to the prediction process.

In an embodiment the online model 112 is initialised with between 500 and 1,500 data points. Initialising the online model with more than 1,500 is not found to greatly increase TPR or Accuracy. FPR is not impacted greatly. It has been found that after 500 points, the number of online models 112 retrained does not vary much as well.

In an embodiment a value for k in the range 10 to 15 is suitable. This is the number of like and unlike neighbours to consider when the offline model 110 calculates its confidence. A higher number provides an upward trend in FPR. This is because the system is calculating confidence based on points that are further away. Furthermore,

TPR and Accuracy dips for k values greater than 15. In an embodiment the first machine learning model 110 and the second machine learning model 112 work together instead of working independently.

For example in an embodiment the second machine learning model 112 builds on/incorporates the knowledge of the first machine learning model 110 to perform classification. In an embodiment the first machine learning model 110 builds

on/incorporates the knowledge of the second machine learning model 112 to perform classification.

In an embodiment the classified data points and respective confidence levels are used to update the first machine learning model 110 in addition to updating the second machine learning model 112.

In an embodiment the second machine learning model 112 incorporates at least some knowledge of the first machine learning model 110 in its training phase. For example, where the second machine learning model 112 includes an SVM, optimal values for C and Y can be determined by an algorithm in the first machine learning model 110.

Figure 8 shows an embodiment of a suitable computing environment to implement embodiments of one or more of the systems and methods disclosed above.

The operating environment of Figure 8 is an example of a suitable operating

environment. It is not intended to suggest any limitation as to the scope of use or functionality of the operating environment. Example computing devices include, but are not limited to, personal computers, server computers, hand-held or laptop devices, mobile devices, multiprocessor systems, consumer electronics, mini computers, mainframe computers, and distributed computing environments that include any of the above systems or devices. Examples of mobile devices include mobile phones, tablets, and Personal Digital Assistants (PDAs).

Although not required, embodiments are described in the general context of 'computer readable instructions' being executed by one or more computing devices. In an embodiment, computer readable instructions are distributed via tangible computer readable media.

In an embodiment, computer readable instructions are implemented as program modules. Examples of program modules include functions, objects, Application

Programming Interfaces (APIs), and data structures that perform particular tasks or implement particular abstract data types. Typically, the functionality of the computer readable instructions is combined or distributed as desired in various environments. Shown in figure 8 is a system 800 comprising a computing device 805 configured to implement one or more embodiments described above. In an embodiment, computing device 805 includes at least one processing unit 810 and memory 815. Depending on the exact configuration and type of computing device, memory 815 is volatile (such as RAM, for example), non-volatile (such as ROM, flash memory, etc., for example) or some combination of the two.

A server 820 is shown by a dashed line notionally grouping processing unit 810 and memory 815 together.

In an embodiment, computing device 805 includes additional features and/or

functionality. One example is removable and/or non-removable additional storage including, but not limited to, magnetic storage and optical storage. Such additional storage is illustrated in Figure 8 as storage 825.

In an embodiment, computer readable instructions to implement one or more

components provided herein are maintained in storage 825. Examples of components implemented by such computer readable instructions include one or more of the data extractor 104, the pre-processor 108, the first machine learning model 110, the second machine learning model 112, the classifier 114, the high confidence database 120, the model trainer 122 and/or the low confidence database 130. Further examples include some or all of the components shown in figure 4.

In an embodiment, storage 825 stores other computer readable instructions to implement an operating system and/or an application program. Computer readable instructions are loaded into memory 815 for execution by processing unit 810, for example.

Memory 815 and storage 825 are examples of computer storage media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVDs) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computing device 805, Any such computer storage media may be part of device 805.

In an embodiment, computing device 805 includes at least one communication connection 840 that allows device 805 to communicate with other devices. The at least one communication connection 840 includes one or more of a modem, a Network Interface Card (NIC), an integrated network interface, a radio frequency transmitter/receiver, an infrared port, a USB connection, or other interfaces for connecting computing device 805 to other computing devices. In an embodiment, communication connection(s) 840 facilitate a wired connection, a wireless connection, or a combination of wired and wireless connections. Communication connection(s) 840 transmit and/or receive communication media.

Communication media typically embodies computer readable instructions or other data in a "modulated data signal" such as a carrier wave or other transport mechanism and includes any information delivery media. The term "modulated data signal" includes a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.

In an embodiment, device 805 includes at least one input device 845 such as keyboard, mouse, pen, voice input device, touch input device, infrared cameras, video input devices, and/or any other input device. Device 805 also includes at least one output device 850 such as one or more displays, speakers, printers, and/or any other output device.

Input device(s) 845 and output device(s) 850 are connected to device 805 via a wired connection, wireless connection, or any combination thereof. In an embodiment, an input device or an output device from another computing device is/are used as input device(s) 845 or output device(s) 850 for computing device 805.

In an embodiment, components of computing device 805 are connected by various interconnects, such as a bus. Such interconnects include one or more of a Peripheral Component Interconnect (PCI), such as PCI Express, a Universal Serial Bus (USB), firewire (IEEE 13104), and an optical bus structure. In an embodiment, components of computing device 805 are interconnected by a network. For example, memory 815 in an embodiment comprises multiple physical memory units located in different physical locations interconnected by a network.

It will be appreciated that storage devices used to store computer readable instructions may be distributed across a network. For example, in an embodiment, a computing device 855 accessible via a network 860 stores computer readable instructions to implement one or more embodiments provided herein.

Computing device 805 accesses computing device 855 in an embodiment and downloads a part or all of the computer readable instructions for execution. Alternatively, computing device 805 downloads portions of the computer readable instructions, as needed. In an embodiment, some instructions are executed at computing device 805 and some at computing device 855.

In an embodiment, the client application 885 is provided as a thin client application configured to run within a web browser. In an embodiment the client application 885 is provided as an application on a user device. It will be appreciated that application 885 in an embodiment is associated to computing device 805 or another computing device.

The foregoing description of the invention includes preferred forms thereof.

Modifications may be made thereto without departing from the scope of the invention, as defined by the accompanying claims.

Claims

1. A method of classifying at least one data point within a data set comprising : maintaining, in a first machine learning model, a first plurality of data points associated to respective first labels; maintaining, in a second machine learning model, a second plurality of data points associated to respective second labels; classifying the at least one data point using the first machine learning model to a first confidence level to determine at least one first classification label; responsive to the first confidence level having a value greater than or equal to a classification threshold value: classifying the at least one data point with the at least one first classification label to the first confidence level; and responsive to the first confidence level having a value of zero and/or the first confidence level having a value below the classification threshold value: classifying the at least one data point using the second machine learning model to a second confidence level to determine at least one second classification label, responsive to the first confidence level having a value less than the second confidence level : classifying the at least one data point with the at least one second classification label, and responsive to the first confidence level having a value greater than or equal to the second confidence level : classifying the at least one data point with the at least one first classification label to the second confidence level.

2. The method of claim 1 wherein the classification threshold value is user defined.

3. The method of claim 1 or claim 2 wherein the classification threshold value is 80% or greater.

4. The method of any one of the preceding claims wherein the classification threshold is 90% or greater.

5. The method of any one of claims 1 to 3 wherein the classification threshold lies in the range of 80% to 90%.

6. The method of any one of the preceding claims further comprising maintaining the at least one data point in a high confidence database responsive to the at least one data point being classified to a classification confidence level having a value greater than an upper threshold confidence level.

7. The method of claim 6 wherein the upper threshold confidence level is 92.5% or greater.

8. The method of claim 6 or claim 7 wherein the upper threshold confidence level lies in the range 92.5% to 97.5%.

9. The method of claim 6 or claim 7 wherein the upper threshold confidence level is 95%.

10. The method of any one of the preceding claims further comprising maintaining the at least one data point in a low confidence database responsive to the at least one data point being classified to a classification confidence level having a value below a lower threshold confidence level.

11. The method of claim 10 wherein the lower threshold confidence level is lower than the upper threshold confidence level.

12. The method of claim 10 or claim 11 wherein the lower threshold confidence level is less than 95%.

13. The method of claim 10 or claim 11 wherein the lower threshold confidence level lies in the range 70% to 95%.

14. The method of any one of claims 10 to 13 wherein the lower threshold confidence level is less than 92.5%.

15. The method of any one of claims 10 to 14 wherein the lower threshold confidence level is less than 80%.

16. The method of any one of the preceding claims wherein the first machine learning model comprises a Radius Nearest Neighbour (Rad-NN) model.

17. The method of claim 16 wherein classifying the at least one data point using the first machine learning model includes counting votes of all neighbouring points positioned within a specified radius r.

18. The method of claim 16 or claim 17 further comprising determining the first confidence level as a ratio of k nearest like neighbours to k nearest unlike neighbours, k having a value in the range 10 to 15.

19. The method of any one of the preceding claims wherein the second machine learning model comprises a Support Vector Machine (SVM) model having parameters C and y.

20. The method of claim 19 wherein the values of C and/or y are user-defined.

21. The method of claim 19 or claim 20 wherein the value of C is approximately

1,000.

22. The method of any one of claims 19 to 21 wherein the value of y is approximately 0.1

23. The method of any one of claims 6 to 9 further comprising training the second machine learning model from at least one data point retrieved from the high confidence database.

24. A classification system configured to classify at least one data point within a data set comprising : a first machine learning model in which is maintained a first plurality of data points associated to respective first labels; a second machine learning model in which is maintained a second plurality of data points associated to respective second labels; and a classifier configured to: classify the at least one data point using the first machine learning model to a first confidence level to determine at least one first classification label; responsive to the first confidence level having a value greater than or equal to a classification threshold value: classify the at least one data point with the at least one first classification label to the first confidence level; and responsive to the first confidence level having a value of zero and/or the first confidence level having a value below the classification threshold value: classify the at least one data point using the second machine learning model to a second confidence level to determine at least one second classification label, responsive to the first confidence level having a value less than the second confidence level : classify the at least one data point with the at least one second classification label, and responsive to the first confidence level having a value greater than or equal to the second confidence level : classify the at least one data point with the at least one first classification label to the second confidence level.

25. The system of claim 24 wherein the classification threshold value is user defined.

26. The system of claim 24 or claim 25 wherein the classification threshold value is 80% or greater.

27. The system of any one of claims 24 to 26 wherein the classification threshold is 90% or greater.

28. The system of any one of claims 24 to 26 wherein the classification threshold lies in the range of 80% to 90%.

29. The system of any one of claims 24 to 28 further comprising a high confidence database in which is maintained the at least one data point responsive to the at least one data point being classified to a classification confidence level having a value greater than an upper threshold confidence level.

30. The system of claim 29 wherein the upper threshold confidence level is 92.5% or greater.

31. The system of claim 29 or claim 30 wherein the upper threshold confidence level lies in the range 92.5% to 97.5%.

32. The system of any one of claims 29 to 31 wherein the upper threshold confidence level is 95%.

33. The system of any one of claims 24 to 32 further comprising a low confidence database in which is maintained the at least one data point responsive to the at least one data point being classified to a classification confidence level having a value below a lower threshold confidence level.

34. The system of claim 33 wherein the lower threshold confidence level is lower than the upper threshold confidence level.

35. The system of claim 33 or claim 34 wherein the lower threshold confidence level is less than 95%.

36. The system of claim 33 or claim 34 wherein the lower threshold confidence level lies in the range 70% to 95%.

37. The system of any one of claims 33 to 36 wherein the lower threshold confidence level is less than 92.5%.

38. The system of any one of claims 33 to 37 wherein the lower threshold confidence level is less than 80%.

39. The system of any one of claims 24 to 38 wherein the first machine learning model comprises a Radius Nearest Neighbour (Rad-NN) model.

40. The system of claim 39 wherein classifying the at least one data point using the first machine learning model includes counting votes of all neighbouring points positioned within a specified radius r.

41. The system of claim 39 or claim 40 wherein the classifier is further configured to determine the first confidence level as a ratio of k nearest like neighbours to k nearest unlike neighbours, k having a value in the range 10 to 15.

42. The system of any one of claims 24 to 41 wherein the second machine learning model comprises a Support Vector Machine (SVM) model having parameters C and y.

43. The system of claim 42 wherein the values of C and/or y are user-defined.

44. The system of claim 42 or claim 43 wherein the value of C is approximately 1,000.

45. The system of any one of claims 42 to 44 wherein the value of y is approximately 0.1

46. The system of any one of claims 29 to 32 further comprising a model trainer configured to train the second machine learning model from at least one data point retrieved from the high confidence database.

47. A computer-readable medium having stored thereon computer-executable instructions that, when executed by a processor, cause the processor to perform a method of classifying at least one data point within a data set, the method comprising : maintaining, in a first machine learning model, a first plurality of data points associated to respective first labels; maintaining, in a second machine learning model, a second plurality of data points associated to respective second labels; classifying the at least one data point using the first machine learning model to a first confidence level to determine at least one first classification label; responsive to the first confidence level having a value greater than or equal to a classification threshold value: classifying the at least one data point with the at least one first classification label to the first confidence level; and responsive to the first confidence level having a value of zero and/or the first confidence level having a value below the classification threshold value: classifying the at least one data point using the second machine learning model to a second confidence level to determine at least one second classification label, responsive to the first confidence level having a value less than the second confidence level : classifying the at least one data point with the at least one second classification label, and responsive to the first confidence level having a value greater than or equal to the second confidence level : classifying the at least one data point with the at least one first classification label to the second confidence level.

48. The computer-readable medium of claim 47 wherein the classification threshold value is user defined.

49. The computer-readable medium of claim 47 or claim 48 wherein the classification threshold value is 80% or greater.

50. The computer-readable medium of any one of claims 47 to 49 wherein the classification threshold is 90% or greater.

51. The computer-readable medium of any one of claims 47 to 49 wherein the classification threshold lies in the range of 80% to 90%.

52. The computer-readable medium of any one of claims 47 to 51 wherein the method further comprises maintaining the at least one data point in a high confidence database responsive to the at least one data point being classified to a classification confidence level having a value greater than an upper threshold confidence level.

53. The computer-readable medium of claim 52 wherein the upper threshold confidence level is 92.5% or greater.

54. The computer-readable medium of claim 52 or claim 53 wherein the upper threshold confidence level lies in the range 92.5% to 97.5%.

55. The computer-readable medium of claim 52 or claim 53 wherein the upper threshold confidence level is 95%.

56. The computer-readable medium of any one of claims 47 to 55, the method further comprising maintaining the at least one data point in a low confidence database responsive to the at least one data point being classified to a classification confidence level having a value below a lower threshold confidence level.

57. The computer-readable medium of claim 56 wherein the lower threshold confidence level is lower than the upper threshold confidence level.

58. The computer-readable medium of claim 56 or claim 57 wherein the lower threshold confidence level is less than 95%.

59. The computer-readable medium of claim 56 or claim 57 wherein the lower threshold confidence level lies in the range 70% to 95%.

60. The computer-readable medium of claim 56 or claim 57 wherein the lower threshold confidence level is less than 92.5%.

61. The computer-readable medium of claim 56 or claim 57 wherein the lower threshold confidence level is less than 80%.

62. The computer-readable medium of any one of claims 47 to 61 wherein the first machine learning model comprises a Radius Nearest Neighbour (Rad-NN) model.

63. The computer-readable medium of claim 62 wherein classifying the at least one data point using the first machine learning model includes counting votes of all neighbouring points positioned within a specified radius r.

64. The computer-readable medium of claim 62 or claim 63, the method further comprising determining the first confidence level as a ratio of k nearest like neighbours to k nearest unlike neighbours, k having a value in the range 10 to 15.

65. The computer-readable medium of any one of claims 47 to 64 wherein the second machine learning model comprises a Support Vector Machine (SVM) model having parameters C and y.

66. The computer-readable medium of claim 65 wherein the values of C and/or y are user-defined .

67. The computer-readable medium of claim 65 or claim 66 wherein the value of C is approximately 1,000.

68. The computer-readable medium of any one of claims 65 to 67 wherein the value of y is approximately 0.1

69. The computer-readable medium of any one of claims 52 to 55, the method further comprising training the second machine learning model from at least one data point retrieved from the high confidence database.

70. A method of detecting anomalies within a set of data points, the method comprising : receiving a data point from the set of data points; determining a similarity score associated to the data point using a first machine learning model; responsive to the data point having a similarity score less than a threshold value: determining a first value for an anomaly score associated to the data point, and adding the data point to a first database; and responsive to the data point having a similarity score greater than the threshold value: determining a second value for the anomaly score associated to the data point, the second value not equal to the first value.

71. The method of claim 70 further comprising :

adding the similarity score associated to the data point to a first batch of similarity scores maintained in a second database; and responsive to determining that the first batch of similarity scores is significantly different to a second batch of similarity scores: adding the data points associated to the respective similarity scores in the first batch to a low confidence database.

72. The method of claim 71 further comprising :

receiving the first batch of similarity scores; and

responsive to determining that the first batch of similarity scores is not significantly different to the second batch of similarity scores: updating the threshold value.

73. The method of claim 72 further comprising :

removing outliers from the first batch of similarity scores.

74. The method of claim 72 or claim 73 wherein updating the threshold value comprises determining a median value of the first batch of similarity scores; and setting the threshold value to the median value.

75. An anomaly detection system configured to detect anomalies within a set of data points, the system comprising :

a first machine learning model;

a second machine learning model; and

a processor configured to:

receive a data point from the set of data points;

determine a similarity score associated to the data point using the first machine learning model;

responsive to the data point having a similarity score less than a threshold value:

determine a first value for an anomaly score associated to the data point, and

add the data point to a first database; and

responsive to the data point having a similarity score greater than the threshold value:

determining a second value for the anomaly score associated to the data point, the second value not equal to the first value.

76. A computer-readable medium having stored thereon computer-executable instructions that, when executed by a processor, cause the processor to perform a method of detecting anomalies within a set of data points , the method comprising : receiving a data point from the set of data points; determining a similarity score associated to the data point using a first machine learning model ;

responsive to the data point having a similarity score less than a threshold value: determining a first value for an anomaly score associated to the data point, and adding the data point to a first database ; and