CN111814908A

CN111814908A - Abnormal data detection model updating method and device based on data flow

Info

Publication number: CN111814908A
Application number: CN202010751183.6A
Authority: CN
Inventors: 王腾江; 陈兆瑞
Original assignee: Inspur General Software Co Ltd
Current assignee: Inspur General Software Co Ltd
Priority date: 2020-07-30
Filing date: 2020-07-30
Publication date: 2020-10-23
Anticipated expiration: 2040-07-30
Also published as: CN111814908B

Abstract

The invention discloses a method and a device for updating an abnormal data detection model based on data flow, wherein the method comprises the following steps: training an initialization model of a clustering algorithm based on an initial data set to obtain a clustering model comprising a normal data set and an abnormal data set; continuously receiving new data from the data stream, determining whether the new data is normal data or abnormal data based on the clustering model in sequence, responding to the new data being normal data, determining whether the clustering model has abnormal data in a neighborhood range of the new data, responding to the abnormal data, resetting each abnormal data to be normal data, further taking the reset data as the new data, iterating and repeatedly executing the previous step until no abnormal data exists in the neighborhood range, shifting all normal data obtained by resetting the abnormal data from the abnormal data into a normal data set, and adding the new data into the normal data set. The invention can reduce the calculation cost, improve the processing speed and the accuracy and is easy to realize.

Description

Abnormal data detection model updating method and device based on data flow

Technical Field

The present invention relates to the field of abnormal data detection, and more particularly, to a method and an apparatus for updating an abnormal data detection model based on data flow.

Background

At present, in most industrial big data analysis platforms, an anomaly detection process uses an offline training model for online detection, and a step of automatically updating the model is lacked, or a set of mechanism triggering offline training is created to update the model and redeploy. For example, Spark batch data calculation tools widely applied by a plurality of platforms integrate Spark ML, a machine learning tool, help the platforms to store and load abnormal detection models, and realize the flow of off-line training and on-line detection, and the mode is slow in updating and iteration and high in error rate when the old models are used for detecting new data; each deployment process is complicated and easy to make mistakes; the method is used on a batch data computing platform more, and is too high in computing cost because of retraining each iteration and is not suitable for real-time streaming computing. More and more platforms hope to use a real-time online updating algorithm of the model, the model can be updated by itself, the accuracy of the model is improved in real time, and the platform stability is improved due to successful one-time deployment. The FTRL (Follow-the-regularized-Leader) algorithm is favored because it is characterized by calculating the class and loss of the samples from each training sample, then using the loss generated by the samples to calculate the gradient, and propagating the updated model back once. If an Alink algorithm tool is integrated in a Flink streaming data processing tool, and an online learning updating model is adopted, the real-time performance is good, the FTRL algorithm has a good effect on a backward propagation guidance model based on logistic regression, but the algorithm is difficult to understand, the model is required to be backward guidance, the method is not suitable for an unsupervised algorithm, and the introduction of norm as a supplement loss function is complex, so that the calculation cost is increased.

Aiming at the problems of high calculation cost, low speed, low accuracy and difficult realization of processing streaming data by an unsupervised algorithm in the prior art, no effective solution is available at present.

Disclosure of Invention

In view of this, an object of the embodiments of the present invention is to provide a method and an apparatus for updating an abnormal data detection model based on a data stream, which can reduce computation cost, increase processing speed, increase accuracy, and are easy to implement.

In view of the foregoing, a first aspect of the embodiments of the present invention provides a method for updating an abnormal data detection model based on data flow, including the following steps:

training an initialization model of a clustering algorithm based on an initial data set to obtain a clustering model comprising a normal data set and an abnormal data set;

continuously receiving new data from the data stream, processing the new data, and sequentially executing the following steps:

determining the new data to be normal data or abnormal data based on the clustering model;

determining whether abnormal data exists in the clustering model in the neighborhood range of the new data or not in response to the new data being normal data;

in response to the existence of abnormal data, resetting each abnormal data to be normal data, and further taking the reset data as new data to iterate and repeatedly execute the previous step until no abnormal data exists in the neighborhood range;

all normal data obtained from the abnormal data reset is shifted from the abnormal data set into the normal data set, and new data is added to the normal data set.

In some embodiments, the method further comprises: performing exception data processing on the new data using the clustering model in response to the new data being exception data, and adding the new data to the exception dataset.

In some embodiments, processing for each new data includes: and processing the previous new data, and processing the next new data by using the clustering model with the updated normal data set and abnormal data set after the normal data set and the abnormal data set caused by processing the previous new data are updated.

In some embodiments, training the initialization model of the clustering algorithm based on the initial data set comprises:

determining a neighborhood radius and a neighborhood density based on the initial data set, and marking all data in the initial data set as unvisited;

randomly choosing the data marked as not-accessed from the initial data set to sequentially execute the following steps until the data marked as not-accessed is not included in the initial data set:

modifying the tag of the selected data from not accessed to accessed;

determining data located within a neighborhood radius of the selected data based on the distance of the selected data to other data and the neighborhood radius;

creating a new cluster and incorporating the selected data into the new cluster in response to the number of data within the neighborhood radius of the selected data being greater than or equal to the neighborhood density and not belonging to any existing cluster;

and modifying the mark of the data in the neighborhood radius of the selected data into the mark which is accessed to be included in the new cluster, and respectively taking the data in the neighborhood radius of the selected data as the selected data to repeatedly execute the steps.

In some embodiments, the clustering algorithm is a DBSCAN algorithm; the distance from the selected data to other data is Euclidean distance or Mahalanobis distance;

the method further comprises the following steps: the selected data is labeled as noise in response to the number of data within the neighborhood radius of the selected data being less than the neighborhood density and there being at least one data belonging to any existing cluster.

A second aspect of the embodiments of the present invention provides an abnormal data detection model updating apparatus based on a data flow, including:

a processor; and

a memory storing program code executable by the processor, the program code when executed performing the steps of:

In some embodiments, the steps further comprise: performing exception data processing on the new data using the clustering model in response to the new data being exception data, and adding the new data to the exception dataset.

modifying the tag of the selected data from not accessed to accessed;

the method also comprises the following steps: the selected data is labeled as noise in response to the number of data within the neighborhood radius of the selected data being less than the neighborhood density and there being at least one data belonging to any existing cluster.

The invention has the following beneficial technical effects: according to the abnormal data detection model updating method and device based on the data flow, provided by the embodiment of the invention, the clustering model comprising the normal data set and the abnormal data set is obtained through the initial model based on the initial data set training clustering algorithm; continuously receiving new data from the data stream, processing each new data, determining whether the new data is normal data or abnormal data based on the clustering model in sequence, responding to the fact that the new data is normal data, determining whether the clustering model has abnormal data in a neighborhood range of the new data, responding to the fact that the abnormal data exists, resetting each abnormal data to be normal data, further taking the reset data as the new data to iterate and repeatedly execute the previous step until no abnormal data exists in the neighborhood range, shifting all the normal data obtained by resetting the abnormal data from the abnormal data set into the normal data set, and adding the new data into the normal data set.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of an abnormal data detection model updating method based on data flow according to the present invention;

fig. 2 is a detailed flowchart of an abnormal data detection model updating method based on data flow according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.

In view of the above, a first aspect of the embodiments of the present invention provides an embodiment of an abnormal data detection model updating method that can reduce the computation cost, increase the processing speed, increase the accuracy, and is easy to implement. Fig. 1 is a schematic flow chart of an abnormal data detection model updating method based on data flow according to the present invention.

The abnormal data detection model updating method based on data flow, as shown in fig. 1, includes the following steps:

step S101: training an initialization model of a clustering algorithm based on an initial data set to obtain a clustering model comprising a normal data set and an abnormal data set;

step S103: continuously receiving new data from the data stream, processing the new data, and sequentially executing the following steps:

step S1031: determining the new data to be normal data or abnormal data based on the clustering model;

step S1033: determining whether abnormal data exists in the clustering model in the neighborhood range of the new data or not in response to the new data being normal data;

step S1035: in response to the existence of abnormal data, resetting each abnormal data to be normal data, and further taking the reset data as new data to iterate and repeatedly execute the previous step until no abnormal data exists in the neighborhood range;

step S1037: all normal data obtained from the abnormal data reset is shifted from the abnormal data set into the normal data set, and new data is added to the normal data set.

The unsupervised learning algorithm is a commonly used industrial data anomaly detection algorithm. An unsupervised learning algorithm needs to learn batch data, so most of the platforms at the present stage adopt an algorithm for updating an anomaly detection model offline, namely, the model is trained offline and deployed online for prediction, and if the model needs to be updated (manual or automatic strategy), the model needs to be retrained and then deployed again according to the collected data. The strategy ensures that the old model cannot be updated in real time, the new data cannot be adapted in time, the accuracy is influenced, and the calculation cost is high when the model is retrained based on a large amount of data. In a flow data platform, data is good in real-time performance, data support is provided for an online updating model, real-time online updating algorithms are also used, for example, FTRL algorithms used by some platforms have good adaptability to algorithms based on logistic regression, but the algorithms are difficult to understand, the models are required to be reversely conductive, and the method is not suitable for unsupervised algorithms. Therefore, the invention provides respective online model updating algorithms in an industrial big data analysis platform for streaming data based on the idea of FTRL algorithm and an unsupervised algorithm, can improve the accuracy of the models in real time, and has the characteristics of low calculation cost, high speed and easy use.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program to instruct relevant hardware to perform the processes, and the processes can be stored in a computer readable storage medium, and when executed, the processes can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like. Embodiments of the computer program may achieve the same or similar effects as any of the preceding method embodiments to which it corresponds.

modifying the tag of the selected data from not accessed to accessed;

In some embodiments, the clustering algorithm is a DBSCAN algorithm; the distance from the selected data to other data is Euclidean distance or Mahalanobis distance; the method further comprises the following steps: the selected data is labeled as noise in response to the number of data within the neighborhood radius of the selected data being less than the neighborhood density and there being at least one data belonging to any existing cluster.

The method disclosed according to an embodiment of the present invention may also be implemented as a computer program executed by a CPU, which may be stored in a computer-readable storage medium. The computer program, when executed by the CPU, performs the above-described functions defined in the method disclosed in the embodiments of the present invention. The above-described method steps and system elements may also be implemented using a controller and a computer-readable storage medium for storing a computer program for causing the controller to implement the functions of the above-described steps or elements.

The following further illustrates embodiments of the invention in terms of specific examples.

The DBSCAN algorithm is a clustering algorithm based on density, is widely applied to anomaly detection, is a marked data object cluster, and comprises the following algorithm steps:

1. inputting a training data set D, a neighborhood radius parameter and a neighborhood density value MinPts;

2. mark all points as unvisited;

3. randomly selecting an unvisited point p and recording as visited (visited);

4. calculating the distance from the point p to other points (the distance can be Euclidean distance, Mahalanobis distance and the like), if at least MinPts objects are in the neighborhood of the point p and do not belong to any set, creating a new set, adding the point p to the set, and if not, marking the point p as noise;

5. marking all objects in the neighborhood of the point p as visited, and adding the objects into the set neighborhood of the point p;

6. repeating the steps 4 and 5 by taking each object in the neighborhood of the point p as the point p until all the points marked as visited in the neighborhood of the point are not marked as the unsuspected point;

7. and repeating the steps 3,4 and 5 until all data are traversed.

The DBSCAN algorithm has more iteration times and large calculation amount, and if offline updating is used, a large amount of data is recalculated each time, so that the calculation cost is higher. The invention provides a method for updating a model on line according to the training process, which saves a large amount of calculation cost, updates the model in real time and improves the accuracy of the model for detecting new data, and the specific flow is shown in figure 2.

Firstly, a part of data is collected in real time to be used as a training set of an initialization model, and model initialization training is carried out. And judging whether the abnormal values are abnormal values or not by using a model for each subsequent industrial data, if so, entering an abnormal value processing flow, if so, searching whether the abnormal values exist in the neighborhood range, and if so, marking the abnormal values as normal values and attributing to a normal value cluster. And then, iteratively updating other abnormal values by using the abnormal values until other abnormal values are traversed, so that some abnormal values are corrected to be normal values, and the accuracy of the model is improved in real time.

In addition, the method of online detection and offline updating of the model can be used for realizing the updating iteration of the model. For example, the model is deployed on a platform for online anomaly detection, when the detected data volume reaches a certain amount, the model is triggered to be updated offline, and the model is deployed on the platform after being updated (the trigger model offline updating strategies are different).

It can be seen from the foregoing embodiments that, in the abnormal data detection model updating method based on data flow provided in the embodiments of the present invention, a clustering model including a normal data set and an abnormal data set is obtained by training an initialization model of a clustering algorithm based on an initial data set; continuously receiving new data from the data stream, processing each new data, determining whether the new data is normal data or abnormal data based on the clustering model in sequence, responding to the fact that the new data is normal data, determining whether the clustering model has abnormal data in a neighborhood range of the new data, responding to the fact that the abnormal data exists, resetting each abnormal data to be normal data, further taking the reset data as the new data to iterate and repeatedly execute the previous step until no abnormal data exists in the neighborhood range, shifting all the normal data obtained by resetting the abnormal data from the abnormal data set into the normal data set, and adding the new data into the normal data set.

It should be particularly noted that, the steps in the embodiments of the data flow-based abnormal data detection model updating method may be mutually intersected, replaced, added, or deleted, and therefore, these reasonable permutation and combination transformations should also belong to the scope of the present invention, and should not limit the scope of the present invention to the described embodiments.

In view of the above-mentioned objects, a second aspect of the embodiments of the present invention provides an embodiment of an abnormal data detection model update apparatus, which can reduce the calculation cost, increase the processing speed, increase the accuracy, and is easy to implement. The abnormal data detection model updating device based on the data flow comprises:

a processor; and

modifying the tag of the selected data from not accessed to accessed;

In some embodiments, the clustering algorithm is a DBSCAN algorithm; the distance from the selected data to other data is Euclidean distance or Mahalanobis distance; the method also comprises the following steps: the selected data is labeled as noise in response to the number of data within the neighborhood radius of the selected data being less than the neighborhood density and there being at least one data belonging to any existing cluster.

It can be seen from the foregoing embodiments that, in the abnormal data detection model updating apparatus based on data stream provided in the embodiments of the present invention, the cluster model including the normal data set and the abnormal data set is obtained by training the initialization model of the clustering algorithm based on the initial data set; continuously receiving new data from the data stream, processing each new data, determining whether the new data is normal data or abnormal data based on the clustering model in sequence, responding to the fact that the new data is normal data, determining whether the clustering model has abnormal data in a neighborhood range of the new data, responding to the fact that the abnormal data exists, resetting each abnormal data to be normal data, further taking the reset data as the new data to iterate and repeatedly execute the previous step until no abnormal data exists in the neighborhood range, shifting all the normal data obtained by resetting the abnormal data from the abnormal data set into the normal data set, and adding the new data into the normal data set.

It should be particularly noted that the above embodiment of the data flow-based abnormal data detection model updating apparatus employs the embodiment of the data flow-based abnormal data detection model updating method to specifically describe the working process of each module, and those skilled in the art can easily think that these modules are applied to other embodiments of the data flow-based abnormal data detection model updating method. Of course, since the steps in the embodiment of the data flow-based abnormal data detection model updating method may be intersected, replaced, added, or deleted, these reasonable permutation, combination and transformation should also belong to the scope of the present invention, and should not limit the scope of the present invention to the embodiment.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items. The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of an embodiment of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims

1. An abnormal data detection model updating method based on data flow is characterized by comprising the following steps:

continuously receiving new data from a data stream, processing the new data, and sequentially executing the following steps:

in response to the new data being normal data, determining whether abnormal data exists in a neighborhood of the new data by the clustering model;

in response to the abnormal data, respectively resetting each abnormal data to be normal data, and further taking the reset data as the new data to iteratively and repeatedly execute the previous step until no abnormal data exists in the neighborhood range;

shifting all of the normal data obtained from the abnormal data reset from the abnormal data set into the normal data set, and adding the new data into the normal data set.

2. The method of claim 1, further comprising: performing exception data processing on the new data using the clustering model in response to the new data being exception data, and adding the new data to the exception dataset.

3. The method of claim 1, wherein processing for each new data comprises: and processing the previous new data, and processing the next new data by using the cluster model with the updated normal data set and abnormal data set after the normal data set and the abnormal data set caused by processing the previous new data are updated.

4. The method of claim 1, wherein training an initialization model of a clustering algorithm based on an initial data set comprises:

randomly selecting data marked as not-accessed from the initial data set continuously to sequentially execute the following steps until the data marked as not-accessed is not included in the initial data set:

modifying the tag of the selected data from not accessed to accessed;

and modifying the mark of the data in the neighborhood radius of the selected data into the accessed data to be included in the new cluster, and respectively taking the data in the neighborhood radius of the selected data as the selected data to repeatedly execute the steps.

5. The method of claim 4, wherein the clustering algorithm is a DBSCAN algorithm; the distance from the selected data to other data is Euclidean distance or Mahalanobis distance;

the method further comprises the following steps: in response to the number of data within a neighborhood radius of the selected data being less than the neighborhood density and there being at least one data belonging to any existing cluster, marking the selected data as noise.

6. An abnormal data detection model updating device based on data flow is characterized by comprising:

a processor; and

7. The apparatus of claim 6, wherein the steps further comprise: performing exception data processing on the new data using the clustering model in response to the new data being exception data, and adding the new data to the exception dataset.

8. The apparatus of claim 6, wherein processing for each new data comprises: and processing the previous new data, and processing the next new data by using the cluster model with the updated normal data set and abnormal data set after the normal data set and the abnormal data set caused by processing the previous new data are updated.

9. The apparatus of claim 6, wherein training an initialization model of a clustering algorithm based on an initial data set comprises:

modifying the tag of the selected data from not accessed to accessed;

10. The apparatus of claim 9, wherein the clustering algorithm is a DBSCAN algorithm; the distance from the selected data to other data is Euclidean distance or Mahalanobis distance;

the steps further include: in response to the number of data within a neighborhood radius of the selected data being less than the neighborhood density and there being at least one data belonging to any existing cluster, marking the selected data as noise.