CN111814908B

CN111814908B - Abnormal data detection model updating method and device based on data flow

Info

Publication number: CN111814908B
Application number: CN202010751183.6A
Authority: CN
Inventors: 王腾江; 陈兆瑞
Original assignee: Inspur General Software Co Ltd
Current assignee: Inspur General Software Co Ltd
Priority date: 2020-07-30
Filing date: 2020-07-30
Publication date: 2023-06-27
Anticipated expiration: 2040-07-30
Also published as: CN111814908A

Abstract

The invention discloses a data flow-based abnormal data detection model updating method and device, wherein the method comprises the following steps: training an initialization model of a clustering algorithm based on the initial data set to obtain a clustering model comprising a normal data set and an abnormal data set; continuously receiving new data from the data stream, sequentially determining whether the new data is normal data or abnormal data based on the clustering model, responding to the new data as normal data, determining whether the clustering model has abnormal data in a neighborhood range of the new data, responding to the abnormal data, resetting each abnormal data as normal data, further iteratively and repeatedly executing the last step by taking the reset data as the new data until no abnormal data exists in the neighborhood range, moving all the normal data obtained by resetting the abnormal data from an abnormal data set into the normal data set, and adding the new data into the normal data set. The invention can reduce the calculation cost, improve the processing speed, improve the accuracy and is easy to realize.

Description

Abnormal data detection model updating method and device based on data flow

Technical Field

The present invention relates to the field of abnormal data detection, and in particular, to a method and apparatus for updating an abnormal data detection model based on a data stream.

Background

At present, in most industrial big data analysis platforms, an abnormality detection flow uses an offline training model to perform online detection, lacks a step of automatically updating the model, or creates a set of mechanism for triggering offline training to update the model and redeploy. For example, spark ML is integrated in a Spark batch type data calculation tool widely applied by a plurality of platforms, a machine learning tool is used for helping the platforms to store and load an abnormal detection model, offline training and online detection flow are realized, the updating iteration is slow, and the error rate of detecting new data by using an old model is high; each deployment flow is tedious and easy to make mistakes; the method is more used in a batch data computing platform, and the computing cost is too high for retraining each time of iteration, so that the method is not suitable for real-time streaming computing. More and more platforms hope to use a real-time online updating algorithm of a model, the model can be self-updated, the accuracy of the model is improved in real time, one-time deployment is successful, and the stability of the platform is improved. The FTRL (Follow-the-guided-Leader) algorithm is favored, and this model update algorithm is characterized by each training sample, computing the class and loss of the sample previously, then using the loss generated by the sample to compute the gradient, and back-propagating the update model once. If an Alink algorithm tool is integrated in the Flink streaming data processing tool, an online learning update model is adopted, the real-time performance is good, the FTRL algorithm has a good effect on a model based on backward propagation guidance such as logistic regression, but the algorithm is difficult to understand, the model is required to be backward guided, the method is not suitable for an unsupervised algorithm, and the introduction of norms as a complementary loss function is complex, so that the calculation cost is increased.

Aiming at the problems of high calculation cost, low speed, low accuracy and difficult realization of processing stream data by an unsupervised algorithm in the prior art, no effective solution exists at present.

Disclosure of Invention

In view of the above, an object of the embodiments of the present invention is to provide a method and apparatus for updating an abnormal data detection model based on a data stream, which can reduce the calculation cost, increase the processing speed, increase the accuracy and be easy to implement.

Based on the above object, a first aspect of the embodiments of the present invention provides a method for updating an abnormal data detection model based on a data stream, including performing the following steps:

training an initialization model of a clustering algorithm based on the initial data set to obtain a clustering model comprising a normal data set and an abnormal data set;

continuously receiving new data from the data stream, processing each new data, and sequentially executing the following steps:

determining new data as normal data or abnormal data based on the clustering model;

determining whether the clustering model has abnormal data in a neighborhood range of the new data in response to the new data being the normal data;

in response to the existence of the abnormal data, resetting each abnormal data as normal data, and further iteratively and repeatedly executing the previous step by taking the reset data as new data until no abnormal data exist in the neighborhood range;

all normal data obtained from the resetting of the abnormal data is moved from the abnormal data set into the normal data set, and new data is added to the normal data set.

In some embodiments, the method further comprises: performing abnormal data processing on the new data using a clustering model in response to the new data being abnormal data, and adding the new data to the abnormal data set.

In some embodiments, processing for each new data includes: and processing the previous new data, and after both the normal data set and the abnormal data set caused by processing the previous new data are updated, processing the next new data by using a clustering model with the updated normal data set and the updated abnormal data set.

In some embodiments, training an initialization model of a clustering algorithm based on an initial data set includes:

determining a neighborhood radius and a neighborhood density based on the initial dataset, and marking all data in the initial dataset as unvisited;

the random selection of data marked as non-accessed from the initial dataset is continued to sequentially perform the following steps until the initial dataset does not include data marked as non-accessed:

modifying the mark of the selected data from unvisited to accessed;

determining data located within a neighborhood radius of the selected data based on the distance of the selected data from other data and the neighborhood radius;

creating a new cluster and incorporating the selected data into the new cluster in response to the number of data within the neighborhood radius of the selected data being greater than or equal to the neighborhood density and none belonging to any existing cluster;

modifying the marks of the data in the neighborhood radius of the selected data to be accessed to be included in the new clusters, and repeatedly executing the steps by taking the data in the neighborhood radius of the selected data as the selected data respectively.

In some embodiments, the clustering algorithm is a DBSCAN algorithm; the distance from the selected data to other data is Euclidean distance or Marsdian distance;

the method further comprises the steps of: the selected data is marked as noise in response to the number of data within the neighborhood radius of the selected data being less than the neighborhood density and at least one data belonging to any existing cluster.

A second aspect of an embodiment of the present invention provides an abnormal data detection model updating apparatus based on a data stream, including:

a processor; and

a memory storing program code executable by a processor, the program code when executed performing the steps of:

In some embodiments, the steps further comprise: performing abnormal data processing on the new data using a clustering model in response to the new data being abnormal data, and adding the new data to the abnormal data set.

modifying the mark of the selected data from unvisited to accessed;

The invention has the following beneficial technical effects: according to the method and the device for updating the abnormal data detection model based on the data flow, the clustering model comprising the normal data set and the abnormal data set is obtained through training the initialization model of the clustering algorithm based on the initial data set; continuously receiving new data from the data stream, processing each new data, sequentially determining whether the new data is normal data or abnormal data based on the clustering model, responding to the new data as normal data, determining whether the clustering model has abnormal data in the neighborhood range of the new data, respectively resetting each abnormal data as normal data in response to the abnormal data, further iteratively and repeatedly executing the last step by taking the reset data as the new data until no abnormal data exists in the neighborhood range, moving all the normal data obtained by resetting the abnormal data from the abnormal data set into the normal data set, and adding the new data into the technical scheme in the normal data set.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for updating an abnormal data detection model based on data flow;

fig. 2 is a detailed flowchart of a method for updating an abnormal data detection model based on data flow provided by the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

It should be noted that, in the embodiments of the present invention, all the expressions "first" and "second" are used to distinguish two entities with the same name but different entities or different parameters, and it is noted that the "first" and "second" are only used for convenience of expression, and should not be construed as limiting the embodiments of the present invention, and the following embodiments are not described one by one.

Based on the above object, a first aspect of the embodiments of the present invention proposes an embodiment of an abnormal data detection model updating method capable of reducing the calculation cost, improving the processing speed, improving the accuracy, and being easy to implement. Fig. 1 is a schematic flow chart of an updating method of an abnormal data detection model based on data flow.

The method for updating the abnormal data detection model based on the data flow, as shown in fig. 1, comprises the following steps:

step S101: training an initialization model of a clustering algorithm based on the initial data set to obtain a clustering model comprising a normal data set and an abnormal data set;

step S103: continuously receiving new data from the data stream, processing each new data, and sequentially executing the following steps:

step S1031: determining new data as normal data or abnormal data based on the clustering model;

step S1033: determining whether the clustering model has abnormal data in a neighborhood range of the new data in response to the new data being the normal data;

step S1035: in response to the existence of the abnormal data, resetting each abnormal data as normal data, and further iteratively and repeatedly executing the previous step by taking the reset data as new data until no abnormal data exist in the neighborhood range;

step S1037: all normal data obtained from the resetting of the abnormal data is moved from the abnormal data set into the normal data set, and new data is added to the normal data set.

The unsupervised learning algorithm of the present invention is a commonly used industrial data anomaly detection algorithm. The unsupervised learning algorithm needs to learn a lot of data, so most platforms at present adopt an algorithm for updating an abnormality detection model offline, namely, the model is trained offline firstly, the model is deployed on line for prediction, and if the model needs to be updated (manual or automatic strategy), the model needs to be retrained and deployed again according to the collected data. The strategy makes the old model not updated in real time, the new data not adapted in time, the accuracy is affected, and the model is retrained based on a large amount of data, so that the calculation cost is high. In the streaming data platform, the real-time performance of the data is good, the data support is provided for the online updating model, the real-time online updating algorithm is also used, for example, the FTRL algorithm used by some platforms has good suitability for the algorithm based on logistic regression, but the algorithm is difficult to understand, the model is required to be reversely conductive, and the method is not suitable for an unsupervised algorithm. Therefore, the invention provides the respective online model updating algorithm in the industrial big data analysis platform aiming at the stream data aiming at the unsupervised algorithm based on the thought of the FTRL algorithm, can improve the accuracy of the model in real time, and has the characteristics of low calculation cost, high speed and easy use.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in the embodiments may be accomplished by a computer program to instruct related hardware, and the program may be stored in a computer readable storage medium, and the program may include the steps of the embodiments of the above-described methods when executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like. Embodiments of the computer program may achieve the same or similar effects as any of the previously described method embodiments corresponding thereto.

modifying the mark of the selected data from unvisited to accessed;

In some embodiments, the clustering algorithm is a DBSCAN algorithm; the distance from the selected data to other data is Euclidean distance or Marsdian distance; the method further comprises the steps of: the selected data is marked as noise in response to the number of data within the neighborhood radius of the selected data being less than the neighborhood density and at least one data belonging to any existing cluster.

The method disclosed according to the embodiment of the present invention may also be implemented as a computer program executed by a CPU, which may be stored in a computer-readable storage medium. When executed by a CPU, performs the functions defined above in the methods disclosed in the embodiments of the present invention. The above method steps and system units may also be implemented with a controller and a computer readable storage medium storing a computer program for causing the controller to implement the above steps or unit functions.

Specific embodiments of the present invention are further described below with reference to specific examples.

The DBSCAN algorithm is a clustering algorithm based on density, has wide application in anomaly detection, and the model is a marked data object cluster, and comprises the following algorithm steps:

1. inputting a training data set D, a neighborhood radius parameter epsilon and a neighborhood density value MinPts;

2. marking all points as unvisited;

3. randomly selecting an unvisited point p and marking the unvisited point p as access (visited);

4. calculating the distance between the point p and other points (the distance can be Euclidean distance, mahalanobis distance and the like), if the point p has at least MinPts objects in the neighborhood and does not belong to any set, creating a new set to add the point p to the set, and if the point p does not belong to any set, marking the point p as noise;

5. marking all objects in the neighborhood of the point p as visible, and adding the objects into the aggregation neighborhood where the point p is located;

6. repeating steps 4 and 5 with each object in the neighborhood of the point p as the point p until all points marked as unvisited in the neighborhood of the point marked as visited are not marked;

7. the steps 3,4,5 are repeated until all data has been traversed.

The DBSCAN algorithm has the advantages of more iteration times and large calculation amount, and if offline updating is used, a large amount of data is recalculated each time, so that the calculation cost is high. The invention provides a method for updating the model on line according to the training process, which saves a great amount of calculation cost, updates the model in real time, improves the accuracy of the model on new data detection, and the specific flow is shown in figure 2.

First, a part of data is collected in real time as a training set of an initialization model, and model initialization training is performed. And judging whether the subsequent industrial data are abnormal values by using a model, if so, entering an abnormal value processing flow, if so, searching whether abnormal values exist in a neighborhood range, and if so, marking the abnormal values as normal values and belonging to a normal value cluster. And then iteratively updating other abnormal values by using the abnormal values until the other abnormal values are traversed, so that some abnormal values are corrected to normal values, and the accuracy of the model is improved in real time.

In addition, the method of online detection and offline updating of the model can be used for realizing updating iteration of the model. For example, the model is deployed on the platform for online anomaly detection, when the detected data volume reaches a certain quantity, the model is triggered to be updated offline, and then the model is deployed in the platform after the model is updated (the model is triggered to be updated offline according to different strategies).

According to the method for updating the abnormal data detection model based on the data flow, the clustering model comprising the normal data set and the abnormal data set is obtained by training an initialization model of a clustering algorithm based on the initial data set; continuously receiving new data from the data stream, processing each new data, sequentially determining whether the new data is normal data or abnormal data based on the clustering model, responding to the new data as normal data, determining whether the clustering model has abnormal data in the neighborhood range of the new data, respectively resetting each abnormal data as normal data in response to the abnormal data, further iteratively and repeatedly executing the last step by taking the reset data as the new data until no abnormal data exists in the neighborhood range, moving all the normal data obtained by resetting the abnormal data from the abnormal data set into the normal data set, and adding the new data into the technical scheme in the normal data set.

It should be noted that, in the foregoing embodiments of the method for updating an abnormal data detection model based on a data stream, the steps may be intersected, replaced, added and subtracted, so that the method for updating an abnormal data detection model based on a data stream by using these reasonable permutation and combination should also belong to the protection scope of the present invention, and should not limit the protection scope of the present invention to the embodiments.

In view of the above-described object, a second aspect of the embodiments of the present invention proposes an embodiment of an abnormal data detection model update apparatus capable of reducing the calculation cost, improving the processing speed, improving the accuracy, and being easy to implement. The abnormal data detection model updating device based on the data flow comprises:

a processor; and

modifying the mark of the selected data from unvisited to accessed;

As can be seen from the above embodiments, the device for updating an abnormal data detection model based on a data stream according to the embodiments of the present invention obtains a clustering model including a normal data set and an abnormal data set by training an initialization model of a clustering algorithm based on an initial data set; continuously receiving new data from the data stream, processing each new data, sequentially determining whether the new data is normal data or abnormal data based on the clustering model, responding to the new data as normal data, determining whether the clustering model has abnormal data in the neighborhood range of the new data, respectively resetting each abnormal data as normal data in response to the abnormal data, further iteratively and repeatedly executing the last step by taking the reset data as the new data until no abnormal data exists in the neighborhood range, moving all the normal data obtained by resetting the abnormal data from the abnormal data set into the normal data set, and adding the new data into the technical scheme in the normal data set.

It should be noted that, the foregoing embodiment of the data flow based abnormal data detection model updating apparatus employs the embodiment of the data flow based abnormal data detection model updating method to specifically describe the operation of each module, and those skilled in the art can easily think that these modules are applied to other embodiments of the data flow based abnormal data detection model updating method. Of course, since the steps in the embodiment of the data flow based abnormal data detection model updating method can be intersected, replaced, added and deleted, the reasonable permutation and combination transformation is that the data flow based abnormal data detection model updating device also belongs to the protection scope of the invention, and the protection scope of the invention is not limited to the embodiment.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

It should be understood that as used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items. The foregoing embodiment of the present invention has been disclosed with reference to the number of embodiments for the purpose of description only, and does not represent the advantages or disadvantages of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Those of ordinary skill in the art will appreciate that: the above discussion of any embodiment is merely exemplary and is not intended to imply that the scope of the disclosure of embodiments of the invention, including the claims, is limited to such examples; combinations of features of the above embodiments or in different embodiments are also possible within the idea of an embodiment of the invention, and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omission, modification, equivalent replacement, improvement, etc. of the embodiments should be included in the protection scope of the embodiments of the present invention.

Claims

1. A method for updating an abnormal data detection model based on a data stream, comprising the steps of:

continuously receiving new data from a data stream and processing each of said new data and in response to an update of said normal data set and said abnormal data set being caused when a previous new data is processed, processing a subsequent new data using said cluster model with updated said normal data set and said abnormal data set, wherein said step of processing each of said new data comprises the steps of, in order:

determining that the new data is normal data or abnormal data based on the clustering model;

determining whether abnormal data exists in a neighborhood range of the new data by the clustering model in response to the new data being normal data;

resetting each abnormal data to be normal data in response to the existence of the abnormal data, and further iteratively and repeatedly executing the last step by taking the reset data as the new data until no abnormal data exist in the neighborhood range;

performing an abnormal data process on the new data using the clustering model in response to the new data being abnormal data, and adding the new data to the abnormal data set;

and moving all the normal data obtained from the abnormal data reset from the abnormal data set into the normal data set, and adding the new data into the normal data set.

2. The method of claim 1, wherein training an initialization model of a clustering algorithm based on an initial dataset comprises:

randomly selecting data marked as not accessed from the initial dataset is continued to sequentially perform the following steps until the initial dataset does not include data marked as not accessed:

modifying the mark of the selected data from unvisited to accessed;

determining data located within a neighborhood radius of the selected data based on a distance of the selected data from other data and the neighborhood radius;

modifying the marks of the data in the neighborhood radius of the selected data to be accessed to be included in the new cluster, and repeatedly executing the steps by taking the data in the neighborhood radius of the selected data as the selected data respectively.

3. The method of claim 2, wherein the clustering algorithm is a DBSCAN algorithm; the distance from the selected data to other data is Euclidean distance or Marsdian distance;

4. An abnormal data detection model updating device based on a data stream, comprising:

a processor; and

5. The apparatus of claim 4, wherein training an initialization model of a clustering algorithm based on an initial dataset comprises:

modifying the mark of the selected data from unvisited to accessed;

6. The apparatus of claim 5, wherein the clustering algorithm is a DBSCAN algorithm; the distance from the selected data to other data is Euclidean distance or Marsdian distance;

the steps further include: the selected data is marked as noise in response to the number of data within the neighborhood radius of the selected data being less than the neighborhood density and at least one data belonging to any existing cluster.