CN111814908A - Abnormal data detection model updating method and device based on data flow - Google Patents

Abnormal data detection model updating method and device based on data flow Download PDF

Info

Publication number
CN111814908A
CN111814908A CN202010751183.6A CN202010751183A CN111814908A CN 111814908 A CN111814908 A CN 111814908A CN 202010751183 A CN202010751183 A CN 202010751183A CN 111814908 A CN111814908 A CN 111814908A
Authority
CN
China
Prior art keywords
data
new
abnormal
normal
neighborhood
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010751183.6A
Other languages
Chinese (zh)
Other versions
CN111814908B (en
Inventor
王腾江
陈兆瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur General Software Co Ltd
Original Assignee
Inspur General Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur General Software Co Ltd filed Critical Inspur General Software Co Ltd
Priority to CN202010751183.6A priority Critical patent/CN111814908B/en
Publication of CN111814908A publication Critical patent/CN111814908A/en
Application granted granted Critical
Publication of CN111814908B publication Critical patent/CN111814908B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method and a device for updating an abnormal data detection model based on data flow, wherein the method comprises the following steps: training an initialization model of a clustering algorithm based on an initial data set to obtain a clustering model comprising a normal data set and an abnormal data set; continuously receiving new data from the data stream, determining whether the new data is normal data or abnormal data based on the clustering model in sequence, responding to the new data being normal data, determining whether the clustering model has abnormal data in a neighborhood range of the new data, responding to the abnormal data, resetting each abnormal data to be normal data, further taking the reset data as the new data, iterating and repeatedly executing the previous step until no abnormal data exists in the neighborhood range, shifting all normal data obtained by resetting the abnormal data from the abnormal data into a normal data set, and adding the new data into the normal data set. The invention can reduce the calculation cost, improve the processing speed and the accuracy and is easy to realize.

Description

Abnormal data detection model updating method and device based on data flow
Technical Field
The present invention relates to the field of abnormal data detection, and more particularly, to a method and an apparatus for updating an abnormal data detection model based on data flow.
Background
At present, in most industrial big data analysis platforms, an anomaly detection process uses an offline training model for online detection, and a step of automatically updating the model is lacked, or a set of mechanism triggering offline training is created to update the model and redeploy. For example, Spark batch data calculation tools widely applied by a plurality of platforms integrate Spark ML, a machine learning tool, help the platforms to store and load abnormal detection models, and realize the flow of off-line training and on-line detection, and the mode is slow in updating and iteration and high in error rate when the old models are used for detecting new data; each deployment process is complicated and easy to make mistakes; the method is used on a batch data computing platform more, and is too high in computing cost because of retraining each iteration and is not suitable for real-time streaming computing. More and more platforms hope to use a real-time online updating algorithm of the model, the model can be updated by itself, the accuracy of the model is improved in real time, and the platform stability is improved due to successful one-time deployment. The FTRL (Follow-the-regularized-Leader) algorithm is favored because it is characterized by calculating the class and loss of the samples from each training sample, then using the loss generated by the samples to calculate the gradient, and propagating the updated model back once. If an Alink algorithm tool is integrated in a Flink streaming data processing tool, and an online learning updating model is adopted, the real-time performance is good, the FTRL algorithm has a good effect on a backward propagation guidance model based on logistic regression, but the algorithm is difficult to understand, the model is required to be backward guidance, the method is not suitable for an unsupervised algorithm, and the introduction of norm as a supplement loss function is complex, so that the calculation cost is increased.
Aiming at the problems of high calculation cost, low speed, low accuracy and difficult realization of processing streaming data by an unsupervised algorithm in the prior art, no effective solution is available at present.
Disclosure of Invention
In view of this, an object of the embodiments of the present invention is to provide a method and an apparatus for updating an abnormal data detection model based on a data stream, which can reduce computation cost, increase processing speed, increase accuracy, and are easy to implement.
In view of the foregoing, a first aspect of the embodiments of the present invention provides a method for updating an abnormal data detection model based on data flow, including the following steps:
training an initialization model of a clustering algorithm based on an initial data set to obtain a clustering model comprising a normal data set and an abnormal data set;
continuously receiving new data from the data stream, processing the new data, and sequentially executing the following steps:
determining the new data to be normal data or abnormal data based on the clustering model;
determining whether abnormal data exists in the clustering model in the neighborhood range of the new data or not in response to the new data being normal data;
in response to the existence of abnormal data, resetting each abnormal data to be normal data, and further taking the reset data as new data to iterate and repeatedly execute the previous step until no abnormal data exists in the neighborhood range;
all normal data obtained from the abnormal data reset is shifted from the abnormal data set into the normal data set, and new data is added to the normal data set.
In some embodiments, the method further comprises: performing exception data processing on the new data using the clustering model in response to the new data being exception data, and adding the new data to the exception dataset.
In some embodiments, processing for each new data includes: and processing the previous new data, and processing the next new data by using the clustering model with the updated normal data set and abnormal data set after the normal data set and the abnormal data set caused by processing the previous new data are updated.
In some embodiments, training the initialization model of the clustering algorithm based on the initial data set comprises:
determining a neighborhood radius and a neighborhood density based on the initial data set, and marking all data in the initial data set as unvisited;
randomly choosing the data marked as not-accessed from the initial data set to sequentially execute the following steps until the data marked as not-accessed is not included in the initial data set:
modifying the tag of the selected data from not accessed to accessed;
determining data located within a neighborhood radius of the selected data based on the distance of the selected data to other data and the neighborhood radius;
creating a new cluster and incorporating the selected data into the new cluster in response to the number of data within the neighborhood radius of the selected data being greater than or equal to the neighborhood density and not belonging to any existing cluster;
and modifying the mark of the data in the neighborhood radius of the selected data into the mark which is accessed to be included in the new cluster, and respectively taking the data in the neighborhood radius of the selected data as the selected data to repeatedly execute the steps.
In some embodiments, the clustering algorithm is a DBSCAN algorithm; the distance from the selected data to other data is Euclidean distance or Mahalanobis distance;
the method further comprises the following steps: the selected data is labeled as noise in response to the number of data within the neighborhood radius of the selected data being less than the neighborhood density and there being at least one data belonging to any existing cluster.
A second aspect of the embodiments of the present invention provides an abnormal data detection model updating apparatus based on a data flow, including:
a processor; and
a memory storing program code executable by the processor, the program code when executed performing the steps of:
training an initialization model of a clustering algorithm based on an initial data set to obtain a clustering model comprising a normal data set and an abnormal data set;
continuously receiving new data from the data stream, processing the new data, and sequentially executing the following steps:
determining the new data to be normal data or abnormal data based on the clustering model;
determining whether abnormal data exists in the clustering model in the neighborhood range of the new data or not in response to the new data being normal data;
in response to the existence of abnormal data, resetting each abnormal data to be normal data, and further taking the reset data as new data to iterate and repeatedly execute the previous step until no abnormal data exists in the neighborhood range;
all normal data obtained from the abnormal data reset is shifted from the abnormal data set into the normal data set, and new data is added to the normal data set.
In some embodiments, the steps further comprise: performing exception data processing on the new data using the clustering model in response to the new data being exception data, and adding the new data to the exception dataset.
In some embodiments, processing for each new data includes: and processing the previous new data, and processing the next new data by using the clustering model with the updated normal data set and abnormal data set after the normal data set and the abnormal data set caused by processing the previous new data are updated.
In some embodiments, training the initialization model of the clustering algorithm based on the initial data set comprises:
determining a neighborhood radius and a neighborhood density based on the initial data set, and marking all data in the initial data set as unvisited;
randomly choosing the data marked as not-accessed from the initial data set to sequentially execute the following steps until the data marked as not-accessed is not included in the initial data set:
modifying the tag of the selected data from not accessed to accessed;
determining data located within a neighborhood radius of the selected data based on the distance of the selected data to other data and the neighborhood radius;
creating a new cluster and incorporating the selected data into the new cluster in response to the number of data within the neighborhood radius of the selected data being greater than or equal to the neighborhood density and not belonging to any existing cluster;
and modifying the mark of the data in the neighborhood radius of the selected data into the mark which is accessed to be included in the new cluster, and respectively taking the data in the neighborhood radius of the selected data as the selected data to repeatedly execute the steps.
In some embodiments, the clustering algorithm is a DBSCAN algorithm; the distance from the selected data to other data is Euclidean distance or Mahalanobis distance;
the method also comprises the following steps: the selected data is labeled as noise in response to the number of data within the neighborhood radius of the selected data being less than the neighborhood density and there being at least one data belonging to any existing cluster.
The invention has the following beneficial technical effects: according to the abnormal data detection model updating method and device based on the data flow, provided by the embodiment of the invention, the clustering model comprising the normal data set and the abnormal data set is obtained through the initial model based on the initial data set training clustering algorithm; continuously receiving new data from the data stream, processing each new data, determining whether the new data is normal data or abnormal data based on the clustering model in sequence, responding to the fact that the new data is normal data, determining whether the clustering model has abnormal data in a neighborhood range of the new data, responding to the fact that the abnormal data exists, resetting each abnormal data to be normal data, further taking the reset data as the new data to iterate and repeatedly execute the previous step until no abnormal data exists in the neighborhood range, shifting all the normal data obtained by resetting the abnormal data from the abnormal data set into the normal data set, and adding the new data into the normal data set.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of an abnormal data detection model updating method based on data flow according to the present invention;
fig. 2 is a detailed flowchart of an abnormal data detection model updating method based on data flow according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.
It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.
In view of the above, a first aspect of the embodiments of the present invention provides an embodiment of an abnormal data detection model updating method that can reduce the computation cost, increase the processing speed, increase the accuracy, and is easy to implement. Fig. 1 is a schematic flow chart of an abnormal data detection model updating method based on data flow according to the present invention.
The abnormal data detection model updating method based on data flow, as shown in fig. 1, includes the following steps:
step S101: training an initialization model of a clustering algorithm based on an initial data set to obtain a clustering model comprising a normal data set and an abnormal data set;
step S103: continuously receiving new data from the data stream, processing the new data, and sequentially executing the following steps:
step S1031: determining the new data to be normal data or abnormal data based on the clustering model;
step S1033: determining whether abnormal data exists in the clustering model in the neighborhood range of the new data or not in response to the new data being normal data;
step S1035: in response to the existence of abnormal data, resetting each abnormal data to be normal data, and further taking the reset data as new data to iterate and repeatedly execute the previous step until no abnormal data exists in the neighborhood range;
step S1037: all normal data obtained from the abnormal data reset is shifted from the abnormal data set into the normal data set, and new data is added to the normal data set.
The unsupervised learning algorithm is a commonly used industrial data anomaly detection algorithm. An unsupervised learning algorithm needs to learn batch data, so most of the platforms at the present stage adopt an algorithm for updating an anomaly detection model offline, namely, the model is trained offline and deployed online for prediction, and if the model needs to be updated (manual or automatic strategy), the model needs to be retrained and then deployed again according to the collected data. The strategy ensures that the old model cannot be updated in real time, the new data cannot be adapted in time, the accuracy is influenced, and the calculation cost is high when the model is retrained based on a large amount of data. In a flow data platform, data is good in real-time performance, data support is provided for an online updating model, real-time online updating algorithms are also used, for example, FTRL algorithms used by some platforms have good adaptability to algorithms based on logistic regression, but the algorithms are difficult to understand, the models are required to be reversely conductive, and the method is not suitable for unsupervised algorithms. Therefore, the invention provides respective online model updating algorithms in an industrial big data analysis platform for streaming data based on the idea of FTRL algorithm and an unsupervised algorithm, can improve the accuracy of the models in real time, and has the characteristics of low calculation cost, high speed and easy use.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program to instruct relevant hardware to perform the processes, and the processes can be stored in a computer readable storage medium, and when executed, the processes can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like. Embodiments of the computer program may achieve the same or similar effects as any of the preceding method embodiments to which it corresponds.
In some embodiments, the method further comprises: performing exception data processing on the new data using the clustering model in response to the new data being exception data, and adding the new data to the exception dataset.
In some embodiments, processing for each new data includes: and processing the previous new data, and processing the next new data by using the clustering model with the updated normal data set and abnormal data set after the normal data set and the abnormal data set caused by processing the previous new data are updated.
In some embodiments, training the initialization model of the clustering algorithm based on the initial data set comprises:
determining a neighborhood radius and a neighborhood density based on the initial data set, and marking all data in the initial data set as unvisited;
randomly choosing the data marked as not-accessed from the initial data set to sequentially execute the following steps until the data marked as not-accessed is not included in the initial data set:
modifying the tag of the selected data from not accessed to accessed;
determining data located within a neighborhood radius of the selected data based on the distance of the selected data to other data and the neighborhood radius;
creating a new cluster and incorporating the selected data into the new cluster in response to the number of data within the neighborhood radius of the selected data being greater than or equal to the neighborhood density and not belonging to any existing cluster;
and modifying the mark of the data in the neighborhood radius of the selected data into the mark which is accessed to be included in the new cluster, and respectively taking the data in the neighborhood radius of the selected data as the selected data to repeatedly execute the steps.
In some embodiments, the clustering algorithm is a DBSCAN algorithm; the distance from the selected data to other data is Euclidean distance or Mahalanobis distance; the method further comprises the following steps: the selected data is labeled as noise in response to the number of data within the neighborhood radius of the selected data being less than the neighborhood density and there being at least one data belonging to any existing cluster.
The method disclosed according to an embodiment of the present invention may also be implemented as a computer program executed by a CPU, which may be stored in a computer-readable storage medium. The computer program, when executed by the CPU, performs the above-described functions defined in the method disclosed in the embodiments of the present invention. The above-described method steps and system elements may also be implemented using a controller and a computer-readable storage medium for storing a computer program for causing the controller to implement the functions of the above-described steps or elements.
The following further illustrates embodiments of the invention in terms of specific examples.
The DBSCAN algorithm is a clustering algorithm based on density, is widely applied to anomaly detection, is a marked data object cluster, and comprises the following algorithm steps:
1. inputting a training data set D, a neighborhood radius parameter and a neighborhood density value MinPts;
2. mark all points as unvisited;
3. randomly selecting an unvisited point p and recording as visited (visited);
4. calculating the distance from the point p to other points (the distance can be Euclidean distance, Mahalanobis distance and the like), if at least MinPts objects are in the neighborhood of the point p and do not belong to any set, creating a new set, adding the point p to the set, and if not, marking the point p as noise;
5. marking all objects in the neighborhood of the point p as visited, and adding the objects into the set neighborhood of the point p;
6. repeating the steps 4 and 5 by taking each object in the neighborhood of the point p as the point p until all the points marked as visited in the neighborhood of the point are not marked as the unsuspected point;
7. and repeating the steps 3,4 and 5 until all data are traversed.
The DBSCAN algorithm has more iteration times and large calculation amount, and if offline updating is used, a large amount of data is recalculated each time, so that the calculation cost is higher. The invention provides a method for updating a model on line according to the training process, which saves a large amount of calculation cost, updates the model in real time and improves the accuracy of the model for detecting new data, and the specific flow is shown in figure 2.
Firstly, a part of data is collected in real time to be used as a training set of an initialization model, and model initialization training is carried out. And judging whether the abnormal values are abnormal values or not by using a model for each subsequent industrial data, if so, entering an abnormal value processing flow, if so, searching whether the abnormal values exist in the neighborhood range, and if so, marking the abnormal values as normal values and attributing to a normal value cluster. And then, iteratively updating other abnormal values by using the abnormal values until other abnormal values are traversed, so that some abnormal values are corrected to be normal values, and the accuracy of the model is improved in real time.
In addition, the method of online detection and offline updating of the model can be used for realizing the updating iteration of the model. For example, the model is deployed on a platform for online anomaly detection, when the detected data volume reaches a certain amount, the model is triggered to be updated offline, and the model is deployed on the platform after being updated (the trigger model offline updating strategies are different).
It can be seen from the foregoing embodiments that, in the abnormal data detection model updating method based on data flow provided in the embodiments of the present invention, a clustering model including a normal data set and an abnormal data set is obtained by training an initialization model of a clustering algorithm based on an initial data set; continuously receiving new data from the data stream, processing each new data, determining whether the new data is normal data or abnormal data based on the clustering model in sequence, responding to the fact that the new data is normal data, determining whether the clustering model has abnormal data in a neighborhood range of the new data, responding to the fact that the abnormal data exists, resetting each abnormal data to be normal data, further taking the reset data as the new data to iterate and repeatedly execute the previous step until no abnormal data exists in the neighborhood range, shifting all the normal data obtained by resetting the abnormal data from the abnormal data set into the normal data set, and adding the new data into the normal data set.
It should be particularly noted that, the steps in the embodiments of the data flow-based abnormal data detection model updating method may be mutually intersected, replaced, added, or deleted, and therefore, these reasonable permutation and combination transformations should also belong to the scope of the present invention, and should not limit the scope of the present invention to the described embodiments.
In view of the above-mentioned objects, a second aspect of the embodiments of the present invention provides an embodiment of an abnormal data detection model update apparatus, which can reduce the calculation cost, increase the processing speed, increase the accuracy, and is easy to implement. The abnormal data detection model updating device based on the data flow comprises:
a processor; and
a memory storing program code executable by the processor, the program code when executed performing the steps of:
training an initialization model of a clustering algorithm based on an initial data set to obtain a clustering model comprising a normal data set and an abnormal data set;
continuously receiving new data from the data stream, processing the new data, and sequentially executing the following steps:
determining the new data to be normal data or abnormal data based on the clustering model;
determining whether abnormal data exists in the clustering model in the neighborhood range of the new data or not in response to the new data being normal data;
in response to the existence of abnormal data, resetting each abnormal data to be normal data, and further taking the reset data as new data to iterate and repeatedly execute the previous step until no abnormal data exists in the neighborhood range;
all normal data obtained from the abnormal data reset is shifted from the abnormal data set into the normal data set, and new data is added to the normal data set.
In some embodiments, the steps further comprise: performing exception data processing on the new data using the clustering model in response to the new data being exception data, and adding the new data to the exception dataset.
In some embodiments, processing for each new data includes: and processing the previous new data, and processing the next new data by using the clustering model with the updated normal data set and abnormal data set after the normal data set and the abnormal data set caused by processing the previous new data are updated.
In some embodiments, training the initialization model of the clustering algorithm based on the initial data set comprises:
determining a neighborhood radius and a neighborhood density based on the initial data set, and marking all data in the initial data set as unvisited;
randomly choosing the data marked as not-accessed from the initial data set to sequentially execute the following steps until the data marked as not-accessed is not included in the initial data set:
modifying the tag of the selected data from not accessed to accessed;
determining data located within a neighborhood radius of the selected data based on the distance of the selected data to other data and the neighborhood radius;
creating a new cluster and incorporating the selected data into the new cluster in response to the number of data within the neighborhood radius of the selected data being greater than or equal to the neighborhood density and not belonging to any existing cluster;
and modifying the mark of the data in the neighborhood radius of the selected data into the mark which is accessed to be included in the new cluster, and respectively taking the data in the neighborhood radius of the selected data as the selected data to repeatedly execute the steps.
In some embodiments, the clustering algorithm is a DBSCAN algorithm; the distance from the selected data to other data is Euclidean distance or Mahalanobis distance; the method also comprises the following steps: the selected data is labeled as noise in response to the number of data within the neighborhood radius of the selected data being less than the neighborhood density and there being at least one data belonging to any existing cluster.
It can be seen from the foregoing embodiments that, in the abnormal data detection model updating apparatus based on data stream provided in the embodiments of the present invention, the cluster model including the normal data set and the abnormal data set is obtained by training the initialization model of the clustering algorithm based on the initial data set; continuously receiving new data from the data stream, processing each new data, determining whether the new data is normal data or abnormal data based on the clustering model in sequence, responding to the fact that the new data is normal data, determining whether the clustering model has abnormal data in a neighborhood range of the new data, responding to the fact that the abnormal data exists, resetting each abnormal data to be normal data, further taking the reset data as the new data to iterate and repeatedly execute the previous step until no abnormal data exists in the neighborhood range, shifting all the normal data obtained by resetting the abnormal data from the abnormal data set into the normal data set, and adding the new data into the normal data set.
It should be particularly noted that the above embodiment of the data flow-based abnormal data detection model updating apparatus employs the embodiment of the data flow-based abnormal data detection model updating method to specifically describe the working process of each module, and those skilled in the art can easily think that these modules are applied to other embodiments of the data flow-based abnormal data detection model updating method. Of course, since the steps in the embodiment of the data flow-based abnormal data detection model updating method may be intersected, replaced, added, or deleted, these reasonable permutation, combination and transformation should also belong to the scope of the present invention, and should not limit the scope of the present invention to the embodiment.
The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.
It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items. The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of an embodiment of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims (10)

1. An abnormal data detection model updating method based on data flow is characterized by comprising the following steps:
training an initialization model of a clustering algorithm based on an initial data set to obtain a clustering model comprising a normal data set and an abnormal data set;
continuously receiving new data from a data stream, processing the new data, and sequentially executing the following steps:
determining the new data to be normal data or abnormal data based on the clustering model;
in response to the new data being normal data, determining whether abnormal data exists in a neighborhood of the new data by the clustering model;
in response to the abnormal data, respectively resetting each abnormal data to be normal data, and further taking the reset data as the new data to iteratively and repeatedly execute the previous step until no abnormal data exists in the neighborhood range;
shifting all of the normal data obtained from the abnormal data reset from the abnormal data set into the normal data set, and adding the new data into the normal data set.
2. The method of claim 1, further comprising: performing exception data processing on the new data using the clustering model in response to the new data being exception data, and adding the new data to the exception dataset.
3. The method of claim 1, wherein processing for each new data comprises: and processing the previous new data, and processing the next new data by using the cluster model with the updated normal data set and abnormal data set after the normal data set and the abnormal data set caused by processing the previous new data are updated.
4. The method of claim 1, wherein training an initialization model of a clustering algorithm based on an initial data set comprises:
determining a neighborhood radius and a neighborhood density based on the initial data set, and marking all data in the initial data set as unvisited;
randomly selecting data marked as not-accessed from the initial data set continuously to sequentially execute the following steps until the data marked as not-accessed is not included in the initial data set:
modifying the tag of the selected data from not accessed to accessed;
determining data located within a neighborhood radius of the selected data based on the distance of the selected data to other data and the neighborhood radius;
creating a new cluster and incorporating the selected data into the new cluster in response to the number of data within the neighborhood radius of the selected data being greater than or equal to the neighborhood density and not belonging to any existing cluster;
and modifying the mark of the data in the neighborhood radius of the selected data into the accessed data to be included in the new cluster, and respectively taking the data in the neighborhood radius of the selected data as the selected data to repeatedly execute the steps.
5. The method of claim 4, wherein the clustering algorithm is a DBSCAN algorithm; the distance from the selected data to other data is Euclidean distance or Mahalanobis distance;
the method further comprises the following steps: in response to the number of data within a neighborhood radius of the selected data being less than the neighborhood density and there being at least one data belonging to any existing cluster, marking the selected data as noise.
6. An abnormal data detection model updating device based on data flow is characterized by comprising:
a processor; and
a memory storing program code executable by the processor, the program code when executed performing the steps of:
training an initialization model of a clustering algorithm based on an initial data set to obtain a clustering model comprising a normal data set and an abnormal data set;
continuously receiving new data from a data stream, processing the new data, and sequentially executing the following steps:
determining the new data to be normal data or abnormal data based on the clustering model;
in response to the new data being normal data, determining whether abnormal data exists in a neighborhood of the new data by the clustering model;
in response to the abnormal data, respectively resetting each abnormal data to be normal data, and further taking the reset data as the new data to iteratively and repeatedly execute the previous step until no abnormal data exists in the neighborhood range;
shifting all of the normal data obtained from the abnormal data reset from the abnormal data set into the normal data set, and adding the new data into the normal data set.
7. The apparatus of claim 6, wherein the steps further comprise: performing exception data processing on the new data using the clustering model in response to the new data being exception data, and adding the new data to the exception dataset.
8. The apparatus of claim 6, wherein processing for each new data comprises: and processing the previous new data, and processing the next new data by using the cluster model with the updated normal data set and abnormal data set after the normal data set and the abnormal data set caused by processing the previous new data are updated.
9. The apparatus of claim 6, wherein training an initialization model of a clustering algorithm based on an initial data set comprises:
determining a neighborhood radius and a neighborhood density based on the initial data set, and marking all data in the initial data set as unvisited;
randomly selecting data marked as not-accessed from the initial data set continuously to sequentially execute the following steps until the data marked as not-accessed is not included in the initial data set:
modifying the tag of the selected data from not accessed to accessed;
determining data located within a neighborhood radius of the selected data based on the distance of the selected data to other data and the neighborhood radius;
creating a new cluster and incorporating the selected data into the new cluster in response to the number of data within the neighborhood radius of the selected data being greater than or equal to the neighborhood density and not belonging to any existing cluster;
and modifying the mark of the data in the neighborhood radius of the selected data into the accessed data to be included in the new cluster, and respectively taking the data in the neighborhood radius of the selected data as the selected data to repeatedly execute the steps.
10. The apparatus of claim 9, wherein the clustering algorithm is a DBSCAN algorithm; the distance from the selected data to other data is Euclidean distance or Mahalanobis distance;
the steps further include: in response to the number of data within a neighborhood radius of the selected data being less than the neighborhood density and there being at least one data belonging to any existing cluster, marking the selected data as noise.
CN202010751183.6A 2020-07-30 2020-07-30 Abnormal data detection model updating method and device based on data flow Active CN111814908B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010751183.6A CN111814908B (en) 2020-07-30 2020-07-30 Abnormal data detection model updating method and device based on data flow

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010751183.6A CN111814908B (en) 2020-07-30 2020-07-30 Abnormal data detection model updating method and device based on data flow

Publications (2)

Publication Number Publication Date
CN111814908A true CN111814908A (en) 2020-10-23
CN111814908B CN111814908B (en) 2023-06-27

Family

ID=72863426

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010751183.6A Active CN111814908B (en) 2020-07-30 2020-07-30 Abnormal data detection model updating method and device based on data flow

Country Status (1)

Country Link
CN (1) CN111814908B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112651617A (en) * 2020-12-21 2021-04-13 福州大学 Wind turbine generator anomaly detection method based on batch flow unified data clustering

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107528823A (en) * 2017-07-03 2017-12-29 中山大学 A kind of network anomaly detection method based on improved K Means clustering algorithms
CN109032829A (en) * 2018-07-23 2018-12-18 腾讯科技(深圳)有限公司 Data exception detection method, device, computer equipment and storage medium
WO2020038353A1 (en) * 2018-08-21 2020-02-27 瀚思安信(北京)软件技术有限公司 Abnormal behavior detection method and system
CN110942099A (en) * 2019-11-29 2020-03-31 华侨大学 Abnormal data identification and detection method of DBSCAN based on core point reservation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107528823A (en) * 2017-07-03 2017-12-29 中山大学 A kind of network anomaly detection method based on improved K Means clustering algorithms
CN109032829A (en) * 2018-07-23 2018-12-18 腾讯科技(深圳)有限公司 Data exception detection method, device, computer equipment and storage medium
WO2020038353A1 (en) * 2018-08-21 2020-02-27 瀚思安信(北京)软件技术有限公司 Abnormal behavior detection method and system
CN110942099A (en) * 2019-11-29 2020-03-31 华侨大学 Abnormal data identification and detection method of DBSCAN based on core point reservation

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112651617A (en) * 2020-12-21 2021-04-13 福州大学 Wind turbine generator anomaly detection method based on batch flow unified data clustering
CN112651617B (en) * 2020-12-21 2022-07-01 福州大学 Wind turbine generator anomaly detection method based on batch flow unified data clustering

Also Published As

Publication number Publication date
CN111814908B (en) 2023-06-27

Similar Documents

Publication Publication Date Title
US11227235B1 (en) Universal artificial intelligence engine for autonomous computing devices and software applications
Kurach et al. Neural random-access machines
Zaremba et al. Reinforcement learning neural turing machines-revised
EP3593288B1 (en) Training action selection neural networks using look-ahead search
CN109933656B (en) Public opinion polarity prediction method, public opinion polarity prediction device, computer equipment and storage medium
EP3991026A1 (en) Machine learning retraining
JP7229308B2 (en) Methods and systems for defending against universal adversarial attacks on time series data
US11704570B2 (en) Learning device, learning system, and learning method
CN111694917B (en) Vehicle abnormal track detection and model training method and device
CN111368887B (en) Training method of thunderstorm weather prediction model and thunderstorm weather prediction method
CN109948140B (en) Word vector embedding method and device
CN114896395A (en) Language model fine-tuning method, text classification method, device and equipment
US20040254760A1 (en) Change-point detection apparatus, method and program therefor
CN110610208A (en) Active safety increment data training method
WO2022164668A1 (en) Natural language source code search using using neural transformers
JP6962123B2 (en) Label estimation device and label estimation program
CN115270954A (en) Unsupervised APT attack detection method and system based on abnormal node identification
CN111814908A (en) Abnormal data detection model updating method and device based on data flow
CN114627980A (en) Chemical inverse synthesis analysis method and system
CN112861364B (en) Method for realizing anomaly detection by modeling industrial control system equipment behavior based on secondary annotation of state delay transition diagram
IL224525A (en) System and method for bit-map based keyword spotting in communication traffic
CN111401569A (en) Hyper-parameter optimization method and device and electronic equipment
CN110866607B (en) Permeation behavior prediction algorithm based on machine learning
CN113095423B (en) Stream data classification method based on online anti-deduction learning and realization device thereof
US20210256313A1 (en) Learning policies using sparse and underspecified rewards

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant