CN111814908B - Abnormal data detection model updating method and device based on data flow - Google Patents

Abnormal data detection model updating method and device based on data flow Download PDF

Info

Publication number
CN111814908B
CN111814908B CN202010751183.6A CN202010751183A CN111814908B CN 111814908 B CN111814908 B CN 111814908B CN 202010751183 A CN202010751183 A CN 202010751183A CN 111814908 B CN111814908 B CN 111814908B
Authority
CN
China
Prior art keywords
data
abnormal
new
normal
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010751183.6A
Other languages
Chinese (zh)
Other versions
CN111814908A (en
Inventor
王腾江
陈兆瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur General Software Co Ltd
Original Assignee
Inspur General Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur General Software Co Ltd filed Critical Inspur General Software Co Ltd
Priority to CN202010751183.6A priority Critical patent/CN111814908B/en
Publication of CN111814908A publication Critical patent/CN111814908A/en
Application granted granted Critical
Publication of CN111814908B publication Critical patent/CN111814908B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data flow-based abnormal data detection model updating method and device, wherein the method comprises the following steps: training an initialization model of a clustering algorithm based on the initial data set to obtain a clustering model comprising a normal data set and an abnormal data set; continuously receiving new data from the data stream, sequentially determining whether the new data is normal data or abnormal data based on the clustering model, responding to the new data as normal data, determining whether the clustering model has abnormal data in a neighborhood range of the new data, responding to the abnormal data, resetting each abnormal data as normal data, further iteratively and repeatedly executing the last step by taking the reset data as the new data until no abnormal data exists in the neighborhood range, moving all the normal data obtained by resetting the abnormal data from an abnormal data set into the normal data set, and adding the new data into the normal data set. The invention can reduce the calculation cost, improve the processing speed, improve the accuracy and is easy to realize.

Description

Abnormal data detection model updating method and device based on data flow
Technical Field
The present invention relates to the field of abnormal data detection, and in particular, to a method and apparatus for updating an abnormal data detection model based on a data stream.
Background
At present, in most industrial big data analysis platforms, an abnormality detection flow uses an offline training model to perform online detection, lacks a step of automatically updating the model, or creates a set of mechanism for triggering offline training to update the model and redeploy. For example, spark ML is integrated in a Spark batch type data calculation tool widely applied by a plurality of platforms, a machine learning tool is used for helping the platforms to store and load an abnormal detection model, offline training and online detection flow are realized, the updating iteration is slow, and the error rate of detecting new data by using an old model is high; each deployment flow is tedious and easy to make mistakes; the method is more used in a batch data computing platform, and the computing cost is too high for retraining each time of iteration, so that the method is not suitable for real-time streaming computing. More and more platforms hope to use a real-time online updating algorithm of a model, the model can be self-updated, the accuracy of the model is improved in real time, one-time deployment is successful, and the stability of the platform is improved. The FTRL (Follow-the-guided-Leader) algorithm is favored, and this model update algorithm is characterized by each training sample, computing the class and loss of the sample previously, then using the loss generated by the sample to compute the gradient, and back-propagating the update model once. If an Alink algorithm tool is integrated in the Flink streaming data processing tool, an online learning update model is adopted, the real-time performance is good, the FTRL algorithm has a good effect on a model based on backward propagation guidance such as logistic regression, but the algorithm is difficult to understand, the model is required to be backward guided, the method is not suitable for an unsupervised algorithm, and the introduction of norms as a complementary loss function is complex, so that the calculation cost is increased.
Aiming at the problems of high calculation cost, low speed, low accuracy and difficult realization of processing stream data by an unsupervised algorithm in the prior art, no effective solution exists at present.
Disclosure of Invention
In view of the above, an object of the embodiments of the present invention is to provide a method and apparatus for updating an abnormal data detection model based on a data stream, which can reduce the calculation cost, increase the processing speed, increase the accuracy and be easy to implement.
Based on the above object, a first aspect of the embodiments of the present invention provides a method for updating an abnormal data detection model based on a data stream, including performing the following steps:
training an initialization model of a clustering algorithm based on the initial data set to obtain a clustering model comprising a normal data set and an abnormal data set;
continuously receiving new data from the data stream, processing each new data, and sequentially executing the following steps:
determining new data as normal data or abnormal data based on the clustering model;
determining whether the clustering model has abnormal data in a neighborhood range of the new data in response to the new data being the normal data;
in response to the existence of the abnormal data, resetting each abnormal data as normal data, and further iteratively and repeatedly executing the previous step by taking the reset data as new data until no abnormal data exist in the neighborhood range;
all normal data obtained from the resetting of the abnormal data is moved from the abnormal data set into the normal data set, and new data is added to the normal data set.
In some embodiments, the method further comprises: performing abnormal data processing on the new data using a clustering model in response to the new data being abnormal data, and adding the new data to the abnormal data set.
In some embodiments, processing for each new data includes: and processing the previous new data, and after both the normal data set and the abnormal data set caused by processing the previous new data are updated, processing the next new data by using a clustering model with the updated normal data set and the updated abnormal data set.
In some embodiments, training an initialization model of a clustering algorithm based on an initial data set includes:
determining a neighborhood radius and a neighborhood density based on the initial dataset, and marking all data in the initial dataset as unvisited;
the random selection of data marked as non-accessed from the initial dataset is continued to sequentially perform the following steps until the initial dataset does not include data marked as non-accessed:
modifying the mark of the selected data from unvisited to accessed;
determining data located within a neighborhood radius of the selected data based on the distance of the selected data from other data and the neighborhood radius;
creating a new cluster and incorporating the selected data into the new cluster in response to the number of data within the neighborhood radius of the selected data being greater than or equal to the neighborhood density and none belonging to any existing cluster;
modifying the marks of the data in the neighborhood radius of the selected data to be accessed to be included in the new clusters, and repeatedly executing the steps by taking the data in the neighborhood radius of the selected data as the selected data respectively.
In some embodiments, the clustering algorithm is a DBSCAN algorithm; the distance from the selected data to other data is Euclidean distance or Marsdian distance;
the method further comprises the steps of: the selected data is marked as noise in response to the number of data within the neighborhood radius of the selected data being less than the neighborhood density and at least one data belonging to any existing cluster.
A second aspect of an embodiment of the present invention provides an abnormal data detection model updating apparatus based on a data stream, including:
a processor; and
a memory storing program code executable by a processor, the program code when executed performing the steps of:
training an initialization model of a clustering algorithm based on the initial data set to obtain a clustering model comprising a normal data set and an abnormal data set;
continuously receiving new data from the data stream, processing each new data, and sequentially executing the following steps:
determining new data as normal data or abnormal data based on the clustering model;
determining whether the clustering model has abnormal data in a neighborhood range of the new data in response to the new data being the normal data;
in response to the existence of the abnormal data, resetting each abnormal data as normal data, and further iteratively and repeatedly executing the previous step by taking the reset data as new data until no abnormal data exist in the neighborhood range;
all normal data obtained from the resetting of the abnormal data is moved from the abnormal data set into the normal data set, and new data is added to the normal data set.
In some embodiments, the steps further comprise: performing abnormal data processing on the new data using a clustering model in response to the new data being abnormal data, and adding the new data to the abnormal data set.
In some embodiments, processing for each new data includes: and processing the previous new data, and after both the normal data set and the abnormal data set caused by processing the previous new data are updated, processing the next new data by using a clustering model with the updated normal data set and the updated abnormal data set.
In some embodiments, training an initialization model of a clustering algorithm based on an initial data set includes:
determining a neighborhood radius and a neighborhood density based on the initial dataset, and marking all data in the initial dataset as unvisited;
the random selection of data marked as non-accessed from the initial dataset is continued to sequentially perform the following steps until the initial dataset does not include data marked as non-accessed:
modifying the mark of the selected data from unvisited to accessed;
determining data located within a neighborhood radius of the selected data based on the distance of the selected data from other data and the neighborhood radius;
creating a new cluster and incorporating the selected data into the new cluster in response to the number of data within the neighborhood radius of the selected data being greater than or equal to the neighborhood density and none belonging to any existing cluster;
modifying the marks of the data in the neighborhood radius of the selected data to be accessed to be included in the new clusters, and repeatedly executing the steps by taking the data in the neighborhood radius of the selected data as the selected data respectively.
In some embodiments, the clustering algorithm is a DBSCAN algorithm; the distance from the selected data to other data is Euclidean distance or Marsdian distance;
the method further comprises the steps of: the selected data is marked as noise in response to the number of data within the neighborhood radius of the selected data being less than the neighborhood density and at least one data belonging to any existing cluster.
The invention has the following beneficial technical effects: according to the method and the device for updating the abnormal data detection model based on the data flow, the clustering model comprising the normal data set and the abnormal data set is obtained through training the initialization model of the clustering algorithm based on the initial data set; continuously receiving new data from the data stream, processing each new data, sequentially determining whether the new data is normal data or abnormal data based on the clustering model, responding to the new data as normal data, determining whether the clustering model has abnormal data in the neighborhood range of the new data, respectively resetting each abnormal data as normal data in response to the abnormal data, further iteratively and repeatedly executing the last step by taking the reset data as the new data until no abnormal data exists in the neighborhood range, moving all the normal data obtained by resetting the abnormal data from the abnormal data set into the normal data set, and adding the new data into the technical scheme in the normal data set.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for updating an abnormal data detection model based on data flow;
fig. 2 is a detailed flowchart of a method for updating an abnormal data detection model based on data flow provided by the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention will be described in further detail with reference to the accompanying drawings.
It should be noted that, in the embodiments of the present invention, all the expressions "first" and "second" are used to distinguish two entities with the same name but different entities or different parameters, and it is noted that the "first" and "second" are only used for convenience of expression, and should not be construed as limiting the embodiments of the present invention, and the following embodiments are not described one by one.
Based on the above object, a first aspect of the embodiments of the present invention proposes an embodiment of an abnormal data detection model updating method capable of reducing the calculation cost, improving the processing speed, improving the accuracy, and being easy to implement. Fig. 1 is a schematic flow chart of an updating method of an abnormal data detection model based on data flow.
The method for updating the abnormal data detection model based on the data flow, as shown in fig. 1, comprises the following steps:
step S101: training an initialization model of a clustering algorithm based on the initial data set to obtain a clustering model comprising a normal data set and an abnormal data set;
step S103: continuously receiving new data from the data stream, processing each new data, and sequentially executing the following steps:
step S1031: determining new data as normal data or abnormal data based on the clustering model;
step S1033: determining whether the clustering model has abnormal data in a neighborhood range of the new data in response to the new data being the normal data;
step S1035: in response to the existence of the abnormal data, resetting each abnormal data as normal data, and further iteratively and repeatedly executing the previous step by taking the reset data as new data until no abnormal data exist in the neighborhood range;
step S1037: all normal data obtained from the resetting of the abnormal data is moved from the abnormal data set into the normal data set, and new data is added to the normal data set.
The unsupervised learning algorithm of the present invention is a commonly used industrial data anomaly detection algorithm. The unsupervised learning algorithm needs to learn a lot of data, so most platforms at present adopt an algorithm for updating an abnormality detection model offline, namely, the model is trained offline firstly, the model is deployed on line for prediction, and if the model needs to be updated (manual or automatic strategy), the model needs to be retrained and deployed again according to the collected data. The strategy makes the old model not updated in real time, the new data not adapted in time, the accuracy is affected, and the model is retrained based on a large amount of data, so that the calculation cost is high. In the streaming data platform, the real-time performance of the data is good, the data support is provided for the online updating model, the real-time online updating algorithm is also used, for example, the FTRL algorithm used by some platforms has good suitability for the algorithm based on logistic regression, but the algorithm is difficult to understand, the model is required to be reversely conductive, and the method is not suitable for an unsupervised algorithm. Therefore, the invention provides the respective online model updating algorithm in the industrial big data analysis platform aiming at the stream data aiming at the unsupervised algorithm based on the thought of the FTRL algorithm, can improve the accuracy of the model in real time, and has the characteristics of low calculation cost, high speed and easy use.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in the embodiments may be accomplished by a computer program to instruct related hardware, and the program may be stored in a computer readable storage medium, and the program may include the steps of the embodiments of the above-described methods when executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like. Embodiments of the computer program may achieve the same or similar effects as any of the previously described method embodiments corresponding thereto.
In some embodiments, the method further comprises: performing abnormal data processing on the new data using a clustering model in response to the new data being abnormal data, and adding the new data to the abnormal data set.
In some embodiments, processing for each new data includes: and processing the previous new data, and after both the normal data set and the abnormal data set caused by processing the previous new data are updated, processing the next new data by using a clustering model with the updated normal data set and the updated abnormal data set.
In some embodiments, training an initialization model of a clustering algorithm based on an initial data set includes:
determining a neighborhood radius and a neighborhood density based on the initial dataset, and marking all data in the initial dataset as unvisited;
the random selection of data marked as non-accessed from the initial dataset is continued to sequentially perform the following steps until the initial dataset does not include data marked as non-accessed:
modifying the mark of the selected data from unvisited to accessed;
determining data located within a neighborhood radius of the selected data based on the distance of the selected data from other data and the neighborhood radius;
creating a new cluster and incorporating the selected data into the new cluster in response to the number of data within the neighborhood radius of the selected data being greater than or equal to the neighborhood density and none belonging to any existing cluster;
modifying the marks of the data in the neighborhood radius of the selected data to be accessed to be included in the new clusters, and repeatedly executing the steps by taking the data in the neighborhood radius of the selected data as the selected data respectively.
In some embodiments, the clustering algorithm is a DBSCAN algorithm; the distance from the selected data to other data is Euclidean distance or Marsdian distance; the method further comprises the steps of: the selected data is marked as noise in response to the number of data within the neighborhood radius of the selected data being less than the neighborhood density and at least one data belonging to any existing cluster.
The method disclosed according to the embodiment of the present invention may also be implemented as a computer program executed by a CPU, which may be stored in a computer-readable storage medium. When executed by a CPU, performs the functions defined above in the methods disclosed in the embodiments of the present invention. The above method steps and system units may also be implemented with a controller and a computer readable storage medium storing a computer program for causing the controller to implement the above steps or unit functions.
Specific embodiments of the present invention are further described below with reference to specific examples.
The DBSCAN algorithm is a clustering algorithm based on density, has wide application in anomaly detection, and the model is a marked data object cluster, and comprises the following algorithm steps:
1. inputting a training data set D, a neighborhood radius parameter epsilon and a neighborhood density value MinPts;
2. marking all points as unvisited;
3. randomly selecting an unvisited point p and marking the unvisited point p as access (visited);
4. calculating the distance between the point p and other points (the distance can be Euclidean distance, mahalanobis distance and the like), if the point p has at least MinPts objects in the neighborhood and does not belong to any set, creating a new set to add the point p to the set, and if the point p does not belong to any set, marking the point p as noise;
5. marking all objects in the neighborhood of the point p as visible, and adding the objects into the aggregation neighborhood where the point p is located;
6. repeating steps 4 and 5 with each object in the neighborhood of the point p as the point p until all points marked as unvisited in the neighborhood of the point marked as visited are not marked;
7. the steps 3,4,5 are repeated until all data has been traversed.
The DBSCAN algorithm has the advantages of more iteration times and large calculation amount, and if offline updating is used, a large amount of data is recalculated each time, so that the calculation cost is high. The invention provides a method for updating the model on line according to the training process, which saves a great amount of calculation cost, updates the model in real time, improves the accuracy of the model on new data detection, and the specific flow is shown in figure 2.
First, a part of data is collected in real time as a training set of an initialization model, and model initialization training is performed. And judging whether the subsequent industrial data are abnormal values by using a model, if so, entering an abnormal value processing flow, if so, searching whether abnormal values exist in a neighborhood range, and if so, marking the abnormal values as normal values and belonging to a normal value cluster. And then iteratively updating other abnormal values by using the abnormal values until the other abnormal values are traversed, so that some abnormal values are corrected to normal values, and the accuracy of the model is improved in real time.
In addition, the method of online detection and offline updating of the model can be used for realizing updating iteration of the model. For example, the model is deployed on the platform for online anomaly detection, when the detected data volume reaches a certain quantity, the model is triggered to be updated offline, and then the model is deployed in the platform after the model is updated (the model is triggered to be updated offline according to different strategies).
According to the method for updating the abnormal data detection model based on the data flow, the clustering model comprising the normal data set and the abnormal data set is obtained by training an initialization model of a clustering algorithm based on the initial data set; continuously receiving new data from the data stream, processing each new data, sequentially determining whether the new data is normal data or abnormal data based on the clustering model, responding to the new data as normal data, determining whether the clustering model has abnormal data in the neighborhood range of the new data, respectively resetting each abnormal data as normal data in response to the abnormal data, further iteratively and repeatedly executing the last step by taking the reset data as the new data until no abnormal data exists in the neighborhood range, moving all the normal data obtained by resetting the abnormal data from the abnormal data set into the normal data set, and adding the new data into the technical scheme in the normal data set.
It should be noted that, in the foregoing embodiments of the method for updating an abnormal data detection model based on a data stream, the steps may be intersected, replaced, added and subtracted, so that the method for updating an abnormal data detection model based on a data stream by using these reasonable permutation and combination should also belong to the protection scope of the present invention, and should not limit the protection scope of the present invention to the embodiments.
In view of the above-described object, a second aspect of the embodiments of the present invention proposes an embodiment of an abnormal data detection model update apparatus capable of reducing the calculation cost, improving the processing speed, improving the accuracy, and being easy to implement. The abnormal data detection model updating device based on the data flow comprises:
a processor; and
a memory storing program code executable by a processor, the program code when executed performing the steps of:
training an initialization model of a clustering algorithm based on the initial data set to obtain a clustering model comprising a normal data set and an abnormal data set;
continuously receiving new data from the data stream, processing each new data, and sequentially executing the following steps:
determining new data as normal data or abnormal data based on the clustering model;
determining whether the clustering model has abnormal data in a neighborhood range of the new data in response to the new data being the normal data;
in response to the existence of the abnormal data, resetting each abnormal data as normal data, and further iteratively and repeatedly executing the previous step by taking the reset data as new data until no abnormal data exist in the neighborhood range;
all normal data obtained from the resetting of the abnormal data is moved from the abnormal data set into the normal data set, and new data is added to the normal data set.
In some embodiments, the steps further comprise: performing abnormal data processing on the new data using a clustering model in response to the new data being abnormal data, and adding the new data to the abnormal data set.
In some embodiments, processing for each new data includes: and processing the previous new data, and after both the normal data set and the abnormal data set caused by processing the previous new data are updated, processing the next new data by using a clustering model with the updated normal data set and the updated abnormal data set.
In some embodiments, training an initialization model of a clustering algorithm based on an initial data set includes:
determining a neighborhood radius and a neighborhood density based on the initial dataset, and marking all data in the initial dataset as unvisited;
the random selection of data marked as non-accessed from the initial dataset is continued to sequentially perform the following steps until the initial dataset does not include data marked as non-accessed:
modifying the mark of the selected data from unvisited to accessed;
determining data located within a neighborhood radius of the selected data based on the distance of the selected data from other data and the neighborhood radius;
creating a new cluster and incorporating the selected data into the new cluster in response to the number of data within the neighborhood radius of the selected data being greater than or equal to the neighborhood density and none belonging to any existing cluster;
modifying the marks of the data in the neighborhood radius of the selected data to be accessed to be included in the new clusters, and repeatedly executing the steps by taking the data in the neighborhood radius of the selected data as the selected data respectively.
In some embodiments, the clustering algorithm is a DBSCAN algorithm; the distance from the selected data to other data is Euclidean distance or Marsdian distance; the method further comprises the steps of: the selected data is marked as noise in response to the number of data within the neighborhood radius of the selected data being less than the neighborhood density and at least one data belonging to any existing cluster.
As can be seen from the above embodiments, the device for updating an abnormal data detection model based on a data stream according to the embodiments of the present invention obtains a clustering model including a normal data set and an abnormal data set by training an initialization model of a clustering algorithm based on an initial data set; continuously receiving new data from the data stream, processing each new data, sequentially determining whether the new data is normal data or abnormal data based on the clustering model, responding to the new data as normal data, determining whether the clustering model has abnormal data in the neighborhood range of the new data, respectively resetting each abnormal data as normal data in response to the abnormal data, further iteratively and repeatedly executing the last step by taking the reset data as the new data until no abnormal data exists in the neighborhood range, moving all the normal data obtained by resetting the abnormal data from the abnormal data set into the normal data set, and adding the new data into the technical scheme in the normal data set.
It should be noted that, the foregoing embodiment of the data flow based abnormal data detection model updating apparatus employs the embodiment of the data flow based abnormal data detection model updating method to specifically describe the operation of each module, and those skilled in the art can easily think that these modules are applied to other embodiments of the data flow based abnormal data detection model updating method. Of course, since the steps in the embodiment of the data flow based abnormal data detection model updating method can be intersected, replaced, added and deleted, the reasonable permutation and combination transformation is that the data flow based abnormal data detection model updating device also belongs to the protection scope of the invention, and the protection scope of the invention is not limited to the embodiment.
The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.
It should be understood that as used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items. The foregoing embodiment of the present invention has been disclosed with reference to the number of embodiments for the purpose of description only, and does not represent the advantages or disadvantages of the embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
Those of ordinary skill in the art will appreciate that: the above discussion of any embodiment is merely exemplary and is not intended to imply that the scope of the disclosure of embodiments of the invention, including the claims, is limited to such examples; combinations of features of the above embodiments or in different embodiments are also possible within the idea of an embodiment of the invention, and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omission, modification, equivalent replacement, improvement, etc. of the embodiments should be included in the protection scope of the embodiments of the present invention.

Claims (6)

1. A method for updating an abnormal data detection model based on a data stream, comprising the steps of:
training an initialization model of a clustering algorithm based on the initial data set to obtain a clustering model comprising a normal data set and an abnormal data set;
continuously receiving new data from a data stream and processing each of said new data and in response to an update of said normal data set and said abnormal data set being caused when a previous new data is processed, processing a subsequent new data using said cluster model with updated said normal data set and said abnormal data set, wherein said step of processing each of said new data comprises the steps of, in order:
determining that the new data is normal data or abnormal data based on the clustering model;
determining whether abnormal data exists in a neighborhood range of the new data by the clustering model in response to the new data being normal data;
resetting each abnormal data to be normal data in response to the existence of the abnormal data, and further iteratively and repeatedly executing the last step by taking the reset data as the new data until no abnormal data exist in the neighborhood range;
performing an abnormal data process on the new data using the clustering model in response to the new data being abnormal data, and adding the new data to the abnormal data set;
and moving all the normal data obtained from the abnormal data reset from the abnormal data set into the normal data set, and adding the new data into the normal data set.
2. The method of claim 1, wherein training an initialization model of a clustering algorithm based on an initial dataset comprises:
determining a neighborhood radius and a neighborhood density based on the initial dataset, and marking all data in the initial dataset as unvisited;
randomly selecting data marked as not accessed from the initial dataset is continued to sequentially perform the following steps until the initial dataset does not include data marked as not accessed:
modifying the mark of the selected data from unvisited to accessed;
determining data located within a neighborhood radius of the selected data based on a distance of the selected data from other data and the neighborhood radius;
creating a new cluster and incorporating the selected data into the new cluster in response to the number of data within the neighborhood radius of the selected data being greater than or equal to the neighborhood density and none belonging to any existing cluster;
modifying the marks of the data in the neighborhood radius of the selected data to be accessed to be included in the new cluster, and repeatedly executing the steps by taking the data in the neighborhood radius of the selected data as the selected data respectively.
3. The method of claim 2, wherein the clustering algorithm is a DBSCAN algorithm; the distance from the selected data to other data is Euclidean distance or Marsdian distance;
the method further comprises the steps of: the selected data is marked as noise in response to the number of data within the neighborhood radius of the selected data being less than the neighborhood density and at least one data belonging to any existing cluster.
4. An abnormal data detection model updating device based on a data stream, comprising:
a processor; and
a memory storing program code executable by a processor, the program code when executed performing the steps of:
training an initialization model of a clustering algorithm based on the initial data set to obtain a clustering model comprising a normal data set and an abnormal data set;
continuously receiving new data from a data stream and processing each of said new data and in response to an update of said normal data set and said abnormal data set being caused when a previous new data is processed, processing a subsequent new data using said cluster model with updated said normal data set and said abnormal data set, wherein said step of processing each of said new data comprises the steps of, in order:
determining that the new data is normal data or abnormal data based on the clustering model;
determining whether abnormal data exists in a neighborhood range of the new data by the clustering model in response to the new data being normal data;
resetting each abnormal data to be normal data in response to the existence of the abnormal data, and further iteratively and repeatedly executing the last step by taking the reset data as the new data until no abnormal data exist in the neighborhood range;
performing an abnormal data process on the new data using the clustering model in response to the new data being abnormal data, and adding the new data to the abnormal data set;
and moving all the normal data obtained from the abnormal data reset from the abnormal data set into the normal data set, and adding the new data into the normal data set.
5. The apparatus of claim 4, wherein training an initialization model of a clustering algorithm based on an initial dataset comprises:
determining a neighborhood radius and a neighborhood density based on the initial dataset, and marking all data in the initial dataset as unvisited;
randomly selecting data marked as not accessed from the initial dataset is continued to sequentially perform the following steps until the initial dataset does not include data marked as not accessed:
modifying the mark of the selected data from unvisited to accessed;
determining data located within a neighborhood radius of the selected data based on a distance of the selected data from other data and the neighborhood radius;
creating a new cluster and incorporating the selected data into the new cluster in response to the number of data within the neighborhood radius of the selected data being greater than or equal to the neighborhood density and none belonging to any existing cluster;
modifying the marks of the data in the neighborhood radius of the selected data to be accessed to be included in the new cluster, and repeatedly executing the steps by taking the data in the neighborhood radius of the selected data as the selected data respectively.
6. The apparatus of claim 5, wherein the clustering algorithm is a DBSCAN algorithm; the distance from the selected data to other data is Euclidean distance or Marsdian distance;
the steps further include: the selected data is marked as noise in response to the number of data within the neighborhood radius of the selected data being less than the neighborhood density and at least one data belonging to any existing cluster.
CN202010751183.6A 2020-07-30 2020-07-30 Abnormal data detection model updating method and device based on data flow Active CN111814908B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010751183.6A CN111814908B (en) 2020-07-30 2020-07-30 Abnormal data detection model updating method and device based on data flow

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010751183.6A CN111814908B (en) 2020-07-30 2020-07-30 Abnormal data detection model updating method and device based on data flow

Publications (2)

Publication Number Publication Date
CN111814908A CN111814908A (en) 2020-10-23
CN111814908B true CN111814908B (en) 2023-06-27

Family

ID=72863426

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010751183.6A Active CN111814908B (en) 2020-07-30 2020-07-30 Abnormal data detection model updating method and device based on data flow

Country Status (1)

Country Link
CN (1) CN111814908B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112651617B (en) * 2020-12-21 2022-07-01 福州大学 Wind turbine generator anomaly detection method based on batch flow unified data clustering

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107528823A (en) * 2017-07-03 2017-12-29 中山大学 A kind of network anomaly detection method based on improved K Means clustering algorithms
CN109032829A (en) * 2018-07-23 2018-12-18 腾讯科技(深圳)有限公司 Data exception detection method, device, computer equipment and storage medium
WO2020038353A1 (en) * 2018-08-21 2020-02-27 瀚思安信(北京)软件技术有限公司 Abnormal behavior detection method and system
CN110942099A (en) * 2019-11-29 2020-03-31 华侨大学 Abnormal data identification and detection method of DBSCAN based on core point reservation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107528823A (en) * 2017-07-03 2017-12-29 中山大学 A kind of network anomaly detection method based on improved K Means clustering algorithms
CN109032829A (en) * 2018-07-23 2018-12-18 腾讯科技(深圳)有限公司 Data exception detection method, device, computer equipment and storage medium
WO2020038353A1 (en) * 2018-08-21 2020-02-27 瀚思安信(北京)软件技术有限公司 Abnormal behavior detection method and system
CN110942099A (en) * 2019-11-29 2020-03-31 华侨大学 Abnormal data identification and detection method of DBSCAN based on core point reservation

Also Published As

Publication number Publication date
CN111814908A (en) 2020-10-23

Similar Documents

Publication Publication Date Title
CN108680155B (en) Robot optimal path planning method based on partial perception Markov decision process
US10785244B2 (en) Anomaly detection method, learning method, anomaly detection device, and learning device
Kurach et al. Neural random-access machines
CN111368887B (en) Training method of thunderstorm weather prediction model and thunderstorm weather prediction method
CN108879732B (en) Transient stability evaluation method and device for power system
CN115204316B (en) Vehicle-mounted network fault diagnosis method based on support vector machine
CN112488183A (en) Model optimization method and device, computer equipment and storage medium
WO2018036547A1 (en) Data processing method and device thereof
CN110188196B (en) Random forest based text increment dimension reduction method
CN111814908B (en) Abnormal data detection model updating method and device based on data flow
CN112329944B (en) Data flow concept drift detection method based on historical model diversity
CN112052934B (en) Motor bearing fault diagnosis method based on improved gray wolf optimization algorithm
CN116668083A (en) Network traffic anomaly detection method and system
KR102460485B1 (en) Neural architecture search apparatus and method based on policy vector
CN115640823A (en) Network security situation prediction method based on intelligent optimization algorithm
CN117150897A (en) Migration fault diagnosis method based on digital twin and generation countering network
CN111401569A (en) Hyper-parameter optimization method and device and electronic equipment
CN116340751A (en) Neural network-based aeroengine sensor fault diagnosis method and system
CN103578274A (en) Method and device for forecasting traffic flows
CN113807541B (en) Fairness repair method, system, equipment and storage medium for decision system
CN116340869A (en) Distributed CatB body detection method and equipment based on red fox optimization algorithm
CN113095423B (en) Stream data classification method based on online anti-deduction learning and realization device thereof
CN110866607B (en) Permeation behavior prediction algorithm based on machine learning
CN114844696A (en) Network intrusion dynamic monitoring method, system, equipment and readable storage medium based on risk pool minimization
Solanke et al. Intrusion detection using deep learning approach with different optimization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant