CN111813618A - Data anomaly detection method, device, equipment and storage medium - Google Patents

Data anomaly detection method, device, equipment and storage medium Download PDF

Info

Publication number
CN111813618A
CN111813618A CN202010468770.4A CN202010468770A CN111813618A CN 111813618 A CN111813618 A CN 111813618A CN 202010468770 A CN202010468770 A CN 202010468770A CN 111813618 A CN111813618 A CN 111813618A
Authority
CN
China
Prior art keywords
data
classification model
hypersphere
value
unmarked
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010468770.4A
Other languages
Chinese (zh)
Inventor
邓悦
郑立颖
徐亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202010468770.4A priority Critical patent/CN111813618A/en
Priority to PCT/CN2020/118524 priority patent/WO2021139249A1/en
Publication of CN111813618A publication Critical patent/CN111813618A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24143Distances to neighbourhood prototypes, e.g. restricted Coulomb energy networks [RCEN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data anomaly detection method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring unmarked data; extracting primary abnormal data from the unmarked data according to a preset query strategy; after the primary abnormal data are subjected to identification marking, the primary abnormal data are stored into a marked first data set to form a second data set, and a pre-trained hypersphere classification model is trained through the second data set; identifying whether the hypersphere classification model reaches a training termination condition; and when the training termination condition is reached, inputting the unlabeled data into the hypersphere classification model under the training termination condition for classification screening to obtain target abnormal data. The method of the application trains the classification model by using a small amount of marked data, classifies the unmarked data by using the classification model after reaching the training termination condition, has no limitation on the original distribution of the data, reduces the data amount required to be marked by operators, and has high accuracy of the classification result.

Description

Data anomaly detection method, device, equipment and storage medium
Technical Field
The application relates to the technical field of data processing in artificial intelligence, in particular to a data anomaly detection method, device, equipment and storage medium.
Background
The monitoring of the computer system is an important component of intelligent operations (AIOps), and in the process of monitoring the computer system, a CPU, a disk and the like of the computer system generate a large amount of index data, which also includes some abnormal values. The reason of the system abnormity can be found out through the branch of the abnormal point, and the suggestion can be provided for the subsequent operation. Therefore, the anomaly detection technology plays an important role in the field of intelligent operation.
Conventional anomaly detection includes statistical-based methods and density-based methods.
The statistical-based method is usually to train a large amount of labeled data to find out suspected abnormal points from the labeled data, and belongs to supervised learning. As is known from past experience, supervised learning has the following problems in practical applications of anomaly detection:
1. most of mass data generated in the program running process is not marked, and data marking is often performed by professionals, so that a large amount of manpower, material resources and financial resources are consumed for obtaining enough data labels.
2. The proportion of the abnormal data is small, and the potential abnormal points and the corresponding classification thereof are found from a large amount of data.
The density-based method belongs to unsupervised learning, can be completed without data marking, but is not high in detection accuracy rate generally, and lacks theoretical support for classification results.
Disclosure of Invention
The application aims to provide a data anomaly detection method, a data anomaly detection device, data anomaly detection equipment and a storage medium. The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview and is intended to neither identify key/critical elements nor delineate the scope of such embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.
According to an aspect of an embodiment of the present application, there is provided a data anomaly detection method, including:
acquiring unmarked data;
extracting primary abnormal data from the unmarked data according to a preset query strategy, wherein the primary abnormal data is data which meets a preset condition in the unmarked data screened out by the query strategy;
the primary abnormal data are subjected to identification marking and then stored into a marked first data set to form a second data set, and a pre-trained hypersphere classification model is trained through the second data set, wherein the hypersphere classification model is a hypersphere which can fit a hypersphere covering most sample values to the current marked data in a high-dimensional space, and a classification model for identifying abnormal data and normal data by taking the hypersphere surface as a boundary;
identifying whether the hypersphere classification model reaches a training termination condition;
and when the training termination condition is reached, inputting the unlabeled data into the hypersphere classification model under the training termination condition for classification screening to obtain target abnormal data.
Further, the training method of the hypersphere classification model comprises the following steps:
respectively setting different penalty coefficients for the abnormal data and the normal data to generate a loss function, wherein the penalty coefficients are constants in preset presets;
after constraint conditions are set, calculating to obtain a sphere center value representing the central position of the hyper-sphere and a radius value representing the distance between the sphere center value of the hyper-sphere and the surface of the hyper-sphere in the hyper-sphere classification model;
and generating a decision function for identifying a normal value and an abnormal value according to the sphere center value and the radius value.
Further, when the training termination condition is reached, inputting the unlabeled data into the hypersphere classification model under the training termination condition for classification and screening to obtain target abnormal data, including:
respectively substituting the unlabeled data into the decision functions to generate decision result values;
judging whether the decision result value is greater than or equal to zero;
and when the data is larger than or equal to zero, outputting the unmarked data and marking the unmarked data as target abnormal data.
Further, the query strategy screens primary abnormal data based on the pre-trained hypersphere classification model, and the preset condition is that a weighted distance value from the hypersphere surface of the hypersphere classification model is minimum.
Further, the extracting primary abnormal data from the unmarked data according to a preset query policy includes:
bringing the unmarked data into the decision function and taking an absolute value to obtain a nearest spherical distance;
calculating the distance value between the unmarked data, and taking the minimum value as the nearest neighbor sample distance;
and normalizing the nearest spherical distance and the nearest neighbor sample distance, and weighting by using a preset coefficient to obtain a weighted distance value of each unlabeled data.
Further, the method for the nearest spherical distance normalization processing comprises the following steps:
selecting a first minimum value with a minimum value and a first maximum value with a maximum value from the nearest spherical distances of all the unmarked data;
dividing the difference between the nearest spherical distance of each unmarked data and the first minimum value by the first maximum value to obtain the normalized nearest spherical distance corresponding to all the unmarked data.
Further, the method for distance normalization processing of nearest neighbor samples includes:
selecting a second minimum value with the smallest value and a second maximum value with the largest value from the nearest neighbor sample distances of all the unlabeled data;
and respectively calculating the difference between each unlabeled data and the second minimum value, and dividing the difference by the second maximum value to obtain the normalized nearest neighbor sample distance of all unlabeled data.
According to another aspect of the embodiments of the present application, there is provided a data anomaly detection apparatus including:
an acquisition module: configured to perform obtaining unmarked data;
the query module: the method comprises the steps that primary abnormal data are extracted from unmarked data according to a preset query strategy, wherein the primary abnormal data are data which meet preset conditions in the unmarked data screened out through the query strategy;
a training module: the classification model is configured to perform identification and marking on the primary abnormal data, store the primary abnormal data into a marked first data set to form a second data set, and train a pre-trained hypersphere classification model through the second data set, wherein the hypersphere classification model is a classification model which can fit a hypersphere covering most sample values in a high-dimensional space for the current marked data and identify abnormal data and normal data by taking the hypersphere surface as a boundary;
an identification module: configured to perform identifying whether the hypersphere classification model reaches a training termination condition;
a result output module: and the system is configured to input the unlabeled data into the hypersphere classification model under the training termination condition for classification screening when the training termination condition is reached, so as to obtain target abnormal data.
According to another aspect of the embodiments of the present application, there is provided a computer device including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the data anomaly detection method described above when executing the computer program.
According to another aspect of embodiments of the present application, there is provided a computer-readable storage medium having a computer program stored thereon, the computer program being executed by a processor to implement the data anomaly detection method described above.
The technical scheme provided by one aspect of the embodiment of the application can have the following beneficial effects:
the data anomaly detection method provided by the embodiment of the application trains a hypersphere classification model by using a small amount of marked data, classifies unmarked data by using the hypersphere classification model after reaching a training termination condition, and continues to train the hypersphere classification model by using updated marked data if not; the method combines an unsupervised method with a supervised method, a hypersphere classification model trained by a small amount of marked data has no limitation on the original distribution of the data, the application range is wider, the query strategy based on the boundary distance and the sample density can accurately find out the most valuable data and reduce the influence of noise, the data amount needing to be marked by operators is greatly reduced, the classification precision of the hypersphere classification model is ensured, the cost of artificial intelligent operation is saved, the method is more suitable for practical industrial scenes, and is convenient for large-scale deployment.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the application, or may be learned by the practice of the embodiments. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 shows a flow diagram of a data anomaly detection method of an embodiment of the present application;
FIG. 2 is a flowchart illustrating steps involved in extracting primary anomalous data from the unlabeled data according to a predetermined query policy, according to an embodiment of the present application;
FIG. 3 is a flowchart of a method for nearest sphere distance normalization processing according to an embodiment of the present application;
FIG. 4 is a flow diagram illustrating a method for nearest neighbor sample distance normalization in accordance with an embodiment of the present application;
FIG. 5 illustrates a flowchart of a method of training a hypersphere classification model of an embodiment of the present application;
FIG. 6 is a flowchart illustrating the steps involved in entering the unlabeled data into the hypersphere classification model under training termination conditions for classification screening to obtain target anomaly data according to an embodiment of the present application;
FIG. 7 is a block diagram of a data anomaly detection apparatus according to an embodiment of the present application;
FIG. 8 shows a hardware architecture diagram of a computer device of an embodiment of the present application;
fig. 9 shows a flow chart of a data anomaly detection method according to another embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
As shown in fig. 1, an embodiment of the present application provides a data anomaly detection method, including:
and S1, acquiring unlabeled data.
In the actual intelligent operation process, data generated by a computer system is often unbalanced, and most of the data belong to normal data, so that abnormal data detection in the operation process can be regarded as a single classification problem. Considering that the monitoring index data of the computer system is distributed in the high-dimensional space, the trained classification model needs to have the capability of distinguishing whether the high-dimensional space data is normal or not. Data generated by a computer system is divided into marked data and unmarked data, the marked data is divided into a marked data set, and the unmarked data is divided into an unmarked data set. The classification model may also be referred to as a classifier.
For example, the monitoring index data of the computer system in one embodiment is shown in table 1:
TABLE 1 System monitoring index data
Figure BDA0002513584950000051
Figure BDA0002513584950000061
And S2, extracting primary abnormal data from the unmarked data according to a preset query strategy, wherein the primary abnormal data is data which reaches a preset condition in the unmarked data screened by the query strategy.
Considering that operators have limited energy and much unmarked data and cannot mark the unmarked data one by one, the inquiry strategy is used for deciding which unmarked data are selected to be marked by the operators.
The query strategy screens primary abnormal data based on the pre-trained hypersphere classification model, and the preset condition is that the weighted distance value from the hypersphere surface of the hypersphere classification model is minimum.
In certain embodiments, as shown in fig. 2, step S2 includes:
s21, bringing the unmarked data into the decision function and taking an absolute value to obtain the nearest spherical distance;
s22, calculating the distance value between the unlabeled data, and taking the minimum value as the nearest neighbor sample distance;
and S23, normalizing the nearest spherical distance and the nearest neighbor sample distance, and weighting by a preset coefficient to obtain a weighted distance value of each unlabeled data.
The surface of a hypersphere classification model in the classification model is a key for distinguishing whether index data is normal or not and is the most uncertain region in a high-dimensional space. Therefore, the distance from the data x to the surface of the classification model of the hypersphere is taken as a measure and is recorded as the nearest sphere distance | f (x) |.
In addition, the more concentrated the distribution of data of the regions passed by the surface of the hypersphere classification model is considered, the higher the data representativeness is. Therefore, the distance between the data and the nearest data is selected to measure the distribution density, and is recorded as the nearest sample distance d (x, NN)1(x) ). The larger the distribution density, the smaller the nearest neighbor sample distance. Therefore, when the distance between two points and the boundary is the same, a sample having a high density in the vicinity (that is, the nearest neighbor sample distance is the smallest) is preferentially selected.
So the query strategy selects the data with the smallest weighted distance each time. The data with the smallest weighted distance is the most representative data, i.e. the primary abnormal data.
As shown in fig. 3, the method for normalizing the nearest spherical distance includes:
s231, selecting a first minimum value with a minimum numerical value and a first maximum value with a maximum numerical value from the nearest spherical distances of all the unmarked data;
and S232, dividing the difference between the nearest spherical distance of each unmarked data and the first minimum value by the first maximum value to obtain the normalized nearest spherical distance corresponding to all the unmarked data.
In the actual operation process, when calculating the normalized nearest spherical distance of all unmarked data, all unmarked data are respectively substituted into a decision function f (x) | | x-a | -R and an absolute value is taken to obtain the nearest spherical distance | f (x) | of each unmarked data, and a minimum value and a maximum value are taken from all | f (x) | and are respectively recorded as
Figure BDA0002513584950000071
U represents an unlabeled dataset, when x ═ x1When, | f (x) | takes a minimum value, when x ═ x2When, | f (x) | takes the maximum value. The decision function f (x) | | x-a | -R represents the difference between the distance between the data x and the center a and the radius R. The distance between the data and the center of sphere of the classification model may be referred to as the distance to the center of sphere corresponding to the data.
Subtracting the minimum value from each | f (x) | and dividing the subtracted value by the maximum value to obtain the normalized nearest spherical distance of all the data
Figure BDA0002513584950000072
As shown in fig. 4, the method for distance normalization processing of nearest neighbor samples includes:
s231', selecting a second minimum value with the smallest value and a second maximum value with the largest value from the nearest neighbor sample distances of all the unmarked data;
s232', calculating the difference between each of the unlabeled data and the second minimum value, and dividing the difference by the second maximum value to obtain the normalized nearest neighbor sample distance of all the unlabeled data.
Specifically, when calculating the normalized nearest neighbor sample distance of all unlabeled data, for each data x, the distance between the data x and all other data is calculated, the minimum value of the distances is taken as the nearest neighbor sample distance, and the point at which the nearest neighbor of the data x is found is taken as d (x, NN)1(x) ). Taking the minimum and maximum of the nearest neighbor sample distances of all data, and recording them as
Figure BDA0002513584950000081
U represents an unlabeled dataset, when x ═ x3When, d (x, NN)1(x) Get the minimum value when x ═ x4When, d (x, NN)1(x) Get the maximum value.
Then normalization operation is carried out, the minimum value in all the nearest neighbor sample distances is subtracted from the nearest neighbor data of each data to obtain a difference, and the difference is divided by the maximum value in all the nearest neighbor sample distances to obtain the normalized nearest neighbor sample distances of all the data
Figure BDA0002513584950000082
Weighting the normalized nearest spherical distance and the normalized nearest neighbor sample distance of each unmarked data by taking 0.5 as a coefficient respectively to obtain the corresponding weighted distance as follows:
Figure BDA0002513584950000083
the weighted distances of all data are arranged from small to large, and the first five data are taken as follows:
TABLE 2 weighted distance top five of unlabeled data
Figure BDA0002513584950000084
S3, the primary abnormal data are identified and marked and then stored in the marked first data set to form a second data set, and a pre-trained hypersphere classification model is trained through the second data set, wherein the hypersphere classification model is a hypersphere which can fit a hypersphere covering most sample values to the current marked data in a high-dimensional space, and the hypersphere surface is used as a boundary to identify the classification model of the abnormal data and the normal data.
And identifying the primary abnormal data and marking to obtain new marked data.
And identifying the primary abnormal data as normal data or abnormal data through a computer or manually, marking the primary abnormal data according to an identification result, and taking the marked primary abnormal data as newly marked data. And receiving a judgment mark of the most representative unmarked data to obtain new marked data, namely new marked data.
Therefore, according to the rule of the query strategy, the unmarked data x with the minimum weighted distance is selected107048And the data is handed to AI operators for judgment and marking to obtain new marked data.
Whether the most representative unmarked data belongs to normal data or abnormal data can be judged manually or through a computer, the most representative unmarked data is marked, and the most representative unmarked data is judged to be marked and becomes new marked data.
And adding the new marked data into the marked data set to obtain an updated marked data set.
And storing the secondary abnormal data obtained after the primary abnormal data are subjected to identification marking into the marked first data set to form a second data set. The first marked data set is the marked data set. The second data set is an updated tagged data set resulting after the primary anomaly data is stored in the tagged data set.
The new tagged data is added to the tagged data set, thereby updating the tagged data set. In this embodiment, marked x107048Add to the marked dataset.
Training the hypersphere classification model using labeled data in the updated labeled dataset.
Training a pre-trained hypersphere classification model through the second data set. The hypersphere classification model is a classification model which can fit a hypersphere covering most sample values to the currently marked data in a high-dimensional space and identify abnormal data and normal data by taking the hypersphere surface as a boundary.
In some embodiments, as shown in fig. 5, the training method of the hypersphere classification model includes:
s31, respectively setting different penalty coefficients for the abnormal data and the normal data to generate a loss function, wherein the penalty coefficients are constants in preset presets;
s32, after constraint conditions are set, calculating to obtain a sphere center value representing the center position of the hyper-sphere and a radius value representing the distance between the sphere center value of the hyper-sphere and the surface of the hyper-sphere in the hyper-sphere classification model;
and S33, generating a decision function for identifying a normal value and an abnormal value according to the sphere center value and the radius value.
In some embodiments, training a hypersphere classification model using labeled data in a labeled dataset includes: aiming at marked data in a current marked data set, fitting a hypersphere model in a high-dimensional space, wherein the hypersphere model comprises a plurality of marked data, and the number of the marked data contained in the hypersphere model meets a preset condition; the preset condition may be set according to actual needs, for example, the number of marked data located on the surface of the hyper-sphere model and in the hyper-sphere model is the largest, or the ratio of the number of marked data located on the surface of the hyper-sphere model and in the hyper-sphere model reaches a preset threshold, and the like; determining the circle center and the radius of the hypersphere model so as to obtain a hypersphere classification model for classifying data by taking the surface of the hypersphere model as a boundary; the surface of the hypersphere classification model is used as a boundary, data located on the surface of the hypersphere classification model and in the hypersphere classification model are used as normal data, and data located outside the hypersphere classification model are abnormal data (classification boundary is the surface of the hypersphere classification model), and the original distribution condition of marked data does not need to be considered.
And determining the circle center and the radius of the hypersphere classification model, and solving by using a loss function and a constraint condition. Because the number of normal data in the marked data is large and the number of abnormal data is small, different punishment coefficients are set for the normal data and the abnormal data to distinguish when a loss function of the classification model is constructed, so that the influence of the abnormal data on the classification model is improved. The constructed loss function is therefore as follows:
Figure BDA0002513584950000101
the constraint conditions are as follows:
||xi-a||2≤R2i,i:xi∈Lin
||xj-a||2≤R2j,j:xj∈Lout
ξi,ξj≥0
wherein a is the center of the sphere, R is the radius of the hyper-sphere model, xii,ξjIs a relaxation variable, xi,xjIs marked data, LinIs a marked normal numberAccording to the set, i and j are numbers for marking different data, LoutIs a marked abnormal data set, and the penalty coefficients C1, C2 are constants ranging from 0 to 1.
The above problem is a non-convex optimization problem, and the lagrange multiplier method cannot find a global optimal solution. To solve the above problem, the constraint condition containing the slack variable is expressed in the form of a risk function, thereby converting the problem expressed by the above formula into an unconstrained optimization problem, as follows:
ξi=l(R2-||φ(xi)-a||2)
ξj=l(||φ(xj)-a||2-R2)
phi (x) is a transformation mapping function and is used for mapping the original data x into a new feature space after feature transformation;
l (t) is a risk function, and the value of a function value of the risk function l (t) is max { -t, 0 }; to combine the sample-independent variables in the risk function l (T) for solution, let T ═ R2-a2And obtaining an optimization target:
Figure BDA0002513584950000111
however, when the risk function l (t) is used, the second derivative of the function does not exist, so that a gradient method cannot be applied for solving, and for this purpose the following risk function l (t) is used:
Figure BDA0002513584950000112
the constant has the function of constraining the value of t, so that the risk function l (t) can carry out second-order derivation in a smaller range, and the difference between the value of the risk function l (t) and the value of the initial risk function is smaller. Here, from practical experience, the risk function l (t) is substituted into the optimization target expression to obtain the risk function l (t) of 0.5
Figure BDA0002513584950000113
Wherein the element K of the matrix Kij=k(xi,xj)=<φ(xi),φ(xj)>=(xixj),eiRepresentation matrix Rn+mThe standard basis of (2) is solved by using a dual form, constant terms are ignored, and a loss function is obtained by simplification:
Figure BDA0002513584950000114
i,j:xi,xj∈Lin,l,m:xl,xm∈Lout
(xixj) Meaning that the ith sample is the vector inner product of the jth sample.
Two inequalities of the constraint condition are simplified to obtain:
ξi≥||xi-a||2-R2,i:xi∈Lin
ξj≤R2-||xj-a||2,j:xj∈Lout
using xii,ξjMultiplied by lagrange coefficients, respectively.
As such, constrained problem solving processes typically use lagrangian methods to obtain an optimal solution.
When the function corresponding to x is obtained by solving the above functioniLagrange multiplier alpha ofiThen, the center of the sphere is calculated
Figure BDA0002513584950000121
Substituting the value of the circle center a into a loss function, and solving the radius R by using an optimization method;
thus, a hypersphere classification model which is initially trained can be obtained.
And S4, identifying whether the hypersphere classification model reaches a training termination condition.
And S5, when the training termination condition is reached, inputting the unlabeled data into the hypersphere classification model under the training termination condition for classification and screening to obtain target abnormal data.
As shown in fig. 6, step S5 includes:
s51, respectively substituting the unlabeled data into the decision functions to generate decision result values;
s52, judging whether the decision result value is larger than or equal to zero;
and S53, when the data is larger than or equal to zero, outputting the unmarked data and marking the unmarked data as target abnormal data.
The query strategy screens primary abnormal data based on the pre-trained hypersphere classification model, and the preset condition is that the weighted distance value from the hypersphere surface of the hypersphere classification model is minimum.
Specifically, if the training of the classification model does not reach the training termination condition, the classification model is retrained using the updated labeled data set, and the cycle is performed. With the continuous updating of the marked data, after the classification model is retrained every time, the sphere center position, the radius and the decision function of the hypersphere classification model are correspondingly adjusted. And when the variation of the classification model after each iteration is smaller than a preset threshold value, the training termination condition is reached, and at the moment, the trained classification model is obtained, so that the final decision function is obtained. The decision function f (x) | | x-a | -R represents the difference between the distance between the data x and the center a and the radius R. And finally determining the values of the circle center a and the radius R of the classification model after the training termination condition is reached.
When the classification is carried out, the unlabeled data xiAnd substituting the data into a decision function f (x) to judge whether the data is positive or negative, if f (x) is less than or equal to 0, considering the unmarked system index data as normal data, and if f (x) is more than 0, considering the corresponding system index data as abnormal.
In other words, when the classification is performed, the unmarked data x is calculatediJudging whether the distance between the distance and the center of the hyper-sphere classification model is larger than the radius of the hyper-sphere classification model; if the distance is less than or equal to the radius of the hypersphere classification model, the distance is not markedNotation data xiFor normal data, if the distance is greater than the radius of the hypersphere classification model, the unlabeled data xiAnd (6) abnormal.
For example, the center of a classification model of a hypersphere obtained by the above steps
a ═ 602.94 (92.69%, 3.28%, 3.49%, 52.36%, 495.53, 63, 69.72%, 98, 357, 54, 91.77%, 58.92%). Taking the actual data in Table 1 as an example, each data has 12 attribute values, and the data is divided into
xiWhen the decision function f (x) is substituted into (94.76%, 3.76%, 1.29%, 47%, 434.78, 59, 78.37%, 104, 379, 50, 95.47%, 64.55%), the decision function f (x) is substituted into (x) x-a R, and f (x) is-49.09 < 0, the data x is obtainediWithin the classification model of the hypersphere, so x can be considerediIs normal data.
For abnormal data, the AI operator can continue to perform root cause analysis and the like, find out the cause of the system abnormality, and give a repair suggestion.
As shown in fig. 7, another embodiment of the present application further provides a data anomaly detection apparatus, including:
the acquisition module 100: configured to perform obtaining unmarked data;
the query module 200: the method comprises the steps that primary abnormal data are extracted from unmarked data according to a preset query strategy, wherein the primary abnormal data are data which meet preset conditions in the unmarked data screened out through the query strategy;
the training module 300: the classification model is configured to perform identification and marking on the primary abnormal data, store the primary abnormal data into a marked first data set to form a second data set, and train a pre-trained hypersphere classification model through the second data set, wherein the hypersphere classification model is a classification model which can fit a hypersphere covering most sample values in a high-dimensional space for the current marked data and identify abnormal data and normal data by taking the hypersphere surface as a boundary;
the identification module 400: configured to perform identifying whether the hypersphere classification model reaches a training termination condition;
the result output module 500: and the system is configured to input the unlabeled data into the hypersphere classification model under the training termination condition for classification screening when the training termination condition is reached, so as to obtain target abnormal data.
As shown in fig. 8, another embodiment of the present application discloses a computer device 600, which includes a memory 601, a processor 602, and a computer program stored on the memory 601 and executable on the processor 602, wherein the processor 602 implements the data anomaly detection method described above when executing the computer program. The computer device is a device capable of automatically performing numerical calculation and/or information processing in accordance with instructions set or stored in advance. As shown, the computer apparatus 600 includes, but is not limited to, at least a memory 601, a processor 602, and a network interface 603, which may be communicatively coupled to each other via a device bus. Wherein:
in this embodiment, the memory 601 includes at least one type of computer-readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage 601 may be an internal storage unit of the computer device 600, such as a hard disk or a memory of the computer device 600. In other embodiments, the memory 601 may also be an external storage device of the computer device 600, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the computer device 600. Of course, the memory 601 may also include both internal and external storage devices for the computer device 600. In this embodiment, the memory 601 is generally used for storing the operating device and various application software installed in the computer device 600, such as the program code of the abnormal medical insurance group identification device 500. In addition, the memory 601 can also be used to temporarily store various types of data that have been output or are to be output.
Processor 602 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 602 is typically used to control the overall operation of the computer device 600. In this embodiment, the processor 602 is configured to run a program code stored in the memory 601 or process data, for example, the data exception detecting apparatus 500, so as to implement the data exception detecting method in each of the above embodiments.
The network interface 603 may include a wireless network interface or a wired network interface, and the network interface 603 is generally used for establishing a communication connection between the computer apparatus 600 and other electronic devices. For example, the network interface 603 is used to connect the computer device 600 to an external terminal through a network, establish a data transmission channel and a communication connection between the computer device 600 and the external terminal, and the like. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), 4G network, 5G network, Bluetooth (Bluetooth), Wi-Fi, and the like.
It is noted that fig. 8 only shows the computer device 600 with components 601 and 603, but it is to be understood that not all of the shown components are required to be implemented, and that more or less components may be implemented instead.
In this embodiment, the abnormal medical insurance group identification apparatus 500 stored in the memory 601 may be further divided into one or more program modules, and the one or more program modules are stored in the memory 601 and executed by one or more processors (in this embodiment, the processor 602) to complete the data abnormality detection method of the present invention.
The present embodiment also provides a computer-readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application mall, etc., on which a computer program is stored, which when executed by a processor implements corresponding functions. The computer readable storage medium of this embodiment is used for storing the abnormal medical insurance group identification apparatus 500, so as to implement the data abnormality detection method of the present invention when being executed by the processor.
Another embodiment of the present application also discloses a computer-readable storage medium, on which a computer program is stored, the program being executed by a processor to implement the data anomaly detection method described above.
As shown in fig. 9, another embodiment of the present application provides a data anomaly detection method, including:
and S00, acquiring unlabeled data.
S10, training a hypersphere classification model by using the marked data in the marked data set.
And S20, extracting primary abnormal data from the unmarked data according to a preset query strategy, wherein the primary abnormal data is data which reaches a preset condition in the unmarked data screened by the query strategy.
And S30, identifying the primary abnormal data and marking to obtain new marked data.
And S40, adding the new marked data into the marked data set to obtain an updated marked data set.
S50, training the hypersphere classification model by using the marked data in the updated marked data set.
S60, identifying whether the hypersphere classification model reaches a training termination condition; if the training termination condition is reached, go to step S70; if the training termination condition is not met, the process goes to step S20.
S70, classifying the unlabeled data by using the hypersphere classification model, and inputting the unlabeled data into the hypersphere classification model under the training termination condition for classification and screening to obtain target abnormal data.
According to the data anomaly detection method provided by the embodiment of the application, an unsupervised learning method and a supervised learning method are combined from the beginning of data, a hypersphere classification model constructed by a small amount of marked data is not limited to the original distribution of the data, and the application range is wider; and the query strategy based on the boundary distance and the sample density can accurately find out the most valuable data and reduce the influence of noise, thereby greatly reducing the data volume needing to be marked by operators, ensuring the data classification precision, saving the cost of artificial intelligence operation, being more suitable for actual industrial scenes and being convenient for large-scale deployment.
The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims (10)

1. A data anomaly detection method is characterized by comprising the following steps:
acquiring unmarked data;
extracting primary abnormal data from the unmarked data according to a preset query strategy, wherein the primary abnormal data is data which meets a preset condition in the unmarked data screened out by the query strategy;
the primary abnormal data are subjected to identification marking and then stored into a marked first data set to form a second data set, and a pre-trained hypersphere classification model is trained through the second data set, wherein the hypersphere classification model is a hypersphere which can fit a hypersphere covering most sample values to the current marked data in a high-dimensional space, and a classification model for identifying abnormal data and normal data by taking the hypersphere surface as a boundary;
identifying whether the hypersphere classification model reaches a training termination condition;
and when the training termination condition is reached, inputting the unlabeled data into the hypersphere classification model under the training termination condition for classification screening to obtain target abnormal data.
2. The method of claim 1, wherein the training method of the hypersphere classification model comprises:
respectively setting different penalty coefficients for the abnormal data and the normal data to generate a loss function, wherein the penalty coefficients are constants in preset presets;
after constraint conditions are set, calculating to obtain a sphere center value representing the central position of the hyper-sphere and a radius value representing the distance between the sphere center value of the hyper-sphere and the surface of the hyper-sphere in the hyper-sphere classification model;
and generating a decision function for identifying a normal value and an abnormal value according to the sphere center value and the radius value.
3. The method of claim 2, wherein when the training termination condition is reached, inputting the unlabeled data into the hypersphere classification model under the training termination condition for classification screening to obtain target abnormal data, comprising:
respectively substituting the unlabeled data into the decision functions to generate decision result values;
judging whether the decision result value is greater than or equal to zero;
and when the data is larger than or equal to zero, outputting the unmarked data and marking the unmarked data as target abnormal data.
4. The method of claim 2, wherein the query strategy screens primary anomaly data based on the pre-trained hypersphere classification model, and the predetermined condition is that a weighted distance value from a hypersphere surface of the hypersphere classification model is minimum.
5. The method according to claim 4, wherein the extracting primary abnormal data from the unlabeled data according to a preset query policy comprises:
bringing the unmarked data into the decision function and taking an absolute value to obtain a nearest spherical distance;
calculating the distance value between the unmarked data, and taking the minimum value as the nearest neighbor sample distance;
and normalizing the nearest spherical distance and the nearest neighbor sample distance, and weighting by using a preset coefficient to obtain a weighted distance value of each unlabeled data.
6. The method of claim 5, wherein the nearest sphere distance normalization processing method comprises:
selecting a first minimum value with a minimum value and a first maximum value with a maximum value from the nearest spherical distances of all the unmarked data;
dividing the difference between the nearest spherical distance of each unmarked data and the first minimum value by the first maximum value to obtain the normalized nearest spherical distance corresponding to all the unmarked data.
7. The method of claim 5, wherein the nearest neighbor sample distance normalization processing method comprises:
selecting a second minimum value with the smallest value and a second maximum value with the largest value from the nearest neighbor sample distances of all the unlabeled data;
and respectively calculating the difference between each unlabeled data and the second minimum value, and dividing the difference by the second maximum value to obtain the normalized nearest neighbor sample distance of all unlabeled data.
8. A data abnormality detection apparatus, characterized by comprising:
an acquisition module: configured to perform obtaining unmarked data;
the query module: the method comprises the steps that primary abnormal data are extracted from unmarked data according to a preset query strategy, wherein the primary abnormal data are data which meet preset conditions in the unmarked data screened out through the query strategy;
a training module: the classification model is configured to perform identification and marking on the primary abnormal data, store the primary abnormal data into a marked first data set to form a second data set, and train a pre-trained hypersphere classification model through the second data set, wherein the hypersphere classification model is a classification model which can fit a hypersphere covering most sample values in a high-dimensional space for the current marked data and identify abnormal data and normal data by taking the hypersphere surface as a boundary;
an identification module: configured to perform identifying whether the hypersphere classification model reaches a training termination condition;
a result output module: and the system is configured to input the unlabeled data into the hypersphere classification model under the training termination condition for classification screening when the training termination condition is reached, so as to obtain target abnormal data.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the data anomaly detection method according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium on which a computer program is stored, the program being executable by a processor to implement the data anomaly detection method according to any one of claims 1 to 7.
CN202010468770.4A 2020-05-28 2020-05-28 Data anomaly detection method, device, equipment and storage medium Pending CN111813618A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010468770.4A CN111813618A (en) 2020-05-28 2020-05-28 Data anomaly detection method, device, equipment and storage medium
PCT/CN2020/118524 WO2021139249A1 (en) 2020-05-28 2020-09-28 Data anomaly detection method, apparatus and device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010468770.4A CN111813618A (en) 2020-05-28 2020-05-28 Data anomaly detection method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111813618A true CN111813618A (en) 2020-10-23

Family

ID=72847794

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010468770.4A Pending CN111813618A (en) 2020-05-28 2020-05-28 Data anomaly detection method, device, equipment and storage medium

Country Status (2)

Country Link
CN (1) CN111813618A (en)
WO (1) WO2021139249A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112306835A (en) * 2020-11-02 2021-02-02 平安科技(深圳)有限公司 User data monitoring and analyzing method, device, equipment and medium
CN113590392A (en) * 2021-06-30 2021-11-02 中国南方电网有限责任公司超高压输电公司昆明局 Converter station equipment abnormality detection method and device, computer equipment and storage medium
CN113687972A (en) * 2021-08-30 2021-11-23 中国平安人寿保险股份有限公司 Method, device and equipment for processing abnormal data of business system and storage medium
CN117333486A (en) * 2023-11-30 2024-01-02 清远欧派集成家居有限公司 UV finish paint performance detection data analysis method, device and storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114443635B (en) * 2022-01-20 2024-04-09 广西壮族自治区林业科学研究院 Data cleaning method and device in soil big data analysis

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11275975B2 (en) * 2017-10-05 2022-03-15 Applied Materials, Inc. Fault detection classification
CN110555054B (en) * 2018-06-15 2023-06-09 泉州信息工程学院 Data classification method and system based on fuzzy double-supersphere classification model
CN110320894B (en) * 2019-08-01 2022-04-15 陕西工业职业技术学院 Thermal power plant pulverizing system fault detection method capable of accurately dividing aliasing area data categories
CN110825545A (en) * 2019-08-31 2020-02-21 武汉理工大学 Cloud service platform anomaly detection method and system
CN110796172A (en) * 2019-09-27 2020-02-14 北京淇瑀信息科技有限公司 Sample label determination method and device for financial data and electronic equipment

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112306835A (en) * 2020-11-02 2021-02-02 平安科技(深圳)有限公司 User data monitoring and analyzing method, device, equipment and medium
CN112306835B (en) * 2020-11-02 2024-05-28 平安科技(深圳)有限公司 User data monitoring and analyzing method, device, equipment and medium
CN113590392A (en) * 2021-06-30 2021-11-02 中国南方电网有限责任公司超高压输电公司昆明局 Converter station equipment abnormality detection method and device, computer equipment and storage medium
CN113590392B (en) * 2021-06-30 2024-04-02 中国南方电网有限责任公司超高压输电公司昆明局 Converter station equipment abnormality detection method, device, computer equipment and storage medium
CN113687972A (en) * 2021-08-30 2021-11-23 中国平安人寿保险股份有限公司 Method, device and equipment for processing abnormal data of business system and storage medium
CN113687972B (en) * 2021-08-30 2023-07-25 中国平安人寿保险股份有限公司 Processing method, device, equipment and storage medium for abnormal data of business system
CN117333486A (en) * 2023-11-30 2024-01-02 清远欧派集成家居有限公司 UV finish paint performance detection data analysis method, device and storage medium
CN117333486B (en) * 2023-11-30 2024-03-22 清远欧派集成家居有限公司 UV finish paint performance detection data analysis method, device and storage medium

Also Published As

Publication number Publication date
WO2021139249A1 (en) 2021-07-15

Similar Documents

Publication Publication Date Title
CN111813618A (en) Data anomaly detection method, device, equipment and storage medium
CN114386514B (en) Unknown flow data identification method and device based on dynamic network environment
CN111325260B (en) Data processing method and device, electronic equipment and computer readable medium
CN116453438B (en) Display screen parameter detection method, device, equipment and storage medium
CN113222149A (en) Model training method, device, equipment and storage medium
CN112734037A (en) Memory-guidance-based weakly supervised learning method, computer device and storage medium
CN111967535A (en) Fault diagnosis method and device for temperature sensor in grain storage management scene
CN111563556B (en) Transformer substation cabinet equipment abnormity identification method and system based on color gradient weight
CN115496384A (en) Monitoring management method and device for industrial equipment and computer equipment
CN116740900B (en) SVM-based power construction early warning method and system
CN110807174B (en) Effluent analysis and abnormity identification method for sewage plant group based on statistical distribution
CN115856694B (en) Battery life prediction method, device, computer equipment and storage medium
CN116757870A (en) Intelligent energy monitoring data processing method and system for energy Internet of things
CN116206208A (en) Forestry plant diseases and insect pests rapid analysis system based on artificial intelligence
CN115374439A (en) Malicious code detection method and device and computer equipment
CN115511798A (en) Pneumonia classification method and device based on artificial intelligence technology
CN111798376B (en) Image recognition method, device, electronic equipment and storage medium
CN115170838A (en) Data screening method and device
CN112148605B (en) Software defect prediction method based on spectral clustering and semi-supervised learning
CN110472656B (en) Vehicle image classification method, device, computer equipment and storage medium
CN113076823A (en) Training method of age prediction model, age prediction method and related device
CN117292304B (en) Multimedia data transmission control method and system
CN111027296A (en) Report generation method and system based on knowledge base
CN116403074B (en) Semi-automatic image labeling method and device based on active labeling
CN115906170B (en) Security protection method and AI system applied to storage cluster

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination