CN111813618A

CN111813618A - Data anomaly detection method, device, equipment and storage medium

Info

Publication number: CN111813618A
Application number: CN202010468770.4A
Authority: CN
Inventors: 邓悦; 郑立颖; 徐亮
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-05-28
Filing date: 2020-05-28
Publication date: 2020-10-23
Also published as: WO2021139249A1

Abstract

The application discloses a data anomaly detection method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring unmarked data; extracting primary abnormal data from the unmarked data according to a preset query strategy; after the primary abnormal data are subjected to identification marking, the primary abnormal data are stored into a marked first data set to form a second data set, and a pre-trained hypersphere classification model is trained through the second data set; identifying whether the hypersphere classification model reaches a training termination condition; and when the training termination condition is reached, inputting the unlabeled data into the hypersphere classification model under the training termination condition for classification screening to obtain target abnormal data. The method of the application trains the classification model by using a small amount of marked data, classifies the unmarked data by using the classification model after reaching the training termination condition, has no limitation on the original distribution of the data, reduces the data amount required to be marked by operators, and has high accuracy of the classification result.

Description

Data anomaly detection method, device, equipment and storage medium

Technical Field

The application relates to the technical field of data processing in artificial intelligence, in particular to a data anomaly detection method, device, equipment and storage medium.

Background

The monitoring of the computer system is an important component of intelligent operations (AIOps), and in the process of monitoring the computer system, a CPU, a disk and the like of the computer system generate a large amount of index data, which also includes some abnormal values. The reason of the system abnormity can be found out through the branch of the abnormal point, and the suggestion can be provided for the subsequent operation. Therefore, the anomaly detection technology plays an important role in the field of intelligent operation.

Conventional anomaly detection includes statistical-based methods and density-based methods.

The statistical-based method is usually to train a large amount of labeled data to find out suspected abnormal points from the labeled data, and belongs to supervised learning. As is known from past experience, supervised learning has the following problems in practical applications of anomaly detection:

1. most of mass data generated in the program running process is not marked, and data marking is often performed by professionals, so that a large amount of manpower, material resources and financial resources are consumed for obtaining enough data labels.

2. The proportion of the abnormal data is small, and the potential abnormal points and the corresponding classification thereof are found from a large amount of data.

The density-based method belongs to unsupervised learning, can be completed without data marking, but is not high in detection accuracy rate generally, and lacks theoretical support for classification results.

Disclosure of Invention

The application aims to provide a data anomaly detection method, a data anomaly detection device, data anomaly detection equipment and a storage medium. The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview and is intended to neither identify key/critical elements nor delineate the scope of such embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.

According to an aspect of an embodiment of the present application, there is provided a data anomaly detection method, including:

acquiring unmarked data;

extracting primary abnormal data from the unmarked data according to a preset query strategy, wherein the primary abnormal data is data which meets a preset condition in the unmarked data screened out by the query strategy;

the primary abnormal data are subjected to identification marking and then stored into a marked first data set to form a second data set, and a pre-trained hypersphere classification model is trained through the second data set, wherein the hypersphere classification model is a hypersphere which can fit a hypersphere covering most sample values to the current marked data in a high-dimensional space, and a classification model for identifying abnormal data and normal data by taking the hypersphere surface as a boundary;

identifying whether the hypersphere classification model reaches a training termination condition;

and when the training termination condition is reached, inputting the unlabeled data into the hypersphere classification model under the training termination condition for classification screening to obtain target abnormal data.

Further, the training method of the hypersphere classification model comprises the following steps:

respectively setting different penalty coefficients for the abnormal data and the normal data to generate a loss function, wherein the penalty coefficients are constants in preset presets;

after constraint conditions are set, calculating to obtain a sphere center value representing the central position of the hyper-sphere and a radius value representing the distance between the sphere center value of the hyper-sphere and the surface of the hyper-sphere in the hyper-sphere classification model;

and generating a decision function for identifying a normal value and an abnormal value according to the sphere center value and the radius value.

Further, when the training termination condition is reached, inputting the unlabeled data into the hypersphere classification model under the training termination condition for classification and screening to obtain target abnormal data, including:

respectively substituting the unlabeled data into the decision functions to generate decision result values;

judging whether the decision result value is greater than or equal to zero;

and when the data is larger than or equal to zero, outputting the unmarked data and marking the unmarked data as target abnormal data.

Further, the query strategy screens primary abnormal data based on the pre-trained hypersphere classification model, and the preset condition is that a weighted distance value from the hypersphere surface of the hypersphere classification model is minimum.

Further, the extracting primary abnormal data from the unmarked data according to a preset query policy includes:

bringing the unmarked data into the decision function and taking an absolute value to obtain a nearest spherical distance;

calculating the distance value between the unmarked data, and taking the minimum value as the nearest neighbor sample distance;

and normalizing the nearest spherical distance and the nearest neighbor sample distance, and weighting by using a preset coefficient to obtain a weighted distance value of each unlabeled data.

Further, the method for the nearest spherical distance normalization processing comprises the following steps:

selecting a first minimum value with a minimum value and a first maximum value with a maximum value from the nearest spherical distances of all the unmarked data;

dividing the difference between the nearest spherical distance of each unmarked data and the first minimum value by the first maximum value to obtain the normalized nearest spherical distance corresponding to all the unmarked data.

Further, the method for distance normalization processing of nearest neighbor samples includes:

selecting a second minimum value with the smallest value and a second maximum value with the largest value from the nearest neighbor sample distances of all the unlabeled data;

and respectively calculating the difference between each unlabeled data and the second minimum value, and dividing the difference by the second maximum value to obtain the normalized nearest neighbor sample distance of all unlabeled data.

According to another aspect of the embodiments of the present application, there is provided a data anomaly detection apparatus including:

an acquisition module: configured to perform obtaining unmarked data;

the query module: the method comprises the steps that primary abnormal data are extracted from unmarked data according to a preset query strategy, wherein the primary abnormal data are data which meet preset conditions in the unmarked data screened out through the query strategy;

a training module: the classification model is configured to perform identification and marking on the primary abnormal data, store the primary abnormal data into a marked first data set to form a second data set, and train a pre-trained hypersphere classification model through the second data set, wherein the hypersphere classification model is a classification model which can fit a hypersphere covering most sample values in a high-dimensional space for the current marked data and identify abnormal data and normal data by taking the hypersphere surface as a boundary;

an identification module: configured to perform identifying whether the hypersphere classification model reaches a training termination condition;

a result output module: and the system is configured to input the unlabeled data into the hypersphere classification model under the training termination condition for classification screening when the training termination condition is reached, so as to obtain target abnormal data.

According to another aspect of the embodiments of the present application, there is provided a computer device including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the data anomaly detection method described above when executing the computer program.

According to another aspect of embodiments of the present application, there is provided a computer-readable storage medium having a computer program stored thereon, the computer program being executed by a processor to implement the data anomaly detection method described above.

The technical scheme provided by one aspect of the embodiment of the application can have the following beneficial effects:

the data anomaly detection method provided by the embodiment of the application trains a hypersphere classification model by using a small amount of marked data, classifies unmarked data by using the hypersphere classification model after reaching a training termination condition, and continues to train the hypersphere classification model by using updated marked data if not; the method combines an unsupervised method with a supervised method, a hypersphere classification model trained by a small amount of marked data has no limitation on the original distribution of the data, the application range is wider, the query strategy based on the boundary distance and the sample density can accurately find out the most valuable data and reduce the influence of noise, the data amount needing to be marked by operators is greatly reduced, the classification precision of the hypersphere classification model is ensured, the cost of artificial intelligent operation is saved, the method is more suitable for practical industrial scenes, and is convenient for large-scale deployment.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the application, or may be learned by the practice of the embodiments. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 shows a flow diagram of a data anomaly detection method of an embodiment of the present application;

FIG. 2 is a flowchart illustrating steps involved in extracting primary anomalous data from the unlabeled data according to a predetermined query policy, according to an embodiment of the present application;

FIG. 3 is a flowchart of a method for nearest sphere distance normalization processing according to an embodiment of the present application;

FIG. 4 is a flow diagram illustrating a method for nearest neighbor sample distance normalization in accordance with an embodiment of the present application;

FIG. 5 illustrates a flowchart of a method of training a hypersphere classification model of an embodiment of the present application;

FIG. 6 is a flowchart illustrating the steps involved in entering the unlabeled data into the hypersphere classification model under training termination conditions for classification screening to obtain target anomaly data according to an embodiment of the present application;

FIG. 7 is a block diagram of a data anomaly detection apparatus according to an embodiment of the present application;

FIG. 8 shows a hardware architecture diagram of a computer device of an embodiment of the present application;

fig. 9 shows a flow chart of a data anomaly detection method according to another embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

As shown in fig. 1, an embodiment of the present application provides a data anomaly detection method, including:

and S1, acquiring unlabeled data.

In the actual intelligent operation process, data generated by a computer system is often unbalanced, and most of the data belong to normal data, so that abnormal data detection in the operation process can be regarded as a single classification problem. Considering that the monitoring index data of the computer system is distributed in the high-dimensional space, the trained classification model needs to have the capability of distinguishing whether the high-dimensional space data is normal or not. Data generated by a computer system is divided into marked data and unmarked data, the marked data is divided into a marked data set, and the unmarked data is divided into an unmarked data set. The classification model may also be referred to as a classifier.

For example, the monitoring index data of the computer system in one embodiment is shown in table 1:

TABLE 1 System monitoring index data

And S2, extracting primary abnormal data from the unmarked data according to a preset query strategy, wherein the primary abnormal data is data which reaches a preset condition in the unmarked data screened by the query strategy.

Considering that operators have limited energy and much unmarked data and cannot mark the unmarked data one by one, the inquiry strategy is used for deciding which unmarked data are selected to be marked by the operators.

The query strategy screens primary abnormal data based on the pre-trained hypersphere classification model, and the preset condition is that the weighted distance value from the hypersphere surface of the hypersphere classification model is minimum.

In certain embodiments, as shown in fig. 2, step S2 includes:

s21, bringing the unmarked data into the decision function and taking an absolute value to obtain the nearest spherical distance;

s22, calculating the distance value between the unlabeled data, and taking the minimum value as the nearest neighbor sample distance;

and S23, normalizing the nearest spherical distance and the nearest neighbor sample distance, and weighting by a preset coefficient to obtain a weighted distance value of each unlabeled data.

The surface of a hypersphere classification model in the classification model is a key for distinguishing whether index data is normal or not and is the most uncertain region in a high-dimensional space. Therefore, the distance from the data x to the surface of the classification model of the hypersphere is taken as a measure and is recorded as the nearest sphere distance | f (x) |.

In addition, the more concentrated the distribution of data of the regions passed by the surface of the hypersphere classification model is considered, the higher the data representativeness is. Therefore, the distance between the data and the nearest data is selected to measure the distribution density, and is recorded as the nearest sample distance d (x, NN)₁(x) ). The larger the distribution density, the smaller the nearest neighbor sample distance. Therefore, when the distance between two points and the boundary is the same, a sample having a high density in the vicinity (that is, the nearest neighbor sample distance is the smallest) is preferentially selected.

So the query strategy selects the data with the smallest weighted distance each time. The data with the smallest weighted distance is the most representative data, i.e. the primary abnormal data.

As shown in fig. 3, the method for normalizing the nearest spherical distance includes:

s231, selecting a first minimum value with a minimum numerical value and a first maximum value with a maximum numerical value from the nearest spherical distances of all the unmarked data;

and S232, dividing the difference between the nearest spherical distance of each unmarked data and the first minimum value by the first maximum value to obtain the normalized nearest spherical distance corresponding to all the unmarked data.

In the actual operation process, when calculating the normalized nearest spherical distance of all unmarked data, all unmarked data are respectively substituted into a decision function f (x) | | x-a | -R and an absolute value is taken to obtain the nearest spherical distance | f (x) | of each unmarked data, and a minimum value and a maximum value are taken from all | f (x) | and are respectively recorded as

U represents an unlabeled dataset, when x ═ x₁When, | f (x) | takes a minimum value, when x ═ x₂When, | f (x) | takes the maximum value. The decision function f (x) | | x-a | -R represents the difference between the distance between the data x and the center a and the radius R. The distance between the data and the center of sphere of the classification model may be referred to as the distance to the center of sphere corresponding to the data.

Subtracting the minimum value from each | f (x) | and dividing the subtracted value by the maximum value to obtain the normalized nearest spherical distance of all the data

As shown in fig. 4, the method for distance normalization processing of nearest neighbor samples includes:

s231', selecting a second minimum value with the smallest value and a second maximum value with the largest value from the nearest neighbor sample distances of all the unmarked data;

s232', calculating the difference between each of the unlabeled data and the second minimum value, and dividing the difference by the second maximum value to obtain the normalized nearest neighbor sample distance of all the unlabeled data.

Specifically, when calculating the normalized nearest neighbor sample distance of all unlabeled data, for each data x, the distance between the data x and all other data is calculated, the minimum value of the distances is taken as the nearest neighbor sample distance, and the point at which the nearest neighbor of the data x is found is taken as d (x, NN)₁(x) ). Taking the minimum and maximum of the nearest neighbor sample distances of all data, and recording them as

U represents an unlabeled dataset, when x ═ x₃When, d (x, NN)₁(x) Get the minimum value when x ═ x₄When, d (x, NN)₁(x) Get the maximum value.

Then normalization operation is carried out, the minimum value in all the nearest neighbor sample distances is subtracted from the nearest neighbor data of each data to obtain a difference, and the difference is divided by the maximum value in all the nearest neighbor sample distances to obtain the normalized nearest neighbor sample distances of all the data

Weighting the normalized nearest spherical distance and the normalized nearest neighbor sample distance of each unmarked data by taking 0.5 as a coefficient respectively to obtain the corresponding weighted distance as follows:

the weighted distances of all data are arranged from small to large, and the first five data are taken as follows:

TABLE 2 weighted distance top five of unlabeled data

S3, the primary abnormal data are identified and marked and then stored in the marked first data set to form a second data set, and a pre-trained hypersphere classification model is trained through the second data set, wherein the hypersphere classification model is a hypersphere which can fit a hypersphere covering most sample values to the current marked data in a high-dimensional space, and the hypersphere surface is used as a boundary to identify the classification model of the abnormal data and the normal data.

And identifying the primary abnormal data and marking to obtain new marked data.

And identifying the primary abnormal data as normal data or abnormal data through a computer or manually, marking the primary abnormal data according to an identification result, and taking the marked primary abnormal data as newly marked data. And receiving a judgment mark of the most representative unmarked data to obtain new marked data, namely new marked data.

Therefore, according to the rule of the query strategy, the unmarked data x with the minimum weighted distance is selected₁₀₇₀₄₈And the data is handed to AI operators for judgment and marking to obtain new marked data.

Whether the most representative unmarked data belongs to normal data or abnormal data can be judged manually or through a computer, the most representative unmarked data is marked, and the most representative unmarked data is judged to be marked and becomes new marked data.

And adding the new marked data into the marked data set to obtain an updated marked data set.

And storing the secondary abnormal data obtained after the primary abnormal data are subjected to identification marking into the marked first data set to form a second data set. The first marked data set is the marked data set. The second data set is an updated tagged data set resulting after the primary anomaly data is stored in the tagged data set.

The new tagged data is added to the tagged data set, thereby updating the tagged data set. In this embodiment, marked x₁₀₇₀₄₈Add to the marked dataset.

Training the hypersphere classification model using labeled data in the updated labeled dataset.

Training a pre-trained hypersphere classification model through the second data set. The hypersphere classification model is a classification model which can fit a hypersphere covering most sample values to the currently marked data in a high-dimensional space and identify abnormal data and normal data by taking the hypersphere surface as a boundary.

In some embodiments, as shown in fig. 5, the training method of the hypersphere classification model includes:

s31, respectively setting different penalty coefficients for the abnormal data and the normal data to generate a loss function, wherein the penalty coefficients are constants in preset presets;

s32, after constraint conditions are set, calculating to obtain a sphere center value representing the center position of the hyper-sphere and a radius value representing the distance between the sphere center value of the hyper-sphere and the surface of the hyper-sphere in the hyper-sphere classification model;

and S33, generating a decision function for identifying a normal value and an abnormal value according to the sphere center value and the radius value.

In some embodiments, training a hypersphere classification model using labeled data in a labeled dataset includes: aiming at marked data in a current marked data set, fitting a hypersphere model in a high-dimensional space, wherein the hypersphere model comprises a plurality of marked data, and the number of the marked data contained in the hypersphere model meets a preset condition; the preset condition may be set according to actual needs, for example, the number of marked data located on the surface of the hyper-sphere model and in the hyper-sphere model is the largest, or the ratio of the number of marked data located on the surface of the hyper-sphere model and in the hyper-sphere model reaches a preset threshold, and the like; determining the circle center and the radius of the hypersphere model so as to obtain a hypersphere classification model for classifying data by taking the surface of the hypersphere model as a boundary; the surface of the hypersphere classification model is used as a boundary, data located on the surface of the hypersphere classification model and in the hypersphere classification model are used as normal data, and data located outside the hypersphere classification model are abnormal data (classification boundary is the surface of the hypersphere classification model), and the original distribution condition of marked data does not need to be considered.

And determining the circle center and the radius of the hypersphere classification model, and solving by using a loss function and a constraint condition. Because the number of normal data in the marked data is large and the number of abnormal data is small, different punishment coefficients are set for the normal data and the abnormal data to distinguish when a loss function of the classification model is constructed, so that the influence of the abnormal data on the classification model is improved. The constructed loss function is therefore as follows:

the constraint conditions are as follows:

||x_i-a||²≤R²+ξ_i，i：x_i∈L_in

||x_j-a||²≤R²-ξ_j，j：x_j∈L_out

ξ_i，ξ_j≥0

wherein a is the center of the sphere, R is the radius of the hyper-sphere model, xi_i，ξ_jIs a relaxation variable, x_i，x_jIs marked data, L_inIs a marked normal numberAccording to the set, i and j are numbers for marking different data, L_outIs a marked abnormal data set, and the penalty coefficients C1, C2 are constants ranging from 0 to 1.

The above problem is a non-convex optimization problem, and the lagrange multiplier method cannot find a global optimal solution. To solve the above problem, the constraint condition containing the slack variable is expressed in the form of a risk function, thereby converting the problem expressed by the above formula into an unconstrained optimization problem, as follows:

ξ_i＝l(R²-||φ(x_i)-a||²)

ξ_j＝l(||φ(x_j)-a||²-R²)

phi (x) is a transformation mapping function and is used for mapping the original data x into a new feature space after feature transformation;

l (t) is a risk function, and the value of a function value of the risk function l (t) is max { -t, 0 }; to combine the sample-independent variables in the risk function l (T) for solution, let T ═ R²-a²And obtaining an optimization target:

however, when the risk function l (t) is used, the second derivative of the function does not exist, so that a gradient method cannot be applied for solving, and for this purpose the following risk function l (t) is used:

the constant has the function of constraining the value of t, so that the risk function l (t) can carry out second-order derivation in a smaller range, and the difference between the value of the risk function l (t) and the value of the initial risk function is smaller. Here, from practical experience, the risk function l (t) is substituted into the optimization target expression to obtain the risk function l (t) of 0.5

Wherein the element K of the matrix K_ij＝k(x_i，x_j)＝＜φ(x_i)，φ(x_j)＞＝(x_ix_j)，e_iRepresentation matrix R^n+mThe standard basis of (2) is solved by using a dual form, constant terms are ignored, and a loss function is obtained by simplification:

i，j：x_i，x_j∈L_in，l，m：x_l，x_m∈L_out

(x_ix_j) Meaning that the ith sample is the vector inner product of the jth sample.

Two inequalities of the constraint condition are simplified to obtain:

ξ_i≥||x_i-a||²-R²，i：x_i∈L_in

ξ_j≤R²-||x_j-a||²，j：x_j∈L_out

using xi_i，ξ_jMultiplied by lagrange coefficients, respectively.

As such, constrained problem solving processes typically use lagrangian methods to obtain an optimal solution.

When the function corresponding to x is obtained by solving the above function_iLagrange multiplier alpha of_iThen, the center of the sphere is calculated

Substituting the value of the circle center a into a loss function, and solving the radius R by using an optimization method;

thus, a hypersphere classification model which is initially trained can be obtained.

And S4, identifying whether the hypersphere classification model reaches a training termination condition.

And S5, when the training termination condition is reached, inputting the unlabeled data into the hypersphere classification model under the training termination condition for classification and screening to obtain target abnormal data.

As shown in fig. 6, step S5 includes:

s51, respectively substituting the unlabeled data into the decision functions to generate decision result values;

s52, judging whether the decision result value is larger than or equal to zero;

and S53, when the data is larger than or equal to zero, outputting the unmarked data and marking the unmarked data as target abnormal data.

Specifically, if the training of the classification model does not reach the training termination condition, the classification model is retrained using the updated labeled data set, and the cycle is performed. With the continuous updating of the marked data, after the classification model is retrained every time, the sphere center position, the radius and the decision function of the hypersphere classification model are correspondingly adjusted. And when the variation of the classification model after each iteration is smaller than a preset threshold value, the training termination condition is reached, and at the moment, the trained classification model is obtained, so that the final decision function is obtained. The decision function f (x) | | x-a | -R represents the difference between the distance between the data x and the center a and the radius R. And finally determining the values of the circle center a and the radius R of the classification model after the training termination condition is reached.

When the classification is carried out, the unlabeled data x_iAnd substituting the data into a decision function f (x) to judge whether the data is positive or negative, if f (x) is less than or equal to 0, considering the unmarked system index data as normal data, and if f (x) is more than 0, considering the corresponding system index data as abnormal.

In other words, when the classification is performed, the unmarked data x is calculated_iJudging whether the distance between the distance and the center of the hyper-sphere classification model is larger than the radius of the hyper-sphere classification model; if the distance is less than or equal to the radius of the hypersphere classification model, the distance is not markedNotation data x_iFor normal data, if the distance is greater than the radius of the hypersphere classification model, the unlabeled data x_iAnd (6) abnormal.

For example, the center of a classification model of a hypersphere obtained by the above steps

a ═ 602.94 (92.69%, 3.28%, 3.49%, 52.36%, 495.53, 63, 69.72%, 98, 357, 54, 91.77%, 58.92%). Taking the actual data in Table 1 as an example, each data has 12 attribute values, and the data is divided into

x_iWhen the decision function f (x) is substituted into (94.76%, 3.76%, 1.29%, 47%, 434.78, 59, 78.37%, 104, 379, 50, 95.47%, 64.55%), the decision function f (x) is substituted into (x) x-a R, and f (x) is-49.09 < 0, the data x is obtained_iWithin the classification model of the hypersphere, so x can be considered_iIs normal data.

For abnormal data, the AI operator can continue to perform root cause analysis and the like, find out the cause of the system abnormality, and give a repair suggestion.

As shown in fig. 7, another embodiment of the present application further provides a data anomaly detection apparatus, including:

the acquisition module 100: configured to perform obtaining unmarked data;

the query module 200: the method comprises the steps that primary abnormal data are extracted from unmarked data according to a preset query strategy, wherein the primary abnormal data are data which meet preset conditions in the unmarked data screened out through the query strategy;

the training module 300: the classification model is configured to perform identification and marking on the primary abnormal data, store the primary abnormal data into a marked first data set to form a second data set, and train a pre-trained hypersphere classification model through the second data set, wherein the hypersphere classification model is a classification model which can fit a hypersphere covering most sample values in a high-dimensional space for the current marked data and identify abnormal data and normal data by taking the hypersphere surface as a boundary;

the identification module 400: configured to perform identifying whether the hypersphere classification model reaches a training termination condition;

the result output module 500: and the system is configured to input the unlabeled data into the hypersphere classification model under the training termination condition for classification screening when the training termination condition is reached, so as to obtain target abnormal data.

As shown in fig. 8, another embodiment of the present application discloses a computer device 600, which includes a memory 601, a processor 602, and a computer program stored on the memory 601 and executable on the processor 602, wherein the processor 602 implements the data anomaly detection method described above when executing the computer program. The computer device is a device capable of automatically performing numerical calculation and/or information processing in accordance with instructions set or stored in advance. As shown, the computer apparatus 600 includes, but is not limited to, at least a memory 601, a processor 602, and a network interface 603, which may be communicatively coupled to each other via a device bus. Wherein:

in this embodiment, the memory 601 includes at least one type of computer-readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage 601 may be an internal storage unit of the computer device 600, such as a hard disk or a memory of the computer device 600. In other embodiments, the memory 601 may also be an external storage device of the computer device 600, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the computer device 600. Of course, the memory 601 may also include both internal and external storage devices for the computer device 600. In this embodiment, the memory 601 is generally used for storing the operating device and various application software installed in the computer device 600, such as the program code of the abnormal medical insurance group identification device 500. In addition, the memory 601 can also be used to temporarily store various types of data that have been output or are to be output.

Processor 602 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 602 is typically used to control the overall operation of the computer device 600. In this embodiment, the processor 602 is configured to run a program code stored in the memory 601 or process data, for example, the data exception detecting apparatus 500, so as to implement the data exception detecting method in each of the above embodiments.

The network interface 603 may include a wireless network interface or a wired network interface, and the network interface 603 is generally used for establishing a communication connection between the computer apparatus 600 and other electronic devices. For example, the network interface 603 is used to connect the computer device 600 to an external terminal through a network, establish a data transmission channel and a communication connection between the computer device 600 and the external terminal, and the like. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), 4G network, 5G network, Bluetooth (Bluetooth), Wi-Fi, and the like.

It is noted that fig. 8 only shows the computer device 600 with components 601 and 603, but it is to be understood that not all of the shown components are required to be implemented, and that more or less components may be implemented instead.

In this embodiment, the abnormal medical insurance group identification apparatus 500 stored in the memory 601 may be further divided into one or more program modules, and the one or more program modules are stored in the memory 601 and executed by one or more processors (in this embodiment, the processor 602) to complete the data abnormality detection method of the present invention.

The present embodiment also provides a computer-readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application mall, etc., on which a computer program is stored, which when executed by a processor implements corresponding functions. The computer readable storage medium of this embodiment is used for storing the abnormal medical insurance group identification apparatus 500, so as to implement the data abnormality detection method of the present invention when being executed by the processor.

Another embodiment of the present application also discloses a computer-readable storage medium, on which a computer program is stored, the program being executed by a processor to implement the data anomaly detection method described above.

As shown in fig. 9, another embodiment of the present application provides a data anomaly detection method, including:

and S00, acquiring unlabeled data.

S10, training a hypersphere classification model by using the marked data in the marked data set.

And S20, extracting primary abnormal data from the unmarked data according to a preset query strategy, wherein the primary abnormal data is data which reaches a preset condition in the unmarked data screened by the query strategy.

And S30, identifying the primary abnormal data and marking to obtain new marked data.

And S40, adding the new marked data into the marked data set to obtain an updated marked data set.

S50, training the hypersphere classification model by using the marked data in the updated marked data set.

S60, identifying whether the hypersphere classification model reaches a training termination condition; if the training termination condition is reached, go to step S70; if the training termination condition is not met, the process goes to step S20.

S70, classifying the unlabeled data by using the hypersphere classification model, and inputting the unlabeled data into the hypersphere classification model under the training termination condition for classification and screening to obtain target abnormal data.

According to the data anomaly detection method provided by the embodiment of the application, an unsupervised learning method and a supervised learning method are combined from the beginning of data, a hypersphere classification model constructed by a small amount of marked data is not limited to the original distribution of the data, and the application range is wider; and the query strategy based on the boundary distance and the sample density can accurately find out the most valuable data and reduce the influence of noise, thereby greatly reducing the data volume needing to be marked by operators, ensuring the data classification precision, saving the cost of artificial intelligence operation, being more suitable for actual industrial scenes and being convenient for large-scale deployment.

The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims

1. A data anomaly detection method is characterized by comprising the following steps:

acquiring unmarked data;

2. The method of claim 1, wherein the training method of the hypersphere classification model comprises:

3. The method of claim 2, wherein when the training termination condition is reached, inputting the unlabeled data into the hypersphere classification model under the training termination condition for classification screening to obtain target abnormal data, comprising:

judging whether the decision result value is greater than or equal to zero;

4. The method of claim 2, wherein the query strategy screens primary anomaly data based on the pre-trained hypersphere classification model, and the predetermined condition is that a weighted distance value from a hypersphere surface of the hypersphere classification model is minimum.

5. The method according to claim 4, wherein the extracting primary abnormal data from the unlabeled data according to a preset query policy comprises:

6. The method of claim 5, wherein the nearest sphere distance normalization processing method comprises:

7. The method of claim 5, wherein the nearest neighbor sample distance normalization processing method comprises:

8. A data abnormality detection apparatus, characterized by comprising:

an acquisition module: configured to perform obtaining unmarked data;

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the data anomaly detection method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium on which a computer program is stored, the program being executable by a processor to implement the data anomaly detection method according to any one of claims 1 to 7.