CN114020811A

CN114020811A - Data anomaly detection method and device and electronic equipment

Info

Publication number: CN114020811A
Application number: CN202111321740.1A
Authority: CN
Inventors: 张为欢; 王培君; 管虹翔; 梁广会
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2021-11-09
Filing date: 2021-11-09
Publication date: 2022-02-08

Abstract

The invention discloses a data anomaly detection method and device and electronic equipment. The detection method comprises the following steps: receiving product operation data, and sending the product operation data to a target park, wherein the target park is accessed with an abnormal detection system, a data detection model runs in the abnormal detection system, the data detection model excavates sample data causing a loss value rate to be larger than a preset probability threshold value in a model training process, and retrains the sample data again, the loss value rate is used for indicating a ratio of sample data quantity of model classification errors to total data quantity, the abnormal detection system is adopted to analyze the product operation data to obtain a detection result, and abnormal data in the detection result is sent to an alarm system. The invention solves the technical problem of lower detection accuracy of the anomaly detection system in the related technology.

Description

Data anomaly detection method and device and electronic equipment

Technical Field

The invention relates to the technical field of data processing, in particular to a data anomaly detection method and device and electronic equipment.

Background

In the financial technology industry, most enterprises/companies begin to build alarm systems based on artificial intelligence algorithms for identifying abnormal data so that relevant personnel can quickly process the data. In the related technology, a mature artificial intelligence algorithm is introduced, and a model obtained based on historical data training is designed to detect abnormal data in production and further send an alarm. However, the existing model is obtained by training and iterating the off-line data of the previous period, and with the rapid development of the financial business field, more positive and negative samples which are difficult to be distinguished exist in various training samples, so that the actual detection rate of the existing model is reduced, and the detection accuracy rate is reduced.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a data anomaly detection method, a device and electronic equipment thereof, which at least solve the technical problem of low detection accuracy of an anomaly detection system in the related technology.

According to an aspect of an embodiment of the present invention, there is provided a data anomaly detection method, including: receiving product operation data; sending the product operation data to a target park, wherein the target park is accessed with an anomaly detection system, a data detection model is operated in the anomaly detection system, the data detection model excavates sample data causing a loss value rate to be larger than a preset probability threshold value in a model training process, and retrains the sample data, and the loss value rate is used for indicating a ratio of sample data volume with wrong model classification to total data volume; analyzing the product operation data by adopting the anomaly detection system to obtain a detection result; and sending abnormal data in the detection result to an alarm system.

Optionally, before receiving the product operation data, the detection method further includes: obtaining product operation data in a first preset time period in a historical process to obtain historical sample data; receiving product error data of product operation data input by external terminal equipment, wherein the product error data indicate sample data which causes a loss value rate to be greater than a preset probability threshold value in a model training process; and inputting the historical sample data and the product error data into a data detection model so as to carry out iterative training on the data detection model.

Optionally, the step of inputting the historical sample data and the product error data into a data detection model to perform iterative training on the data detection model includes: performing anomaly detection on the historical sample data by adopting a preset normal distribution algorithm to obtain a negative sample set and a positive sample set; and analyzing the sample data in the positive sample set by adopting a local abnormal factor algorithm to determine wrong data in the positive sample set and finish model training.

Optionally, the step of performing anomaly detection on the historical sample data by using a preset normal distribution algorithm to obtain a negative sample set and a positive sample set includes: taking each index to be detected as a reference, extracting sample data corresponding to each index to be detected in the historical sample data; calculating the sample mean value of all the extracted sample data to obtain an index detection mean value; performing sample variance calculation on all the extracted sample data to obtain an index detection variance value; calculating a normal distribution area based on the index detection mean value and the index detection variance value; and classifying the historical sample data in the normal distribution area into a positive sample set, and classifying the historical sample data which does not fall in the normal distribution area into the negative sample set.

Optionally, the step of analyzing the sample data in the positive sample set by using a local abnormal factor algorithm to determine the error data in the positive sample set includes: combining the index data corresponding to each index to be detected and the corresponding data time in the historical sample data to obtain a plurality of time sequence data; determining surrounding neighborhood points in the target time sequence data by taking each time sequence data as a center to obtain a neighborhood set; calculating the reachable distance of other neighborhood points in the neighborhood set and the local reachable density of the target time sequence data; calculating the density mean value of the local reachable densities of all other neighborhood points in the neighborhood set, and calculating the density ratio between the density mean value and the local reachable density of the target time sequence data; and if the density ratio is larger than a preset ratio threshold, determining that the target time sequence data is wrong data in the positive sample set.

Optionally, the step of determining a neighborhood point around the target time series data by taking each time series data as a center to obtain a neighborhood set includes: acquiring a vector distance value between the target time sequence data and other time sequence data; sequencing all vector distance values to obtain a sequencing result; and classifying the time sequence data with the vector distance value smaller than or equal to a preset distance threshold value into the neighborhood set based on the sorting result.

Optionally, the step of calculating the reachable distance of other neighborhood points in the neighborhood set and the local reachable density of the target time series data includes: if the actual distance between other neighborhood points in the neighborhood set and the target time sequence data is less than or equal to k distance points, determining the reachable distance between the other neighborhood points and the target time sequence data to be a preset distance threshold; if the actual distance between other neighborhood points in the neighborhood set and the target time sequence data is greater than k distance points, determining the reachable distance between the other neighborhood points and the target time sequence data as the actual distance; and calculating the reciprocal of the reachable distance between the target time sequence data and other neighborhood points in the neighborhood set to obtain the local reachable density of the target time sequence data.

Optionally, after sending the abnormal data in the detection result to the alarm system, the detection method further includes: receiving a data confirmation result fed back by the alarm system; if the data confirmation result indicates that the detection result is real, classifying the abnormal data into a positive sample library; and if the data confirmation result indicates that the detection result is false, classifying the abnormal data into a negative sample library.

According to another aspect of the embodiments of the present invention, there is also provided a data anomaly detection apparatus, including: the receiving unit is used for receiving product operation data; the system comprises a first sending unit, a second sending unit and a third sending unit, wherein the first sending unit is used for sending the product operation data to a target park, the target park is accessed to an abnormity detection system, a data detection model runs in the abnormity detection system, the data detection model excavates sample data causing a loss value rate to be larger than a preset probability threshold value in a model training process, and retrains the sample data, and the loss value rate is used for indicating a ratio of sample data volume with wrong model classification to total data volume; the analysis unit is used for analyzing the product operation data by adopting the anomaly detection system to obtain a detection result; and the second sending unit is used for sending the abnormal data in the detection result to the alarm system.

Optionally, the detection apparatus further comprises: the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring product operation data in a first preset time period in a historical process before receiving the product operation data to obtain historical sample data; the system comprises a first receiving module, a second receiving module and a third receiving module, wherein the first receiving module is used for receiving product error data of product operation data input by external terminal equipment, and the product error data indicate sample data which causes a loss value rate to be larger than a preset probability threshold value in a model training process; and the first training module is used for inputting the historical sample data and the product error data into a data detection model so as to carry out iterative training on the data detection model.

Optionally, the first training module comprises: the first detection submodule is used for carrying out abnormal detection on the historical sample data by adopting a preset normal distribution algorithm to obtain a negative sample set and a positive sample set; and the first analysis submodule is used for analyzing the sample data in the positive sample set by adopting a local abnormal factor algorithm so as to determine the error data in the positive sample set and finish model training.

Optionally, the first detection submodule includes: the first extraction submodule is used for extracting sample data corresponding to each index to be detected from the historical sample data by taking each index to be detected as a reference; the first calculation submodule is used for calculating the sample mean value of all the extracted sample data to obtain an index detection mean value; the second calculation submodule is used for performing sample variance calculation on all the extracted sample data to obtain an index detection variance value; the third calculation submodule is used for calculating a normal distribution area based on the index detection mean value and the index detection variance value; and the first classification submodule is used for classifying the historical sample data in the normal distribution area into a positive sample set, and classifying the historical sample data which does not fall in the normal distribution area into the negative sample set.

Optionally, the first analysis sub-module comprises: the first combination sub-module is used for combining the index data corresponding to each index to be detected and the corresponding data time in the historical sample data to obtain a plurality of time sequence data; the first determining submodule is used for determining surrounding neighborhood points of the target time sequence data by taking each time sequence data as a center to obtain a neighborhood set; the fourth calculation submodule is used for calculating the reachable distance of other neighborhood points in the neighborhood set and the local reachable density of the target time sequence data; the fifth calculation submodule is used for calculating the density mean value of the local reachable densities of all other neighborhood points in the neighborhood set and calculating the density ratio between the density mean value and the local reachable density of the target time sequence data; and the second determining submodule is used for determining that the target time sequence data is wrong data in the positive sample set if the density ratio is larger than a preset ratio threshold.

Optionally, the first determining sub-module includes: the first acquisition submodule is used for acquiring vector distance values between the target time sequence data and other time sequence data; the first sequencing submodule is used for sequencing all the vector distance values to obtain a sequencing result; and the second classification submodule is used for classifying the time sequence data of which the vector distance value is less than or equal to a preset distance threshold value into the neighborhood set based on the sequencing result.

Optionally, the fourth computing submodule includes: a third determining submodule, configured to determine, if an actual distance between another neighborhood point in the neighborhood set and the target time series data is less than or equal to a k distance point, that an reachable distance between the another neighborhood point and the target time series data is a preset distance threshold; a fourth determining submodule, configured to determine, if an actual distance between another neighboring point in the neighboring set and the target timing data is greater than a k-distance point, an reachable distance between the another neighboring point and the target timing data to be the actual distance; and the sixth calculating submodule is used for calculating the reciprocal of the reachable distance between the target time sequence data and other neighborhood points in the neighborhood set to obtain the local reachable density of the target time sequence data.

Optionally, the detection apparatus further comprises: after the abnormal data in the detection result is sent to the alarm system, the second receiving module is used for receiving a data confirmation result fed back by the alarm system; the first classification module is used for classifying the abnormal data into a positive sample library if the data confirmation result indicates that the detection result is real; and the second classification module is used for classifying the abnormal data into a negative sample library if the data confirmation result indicates that the detection result is false.

According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform any of the data anomaly detection methods described above via execution of the executable instructions.

According to another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, and when the computer program runs, the apparatus where the computer-readable storage medium is located is controlled to execute any one of the above data anomaly detection methods.

In the method, product operation data are received, the product operation data are sent to a target park, an abnormal detection system is connected to the target park, a data detection model operates in the abnormal detection system, the data detection model excavates sample data causing a loss value rate to be larger than a preset probability threshold value in a model training process, the sample data are retrained, the loss value rate is used for indicating the ratio of the sample data volume with wrong model classification to the total data volume, the abnormal detection system is adopted to analyze the product operation data, a detection result is obtained, and the abnormal data in the detection result are sent to an alarm system. In the application, the detection rate of the abnormal detection system can be improved by carrying out difficult mining on the real-time product operation data (namely mining some sample data with a large loss value in the model training process) and retraining the data, so that the abnormal data can be found more accurately in real time, and the technical problem of low detection accuracy of the abnormal detection system in the related technology is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow diagram of an alternative data anomaly detection method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an alternative method for increasing the accuracy of an artificial intelligence alarm system in accordance with embodiments of the present invention;

FIG. 3 is a flow diagram of an alternative hard case mining according to an embodiment of the invention;

fig. 4 is a schematic diagram of an alternative data anomaly detection apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

To facilitate understanding of the invention by those skilled in the art, some terms or nouns referred to in the embodiments of the invention are explained below:

difficult excavation: it is referred to mining some sample data that causes a high loss value rate in the model training process (i.e. sample data that makes the model classify errors with a high probability), and retraining the sample data.

Hard-to-classify positive (negative) samples: the positive (negative) samples are misclassified as negative (positive) samples, with the highest loss of positive (negative) samples in the training process.

LOF algorithm: local Outlier Factor, a Local anomaly Factor algorithm, compares the Local density of a given data point and its neighborhood points, and these data point samples are considered anomalous data samples when their Local density is significantly lower than the Local density of the neighborhood points.

The following embodiments of the invention can be applied to various systems and applications for detecting abnormal data or scenes needing to detect abnormal data, the related data can be data operated in actual production, for example, data related to various financial products (for example, data of fund products, data related to bond products) and the like.

Example one

In accordance with an embodiment of the present invention, there is provided a data anomaly detection method embodiment, it is noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer-executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.

Fig. 1 is a flow chart of an alternative data anomaly detection method according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:

step S102, product operation data is received.

And step S104, sending the product operation data to a target park, wherein the target park is accessed with an anomaly detection system, a data detection model is operated in the anomaly detection system, the data detection model excavates sample data causing a loss value rate to be larger than a preset probability threshold value in the model training process, and retrains the sample data, and the loss value rate is used for indicating the ratio of the sample data quantity with wrong model classification to the total data quantity.

And S106, analyzing the product operation data by adopting an anomaly detection system to obtain a detection result.

And step S108, sending the abnormal data in the detection result to an alarm system.

Through the steps, the product operation data can be received, the product operation data are sent to the target park, the target park is connected with an abnormity detection system, a data detection model operates in the abnormity detection system, the data detection model excavates sample data causing loss value rate larger than a preset probability threshold value in a model training process, and retrains the sample data, the loss value rate is used for indicating the ratio of the sample data volume with wrong model classification to the total data volume, the abnormity detection system is adopted to analyze the product operation data to obtain a detection result, and the abnormity data in the detection result are sent to an alarm system. In the embodiment of the invention, the detection rate of the anomaly detection system can be improved by carrying out hard mining on the real-time product operation data (adopting a data detection model operated in the anomaly detection system to carry out anomaly detection, and mining some sample data with large loss value in the training process of the model) and retraining the data, so that the anomaly data can be found more accurately in real time, and the technical problem of lower detection accuracy of the anomaly detection system in the related technology is solved.

The following will explain the embodiments of the present invention in detail with reference to the above steps.

In an embodiment of the present invention, before receiving the product operation data, the detection method further includes: obtaining product operation data in a first preset time period in a historical process to obtain historical sample data; receiving product error data of product operation data input by external terminal equipment, wherein the product error data indicate sample data which causes a loss value rate to be greater than a preset probability threshold value in a model training process; and inputting the historical sample data and the product error data into the data detection model so as to carry out iterative training on the data detection model.

In the embodiment of the present invention, historical sample data (i.e., product operation data in a first preset time period in a historical process) and difficultly-differentiated sample data (i.e., product error data, which is sample data causing a loss value rate greater than a preset probability threshold (e.g., sixty percent, where the preset probability threshold is determined by an actual situation and is not limited herein) obtained by difficultly mining real-time data in production) may be input to a model (i.e., a data detection model) for iterative training.

Optionally, the step of inputting the historical sample data and the product error data into the data detection model to perform iterative training on the data detection model includes: performing anomaly detection on historical sample data by adopting a preset normal distribution algorithm to obtain a negative sample set and a positive sample set; and analyzing the sample data in the positive sample set by adopting a local abnormal factor algorithm to determine wrong data in the positive sample set and finish model training.

In the embodiment of the invention, historical sample data can be mined twice through an algorithm to determine product error data and complete model training, and the 3-sigma algorithm (namely a preset normal distribution algorithm) can be used for carrying out anomaly detection on the historical sample data (the historical sample data is required to accord with normal distribution, if the historical sample data does not accord with the normal distribution, log calculation can be used for converting the historical sample data into normal distribution), so that a negative sample set and a positive sample set are obtained, and then an LOF algorithm (namely a local anomaly factor algorithm) is used for detecting the sample data in the positive sample set so as to determine the error data and complete the model training.

Optionally, the step of performing anomaly detection on the historical sample data by using a preset normal distribution algorithm to obtain a negative sample set and a positive sample set includes: taking each index to be detected as a reference, extracting sample data corresponding to each index to be detected in historical sample data; calculating the sample mean value of all the extracted sample data to obtain an index detection mean value; performing sample variance calculation on all the extracted sample data to obtain an index detection variance value; calculating a normal distribution area based on the index detection mean value and the index detection variance value; and classifying the historical sample data in the normal distribution area into a positive sample set, and classifying the historical sample data which does not fall in the normal distribution area into a negative sample set.

In the embodiment of the present invention, a 3-sigma algorithm (i.e., a preset normal distribution algorithm) is used to perform anomaly detection on historical sample data, so as to perform sample mean calculation on the sample data of an index to be detected (i.e., an index to be detected, such as success rate, transaction time consumption, and the like) to obtain an index detection mean value μ, then perform sample variance calculation to obtain an index detection variance value σ, determine that the sample data is not within a 3-sigma range (i.e., is not within a range of a normal distribution region, such as (μ -3 σ, μ +3 σ)) as a negative sample set, and determine that the sample data is within the 3-sigma range (i.e., is within the range of the normal distribution region, such as (μ -3 σ, μ +3 σ)) as a positive sample set.

Optionally, the step of analyzing the sample data in the positive sample set by using a local abnormal factor algorithm to determine the misclassified data in the positive sample set includes: combining the index data corresponding to each index to be detected and the corresponding data time in the historical sample data to obtain a plurality of time sequence data; determining surrounding neighborhood points in the target time sequence data by taking each time sequence data as a center to obtain a neighborhood set; calculating the reachable distance of other neighborhood points in the neighborhood set and the local reachable density of the target time sequence data; calculating the density mean value of the local reachable density of all other neighborhood points in the neighborhood set, and calculating the density ratio between the density mean value and the local reachable density of the target time sequence data; and if the density ratio is larger than a preset ratio threshold, determining that the target time sequence data is wrong data in the positive sample set.

In the embodiment of the present invention, an LOF algorithm may be used to detect sample data in a positive sample set, where two-dimensional data (i.e., time series data) needs to be used, index data (e.g., success rate) corresponding to each index to be detected and corresponding data time may be combined (e.g., time points at which success rate and transaction time consumption occur are combined into time series two-dimensional data), so as to obtain time series data, and with each time series data as a center, a neighborhood point (e.g., a set point O is an inside neighborhood point) around a target time series data (which may be set as a point P) is determined, so as to obtain a neighborhood set (which may be set as N)_k(P) representing a kth distance neighborhood, i.e. all points having a distance to point P less than or equal to K distance), then calculating the reachable distances of other neighborhood points in the neighborhood set (e.g. the kth reachable distance of point O to point P, being the kth distance of point O, or the actual distance between point O and point P) and the local reachable density of the target time series data (i.e. the inverse of the reachable distance of point P to other neighborhood points), calculating the density ratio between the density mean of the local reachable densities of all other neighborhood points in the neighborhood set and the local reachable density of the target time series data (i.e. by the formula, i.e. the mean of the local reachable densities of points in the neighborhood divided by the local reachable density of point P).

Where ρ is_k(O) represents the local achievable density of the point O, p_k(P) represents the local achievable density of point P,|N_k(P) | represents the number of points in the neighborhood, LOF_k(P) represents a density ratio.

If the density ratio is greater than a preset ratio threshold (for example, 1), determining that the target time sequence data is wrong fraction data in the positive sample set, if the density ratio is close to the preset ratio threshold, it indicates that the density of the neighborhood points of the point P is almost the same, the point P may be in the same cluster as the neighborhood, and if the density ratio is less than the preset ratio threshold, it indicates that the density of the point P is higher than the density of the neighborhood points, which indicates that the point P is a dense point.

Optionally, the step of determining, with each time series data as a center, neighborhood points around the target time series data to obtain a neighborhood set includes: acquiring a vector distance value between target time sequence data and other time sequence data; sequencing all vector distance values to obtain a sequencing result; and classifying the time sequence data with the vector distance value smaller than or equal to a preset distance threshold value into a neighborhood set based on the sorting result.

In the embodiment of the present invention, vector distance values between the target time series data and other time series data may be calculated, the vector distance values are sorted to obtain a sorting result, for example, a kth distance of the target time series data (for example, point P) is calculated, distances between the point P and other points (for example, other time series data) are sorted from small to large, the kth distance is the K distance, and then time series data having a vector distance value smaller than or equal to a preset distance threshold (if a K distance neighborhood is calculated, the preset distance threshold is the kth distance) are sorted into a neighborhood set, for example, a K distance neighborhood is calculated, that is, a neighborhood set composed of all points having a distance to the point P smaller than or equal to the K distance is recorded as N_k(P)。

Optionally, the step of calculating the reachable distance of other neighborhood points in the neighborhood set and the local reachable density of the target time series data includes: if the actual distance between other neighborhood points in the neighborhood set and the target time sequence data is less than or equal to k distance points, determining the reachable distance between other neighborhood points and the target time sequence data as a preset distance threshold; if the actual distance between other neighborhood points in the neighborhood set and the target time sequence data is greater than the k distance point, determining the reachable distance between the other neighborhood points and the target time sequence data as the actual distance; and calculating the reciprocal of the reachable distance between the target time sequence data and other neighborhood points in the neighborhood set to obtain the local reachable density of the target time sequence data.

In an embodiment of the present invention, the reachable distance may be calculated by the following formula, where point P represents the target timing data, point O represents other neighborhood points in the neighborhood set, and d_k(O) represents the Kth distance of the point O, d (P, O) represents the actual distance between the point O and the point P, d_k(P, O) is the reachable distance.

d_k(P,O)＝max{d_k(O),d(P,O)}；

That is, if the distance to the point P is less than or equal to the k-distance point (i.e., if the actual distance from other neighbor points in the neighbor set to the target time series data is less than or equal to the k-distance point), the reachable distance is the k-distance (i.e., a preset distance threshold), otherwise, the reachable distance is the actual distance (i.e., if the actual distance between other neighbor points in the neighbor set and the target time series data is greater than the k-distance point, the reachable distance between other neighbor points and the target time series data is determined to be the actual distance).

Then, the local reachable density can be calculated by the following formula, wherein the point O represents a point in the neighborhood, the point P is used for target time sequence data, and the local reachable density of the point P is calculated by the reciprocal of the reachable distance between the point P and other neighborhood points.

After the model is trained and the training model is iteratively updated, the model can be directly used for detecting abnormal data. And the real-time data is difficultly mined and model iteration is carried out, so that the production abnormity can be found more accurately in real time.

Step S102, product operation data is received.

In an embodiment of the present invention, the product operation data is real-time production data in production, for example, data about funds, data about bonds, and the like.

In the embodiment of the invention, after the historical sample data of the production stock is used for basic model training to obtain the initial model (namely the data detection model), the data detection model can be put into the abnormity detection system which is built in production, wherein, the data detection model mines the sample data causing the loss value rate to be larger than a preset probability threshold (for example, sixty percent) in the model training process, and retraining the sample data, the loss value rate being indicative of a ratio of a sample data amount to a total data amount for which the model classification is incorrect, the anomaly detection system may be divided into different parks (e.g., the A, B park), which may perform different functions, can set up two lives in the two gardens of system to alleviate single garden system pressure, so, can send corresponding target garden according to the type of product operation data.

In the embodiment of the invention, each park can correspond to one abnormality detection system, and the parks can run in parallel, namely, two parks can live in two, three parks can live in three, so that the advantage of parallel running of a plurality of parks is as follows: the system pressure can be reduced, the data processing speed is improved, and if one of the parks has a fault, the abnormal detection system can be changed through the multi-activity (for example, double-activity) function (namely, the abnormal detection system corresponding to the other park is selected), and the data processing is continued.

In the embodiment of the invention, the product operation data can be detected by the abnormity detection system, whether the data is abnormal or not is preliminarily determined according to the detection result, and the data is stored in the sample library.

In the embodiment of the invention, the abnormity warning system detects whether a new suspected abnormity exists in real time, the abnormity is displayed and notified to related personnel, after the related personnel receive the notification (namely, a data confirmation result fed back by the warning system is received), whether the suspected abnormity is real abnormity is confirmed according to the actual situation, the abnormity warning system confirms, after the confirmation is finished, abnormal data enters the positive sample library (namely, if the data confirmation result indicates that the detection result is real, the abnormal data is classified into the positive sample library), and data which is not abnormal enters the negative sample library (namely, if the data confirmation result indicates that the detection result is false, the abnormal data is classified into the negative sample library), the warning notification system captures the content in the positive sample library, and the real warning data is rapidly sent to the related personnel in real time.

The embodiment of the invention provides a method for improving the accuracy of an artificial intelligence warning system, which not only solves the problem of detection accuracy reduction caused by updating and training an artificial intelligence model based on a traditional method in production, but also can find production abnormity more accurately and in real time by carrying out difficult mining on real-time data and carrying out model iteration, is suitable for improving the accuracy of the artificial intelligence model-based warning system in production, is compatible with various artificial intelligence algorithms, and has better expandability.

Example two

FIG. 2 is a schematic diagram of an alternative method for improving the accuracy of an artificial intelligence alarm system according to an embodiment of the present invention, including the following steps:

step 1: and confirming the trained model, using historical sample data of the stock in production for basic model training to obtain an initial model, and putting the model into production to build an anomaly detection system (wherein the anomaly detection system can be divided into different parks (for example, A, B parks), and the different parks can execute different functions, namely, two parks and two lives of the system can be set so as to reduce the system pressure of a single park).

Step 2: the production data are accessed to the corresponding park in real time, the initial model detects the data through the anomaly detection system, whether the data are abnormal or not is preliminarily determined according to the result of model output training, and the data are stored in a sample library.

And step 3: and (4) alarm detection, wherein an abnormity alarm system detects whether a new suspected abnormity exists in real time, displays the abnormity and informs development operation and maintenance personnel to confirm whether the abnormity is true.

And 4, step 4: after receiving the notification, the development and operation and maintenance personnel confirm whether the suspected abnormality is a real abnormality according to the actual situation, confirm in the abnormality warning system, and after the confirmation, enter the positive sample library for the data which is abnormal, and enter the negative sample library for the data which is not abnormal.

And the alarm notification system captures the content in the positive sample library and quickly transmits the real alarm data to related personnel in real time.

And 5: and (4) model updating iteration, namely, performing iterative training by simultaneously inputting historical sample data (namely, obtained from a sample library) and difficultly mined difficultly divided samples for producing real-time data into the model so as to update the model.

Fig. 3 is a flowchart of an optional hard case mining according to an embodiment of the present invention, and as shown in fig. 3, in order to improve accuracy of hard case mining, the embodiment of the present invention may perform mining through two algorithms. Firstly, anomaly detection can be performed on real-time data (wherein the real-time data is required to be in accordance with normal distribution, and if the real-time data is not in accordance with the normal distribution, log computation can be used for converting the real-time data into the normal distribution), sample data of an index to be detected (such as success rate, transaction time consumption and the like) can be subjected to sample mean computation to obtain an index detection mean value mu, then sample variance computation is performed to obtain an index detection variance value sigma, the sample data which are not in a 3-sigma range (such as mu-3 sigma and mu +3 sigma) are confirmed to be a difficultly-divided negative sample set, and the sample data which are in the 3-sigma range (such as mu-3 sigma and mu +3 sigma) are confirmed to be a positive sample set.

If the 3-Sigma detection is a difficult negative sample set (namely, the real-time data is abnormal), the real-time data is stored in a difficult sample library, and if the 3-Sigma detection is a positive sample set (namely, the real-time data is normal), the LOF algorithm is continuously used for detection (wherein, the LOF algorithm needs two-dimensional data, namely, index values, such as the success rate and the time point when the transaction consumes time, can be combined into time-series two-dimensional data).

Firstly, calculating a kth distance, and sequencing distances between a point P (namely a selected target real-time data point) and other points from small to large, wherein the kth distance is the k distance; secondly, calculating a k distance neighborhood set: all points with the distance to the point P less than or equal to the k distance are k in total and are marked as N_k(P); thirdly, calculating the reachable distance, if the distance to the point P is less than or equal to the k distance, the k distance is obtained, otherwise, the actual distance is obtained, and the following formula is shown:

d_k(P,O)＝max{d_k(O),d(P,O)}；

and fourthly, calculating the local reachable density as (wherein, the point in the neighborhood is taken as the point O): the calculation is performed by calculating the reciprocal of the reachable distance of point P from other neighboring points.

And fifthly, calculating a local outlier factor, namely calculating by the following formula, namely dividing the mean value of the local reachable densities of the points in the neighborhood by the local reachable density of the point P.

LOF_k(P) neighborhood N representing P_k(P) local achievable density of other points in the array to local achievable density value of PIf the value is closer to 1, the neighborhood point density of P is shown to be almost the same, P may be the same cluster with the neighborhood, if the ratio is smaller than 1, the density of P is shown to be higher than the neighborhood point density, P is a dense point, if the ratio is larger than 1, the density of P is shown to be smaller than the neighborhood point density, P is an abnormal point, namely the real-time data with the ratio larger than 1 enters a difficult negative sample library.

The embodiment of the invention has the following beneficial effects:

(1) by carrying out difficult mining on real-time data and carrying out model iteration, the production abnormity can be found more accurately in real time;

(2) the artificial intelligence model trained by the application can be suitable for various product production warning systems, and is compatible with various artificial intelligence algorithms, and the expandability is good.

EXAMPLE III

The data anomaly detection device provided in the present embodiment includes a plurality of implementation units, each implementation unit corresponding to each implementation step in the first embodiment.

Fig. 4 is a schematic diagram of an alternative data anomaly detection apparatus according to an embodiment of the present invention, as shown in fig. 4, the detection apparatus may include: a receiving unit 40, a first transmitting unit 42, an analyzing unit 44, a second transmitting unit 46, wherein,

a receiving unit 40 for receiving product operation data;

the first sending unit 42 is configured to send the product operation data to a target campus, where the target campus is accessed to an anomaly detection system, a data detection model is operated in the anomaly detection system, the data detection model excavates sample data causing a loss value rate greater than a preset probability threshold in a model training process, and retrains the sample data, and the loss value rate is used to indicate a ratio of sample data size with a model classification error to total data size;

the analysis unit 44 is used for analyzing the product operation data by adopting an anomaly detection system to obtain a detection result;

and a second sending unit 46, configured to send the abnormal data in the detection result to the alarm system.

The detection device can receive product operation data through the receiving unit 40, and send the product operation data to the target garden through the first sending unit 42, wherein the target garden is connected with an abnormality detection system, a data detection model runs in the abnormality detection system, the data detection model excavates sample data causing a loss value rate larger than a preset probability threshold value in a model training process, and retrains the sample data, the loss value rate is used for indicating a ratio of the sample data volume with wrong model classification to a total data volume, the abnormality detection system is adopted to analyze the product operation data through the analysis unit 44 to obtain a detection result, and the abnormal data in the detection result is sent to the alarm system through the second sending unit 46. In the embodiment of the invention, the detection rate of the anomaly detection system can be improved by carrying out hard-case mining on the real-time product operation data and retraining the data, so that the anomaly data can be found more accurately in real time, and the technical problem of lower detection accuracy of the anomaly detection system in the related technology is solved.

Optionally, the detection device further includes: the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring product operation data in a first preset time period in a historical process before receiving the product operation data to obtain historical sample data; the system comprises a first receiving module, a second receiving module and a third receiving module, wherein the first receiving module is used for receiving product error data of product operation data input by external terminal equipment, and the product error data indicate sample data which causes a loss value rate to be larger than a preset probability threshold value in a model training process; the first training module is used for inputting the historical sample data and the product error data into the data detection model so as to carry out iterative training on the data detection model.

Optionally, the first training module includes: the first detection submodule is used for carrying out abnormal detection on historical sample data by adopting a preset normal distribution algorithm to obtain a negative sample set and a positive sample set; and the first analysis submodule is used for analyzing the sample data in the positive sample set by adopting a local abnormal factor algorithm so as to determine wrong data in the positive sample set and finish model training.

Optionally, the first detection submodule includes: the first extraction submodule is used for extracting sample data corresponding to each index to be detected from historical sample data by taking each index to be detected as a reference; the first calculation submodule is used for calculating the sample mean value of all the extracted sample data to obtain an index detection mean value; the second calculation submodule is used for performing sample variance calculation on all the extracted sample data to obtain an index detection variance value; the third calculation submodule is used for calculating a normal distribution area based on the index detection mean value and the index detection variance value; and the first classification submodule is used for classifying the historical sample data in the normal distribution area into a positive sample set and classifying the historical sample data which does not fall in the normal distribution area into a negative sample set.

Optionally, the first analysis sub-module includes: the first combination submodule is used for combining the index data corresponding to each index to be detected and the corresponding data time in the historical sample data to obtain a plurality of time sequence data; the first determining submodule is used for determining surrounding neighborhood points of the target time sequence data by taking each time sequence data as a center to obtain a neighborhood set; the fourth calculation submodule is used for calculating the reachable distance of other neighborhood points in the neighborhood set and the local reachable density of the target time sequence data; the fifth calculation submodule is used for calculating the density mean value of the local reachable density of all other neighborhood points in the neighborhood set and calculating the density ratio between the density mean value and the local reachable density of the target time sequence data; and the second determining submodule is used for determining that the target time sequence data is wrong data in the positive sample set if the density ratio is larger than the preset ratio threshold.

Optionally, the first determining sub-module includes: the first acquisition submodule is used for acquiring a vector distance value between target time sequence data and other time sequence data; the first sequencing submodule is used for sequencing all the vector distance values to obtain a sequencing result; and the second classification submodule is used for classifying the time sequence data of which the vector distance value is less than or equal to a preset distance threshold value into a neighborhood set based on the sequencing result.

Optionally, the fourth computation submodule includes: the third determining submodule is used for determining the reachable distance between other neighborhood points and the target time sequence data as a preset distance threshold if the actual distance between the other neighborhood points in the neighborhood set and the target time sequence data is less than or equal to k distance points; the fourth determining submodule is used for determining the reachable distance between other neighborhood points and the target time sequence data as the actual distance if the actual distance between the other neighborhood points and the target time sequence data in the neighborhood set is greater than the k distance point; and the sixth calculating submodule is used for calculating the reciprocal of the reachable distance between the target time sequence data and other neighborhood points in the neighborhood set to obtain the local reachable density of the target time sequence data.

Optionally, the detection device further includes: after the abnormal data in the detection result is sent to the alarm system, the second receiving module is used for receiving a data confirmation result fed back by the alarm system; the first classification module is used for classifying the abnormal data into a positive sample library if the data confirmation result indicates that the detection result is real; and the second classification module is used for classifying the abnormal data into the negative sample library if the data confirmation result indicates that the detection result is false.

The above-mentioned detection device may further include a processor and a memory, the above-mentioned receiving unit 40, the first transmitting unit 42, the analyzing unit 44, the second transmitting unit 46, etc. are all stored in the memory as program units, and the processor executes the above-mentioned program units stored in the memory to implement the corresponding functions.

The processor comprises a kernel, and the kernel calls a corresponding program unit from the memory. The kernel can be set to be one or more, and abnormal data in the detection result is sent to the alarm system by adjusting the kernel parameters.

The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.

The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device: receiving product operation data, and sending the product operation data to a target park, wherein the target park is accessed with an abnormal detection system, a data detection model runs in the abnormal detection system, the data detection model excavates sample data causing a loss value rate to be larger than a preset probability threshold value in a model training process, and retrains the sample data again, the loss value rate is used for indicating a ratio of sample data quantity of model classification errors to total data quantity, the abnormal detection system is adopted to analyze the product operation data to obtain a detection result, and abnormal data in the detection result is sent to an alarm system.

According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including: a processor; and a memory for storing executable instructions for the processor; wherein the processor is configured to perform any of the data anomaly detection methods described above via execution of executable instructions.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A data anomaly detection method is characterized by comprising the following steps:

receiving product operation data;

sending the product operation data to a target park, wherein the target park is accessed with an anomaly detection system, a data detection model is operated in the anomaly detection system, the data detection model excavates sample data causing a loss value rate to be larger than a preset probability threshold value in a model training process, and retrains the sample data, and the loss value rate is used for indicating a ratio of sample data volume with wrong model classification to total data volume;

analyzing the product operation data by adopting the anomaly detection system to obtain a detection result;

and sending abnormal data in the detection result to an alarm system.

2. The inspection method of claim 1, wherein prior to receiving product operational data, the inspection method further comprises:

obtaining product operation data in a first preset time period in a historical process to obtain historical sample data;

receiving product error data of product operation data input by external terminal equipment, wherein the product error data indicate sample data which causes a loss value rate to be greater than a preset probability threshold value in a model training process;

and inputting the historical sample data and the product error data into a data detection model so as to carry out iterative training on the data detection model.

3. The method of claim 2, wherein the step of inputting the historical sample data and the product error data into a data detection model for iterative training of the data detection model comprises:

performing anomaly detection on the historical sample data by adopting a preset normal distribution algorithm to obtain a negative sample set and a positive sample set;

and analyzing the sample data in the positive sample set by adopting a local abnormal factor algorithm to determine wrong data in the positive sample set and finish model training.

4. The detection method according to claim 3, wherein the step of performing anomaly detection on the historical sample data by using a preset normal distribution algorithm to obtain a negative sample set and a positive sample set comprises:

taking each index to be detected as a reference, extracting sample data corresponding to each index to be detected in the historical sample data;

calculating the sample mean value of all the extracted sample data to obtain an index detection mean value;

performing sample variance calculation on all the extracted sample data to obtain an index detection variance value;

calculating a normal distribution area based on the index detection mean value and the index detection variance value;

and classifying the historical sample data in the normal distribution area into a positive sample set, and classifying the historical sample data which does not fall in the normal distribution area into the negative sample set.

5. The method according to claim 3, wherein the step of analyzing the sample data in the positive sample set by using a local anomaly factor algorithm to determine the wrong data in the positive sample set comprises:

combining the index data corresponding to each index to be detected and the corresponding data time in the historical sample data to obtain a plurality of time sequence data;

determining surrounding neighborhood points in the target time sequence data by taking each time sequence data as a center to obtain a neighborhood set;

calculating the reachable distance of other neighborhood points in the neighborhood set and the local reachable density of the target time sequence data;

calculating the density mean value of the local reachable densities of all other neighborhood points in the neighborhood set, and calculating the density ratio between the density mean value and the local reachable density of the target time sequence data;

and if the density ratio is larger than a preset ratio threshold, determining that the target time sequence data is wrong data in the positive sample set.

6. The detection method according to claim 5, wherein the step of determining a neighborhood point around the target time series data by centering on each time series data to obtain a neighborhood set comprises:

acquiring a vector distance value between the target time sequence data and other time sequence data;

sequencing all vector distance values to obtain a sequencing result;

and classifying the time sequence data with the vector distance value smaller than or equal to a preset distance threshold value into the neighborhood set based on the sorting result.

7. The detection method according to claim 5, wherein the step of calculating the reachable distances of other neighborhood points in the neighborhood set and the local reachable density of the target time series data comprises:

if the actual distance between other neighborhood points in the neighborhood set and the target time sequence data is less than or equal to k distance points, determining the reachable distance between the other neighborhood points and the target time sequence data to be a preset distance threshold;

if the actual distance between other neighborhood points in the neighborhood set and the target time sequence data is greater than k distance points, determining the reachable distance between the other neighborhood points and the target time sequence data as the actual distance;

and calculating the reciprocal of the reachable distance between the target time sequence data and other neighborhood points in the neighborhood set to obtain the local reachable density of the target time sequence data.

8. The detection method according to claim 1, wherein after sending the abnormal data in the detection result to an alarm system, the detection method further comprises:

receiving a data confirmation result fed back by the alarm system;

if the data confirmation result indicates that the detection result is real, classifying the abnormal data into a positive sample library;

and if the data confirmation result indicates that the detection result is false, classifying the abnormal data into a negative sample library.

9. A data abnormality detection apparatus, characterized by comprising:

the receiving unit is used for receiving product operation data;

the system comprises a first sending unit, a second sending unit and a third sending unit, wherein the first sending unit is used for sending the product operation data to a target park, the target park is accessed to an abnormity detection system, a data detection model runs in the abnormity detection system, the data detection model excavates sample data causing a loss value rate to be larger than a preset probability threshold value in a model training process, and retrains the sample data, and the loss value rate is used for indicating a ratio of sample data volume with wrong model classification to total data volume;

the analysis unit is used for analyzing the product operation data by adopting the anomaly detection system to obtain a detection result;

and the second sending unit is used for sending the abnormal data in the detection result to the alarm system.

10. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the data anomaly detection method of any one of claims 1 to 8 via execution of the executable instructions.

11. A computer-readable storage medium, comprising a stored computer program, wherein when the computer program runs, the computer-readable storage medium controls a device to execute the data anomaly detection method according to any one of claims 1 to 8.