CN117216597A

CN117216597A - Data anomaly detection method and device, storage medium and computer equipment

Info

Publication number: CN117216597A
Application number: CN202311177487.6A
Authority: CN
Inventors: 张天力
Original assignee: Kangjian Information Technology Shenzhen Co Ltd
Current assignee: Kangjian Information Technology Shenzhen Co Ltd
Priority date: 2023-09-12
Filing date: 2023-09-12
Publication date: 2023-12-12

Abstract

The invention discloses a data anomaly detection method, a device, a storage medium and computer equipment, relates to the technical field of information and digital medical treatment, and mainly aims to improve the anomaly detection accuracy of data. The method comprises the following steps: acquiring data in a data source to be detected; classifying the data in the data source to be detected to obtain a target cluster category to which the data in the data source to be detected belongs; determining an abnormality detection mode corresponding to the target cluster type; and carrying out anomaly detection on the data in the data source to be detected by using the anomaly detection mode to obtain the anomaly data in the data source to be detected. The method and the device are suitable for detecting the abnormality of the data in the data source.

Description

Data anomaly detection method and device, storage medium and computer equipment

Technical Field

The present invention relates to the field of information technologies, and in particular, to a method and apparatus for detecting data anomalies, a storage medium, and a computer device.

Background

Anomaly detection is an important problem in research in the fields of data mining and statistical analysis, and the objective is to detect data points from raw data that are significantly different from other data, and anomaly detection has wide application in many fields, such as financial fraud detection, medical diagnosis, network intrusion detection, etc.

Currently, all data is usually detected by a unified detection method. However, this way of sharing the same detection method for all data, since this detection method is not necessarily applicable to all data, the abnormality detection accuracy for part of the data is low.

Disclosure of Invention

The invention provides a data anomaly detection method, a data anomaly detection device, a storage medium and computer equipment, which mainly aim at improving the anomaly detection accuracy of data.

According to a first aspect of the present invention, there is provided a data anomaly detection method comprising:

acquiring data in a data source to be detected;

classifying the data in the data source to be detected to obtain a target cluster category to which the data in the data source to be detected belongs;

determining an abnormality detection mode corresponding to the target cluster type;

and carrying out anomaly detection on the data in the data source to be detected by using the anomaly detection mode to obtain the anomaly data in the data source to be detected.

Optionally, the classifying the data in the data source to be detected to obtain a target cluster category to which the data in the data source to be detected belongs includes:

determining a classification feature vector corresponding to the data in the data source to be detected and determining a reference feature vector corresponding to the reference data under different clustering categories;

According to the classification feature vector and the reference feature vector, calculating the similarity distance between the data in the data source to be detected and the reference data under different clustering categories respectively;

and determining the target cluster category to which the data in the data source to be detected belongs based on the similar distance.

Optionally, before the determining the reference feature vector corresponding to the reference data under the different cluster categories, the method further includes:

determining initial centroids corresponding to the k clusters;

calculating the distance between the reference data and the initial centroids corresponding to the k clusters, and dividing the reference data into the k clusters based on the distance;

determining updated centroids corresponding to the k clusters based on reference data in the k clusters;

and dividing the reference data into the k clusters again based on the updated centroids until the updated centroids are unchanged, and determining the reference data finally divided into the k clusters as the reference data in different cluster categories.

Optionally, the performing anomaly detection on the data in the data source to be detected by using the anomaly detection mode to obtain anomaly data in the data source to be detected includes:

If the data in the data source to be detected belongs to the asset clustering category, determining a preset box-whisker diagram detection algorithm corresponding to the asset clustering category;

performing anomaly detection on the asset data in the data source to be detected by using the preset box-whisker diagram detection algorithm to obtain anomaly data in the asset data;

if the data in the data source to be detected belongs to a log clustering class, determining a preset abnormal detection model corresponding to the log clustering class;

and carrying out anomaly detection on the log data in the data source to be detected by using the preset anomaly detection model to obtain anomaly data in the log data.

Optionally, the performing anomaly detection on the asset data in the data source to be detected by using the preset box-whisker diagram detection algorithm to obtain anomaly data in the asset data includes:

determining a median data value corresponding to the asset data;

determining a lower quartile corresponding to the asset data based on a minimum data value in the asset data and the median data value;

determining an upper quartile corresponding to the asset data based on a maximum data value in the asset data and the median data value;

Calculating the distance between the lower quartile and the upper quartile to obtain a quartile distance corresponding to the asset data;

calculating an anomaly detection lower limit value corresponding to the asset data according to the lower quartile and the quartile interval;

calculating an abnormality detection upper limit value corresponding to the asset data according to the upper quartile and the quartile interval;

among the asset data, data from the abnormality detection lower limit value to outside the abnormality detection upper limit value range is determined as abnormality data in the asset data.

Optionally, before performing anomaly detection on the log data in the data source to be detected by using the preset anomaly detection model to obtain the anomaly data in the log data, the method further includes:

constructing a preset initial anomaly detection model, and acquiring sample data and actual anomaly data in the sample data;

inputting the sample data into the preset initial abnormality detection model to detect abnormal data, and obtaining predicted abnormal data in the sample data;

constructing a loss function corresponding to the preset initial anomaly detection model based on the actual anomaly data and the predicted anomaly data;

Constructing the preset anomaly detection model based on the loss function;

performing anomaly detection on the log data in the data source to be detected by using the preset anomaly detection model to obtain the anomaly data in the log data, including:

obtaining an abnormality detection type;

determining a detection prompt word corresponding to the abnormality detection type;

and inputting the log data in the data source to be detected and the detection prompt word into the preset abnormality detection model to detect the abnormal data, so as to obtain the abnormal data in the log data.

Optionally, after performing anomaly detection on the data in the data source to be detected by using the anomaly detection mode to obtain the anomaly data in the data source to be detected, the method further includes:

determining an abnormal alarm template corresponding to the target cluster type;

based on the abnormal alarm template, acquiring various abnormal information corresponding to the abnormal data;

based on the abnormal information, generating an alarm prompt list corresponding to the abnormal data, and sending the alarm prompt list to an operation and maintenance personnel terminal, so that the operation and maintenance personnel of the operation and maintenance personnel terminal can correct the abnormal data according to the alarm prompt list.

According to a second aspect of the present invention, there is provided a data anomaly detection apparatus comprising:

the acquisition unit is used for acquiring a data source to be detected;

the classification unit is used for classifying the data in the data source to be detected to obtain a target cluster category to which the data in the data source to be detected belongs;

the determining unit is used for determining an abnormality detection mode corresponding to the target cluster type;

the abnormality detection unit is used for carrying out abnormality detection on the data in the data source to be detected by using the abnormality detection mode to obtain abnormal data in the data source to be detected.

According to a third aspect of the present invention, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above data abnormality detection method.

According to a fourth aspect of the present invention, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above data anomaly detection method when executing the program.

According to the data anomaly detection method, the device, the storage medium and the computer equipment provided by the invention, compared with the current mode of detecting all data through a unified detection method, the data in the data source to be detected is obtained; classifying the data in the data source to be detected to obtain a target cluster category to which the data in the data source to be detected belongs; then determining an abnormality detection mode corresponding to the target cluster type; and finally, carrying out anomaly detection on the data in the data source to be detected by using the anomaly detection mode to obtain the anomaly data in the data source to be detected. The method comprises the steps of determining a target cluster type to which data belongs in a data source, determining an abnormality detection mode corresponding to the target cluster type, and finally carrying out abnormality detection on the data by utilizing the abnormality detection mode, namely determining the abnormality detection mode by the cluster type to which the data belongs, wherein the different abnormality detection modes correspond to the data under different cluster types, so that the determined abnormality detection mode can be suitable for the data of the type, and the abnormality detection accuracy of the data can be improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 shows a flow chart of a data anomaly detection method provided by an embodiment of the application;

FIG. 2 is a flowchart of another method for detecting data anomalies according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a data anomaly detection device according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of another data anomaly detection device according to an embodiment of the present application;

fig. 5 shows a schematic physical structure of a computer device according to an embodiment of the present application.

Detailed Description

The application will be described in detail hereinafter with reference to the drawings in conjunction with embodiments. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other.

At present, the detection method is not necessarily suitable for all data in a mode of detecting all data through a unified detection method, so that the abnormal detection accuracy of partial data is low.

In order to solve the above problem, an embodiment of the present invention provides a data anomaly detection method, as shown in fig. 1, including:

101. and acquiring data in the data source to be detected.

The data source is a source point for providing data, and the data source to be detected contains a large amount of data, and the variety of the large amount of data is at least one.

For the embodiment of the invention, the data in the data source to be detected can be medical insurance information data, personal health record data, system operation data, asset data and the like. The data may be acquired in the data source to be detected using crawler technology.

102. Classifying the data in the data source to be detected to obtain the target cluster category to which the data in the data source to be detected belongs.

The target cluster may be any data category, such as an asset cluster category, a log cluster category, a medical cluster category, and the like.

For the embodiment of the invention, after acquiring the data in the data source to be detected, in order to determine a proper abnormality detection mode for the data in the data source to be detected, firstly, classifying the data in the data source to be detected, according to the classification result, determining a target cluster type to which the data in the data source to be detected belongs, for example, classifying the data in the data source to be detected to obtain a part of data belonging to medical record type data, and the other part of data belonging to asset type data, then determining an abnormality detection mode corresponding to case type data, and at the same time, determining an abnormality detection mode corresponding to asset type data, and performing abnormality detection on the corresponding type data by using different abnormality detection modes.

103. And determining an abnormality detection mode corresponding to the target cluster type.

For the embodiment of the invention, after the data in the data source to be detected are classified according to different cluster types corresponding to different anomaly detection modes, if one part of the data in the data source to be detected belongs to insurance type data and the other part of the data belongs to log type data, the anomaly detection mode corresponding to the insurance type data is queried in the preset detection method configuration table, the anomaly detection is carried out on the insurance type data by utilizing the anomaly detection mode corresponding to the insurance type to obtain a detection result, and meanwhile, the anomaly detection mode corresponding to the log type data is queried in the preset detection method configuration table, and the anomaly detection is carried out on the log type data by utilizing the anomaly detection mode corresponding to the log type to obtain a detection result. It should be noted that, the configuration table of the preset detection method stores the abnormality detection modes applicable to various types of data after verification.

104. And carrying out anomaly detection on the data in the data source to be detected by utilizing an anomaly detection mode to obtain the anomaly data in the data source to be detected.

For the embodiment of the invention, if the data in the data source to be detected belongs to case type data, determining an abnormality detection mode corresponding to the case type data in a preset detection method configuration table, then detecting the data in the data source to be detected by using the abnormality detection mode, if the data in the data source to be detected belongs to medical insurance information data, determining an abnormality detection mode corresponding to the medical insurance information data in the preset detection method configuration table, then detecting the data in the data source to be detected by using the abnormality detection mode, thereby determining a target cluster type to which the data in the data source belongs, determining an abnormality detection mode corresponding to the target cluster type, finally performing abnormality detection on the data by using the abnormality detection mode, namely determining the abnormality detection mode corresponding to the data in different cluster types by using the cluster type to which the data belongs, and enabling the determined abnormality detection mode to be applicable to the type of the data, thereby improving the abnormality detection accuracy of the data.

According to the data anomaly detection method provided by the invention, compared with the current mode of detecting all data through a unified detection method, the data in the data source to be detected is obtained; classifying the data in the data source to be detected to obtain a target cluster category to which the data in the data source to be detected belongs; then determining an abnormality detection mode corresponding to the target cluster type; and finally, carrying out anomaly detection on the data in the data source to be detected by using the anomaly detection mode to obtain the anomaly data in the data source to be detected. The method comprises the steps of determining a target cluster type to which data belongs in a data source, determining an abnormality detection mode corresponding to the target cluster type, and finally carrying out abnormality detection on the data by utilizing the abnormality detection mode, namely determining the abnormality detection mode by the cluster type to which the data belongs, wherein the different abnormality detection modes correspond to the data under different cluster types, so that the determined abnormality detection mode can be suitable for the data of the type, and the abnormality detection accuracy of the data can be improved.

Further, in order to better illustrate the foregoing process of anomaly detection for data, as a refinement and extension of the foregoing embodiment, an embodiment of the present invention provides another data anomaly detection method, as shown in fig. 2, where the method includes:

201. and acquiring data in the data source to be detected.

The data sources include enterprise internal data sources and external data sources. The data sources include various types of data such as databases, data warehouses, log files, data generated by application programs, and the like.

For the embodiment of the invention, the data source acquisition interface can be determined through information such as the attribute of the data source to be detected, then the data source is acquired through the data source acquisition interface, the data is acquired in the data source to be detected by utilizing the crawler technology, and then the data in the data source to be detected is subjected to anomaly detection.

202. And determining the classification feature vectors corresponding to the data in the data source to be detected, and determining the reference feature vectors corresponding to the reference data under different clustering categories.

The reference data refers to data which are accurately classified into different clustering categories in advance.

For the embodiment of the invention, in order to determine the reference feature vectors corresponding to the reference data under different clustering categories, the reference data needs to be clustered first, so as to obtain the reference data under different clustering categories, and based on the method, the method comprises the following steps: determining initial centroids corresponding to the k clusters; calculating the distance between the reference data and the initial centroids corresponding to the k clusters, and dividing the reference data into the k clusters based on the distance; determining updated centroids corresponding to the k clusters based on reference data in the k clusters; and dividing the reference data into the k clusters again based on the updated centroids until the updated centroids are unchanged, and determining the reference data finally divided into the k clusters as the reference data in different cluster categories.

Specifically, initial centroids corresponding to K clusters are most forbidden to be selected, distances from each reference data to the K initial centroids are calculated, each reference data is distributed to the cluster corresponding to the initial centroid closest to the initial centroid, then the centroid of each cluster is recalculated for each cluster, different reference data are divided into different clusters again, the reference data are continuously divided until the positions of the centroids are unchanged, and finally the reference data divided into different clusters are determined to be the reference data under different cluster categories.

Further, after determining the reference data under different clustering categories, the data in the data source to be detected needs to be classified according to the reference data under different clustering categories, based on the data, each character contained in the data source to be detected needs to be determined, an embedded vector corresponding to each character is determined, then the embedded vector is input into a preset feature extraction model to conduct feature extraction, a classified feature vector corresponding to the data in the data source to be detected is obtained, each character contained in the reference data under different clustering categories is determined, an embedded vector corresponding to each character is determined, then the embedded vector is input into a preset feature extraction model to conduct feature extraction, a reference feature vector corresponding to the reference data under different clustering categories is obtained, and finally the data in the data source to be detected is classified based on the classified feature vector and the reference feature vector, therefore the classification accuracy of the data can be improved through determining the feature vector corresponding to the data in the data source to be detected and the feature vector corresponding to the reference data under different clustering categories.

203. And respectively calculating the similarity distances between the data in the data source to be detected and the reference data under different clustering categories according to the classification feature vector and the reference feature vector.

204. And determining the target cluster category to which the data in the data source to be detected belongs based on the similar distance.

For the embodiment of the invention, the similarity distance between the data in the data source to be detected and the reference data under different clustering categories can be calculated in a cosine similarity mode, and a specific calculation formula is as follows:

wherein cos (theta) represents cosine similarity between data in the data source to be detected and reference data under different clustering categories, and x _i Representing the classification feature vector, y _i The reference feature vector is represented, n represents the number of elements in the classification feature vector, and i represents any one element in the classification feature vector. The phase between the data in the data source to be detected and the reference data under different clustering categories can be calculated according to the formulaMeanwhile, the similar distance between the data in the data source to be detected and the reference data under different clustering categories can be calculated by the Euclidean distance calculation mode, and the calculation mode of the similar distance between the data in the data source to be detected and the reference data under different clustering categories is not particularly limited.

Further, after calculating the similarity distance between the data in the data source to be detected and the reference data under different clustering categories, if the similarity distance is calculated in a cosine similarity mode, determining the maximum similarity distance among the similarity distances, and determining the clustering category corresponding to the maximum similarity distance as the target clustering category to which the data in the data source to be detected belongs. For example, if the cosine similarity between the data in the data source to be detected and the reference data under the cluster category a is 0.25, the cosine similarity between the data in the data source to be detected and the reference data under the cluster category B is 0.3, and the cosine similarity between the data in the data source to be detected and the reference data under the cluster category C is 0.89, it is finally determined that the data in the data source to be detected belongs to the cluster category C. Further, if a calculation mode of Euclidean distance is adopted to determine the similar distance between the data in the data source to be detected and the reference data under different clustering categories, the clustering category with the smallest Euclidean distance is determined as the target clustering category to which the data in the data source to be detected belongs.

205. And determining an abnormality detection mode corresponding to the target cluster type.

Specifically, the preset detection method configuration table stores anomaly detection modes corresponding to various clustering categories, after determining a target clustering category to which data in a data source to be detected belongs, the anomaly detection modes need to be determined in the preset detection method configuration table according to the target clustering category, and based on the anomaly detection modes, the method for determining the anomaly detection modes comprises the following steps: inputting the target clustering category into a preset semantic information extraction model for semantic extraction to obtain a target semantic information vector corresponding to the target clustering category; respectively inputting different clustering categories stored in a preset detection method configuration table into a preset semantic information extraction model to carry out semantic extraction, and obtaining standard semantic information vectors respectively corresponding to the different clustering categories; based on the target semantic information vector and the standard semantic information vector, respectively calculating cosine similarity between the target clustering category and the different clustering categories; determining a clustering category corresponding to the target clustering category in the preset detection method configuration table according to the cosine similarity; and determining the abnormal detection mode corresponding to the clustering type as the abnormal detection mode corresponding to the target clustering type. The method comprises the steps of calculating the similarity between a target cluster category and each cluster category in a preset detection method configuration table, determining the maximum similarity in each similarity, determining the cluster category corresponding to the maximum similarity as the same cluster category as the target cluster category, and determining the detection mode corresponding to the cluster category as the detection mode corresponding to the target cluster category.

206. And carrying out anomaly detection on the data in the data source to be detected by utilizing an anomaly detection mode to obtain the anomaly data in the data source to be detected.

For the embodiment of the present invention, after determining the abnormality detection mode corresponding to the data in the data source to be detected, the abnormality detection mode needs to be used to detect the abnormality of the data in the data source, based on which step 206 specifically includes: if the data in the data source to be detected belongs to the asset clustering category, determining a preset box-whisker diagram detection algorithm corresponding to the asset clustering category; performing anomaly detection on the asset data in the data source to be detected by using the preset box-whisker diagram detection algorithm to obtain anomaly data in the asset data; if the data in the data source to be detected belongs to a log clustering class, determining a preset abnormal detection model corresponding to the log clustering class; and carrying out anomaly detection on the log data in the data source to be detected by using the preset anomaly detection model to obtain anomaly data in the log data.

Specifically, as known from the detection modes recorded in the configuration table of the preset detection method, the abnormality detection mode corresponding to the asset cluster type is a preset box-whisker diagram detection algorithm, the abnormality detection mode corresponding to the log cluster type is a preset abnormality detection model, and if the data source data to be detected belongs to the asset cluster type, the asset data in the data source to be detected needs to be detected by using the preset box-whisker diagram detection algorithm, based on the detection method, the specific detection method comprises: determining a median data value corresponding to the asset data; determining a lower quartile corresponding to the asset data based on a minimum data value in the asset data and the median data value; determining an upper quartile corresponding to the asset data based on a maximum data value in the asset data and the median data value; calculating the distance between the lower quartile and the upper quartile to obtain a quartile distance corresponding to the asset data; calculating an anomaly detection lower limit value corresponding to the asset data according to the lower quartile and the quartile interval; calculating an abnormality detection upper limit value corresponding to the asset data according to the upper quartile and the quartile interval; among the asset data, data from the abnormality detection lower limit value to outside the abnormality detection upper limit value range is determined as abnormality data in the asset data.

Specifically, first, a median data value M in the asset data, that is, a median value in the asset data is determined; and determining a lower quartile D corresponding to the asset data, namely an intermediate value between a minimum data value and a median data value in the asset data, determining an upper quartile U corresponding to the asset data, namely an intermediate value between the median data value and a maximum data value of the asset data, determining a quartile interval IQR corresponding to the asset data, namely a distance between the lower quartile and the upper quartile, and calculating an anomaly detection upper limit value MAX=U+1.5 IQR and an anomaly detection lower limit value MIN=D-1.5 IQR corresponding to the asset data, finally determining data outside the MIN-MAX range in the source data to be detected, and determining the data outside the MIN-MAX range as anomaly data.

For the embodiment of the invention, if the data of the data source to be detected belongs to the log clustering category, the log data in the data source to be detected needs to be detected by using a preset abnormal detection model, and based on the detection, the specific detection method comprises the following steps: obtaining an abnormality detection type; determining a detection prompt word corresponding to the abnormality detection type; and inputting the log data in the data source to be detected and the detection prompt word into the preset abnormality detection model to detect the abnormal data, so as to obtain the abnormal data in the log data.

The preset anomaly detection model may specifically be a neural network model. In order to improve the detection precision of a preset abnormality detection model, the preset abnormality detection model needs to be trained and constructed first, and based on the preset abnormality detection model, the method comprises the following steps: constructing a preset initial anomaly detection model, and acquiring sample data and actual anomaly data in the sample data; inputting the sample data into the preset initial abnormality detection model to detect abnormal data, and obtaining predicted abnormal data in the sample data; constructing a loss function corresponding to the preset initial anomaly detection model based on the actual anomaly data and the predicted anomaly data; and constructing the preset abnormality detection model based on the loss function.

Specifically, in order to improve the construction precision of the model, the selected sample data may be log type data, meanwhile, abnormal data in the sample data is marked, namely actual abnormal data, at least one preset initial abnormal detection model is constructed, then a semantic information vector corresponding to the sample data is determined, the semantic information vector is input into the preset initial abnormal detection model for abnormal detection, predicted abnormal data in the sample data is obtained, a loss function corresponding to the preset initial abnormal detection model is constructed according to deviation between the actual abnormal data corresponding to the same sample data and the predicted abnormal data, then parameters of the preset initial abnormal detection model are continuously optimized according to the loss function, finally a preset initial abnormal detection model of optimal parameters is obtained, and the preset initial abnormal detection model of the optimal parameters is determined as the preset abnormal detection model. In addition, a plurality of preset initial anomaly detection models can be constructed at first, sample data are divided into training sets and testing sets, each training set and testing set contains actual anomaly data corresponding to the sample data, the preset initial anomaly detection models are trained by the sample data in the training sets, the trained preset initial anomaly detection models are checked by the sample data in the testing sets, the sample data in the testing sets can be respectively input into each preset initial anomaly detection model for detection during specific detection, predicted anomaly data output by each preset initial anomaly detection model are obtained, then prediction accuracy corresponding to each preset initial anomaly detection model is calculated according to differences between the predicted anomaly data and the actual anomaly data, and the preset initial anomaly detection model with the highest prediction accuracy is determined to be the preset anomaly detection model.

Further, after the preset abnormality detection model with higher precision is built, the abnormality detection model is needed to be used for detecting the abnormality of the log data, and based on the abnormality detection, the specific detection method comprises the following steps: firstly, determining an abnormality detection type, wherein the abnormality detection type is artificially set according to actual requirements, such as detecting only the abnormality of the system operation aspect in log data, and then determining detection prompt words corresponding to the abnormality detection type for improving detection accuracy, wherein different abnormality detection types correspond to different detection prompt words, for example, the prompt words corresponding to the system operation aspect can be CPU memory, disk capacity and the like. Further, inputting log data into a preset semantic information extraction model for semantic information extraction to obtain a data semantic feature vector, inputting a detection prompt word into the preset semantic information extraction model for semantic information extraction to obtain a prompt word semantic feature vector, and performing cross processing on the data semantic feature vector and the prompt word semantic feature vector to obtain a detection cross feature vector, wherein the cross processing comprises at least one of low-order cross processing, high-order cross processing and element-level cross processing, for example, performing low-order cross processing on the data semantic feature vector and the prompt word semantic feature vector to obtain a first cross vector; performing high-order cross processing on the data semantic feature vector and the prompt word semantic feature vector to obtain a second cross vector; and performing element level cross processing on the data semantic feature vector and the prompt word semantic feature vector to obtain a third cross vector, and then performing fusion processing on the first cross vector, the second cross vector and the third cross vector by using a preset fusion function to obtain a detection cross feature vector. And then inputting the detection cross feature vector into a preset abnormality detection model to perform abnormality detection, and outputting abnormal data in the log data through the preset abnormality detection model. And the detection prompt word and the data source data are subjected to cross processing to obtain deeper information, so that the abnormality detection accuracy of the data can be improved.

Further, after abnormality detection is performed on data in a data source to be detected to obtain abnormal data, in order for an operation and maintenance person to timely correct the abnormal data, abnormal information needs to be sent to the operation and maintenance person in real time, and based on the abnormal data, the method comprises the following steps: determining an abnormal alarm template corresponding to the target cluster type; based on the abnormal alarm template, acquiring various abnormal information corresponding to the abnormal data; based on the abnormal information, generating an alarm prompt list corresponding to the abnormal data, and sending the alarm prompt list to an operation and maintenance personnel terminal, so that the operation and maintenance personnel of the operation and maintenance personnel terminal can correct the abnormal data according to the alarm prompt list.

Wherein, the different clustering categories correspond to the abnormal alarm templates. The anomaly information includes: the location of the anomaly data, the cause of the anomaly data, the type of anomaly, etc.

Specifically, an abnormal alarm template corresponding to a target cluster type is determined, various abnormal information related to abnormal data is filled in the alarm template, an alarm prompt list can be generated after filling is completed, a preset communication tool interface is called, the alarm prompt list is sent to an operation and maintenance personnel terminal, the operation and maintenance personnel terminal can acquire the abnormal data type according to the abnormal alarm template type, and then a proper operation and maintenance personnel can be determined to carry out maintenance of the abnormal data, so that the condition that the abnormal data is overhauled by an incorrect operation and maintenance personnel to cause overhauling errors is avoided, and meanwhile, the time wasted by the operation and maintenance personnel can be avoided being determined for many times.

According to the other data anomaly detection method provided by the invention, compared with the current mode of detecting all data through a unified detection method, the data in the data source to be detected is obtained; determining classification feature vectors corresponding to data in the data source to be detected and determining reference feature vectors corresponding to reference data under different clustering categories; then, according to the classification feature vector and the reference feature vector, respectively calculating the similarity distance between the data in the data source to be detected and the reference data under different clustering categories; meanwhile, determining a target cluster category to which the data in the data source to be detected belongs based on the similar distance; then determining an abnormality detection mode corresponding to the target cluster type; and finally, carrying out anomaly detection on the data in the data source to be detected by using the anomaly detection mode to obtain the anomaly data in the data source to be detected. The method comprises the steps of determining the target clustering type of data in a data source through reference data under different clustering clusters, improving the classification accuracy of the data in the data source, determining the abnormality detection mode corresponding to the target clustering type through determining the target clustering type of the data in the data source, and finally carrying out abnormality detection on the data by utilizing the abnormality detection mode, namely, determining the abnormality detection mode corresponding to the data under different clustering types through the clustering type of the data, wherein the determined abnormality detection mode is applicable to the data of the type, so that the abnormality detection accuracy of the data can be improved.

Further, as a specific implementation of fig. 1, an embodiment of the present invention provides a data anomaly detection apparatus, as shown in fig. 3, where the apparatus includes: an acquisition unit 31, a classification unit 32, a determination unit 33, and an abnormality detection unit 34.

The acquiring unit 31 may be configured to acquire data in a data source to be detected.

The classifying unit 32 may be configured to classify the data in the data source to be detected, so as to obtain a target cluster class to which the data in the data source to be detected belongs.

The determining unit 33 may be configured to determine an anomaly detection manner corresponding to the target cluster category.

The anomaly detection unit 34 may be configured to perform anomaly detection on the data in the data source to be detected by using the anomaly detection manner, so as to obtain anomaly data in the data source to be detected.

In a specific application scenario, in order to determine a target cluster category to which data in a data source to be detected belongs, as shown in fig. 4, the classification unit 32 includes a first determining module 321 and a calculating module 322.

The first determining module 321 may be configured to determine a classification feature vector corresponding to data in the data source to be detected, and determine a reference feature vector corresponding to reference data in different clustering categories.

The calculating module 322 may be configured to calculate, according to the classification feature vector and the reference feature vector, a similarity distance between the data in the data source to be detected and the reference data under the different clustering categories.

The first determining module 321 may specifically be configured to determine, based on the similar distance, a target cluster category to which the data in the data source to be detected belongs.

In a specific application scenario, in order to determine the reference data under different cluster categories, the first determining module 321 may be further configured to determine initial centroids corresponding to k clusters.

The calculation module 322 may be further configured to calculate a distance between reference data and initial centroids corresponding to the k clusters, and divide the reference data into the k clusters based on the distance.

The first determining module 321 may be further specifically configured to determine updated centroids corresponding to the k clusters based on the reference data in the k clusters.

The first determining module 321 may be further configured to re-divide the reference data into the k clusters based on the updated centroids until the updated centroids are unchanged, and determine the reference data finally divided into the k clusters as the reference data under the different cluster categories.

In a specific application scenario, in order to perform anomaly detection on the data in the data source, the anomaly detection unit 34 includes a second determination module 341 and an anomaly detection module 342.

The second determining module 341 may be configured to determine a preset box-whisker-graph detection algorithm corresponding to an asset cluster category if the data in the to-be-detected data source belongs to the asset cluster category.

The anomaly detection module 342 may be configured to perform anomaly detection on the asset data in the data source to be detected by using the preset box-whisker diagram detection algorithm, so as to obtain anomaly data in the asset data.

The second determining module 341 may be further configured to determine a preset anomaly detection model corresponding to a log cluster type if data in the data source to be detected belongs to the log cluster type.

The anomaly detection module 342 may be further configured to perform anomaly detection on the log data in the data source to be detected by using the preset anomaly detection model, so as to obtain anomaly data in the log data.

In a specific application scenario, in order to perform anomaly detection on asset data under the asset cluster category, the anomaly detection module 342 includes a determination sub-module and a calculation sub-module.

The determination submodule can be used for determining a median data value corresponding to the asset data.

The determining submodule may be specifically configured to determine a lower quartile corresponding to the asset data based on a minimum data value in the asset data and the median data value.

The determining submodule is further used for determining the upper quartile corresponding to the asset data based on the maximum data value and the median data value in the asset data.

The calculating submodule can be used for calculating the distance between the lower quartile and the upper quartile to obtain the quartile distance corresponding to the asset data.

The calculation submodule is specifically configured to calculate an anomaly detection lower limit value corresponding to the asset data according to the lower quartile and the quartile space.

The calculation submodule is further used for calculating an abnormality detection upper limit value corresponding to the asset data according to the upper quartile and the quartile interval.

The determining submodule may be specifically configured to determine, as abnormal data in the asset data, data from the lower limit value of abnormal detection to a value out of the upper limit value of abnormal detection.

In a specific application scenario, in order to perform anomaly detection on log data under the log clustering category, the anomaly detection module 342 includes a construction sub-module and a detection sub-module.

The construction submodule can be used for constructing a preset initial abnormality detection model, acquiring sample data and actual abnormality data in the sample data.

The detection sub-module can be used for inputting the sample data into the preset initial abnormality detection model to detect abnormal data, so as to obtain predicted abnormal data in the sample data.

The construction submodule is specifically configured to construct a loss function corresponding to the preset initial anomaly detection model based on the actual anomaly data and the predicted anomaly data.

The construction submodule may be specifically configured to construct the preset anomaly detection model based on the loss function.

The detection submodule can be used for acquiring an abnormal detection type; determining a detection prompt word corresponding to the abnormality detection type; and inputting the log data in the data source to be detected and the detection prompt word into the preset abnormality detection model to detect the abnormal data, so as to obtain the abnormal data in the log data.

In a specific application scenario, in order to generate the alarm prompt sheet, the device further includes: a generating unit 35.

The determining unit 33 may be further configured to determine an abnormal alert template corresponding to the target cluster category.

The obtaining unit 31 may be further configured to obtain, based on the anomaly alert template, each item of anomaly information corresponding to the anomaly data.

The generating unit 35 may be configured to generate an alarm prompt list corresponding to the abnormal data based on the various abnormal information, and send the alarm prompt list to an operation and maintenance personnel terminal, so that an operation and maintenance personnel of the operation and maintenance personnel terminal corrects the abnormal data according to the alarm prompt list.

It should be noted that, other corresponding descriptions of each functional module related to the data anomaly detection apparatus provided by the embodiment of the present invention may refer to corresponding descriptions of the method shown in fig. 1, and are not repeated herein.

Based on the above method as shown in fig. 1, correspondingly, the embodiment of the present invention further provides a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, implements the following steps: acquiring data in a data source to be detected; classifying the data in the data source to be detected to obtain a target cluster category to which the data in the data source to be detected belongs; determining an abnormality detection mode corresponding to the target cluster type; and carrying out anomaly detection on the data in the data source to be detected by using the anomaly detection mode to obtain the anomaly data in the data source to be detected.

Based on the embodiment of the method shown in fig. 1 and the device shown in fig. 3, the embodiment of the invention further provides a physical structure diagram of a computer device, as shown in fig. 5, where the computer device includes: a processor 41, a memory 42, and a computer program stored on the memory 42 and executable on the processor, wherein the memory 42 and the processor 41 are both arranged on a bus 43, the processor 41 performing the following steps when said program is executed: acquiring data in a data source to be detected; classifying the data in the data source to be detected to obtain a target cluster category to which the data in the data source to be detected belongs; determining an abnormality detection mode corresponding to the target cluster type; and carrying out anomaly detection on the data in the data source to be detected by using the anomaly detection mode to obtain the anomaly data in the data source to be detected.

According to the technical scheme, the data in the data source to be detected is obtained; classifying the data in the data source to be detected to obtain a target cluster category to which the data in the data source to be detected belongs; then determining an abnormality detection mode corresponding to the target cluster type; and finally, carrying out anomaly detection on the data in the data source to be detected by using the anomaly detection mode to obtain the anomaly data in the data source to be detected. The method comprises the steps of determining a target cluster type to which data belongs in a data source, determining an abnormality detection mode corresponding to the target cluster type, and finally carrying out abnormality detection on the data by utilizing the abnormality detection mode, namely determining the abnormality detection mode by the cluster type to which the data belongs, wherein the different abnormality detection modes correspond to the data under different cluster types, so that the determined abnormality detection mode can be suitable for the data of the type, and the abnormality detection accuracy of the data can be improved.

It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a memory device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module for implementation. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A data anomaly detection method, comprising:

Acquiring data in a data source to be detected;

2. The method of claim 1, wherein classifying the data in the data source to be detected to obtain a target cluster class to which the data in the data source to be detected belongs comprises:

3. The method of claim 2, wherein prior to said determining the reference feature vector corresponding to the reference data under the different cluster categories, the method further comprises:

Determining initial centroids corresponding to the k clusters;

4. The method of claim 1, wherein the performing anomaly detection on the data in the data source to be detected by using the anomaly detection method to obtain the anomaly data in the data source to be detected comprises:

5. The method of claim 4, wherein the performing anomaly detection on the asset data in the data source to be detected using the preset box-and-whisker diagram detection algorithm to obtain anomaly data in the asset data comprises:

determining a median data value corresponding to the asset data;

6. The method of claim 4, wherein prior to performing anomaly detection on the log data in the data source to be detected using the preset anomaly detection model to obtain the anomaly data in the log data, the method further comprises:

constructing the preset anomaly detection model based on the loss function;

Obtaining an abnormality detection type;

7. The method according to claim 1, wherein after performing anomaly detection on the data in the data source to be detected by using the anomaly detection method to obtain the anomaly data in the data source to be detected, the method further comprises:

8. A data anomaly detection device, comprising:

the acquisition unit is used for acquiring a data source to be detected;

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program when executed by the processor implements the steps of the method according to any one of claims 1 to 7.