CN113569006A

CN113569006A - Large-scale data quality anomaly detection method based on data characteristics

Info

Publication number: CN113569006A
Application number: CN202110671429.3A
Authority: CN
Inventors: 葛俊; 梁云丹; 黄建平; 张旭东; 张建松; 陈浩
Original assignee: State Grid Corp of China SGCC; State Grid Zhejiang Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Zhejiang Electric Power Co Ltd
Priority date: 2021-06-17
Filing date: 2021-06-17
Publication date: 2021-10-29

Abstract

The invention discloses a large-scale data quality anomaly detection method based on data characteristics, which comprises the following steps of: constructing a data anomaly detection method library, setting a corresponding detection method according to each data characteristic, and summarizing to form the data anomaly detection method library; carrying out anomaly detection method matching on the data characteristics, and detecting according to an anomaly detection method in a matching result; and traversing the large-scale data features, and matching and detecting each data feature. The substantial effects of the invention include: the method has the advantages that the mode that the anomaly detection is driven by the detection rule is changed into the mode driven by the data characteristics, the corresponding abnormal value detection method is generated based on the characteristic information of the data in each field, meanwhile, a special fuzzification processing mechanism is set for large-scale data, the scale and automation of data quality checking are realized, and the detection efficiency of data quality problems is improved.

Description

Large-scale data quality anomaly detection method based on data characteristics

Technical Field

The invention relates to the field of data anomaly detection, in particular to a large-scale data quality anomaly detection method based on data characteristics.

Background

With the development of digital economy, various industries do not pursue the scale of data volume one by one, the requirement on data quality is higher and higher in the data application process, and in the face of massive data resources, how to find the problem of positioning data quality more quickly, accurately and intelligently, and develop corresponding treatment work is the key and core of current enterprise-level data asset management.

The invention of publication number CN108256074A discloses a method of verification processing, which includes obtaining models of a data warehouse to be verified, each model including a plurality of field information, the field information including field definitions and field types; checking the field information according to a pre-stored data dictionary, wherein the data dictionary comprises a plurality of standard expressions, and each standard expression comprises a standard definition and a standard type; and if the field definition is matched with the standard definition and the field type is not matched with the standard type, modifying the field type to be consistent with the standard type. The method checks a model of a data warehouse according to standard terms, when a field definition is matched with a standard definition and the field type is not matched with the standard type, the field type is pertinently modified to be consistent with the standard type, so that the traditional data quality abnormity detection of the standard consistent model is driven by rules, a set of quality abnormity detection method is designed by a service expert according to service specifications and experience knowledge aiming at a specific field of a specific table, and corresponding special treatment work is carried out.

Disclosure of Invention

Aiming at the problems, the invention provides a large-scale data quality anomaly detection method based on data characteristics, which converts a mode of anomaly detection driven by a detection rule into a mode of anomaly detection driven by data characteristics, and generates a corresponding abnormal value detection method based on the characteristic information of data in each field, thereby realizing the scale and automation of data quality check, expanding the range of data quality detection and improving the detection efficiency of data quality problems.

The technical scheme of the invention is as follows.

A large-scale data quality anomaly detection method based on data characteristics comprises the following steps:

constructing a data anomaly detection method library, setting a corresponding detection method according to each data characteristic, and summarizing to form the data anomaly detection method library; carrying out anomaly detection method matching on the data characteristics, and detecting according to an anomaly detection method in a matching result; and traversing the large-scale data features, and matching and detecting each data feature.

The method base setting of the invention is to design corresponding anomaly detection methods for different data characteristics from the aspects of statistics, common sense, natural law, professional general knowledge and the like, for example, data value type characteristic design reports anomaly when an extreme value occurs in a field value, date characteristic reports anomaly to field content which does not conform to a date format and the like, the method base setting is specifically determined according to actual use requirements, and after matching, the method base is pertinently detected.

Preferably, the data anomaly detection method library is stored in a dictionary type, a tuple composed of data feature names and feature parameters thereof is used as a key of the dictionary, and an anomaly detection method corresponding to the data feature is used as a value of the dictionary. The dictionary type of Python is a key value pair, the dictionary type of Python is used for storing data characteristics and an abnormality detection method thereof, the key of the dictionary stores a tuple consisting of data characteristic names and characteristic parameters thereof, and the value of the dictionary stores the abnormality detection method corresponding to the data characteristics, wherein the threshold value of each abnormality detection method is given by the characteristic parameters, and the key and the value can be clearly divided by storing in a dictionary manner, so that subsequent matching is facilitated.

Preferably, the matching comprises the following process: and respectively embedding word vectors obtained through NLP into keys in a data feature name and an abnormality detection method library to be processed, calculating cosine similarity between the word vectors, wherein the key with the similarity being equal to a threshold value is a potential key corresponding to the data feature, and the abnormality detection method corresponding to the keys is a matching result. The word vector contains multidimensional numerical values, and can be judged and compared more accurately by means of cosine similarity.

Preferably, the calculation formula of the cosine similarity is as follows:

where u and v represent two word vectors, respectively. This formula is a common formula for cosine similarity calculation.

Preferably, the large-scale data feature traversal process includes: scaling each dimension value in the word vector to be matched to be within the range of 0-255 according to a proportion, representing each word vector by n pixel point arrays which are sequentially expanded and arranged, wherein n is the dimension of the word vector, the value of each dimension of the word vector is the gray value of each pixel point, copying an image represented by the pixel point arrays to a white background picture of m pixel points to obtain a complex carving picture, wherein m is x ^2 times of n, x is a natural number which is more than or equal to 2, reducing the pixel to n of the complex carving picture, reading the gray value of each pixel to form a new special word vector, and calculating the cosine similarity by using the special word vector to reduce the calculation intensity under the large-scale data quantity. When massive data is faced, if the method is still completely consistent with the method for processing single data, although the accuracy is high, the operation amount is large, the overall efficiency is low, and therefore the vector is fuzzified by adopting the method, although deviation can be generated between the fuzzified word vector and the original word vector, proper similarity still remains between the original similar word vectors, so that the difference of the calculation results of the similarity is small, and the calculation pressure under the massive data can be responded by the method.

As an alternative, the large-scale data feature traversal process includes: scaling each dimension value in the word vector to be matched to be in a range of 0-255, dividing 0-225 into a plurality of orders, modifying the value of each dimension to be an intermediate number in the order corresponding to the value, generating a new special word vector, and calculating cosine similarity by using the special word vector to reduce the calculation intensity under large-scale data quantity. The scheme still takes the fuzzified word vector as a main part, and reduces the calculation amount under large-scale data.

The substantial effects of the invention include: the method has the advantages that the mode that the anomaly detection is driven by the detection rule is changed into the mode driven by the data characteristics, the corresponding abnormal value detection method is generated based on the characteristic information of the data in each field, meanwhile, a special fuzzification processing mechanism is set for large-scale data, the scale and automation of data quality checking are realized, and the detection efficiency of data quality problems is improved.

Detailed Description

The technical solution of the present application will be described with reference to the following examples. In addition, numerous specific details are set forth below in order to provide a better understanding of the present invention. It will be understood by those skilled in the art that the present invention may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present invention.

Example 1:

step S1: and constructing a data anomaly detection method library, setting a corresponding detection method according to each data characteristic, and summarizing to form the data anomaly detection method library.

The data anomaly detection method library is stored in a dictionary type mode, a tuple composed of data feature names and feature parameters of the data feature names serves as a key of the dictionary, and an anomaly detection method corresponding to the data feature serves as a value of the dictionary. The dictionary type of Python is a key value pair, the dictionary type of Python is used for storing data characteristics and an abnormality detection method thereof, the key of the dictionary stores a tuple consisting of data characteristic names and characteristic parameters thereof, and the value of the dictionary stores the abnormality detection method corresponding to the data characteristics, wherein the threshold value of each abnormality detection method is given by the characteristic parameters, and the key and the value can be clearly divided by storing in a dictionary manner, so that subsequent matching is facilitated.

Step S2: and matching the data characteristics by an anomaly detection method, and detecting according to the anomaly detection method in the matching result.

Wherein the matching comprises the following processes: and respectively embedding word vectors obtained through NLP into keys in a data feature name and an abnormality detection method library to be processed, calculating cosine similarity between the word vectors, wherein the key with the similarity being equal to a threshold value is a potential key corresponding to the data feature, and the abnormality detection method corresponding to the keys is a matching result. The word vector contains multidimensional numerical values, and can be judged and compared more accurately by means of cosine similarity.

The cosine similarity is calculated as follows:

Step S3: and traversing the large-scale data features, and matching and detecting each data feature.

The large-scale data feature traversal process of the embodiment includes: scaling each dimension value in the word vector to be matched to be within the range of 0-255 according to a proportion, representing each word vector by n pixel point arrays which are sequentially expanded and arranged, wherein n is the dimension of the word vector, the value of each dimension of the word vector is the gray value of each pixel point, copying an image represented by the pixel point arrays to a white background picture of m pixel points to obtain a complex carving picture, wherein m is x ^2 times of n, x is a natural number which is more than or equal to 2, reducing the pixel to n of the complex carving picture, reading the gray value of each pixel to form a new special word vector, and calculating the cosine similarity by using the special word vector to reduce the calculation intensity under the large-scale data quantity. When massive data is faced, if the method is still completely consistent with the method for processing single data, although the accuracy is high, the operation amount is large, the overall efficiency is low, and therefore the vector is fuzzified by adopting the method, although deviation can be generated between the fuzzified word vector and the original word vector, proper similarity still remains between the original similar word vectors, so that the difference of the calculation results of the similarity is small, and the calculation pressure under the massive data can be responded by the method.

The method library setting of the embodiment is to design corresponding anomaly detection methods for different data characteristics from the aspects of statistics, common sense, natural law, professional general knowledge and the like, for example, data value type characteristic design is used for reporting anomalies when a field value has an extreme value, date characteristics are used for reporting anomalies for field contents which do not conform to a date format, and the like, the method library setting is specifically determined according to actual use requirements, and after matching, the detection is performed pertinently.

Example 2:

the present embodiment is wholly consistent with embodiment 1, and is different from the large-scale data feature traversal process, and the large-scale data feature traversal process of the present embodiment includes: scaling each dimension value in the word vector to be matched to be in a range of 0-255, dividing 0-225 into a plurality of orders, modifying the value of each dimension to be an intermediate number in the order corresponding to the value, generating a new special word vector, and calculating cosine similarity by using the special word vector to reduce the calculation intensity under large-scale data quantity. The scheme still takes the fuzzified word vector as a main part, and reduces the calculation amount under large-scale data.

The substantial effects of the above embodiments include: the method has the advantages that the mode that the anomaly detection is driven by the detection rule is changed into the mode driven by the data characteristics, the corresponding abnormal value detection method is generated based on the characteristic information of the data in each field, meanwhile, a special fuzzification processing mechanism is set for large-scale data, the scale and automation of data quality checking are realized, and the detection efficiency of data quality problems is improved.

Through the above description of the embodiments, those skilled in the art can understand that the present embodiment, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partially contributed to by the prior art, or all or part of the technical solutions may be embodied in the form of a software product, where the software product is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A large-scale data quality anomaly detection method based on data characteristics is characterized by comprising the following steps:

constructing a data anomaly detection method library, setting a corresponding detection method according to each data characteristic, and summarizing to form the data anomaly detection method library;

carrying out anomaly detection method matching on the data characteristics, and detecting according to an anomaly detection method in a matching result;

and traversing the large-scale data features, and matching and detecting each data feature.

2. The large-scale data quality anomaly detection method based on data features as claimed in claim 1, wherein the database of data anomaly detection methods is stored in a dictionary type, a tuple consisting of data feature names and feature parameters thereof is used as a key of the dictionary, and anomaly detection methods corresponding to the data features are used as values of the dictionary.

3. The large-scale data quality anomaly detection method based on data features as claimed in claim 2, wherein the matching comprises the following processes: and respectively embedding word vectors obtained through NLP into keys in a data feature name and an abnormality detection method library to be processed, calculating cosine similarity between the word vectors, wherein the key with the similarity being equal to a threshold value is a potential key corresponding to the data feature, and the abnormality detection method corresponding to the keys is a matching result.

4. The large-scale data quality anomaly detection method based on data characteristics according to claim 3, wherein the cosine similarity is calculated according to the following formula:

where u and v represent two word vectors, respectively.

5. The large-scale data quality anomaly detection method based on data features as claimed in claim 3 or 4, wherein the large-scale data feature traversal process comprises: scaling each dimension value in the word vector to be matched to be within the range of 0-255 according to a proportion, representing each word vector by n pixel point arrays which are sequentially expanded and arranged, wherein n is the dimension of the word vector, the value of each dimension of the word vector is the gray value of each pixel point, copying an image represented by the pixel point arrays to a white background picture of m pixel points to obtain a complex carving picture, wherein m is x ^2 times of n, x is a natural number which is more than or equal to 2, reducing the pixel to n of the complex carving picture, reading the gray value of each pixel to form a new special word vector, and calculating the cosine similarity by using the special word vector to reduce the calculation intensity under the large-scale data quantity.

6. The large-scale data quality anomaly detection method based on data features as claimed in claim 3 or 4, wherein the large-scale data feature traversal process comprises: scaling each dimension value in the word vector to be matched to be in a range of 0-255, dividing 0-225 into a plurality of orders, modifying the value of each dimension to be an intermediate number in the order corresponding to the value, generating a new special word vector, and calculating cosine similarity by using the special word vector to reduce the calculation intensity under large-scale data quantity.