CN113569006A - Large-scale data quality anomaly detection method based on data characteristics - Google Patents

Large-scale data quality anomaly detection method based on data characteristics Download PDF

Info

Publication number
CN113569006A
CN113569006A CN202110671429.3A CN202110671429A CN113569006A CN 113569006 A CN113569006 A CN 113569006A CN 202110671429 A CN202110671429 A CN 202110671429A CN 113569006 A CN113569006 A CN 113569006A
Authority
CN
China
Prior art keywords
data
detection method
anomaly detection
word vector
scale
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110671429.3A
Other languages
Chinese (zh)
Inventor
葛俊
梁云丹
黄建平
张旭东
张建松
陈浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
State Grid Zhejiang Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
State Grid Zhejiang Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, State Grid Zhejiang Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN202110671429.3A priority Critical patent/CN113569006A/en
Publication of CN113569006A publication Critical patent/CN113569006A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a large-scale data quality anomaly detection method based on data characteristics, which comprises the following steps of: constructing a data anomaly detection method library, setting a corresponding detection method according to each data characteristic, and summarizing to form the data anomaly detection method library; carrying out anomaly detection method matching on the data characteristics, and detecting according to an anomaly detection method in a matching result; and traversing the large-scale data features, and matching and detecting each data feature. The substantial effects of the invention include: the method has the advantages that the mode that the anomaly detection is driven by the detection rule is changed into the mode driven by the data characteristics, the corresponding abnormal value detection method is generated based on the characteristic information of the data in each field, meanwhile, a special fuzzification processing mechanism is set for large-scale data, the scale and automation of data quality checking are realized, and the detection efficiency of data quality problems is improved.

Description

Large-scale data quality anomaly detection method based on data characteristics
Technical Field
The invention relates to the field of data anomaly detection, in particular to a large-scale data quality anomaly detection method based on data characteristics.
Background
With the development of digital economy, various industries do not pursue the scale of data volume one by one, the requirement on data quality is higher and higher in the data application process, and in the face of massive data resources, how to find the problem of positioning data quality more quickly, accurately and intelligently, and develop corresponding treatment work is the key and core of current enterprise-level data asset management.
The invention of publication number CN108256074A discloses a method of verification processing, which includes obtaining models of a data warehouse to be verified, each model including a plurality of field information, the field information including field definitions and field types; checking the field information according to a pre-stored data dictionary, wherein the data dictionary comprises a plurality of standard expressions, and each standard expression comprises a standard definition and a standard type; and if the field definition is matched with the standard definition and the field type is not matched with the standard type, modifying the field type to be consistent with the standard type. The method checks a model of a data warehouse according to standard terms, when a field definition is matched with a standard definition and the field type is not matched with the standard type, the field type is pertinently modified to be consistent with the standard type, so that the traditional data quality abnormity detection of the standard consistent model is driven by rules, a set of quality abnormity detection method is designed by a service expert according to service specifications and experience knowledge aiming at a specific field of a specific table, and corresponding special treatment work is carried out.
Disclosure of Invention
Aiming at the problems, the invention provides a large-scale data quality anomaly detection method based on data characteristics, which converts a mode of anomaly detection driven by a detection rule into a mode of anomaly detection driven by data characteristics, and generates a corresponding abnormal value detection method based on the characteristic information of data in each field, thereby realizing the scale and automation of data quality check, expanding the range of data quality detection and improving the detection efficiency of data quality problems.
The technical scheme of the invention is as follows.
A large-scale data quality anomaly detection method based on data characteristics comprises the following steps:
constructing a data anomaly detection method library, setting a corresponding detection method according to each data characteristic, and summarizing to form the data anomaly detection method library; carrying out anomaly detection method matching on the data characteristics, and detecting according to an anomaly detection method in a matching result; and traversing the large-scale data features, and matching and detecting each data feature.
The method base setting of the invention is to design corresponding anomaly detection methods for different data characteristics from the aspects of statistics, common sense, natural law, professional general knowledge and the like, for example, data value type characteristic design reports anomaly when an extreme value occurs in a field value, date characteristic reports anomaly to field content which does not conform to a date format and the like, the method base setting is specifically determined according to actual use requirements, and after matching, the method base is pertinently detected.
Preferably, the data anomaly detection method library is stored in a dictionary type, a tuple composed of data feature names and feature parameters thereof is used as a key of the dictionary, and an anomaly detection method corresponding to the data feature is used as a value of the dictionary. The dictionary type of Python is a key value pair, the dictionary type of Python is used for storing data characteristics and an abnormality detection method thereof, the key of the dictionary stores a tuple consisting of data characteristic names and characteristic parameters thereof, and the value of the dictionary stores the abnormality detection method corresponding to the data characteristics, wherein the threshold value of each abnormality detection method is given by the characteristic parameters, and the key and the value can be clearly divided by storing in a dictionary manner, so that subsequent matching is facilitated.
Preferably, the matching comprises the following process: and respectively embedding word vectors obtained through NLP into keys in a data feature name and an abnormality detection method library to be processed, calculating cosine similarity between the word vectors, wherein the key with the similarity being equal to a threshold value is a potential key corresponding to the data feature, and the abnormality detection method corresponding to the keys is a matching result. The word vector contains multidimensional numerical values, and can be judged and compared more accurately by means of cosine similarity.
Preferably, the calculation formula of the cosine similarity is as follows:
Figure BDA0003119430860000021
where u and v represent two word vectors, respectively. This formula is a common formula for cosine similarity calculation.
Preferably, the large-scale data feature traversal process includes: scaling each dimension value in the word vector to be matched to be within the range of 0-255 according to a proportion, representing each word vector by n pixel point arrays which are sequentially expanded and arranged, wherein n is the dimension of the word vector, the value of each dimension of the word vector is the gray value of each pixel point, copying an image represented by the pixel point arrays to a white background picture of m pixel points to obtain a complex carving picture, wherein m is x ^2 times of n, x is a natural number which is more than or equal to 2, reducing the pixel to n of the complex carving picture, reading the gray value of each pixel to form a new special word vector, and calculating the cosine similarity by using the special word vector to reduce the calculation intensity under the large-scale data quantity. When massive data is faced, if the method is still completely consistent with the method for processing single data, although the accuracy is high, the operation amount is large, the overall efficiency is low, and therefore the vector is fuzzified by adopting the method, although deviation can be generated between the fuzzified word vector and the original word vector, proper similarity still remains between the original similar word vectors, so that the difference of the calculation results of the similarity is small, and the calculation pressure under the massive data can be responded by the method.
As an alternative, the large-scale data feature traversal process includes: scaling each dimension value in the word vector to be matched to be in a range of 0-255, dividing 0-225 into a plurality of orders, modifying the value of each dimension to be an intermediate number in the order corresponding to the value, generating a new special word vector, and calculating cosine similarity by using the special word vector to reduce the calculation intensity under large-scale data quantity. The scheme still takes the fuzzified word vector as a main part, and reduces the calculation amount under large-scale data.
The substantial effects of the invention include: the method has the advantages that the mode that the anomaly detection is driven by the detection rule is changed into the mode driven by the data characteristics, the corresponding abnormal value detection method is generated based on the characteristic information of the data in each field, meanwhile, a special fuzzification processing mechanism is set for large-scale data, the scale and automation of data quality checking are realized, and the detection efficiency of data quality problems is improved.
Detailed Description
The technical solution of the present application will be described with reference to the following examples. In addition, numerous specific details are set forth below in order to provide a better understanding of the present invention. It will be understood by those skilled in the art that the present invention may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present invention.
Example 1:
a large-scale data quality anomaly detection method based on data characteristics comprises the following steps:
step S1: and constructing a data anomaly detection method library, setting a corresponding detection method according to each data characteristic, and summarizing to form the data anomaly detection method library.
The data anomaly detection method library is stored in a dictionary type mode, a tuple composed of data feature names and feature parameters of the data feature names serves as a key of the dictionary, and an anomaly detection method corresponding to the data feature serves as a value of the dictionary. The dictionary type of Python is a key value pair, the dictionary type of Python is used for storing data characteristics and an abnormality detection method thereof, the key of the dictionary stores a tuple consisting of data characteristic names and characteristic parameters thereof, and the value of the dictionary stores the abnormality detection method corresponding to the data characteristics, wherein the threshold value of each abnormality detection method is given by the characteristic parameters, and the key and the value can be clearly divided by storing in a dictionary manner, so that subsequent matching is facilitated.
Step S2: and matching the data characteristics by an anomaly detection method, and detecting according to the anomaly detection method in the matching result.
Wherein the matching comprises the following processes: and respectively embedding word vectors obtained through NLP into keys in a data feature name and an abnormality detection method library to be processed, calculating cosine similarity between the word vectors, wherein the key with the similarity being equal to a threshold value is a potential key corresponding to the data feature, and the abnormality detection method corresponding to the keys is a matching result. The word vector contains multidimensional numerical values, and can be judged and compared more accurately by means of cosine similarity.
The cosine similarity is calculated as follows:
Figure BDA0003119430860000031
where u and v represent two word vectors, respectively. This formula is a common formula for cosine similarity calculation.
Step S3: and traversing the large-scale data features, and matching and detecting each data feature.
The large-scale data feature traversal process of the embodiment includes: scaling each dimension value in the word vector to be matched to be within the range of 0-255 according to a proportion, representing each word vector by n pixel point arrays which are sequentially expanded and arranged, wherein n is the dimension of the word vector, the value of each dimension of the word vector is the gray value of each pixel point, copying an image represented by the pixel point arrays to a white background picture of m pixel points to obtain a complex carving picture, wherein m is x ^2 times of n, x is a natural number which is more than or equal to 2, reducing the pixel to n of the complex carving picture, reading the gray value of each pixel to form a new special word vector, and calculating the cosine similarity by using the special word vector to reduce the calculation intensity under the large-scale data quantity. When massive data is faced, if the method is still completely consistent with the method for processing single data, although the accuracy is high, the operation amount is large, the overall efficiency is low, and therefore the vector is fuzzified by adopting the method, although deviation can be generated between the fuzzified word vector and the original word vector, proper similarity still remains between the original similar word vectors, so that the difference of the calculation results of the similarity is small, and the calculation pressure under the massive data can be responded by the method.
The method library setting of the embodiment is to design corresponding anomaly detection methods for different data characteristics from the aspects of statistics, common sense, natural law, professional general knowledge and the like, for example, data value type characteristic design is used for reporting anomalies when a field value has an extreme value, date characteristics are used for reporting anomalies for field contents which do not conform to a date format, and the like, the method library setting is specifically determined according to actual use requirements, and after matching, the detection is performed pertinently.
Example 2:
the present embodiment is wholly consistent with embodiment 1, and is different from the large-scale data feature traversal process, and the large-scale data feature traversal process of the present embodiment includes: scaling each dimension value in the word vector to be matched to be in a range of 0-255, dividing 0-225 into a plurality of orders, modifying the value of each dimension to be an intermediate number in the order corresponding to the value, generating a new special word vector, and calculating cosine similarity by using the special word vector to reduce the calculation intensity under large-scale data quantity. The scheme still takes the fuzzified word vector as a main part, and reduces the calculation amount under large-scale data.
The substantial effects of the above embodiments include: the method has the advantages that the mode that the anomaly detection is driven by the detection rule is changed into the mode driven by the data characteristics, the corresponding abnormal value detection method is generated based on the characteristic information of the data in each field, meanwhile, a special fuzzification processing mechanism is set for large-scale data, the scale and automation of data quality checking are realized, and the detection efficiency of data quality problems is improved.
Through the above description of the embodiments, those skilled in the art can understand that the present embodiment, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partially contributed to by the prior art, or all or part of the technical solutions may be embodied in the form of a software product, where the software product is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (6)

1. A large-scale data quality anomaly detection method based on data characteristics is characterized by comprising the following steps:
constructing a data anomaly detection method library, setting a corresponding detection method according to each data characteristic, and summarizing to form the data anomaly detection method library;
carrying out anomaly detection method matching on the data characteristics, and detecting according to an anomaly detection method in a matching result;
and traversing the large-scale data features, and matching and detecting each data feature.
2. The large-scale data quality anomaly detection method based on data features as claimed in claim 1, wherein the database of data anomaly detection methods is stored in a dictionary type, a tuple consisting of data feature names and feature parameters thereof is used as a key of the dictionary, and anomaly detection methods corresponding to the data features are used as values of the dictionary.
3. The large-scale data quality anomaly detection method based on data features as claimed in claim 2, wherein the matching comprises the following processes: and respectively embedding word vectors obtained through NLP into keys in a data feature name and an abnormality detection method library to be processed, calculating cosine similarity between the word vectors, wherein the key with the similarity being equal to a threshold value is a potential key corresponding to the data feature, and the abnormality detection method corresponding to the keys is a matching result.
4. The large-scale data quality anomaly detection method based on data characteristics according to claim 3, wherein the cosine similarity is calculated according to the following formula:
Figure FDA0003119430850000011
where u and v represent two word vectors, respectively.
5. The large-scale data quality anomaly detection method based on data features as claimed in claim 3 or 4, wherein the large-scale data feature traversal process comprises: scaling each dimension value in the word vector to be matched to be within the range of 0-255 according to a proportion, representing each word vector by n pixel point arrays which are sequentially expanded and arranged, wherein n is the dimension of the word vector, the value of each dimension of the word vector is the gray value of each pixel point, copying an image represented by the pixel point arrays to a white background picture of m pixel points to obtain a complex carving picture, wherein m is x ^2 times of n, x is a natural number which is more than or equal to 2, reducing the pixel to n of the complex carving picture, reading the gray value of each pixel to form a new special word vector, and calculating the cosine similarity by using the special word vector to reduce the calculation intensity under the large-scale data quantity.
6. The large-scale data quality anomaly detection method based on data features as claimed in claim 3 or 4, wherein the large-scale data feature traversal process comprises: scaling each dimension value in the word vector to be matched to be in a range of 0-255, dividing 0-225 into a plurality of orders, modifying the value of each dimension to be an intermediate number in the order corresponding to the value, generating a new special word vector, and calculating cosine similarity by using the special word vector to reduce the calculation intensity under large-scale data quantity.
CN202110671429.3A 2021-06-17 2021-06-17 Large-scale data quality anomaly detection method based on data characteristics Pending CN113569006A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110671429.3A CN113569006A (en) 2021-06-17 2021-06-17 Large-scale data quality anomaly detection method based on data characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110671429.3A CN113569006A (en) 2021-06-17 2021-06-17 Large-scale data quality anomaly detection method based on data characteristics

Publications (1)

Publication Number Publication Date
CN113569006A true CN113569006A (en) 2021-10-29

Family

ID=78162179

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110671429.3A Pending CN113569006A (en) 2021-06-17 2021-06-17 Large-scale data quality anomaly detection method based on data characteristics

Country Status (1)

Country Link
CN (1) CN113569006A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113987190A (en) * 2021-11-16 2022-01-28 全球能源互联网研究院有限公司 Data quality check rule extraction method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106708909A (en) * 2015-11-18 2017-05-24 阿里巴巴集团控股有限公司 Data quality detection method and apparatus
WO2017107566A1 (en) * 2015-12-25 2017-06-29 广州视源电子科技股份有限公司 Retrieval method and system based on word vector similarity
CN109766331A (en) * 2018-12-06 2019-05-17 中科恒运股份有限公司 Method for processing abnormal data and device
CN111783442A (en) * 2019-12-19 2020-10-16 国网江西省电力有限公司电力科学研究院 Intrusion detection method, device, server and storage medium
CN112329847A (en) * 2020-11-03 2021-02-05 北京神州泰岳软件股份有限公司 Abnormity detection method and device, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106708909A (en) * 2015-11-18 2017-05-24 阿里巴巴集团控股有限公司 Data quality detection method and apparatus
WO2017107566A1 (en) * 2015-12-25 2017-06-29 广州视源电子科技股份有限公司 Retrieval method and system based on word vector similarity
CN109766331A (en) * 2018-12-06 2019-05-17 中科恒运股份有限公司 Method for processing abnormal data and device
CN111783442A (en) * 2019-12-19 2020-10-16 国网江西省电力有限公司电力科学研究院 Intrusion detection method, device, server and storage medium
CN112329847A (en) * 2020-11-03 2021-02-05 北京神州泰岳软件股份有限公司 Abnormity detection method and device, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
万耀青,黄心渊,王文清等编: "机械优化设计建模与优化方法评价", 31 October 1995, 北京理工大学出版社, pages: 96 *
孔令信,刘振东,马亚军编: "Python程序设计", 31 March 2021, 重庆大学出版社, pages: 72 - 73 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113987190A (en) * 2021-11-16 2022-01-28 全球能源互联网研究院有限公司 Data quality check rule extraction method and system
CN113987190B (en) * 2021-11-16 2023-02-28 国网智能电网研究院有限公司 Data quality check rule extraction method and system

Similar Documents

Publication Publication Date Title
Zhao et al. Robust hashing for image authentication using Zernike moments and local features
CN110263821B (en) Training of transaction feature generation model, and method and device for generating transaction features
CN111597348B (en) User image drawing method, device, computer equipment and storage medium
CN114092474B (en) Method and system for detecting processing defects of complex texture background of mobile phone shell
CN111104241A (en) Server memory anomaly detection method, system and equipment based on self-encoder
CN112990281A (en) Abnormal bid identification model training method, abnormal bid identification method and abnormal bid identification device
CN111428757A (en) Model training method, abnormal data detection method and device and electronic equipment
CN113569006A (en) Large-scale data quality anomaly detection method based on data characteristics
US20220164850A1 (en) Method and apparatus for providing information using trained model based on machine learning
CN116522003B (en) Information recommendation method, device, equipment and medium based on embedded table compression
CN112612810A (en) Slow SQL statement identification method and system
CN109902162B (en) Text similarity identification method based on digital fingerprints, storage medium and device
CN112465012A (en) Machine learning modeling method and device, electronic equipment and readable storage medium
CN110033113B (en) Information processing system and learning method for information processing system
CN116703427A (en) Anti-counterfeiting tracing method combining dot matrix code and digital watermark
CN113591485A (en) Intelligent data quality auditing system and method based on data science
CN115829925A (en) Appearance defect detection method and device, computer equipment and storage medium
CN113569005B (en) Large-scale data characteristic intelligent extraction method based on data content
CN115203364A (en) Software fault feedback processing method, device, equipment and readable storage medium
CN111538905B (en) Object recommendation method and device
Wang et al. Median filtering detection using LBP encoding pattern★
CN114547285B (en) Method and device for inferring meaning of table data, computer device and storage medium
CN117951271A (en) Data analysis method, device, computer equipment and computer readable storage medium
CN117350295A (en) E-commerce comment relation extraction method and device
CN115660756A (en) Price monitoring method, device, equipment and medium for E-commerce commodities

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination