CN113569006A - Large-scale data quality anomaly detection method based on data characteristics - Google Patents
Large-scale data quality anomaly detection method based on data characteristics Download PDFInfo
- Publication number
- CN113569006A CN113569006A CN202110671429.3A CN202110671429A CN113569006A CN 113569006 A CN113569006 A CN 113569006A CN 202110671429 A CN202110671429 A CN 202110671429A CN 113569006 A CN113569006 A CN 113569006A
- Authority
- CN
- China
- Prior art keywords
- data
- detection method
- anomaly detection
- word vector
- scale
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 75
- 238000000034 method Methods 0.000 claims abstract description 31
- 239000013598 vector Substances 0.000 claims description 48
- 230000005856 abnormality Effects 0.000 claims description 12
- 238000003491 array Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 abstract description 6
- 230000002159 abnormal effect Effects 0.000 abstract description 4
- 230000000694 effects Effects 0.000 abstract description 3
- 238000013461 design Methods 0.000 description 4
- 230000014509 gene expression Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a large-scale data quality anomaly detection method based on data characteristics, which comprises the following steps of: constructing a data anomaly detection method library, setting a corresponding detection method according to each data characteristic, and summarizing to form the data anomaly detection method library; carrying out anomaly detection method matching on the data characteristics, and detecting according to an anomaly detection method in a matching result; and traversing the large-scale data features, and matching and detecting each data feature. The substantial effects of the invention include: the method has the advantages that the mode that the anomaly detection is driven by the detection rule is changed into the mode driven by the data characteristics, the corresponding abnormal value detection method is generated based on the characteristic information of the data in each field, meanwhile, a special fuzzification processing mechanism is set for large-scale data, the scale and automation of data quality checking are realized, and the detection efficiency of data quality problems is improved.
Description
Technical Field
The invention relates to the field of data anomaly detection, in particular to a large-scale data quality anomaly detection method based on data characteristics.
Background
With the development of digital economy, various industries do not pursue the scale of data volume one by one, the requirement on data quality is higher and higher in the data application process, and in the face of massive data resources, how to find the problem of positioning data quality more quickly, accurately and intelligently, and develop corresponding treatment work is the key and core of current enterprise-level data asset management.
The invention of publication number CN108256074A discloses a method of verification processing, which includes obtaining models of a data warehouse to be verified, each model including a plurality of field information, the field information including field definitions and field types; checking the field information according to a pre-stored data dictionary, wherein the data dictionary comprises a plurality of standard expressions, and each standard expression comprises a standard definition and a standard type; and if the field definition is matched with the standard definition and the field type is not matched with the standard type, modifying the field type to be consistent with the standard type. The method checks a model of a data warehouse according to standard terms, when a field definition is matched with a standard definition and the field type is not matched with the standard type, the field type is pertinently modified to be consistent with the standard type, so that the traditional data quality abnormity detection of the standard consistent model is driven by rules, a set of quality abnormity detection method is designed by a service expert according to service specifications and experience knowledge aiming at a specific field of a specific table, and corresponding special treatment work is carried out.
Disclosure of Invention
Aiming at the problems, the invention provides a large-scale data quality anomaly detection method based on data characteristics, which converts a mode of anomaly detection driven by a detection rule into a mode of anomaly detection driven by data characteristics, and generates a corresponding abnormal value detection method based on the characteristic information of data in each field, thereby realizing the scale and automation of data quality check, expanding the range of data quality detection and improving the detection efficiency of data quality problems.
The technical scheme of the invention is as follows.
A large-scale data quality anomaly detection method based on data characteristics comprises the following steps:
constructing a data anomaly detection method library, setting a corresponding detection method according to each data characteristic, and summarizing to form the data anomaly detection method library; carrying out anomaly detection method matching on the data characteristics, and detecting according to an anomaly detection method in a matching result; and traversing the large-scale data features, and matching and detecting each data feature.
The method base setting of the invention is to design corresponding anomaly detection methods for different data characteristics from the aspects of statistics, common sense, natural law, professional general knowledge and the like, for example, data value type characteristic design reports anomaly when an extreme value occurs in a field value, date characteristic reports anomaly to field content which does not conform to a date format and the like, the method base setting is specifically determined according to actual use requirements, and after matching, the method base is pertinently detected.
Preferably, the data anomaly detection method library is stored in a dictionary type, a tuple composed of data feature names and feature parameters thereof is used as a key of the dictionary, and an anomaly detection method corresponding to the data feature is used as a value of the dictionary. The dictionary type of Python is a key value pair, the dictionary type of Python is used for storing data characteristics and an abnormality detection method thereof, the key of the dictionary stores a tuple consisting of data characteristic names and characteristic parameters thereof, and the value of the dictionary stores the abnormality detection method corresponding to the data characteristics, wherein the threshold value of each abnormality detection method is given by the characteristic parameters, and the key and the value can be clearly divided by storing in a dictionary manner, so that subsequent matching is facilitated.
Preferably, the matching comprises the following process: and respectively embedding word vectors obtained through NLP into keys in a data feature name and an abnormality detection method library to be processed, calculating cosine similarity between the word vectors, wherein the key with the similarity being equal to a threshold value is a potential key corresponding to the data feature, and the abnormality detection method corresponding to the keys is a matching result. The word vector contains multidimensional numerical values, and can be judged and compared more accurately by means of cosine similarity.
Preferably, the calculation formula of the cosine similarity is as follows:
where u and v represent two word vectors, respectively. This formula is a common formula for cosine similarity calculation.
Preferably, the large-scale data feature traversal process includes: scaling each dimension value in the word vector to be matched to be within the range of 0-255 according to a proportion, representing each word vector by n pixel point arrays which are sequentially expanded and arranged, wherein n is the dimension of the word vector, the value of each dimension of the word vector is the gray value of each pixel point, copying an image represented by the pixel point arrays to a white background picture of m pixel points to obtain a complex carving picture, wherein m is x ^2 times of n, x is a natural number which is more than or equal to 2, reducing the pixel to n of the complex carving picture, reading the gray value of each pixel to form a new special word vector, and calculating the cosine similarity by using the special word vector to reduce the calculation intensity under the large-scale data quantity. When massive data is faced, if the method is still completely consistent with the method for processing single data, although the accuracy is high, the operation amount is large, the overall efficiency is low, and therefore the vector is fuzzified by adopting the method, although deviation can be generated between the fuzzified word vector and the original word vector, proper similarity still remains between the original similar word vectors, so that the difference of the calculation results of the similarity is small, and the calculation pressure under the massive data can be responded by the method.
As an alternative, the large-scale data feature traversal process includes: scaling each dimension value in the word vector to be matched to be in a range of 0-255, dividing 0-225 into a plurality of orders, modifying the value of each dimension to be an intermediate number in the order corresponding to the value, generating a new special word vector, and calculating cosine similarity by using the special word vector to reduce the calculation intensity under large-scale data quantity. The scheme still takes the fuzzified word vector as a main part, and reduces the calculation amount under large-scale data.
The substantial effects of the invention include: the method has the advantages that the mode that the anomaly detection is driven by the detection rule is changed into the mode driven by the data characteristics, the corresponding abnormal value detection method is generated based on the characteristic information of the data in each field, meanwhile, a special fuzzification processing mechanism is set for large-scale data, the scale and automation of data quality checking are realized, and the detection efficiency of data quality problems is improved.
Detailed Description
The technical solution of the present application will be described with reference to the following examples. In addition, numerous specific details are set forth below in order to provide a better understanding of the present invention. It will be understood by those skilled in the art that the present invention may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present invention.
Example 1:
a large-scale data quality anomaly detection method based on data characteristics comprises the following steps:
step S1: and constructing a data anomaly detection method library, setting a corresponding detection method according to each data characteristic, and summarizing to form the data anomaly detection method library.
The data anomaly detection method library is stored in a dictionary type mode, a tuple composed of data feature names and feature parameters of the data feature names serves as a key of the dictionary, and an anomaly detection method corresponding to the data feature serves as a value of the dictionary. The dictionary type of Python is a key value pair, the dictionary type of Python is used for storing data characteristics and an abnormality detection method thereof, the key of the dictionary stores a tuple consisting of data characteristic names and characteristic parameters thereof, and the value of the dictionary stores the abnormality detection method corresponding to the data characteristics, wherein the threshold value of each abnormality detection method is given by the characteristic parameters, and the key and the value can be clearly divided by storing in a dictionary manner, so that subsequent matching is facilitated.
Step S2: and matching the data characteristics by an anomaly detection method, and detecting according to the anomaly detection method in the matching result.
Wherein the matching comprises the following processes: and respectively embedding word vectors obtained through NLP into keys in a data feature name and an abnormality detection method library to be processed, calculating cosine similarity between the word vectors, wherein the key with the similarity being equal to a threshold value is a potential key corresponding to the data feature, and the abnormality detection method corresponding to the keys is a matching result. The word vector contains multidimensional numerical values, and can be judged and compared more accurately by means of cosine similarity.
The cosine similarity is calculated as follows:
where u and v represent two word vectors, respectively. This formula is a common formula for cosine similarity calculation.
Step S3: and traversing the large-scale data features, and matching and detecting each data feature.
The large-scale data feature traversal process of the embodiment includes: scaling each dimension value in the word vector to be matched to be within the range of 0-255 according to a proportion, representing each word vector by n pixel point arrays which are sequentially expanded and arranged, wherein n is the dimension of the word vector, the value of each dimension of the word vector is the gray value of each pixel point, copying an image represented by the pixel point arrays to a white background picture of m pixel points to obtain a complex carving picture, wherein m is x ^2 times of n, x is a natural number which is more than or equal to 2, reducing the pixel to n of the complex carving picture, reading the gray value of each pixel to form a new special word vector, and calculating the cosine similarity by using the special word vector to reduce the calculation intensity under the large-scale data quantity. When massive data is faced, if the method is still completely consistent with the method for processing single data, although the accuracy is high, the operation amount is large, the overall efficiency is low, and therefore the vector is fuzzified by adopting the method, although deviation can be generated between the fuzzified word vector and the original word vector, proper similarity still remains between the original similar word vectors, so that the difference of the calculation results of the similarity is small, and the calculation pressure under the massive data can be responded by the method.
The method library setting of the embodiment is to design corresponding anomaly detection methods for different data characteristics from the aspects of statistics, common sense, natural law, professional general knowledge and the like, for example, data value type characteristic design is used for reporting anomalies when a field value has an extreme value, date characteristics are used for reporting anomalies for field contents which do not conform to a date format, and the like, the method library setting is specifically determined according to actual use requirements, and after matching, the detection is performed pertinently.
Example 2:
the present embodiment is wholly consistent with embodiment 1, and is different from the large-scale data feature traversal process, and the large-scale data feature traversal process of the present embodiment includes: scaling each dimension value in the word vector to be matched to be in a range of 0-255, dividing 0-225 into a plurality of orders, modifying the value of each dimension to be an intermediate number in the order corresponding to the value, generating a new special word vector, and calculating cosine similarity by using the special word vector to reduce the calculation intensity under large-scale data quantity. The scheme still takes the fuzzified word vector as a main part, and reduces the calculation amount under large-scale data.
The substantial effects of the above embodiments include: the method has the advantages that the mode that the anomaly detection is driven by the detection rule is changed into the mode driven by the data characteristics, the corresponding abnormal value detection method is generated based on the characteristic information of the data in each field, meanwhile, a special fuzzification processing mechanism is set for large-scale data, the scale and automation of data quality checking are realized, and the detection efficiency of data quality problems is improved.
Through the above description of the embodiments, those skilled in the art can understand that the present embodiment, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partially contributed to by the prior art, or all or part of the technical solutions may be embodied in the form of a software product, where the software product is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (6)
1. A large-scale data quality anomaly detection method based on data characteristics is characterized by comprising the following steps:
constructing a data anomaly detection method library, setting a corresponding detection method according to each data characteristic, and summarizing to form the data anomaly detection method library;
carrying out anomaly detection method matching on the data characteristics, and detecting according to an anomaly detection method in a matching result;
and traversing the large-scale data features, and matching and detecting each data feature.
2. The large-scale data quality anomaly detection method based on data features as claimed in claim 1, wherein the database of data anomaly detection methods is stored in a dictionary type, a tuple consisting of data feature names and feature parameters thereof is used as a key of the dictionary, and anomaly detection methods corresponding to the data features are used as values of the dictionary.
3. The large-scale data quality anomaly detection method based on data features as claimed in claim 2, wherein the matching comprises the following processes: and respectively embedding word vectors obtained through NLP into keys in a data feature name and an abnormality detection method library to be processed, calculating cosine similarity between the word vectors, wherein the key with the similarity being equal to a threshold value is a potential key corresponding to the data feature, and the abnormality detection method corresponding to the keys is a matching result.
5. The large-scale data quality anomaly detection method based on data features as claimed in claim 3 or 4, wherein the large-scale data feature traversal process comprises: scaling each dimension value in the word vector to be matched to be within the range of 0-255 according to a proportion, representing each word vector by n pixel point arrays which are sequentially expanded and arranged, wherein n is the dimension of the word vector, the value of each dimension of the word vector is the gray value of each pixel point, copying an image represented by the pixel point arrays to a white background picture of m pixel points to obtain a complex carving picture, wherein m is x ^2 times of n, x is a natural number which is more than or equal to 2, reducing the pixel to n of the complex carving picture, reading the gray value of each pixel to form a new special word vector, and calculating the cosine similarity by using the special word vector to reduce the calculation intensity under the large-scale data quantity.
6. The large-scale data quality anomaly detection method based on data features as claimed in claim 3 or 4, wherein the large-scale data feature traversal process comprises: scaling each dimension value in the word vector to be matched to be in a range of 0-255, dividing 0-225 into a plurality of orders, modifying the value of each dimension to be an intermediate number in the order corresponding to the value, generating a new special word vector, and calculating cosine similarity by using the special word vector to reduce the calculation intensity under large-scale data quantity.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110671429.3A CN113569006A (en) | 2021-06-17 | 2021-06-17 | Large-scale data quality anomaly detection method based on data characteristics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110671429.3A CN113569006A (en) | 2021-06-17 | 2021-06-17 | Large-scale data quality anomaly detection method based on data characteristics |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113569006A true CN113569006A (en) | 2021-10-29 |
Family
ID=78162179
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110671429.3A Pending CN113569006A (en) | 2021-06-17 | 2021-06-17 | Large-scale data quality anomaly detection method based on data characteristics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113569006A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113987190A (en) * | 2021-11-16 | 2022-01-28 | 全球能源互联网研究院有限公司 | Data quality check rule extraction method and system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106708909A (en) * | 2015-11-18 | 2017-05-24 | 阿里巴巴集团控股有限公司 | Data quality detection method and apparatus |
WO2017107566A1 (en) * | 2015-12-25 | 2017-06-29 | 广州视源电子科技股份有限公司 | Retrieval method and system based on word vector similarity |
CN109766331A (en) * | 2018-12-06 | 2019-05-17 | 中科恒运股份有限公司 | Method for processing abnormal data and device |
CN111783442A (en) * | 2019-12-19 | 2020-10-16 | 国网江西省电力有限公司电力科学研究院 | Intrusion detection method, device, server and storage medium |
CN112329847A (en) * | 2020-11-03 | 2021-02-05 | 北京神州泰岳软件股份有限公司 | Abnormity detection method and device, electronic equipment and storage medium |
-
2021
- 2021-06-17 CN CN202110671429.3A patent/CN113569006A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106708909A (en) * | 2015-11-18 | 2017-05-24 | 阿里巴巴集团控股有限公司 | Data quality detection method and apparatus |
WO2017107566A1 (en) * | 2015-12-25 | 2017-06-29 | 广州视源电子科技股份有限公司 | Retrieval method and system based on word vector similarity |
CN109766331A (en) * | 2018-12-06 | 2019-05-17 | 中科恒运股份有限公司 | Method for processing abnormal data and device |
CN111783442A (en) * | 2019-12-19 | 2020-10-16 | 国网江西省电力有限公司电力科学研究院 | Intrusion detection method, device, server and storage medium |
CN112329847A (en) * | 2020-11-03 | 2021-02-05 | 北京神州泰岳软件股份有限公司 | Abnormity detection method and device, electronic equipment and storage medium |
Non-Patent Citations (2)
Title |
---|
万耀青,黄心渊,王文清等编: "机械优化设计建模与优化方法评价", 31 October 1995, 北京理工大学出版社, pages: 96 * |
孔令信,刘振东,马亚军编: "Python程序设计", 31 March 2021, 重庆大学出版社, pages: 72 - 73 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113987190A (en) * | 2021-11-16 | 2022-01-28 | 全球能源互联网研究院有限公司 | Data quality check rule extraction method and system |
CN113987190B (en) * | 2021-11-16 | 2023-02-28 | 国网智能电网研究院有限公司 | Data quality check rule extraction method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhao et al. | Robust hashing for image authentication using Zernike moments and local features | |
CN110263821B (en) | Training of transaction feature generation model, and method and device for generating transaction features | |
CN111597348B (en) | User image drawing method, device, computer equipment and storage medium | |
CN114092474B (en) | Method and system for detecting processing defects of complex texture background of mobile phone shell | |
CN111104241A (en) | Server memory anomaly detection method, system and equipment based on self-encoder | |
CN112990281A (en) | Abnormal bid identification model training method, abnormal bid identification method and abnormal bid identification device | |
CN111428757A (en) | Model training method, abnormal data detection method and device and electronic equipment | |
CN113569006A (en) | Large-scale data quality anomaly detection method based on data characteristics | |
US20220164850A1 (en) | Method and apparatus for providing information using trained model based on machine learning | |
CN116522003B (en) | Information recommendation method, device, equipment and medium based on embedded table compression | |
CN112612810A (en) | Slow SQL statement identification method and system | |
CN109902162B (en) | Text similarity identification method based on digital fingerprints, storage medium and device | |
CN112465012A (en) | Machine learning modeling method and device, electronic equipment and readable storage medium | |
CN110033113B (en) | Information processing system and learning method for information processing system | |
CN116703427A (en) | Anti-counterfeiting tracing method combining dot matrix code and digital watermark | |
CN113591485A (en) | Intelligent data quality auditing system and method based on data science | |
CN115829925A (en) | Appearance defect detection method and device, computer equipment and storage medium | |
CN113569005B (en) | Large-scale data characteristic intelligent extraction method based on data content | |
CN115203364A (en) | Software fault feedback processing method, device, equipment and readable storage medium | |
CN111538905B (en) | Object recommendation method and device | |
Wang et al. | Median filtering detection using LBP encoding pattern★ | |
CN114547285B (en) | Method and device for inferring meaning of table data, computer device and storage medium | |
CN117951271A (en) | Data analysis method, device, computer equipment and computer readable storage medium | |
CN117350295A (en) | E-commerce comment relation extraction method and device | |
CN115660756A (en) | Price monitoring method, device, equipment and medium for E-commerce commodities |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |