CN113569005B

CN113569005B - Large-scale data characteristic intelligent extraction method based on data content

Info

Publication number: CN113569005B
Application number: CN202110670587.7A
Authority: CN
Inventors: 葛俊; 梁云丹; 黄建平; 张旭东; 张建松; 陈浩
Original assignee: State Grid Corp of China SGCC; State Grid Zhejiang Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Zhejiang Electric Power Co Ltd
Priority date: 2021-06-17
Filing date: 2021-06-17
Publication date: 2024-02-20
Anticipated expiration: 2041-06-17
Also published as: CN113569005A

Abstract

The invention discloses a large-scale data characteristic intelligent extraction method based on data content, which comprises the following steps: performing preliminary identification of field types on the data, and eliminating invalid data; judging Chinese description and field types of the data, sampling the unmatched data, calculating the duty ratio of each field type in the sample, and revising the field types according to the duty ratio result; and extracting the characteristics according to the field types. The essential effects of the invention include: the method has the advantages that the relation between the universality of table processing and detection and the meaning represented by data of different field types is considered, the corresponding characteristics can be extracted for each field only according to the table header information and the data content, the automation and the scale of the characteristic extraction of the data are realized, the accurate detection object identification and positioning are provided for the data quality problem investigation, and the foundation is provided for the improvement of the subsequent quality detection work efficiency.

Description

Large-scale data characteristic intelligent extraction method based on data content

Technical Field

The invention relates to the technical field of data feature extraction, in particular to a large-scale data feature intelligent extraction method based on data content.

Background

Along with the development of digital economy, each industry does not pursue the scale of data volume, the requirement on data quality is higher and higher in the process of data application, and how to find and locate the data quality problem faster, more accurately and more intelligently, and develop corresponding treatment work is the key and core of the current enterprise-level data asset management.

In the prior art, the invention of publication No. CN105554152A discloses a method and a device for extracting data characteristics. In more detailed technical content, another invention, such as publication number CN108256074a, discloses a method of verification processing, comprising obtaining models of a data warehouse to be verified, each model comprising a plurality of field information, the field information comprising field definitions and field types; verifying the field information according to a pre-stored data dictionary, wherein the data dictionary comprises a plurality of standard expressions, and each standard expression comprises standard definition and standard type; if the field definition matches the standard definition and the field type does not match the standard type, the field type is modified to be consistent with the standard type. The method comprises the steps of verifying a model of a data warehouse according to standard terms, and purposefully modifying a field type to be consistent with a standard type when the field definition is matched with the standard definition and the field type is not matched with the standard type, so that a standard consistent model is obtained.

In the prior art, the mode of solving the related problems is thousands of times, in the traditional data quality management mode, the selection of the problem detection objects is that a business expert needs to specify specific and specific data tables and fields according to business specifications and experience knowledge, what kind of characteristics each field has and what kind of rules are applicable are required to be specified, the mode and the result have extremely high requirements on the experience and the expertise of the business expert, the range of the detection objects of the data quality problems is limited, the business expert is highly depended on, the business expert is required to respectively specify the corresponding detection objects and the range one by one for large-scale massive data, the universality of the data characteristics is weak, the maintenance is time-consuming and labor-consuming, the definition of the large-scale and automatic data quality detection objects and the extraction of the corresponding data characteristics cannot be realized, the efficiency of the data quality inspection is low, and the influence of the manual experience is serious.

Disclosure of Invention

Aiming at the problems in the prior art, the invention aims to provide a large-scale data characteristic intelligent extraction method based on data content, considers the universality of table processing and detection, can extract corresponding characteristics for each field only according to table header information and data content without providing any additional corresponding knowledge, realizes the automation and scale of data characteristic extraction, does not need to designate data objects and characteristic conditions of data quality detection one by one, reduces the dependence on knowledge and experience of service personnel, provides accurate detection object identification and positioning for data quality problem investigation, and provides a basis for improving the subsequent quality detection work efficiency.

The following is a technical scheme of the invention.

A large-scale data characteristic intelligent extraction method based on data content comprises the following steps:

performing preliminary identification of field types on the data, and eliminating invalid data;

judging Chinese description and field types of the data, sampling the unmatched data, calculating the duty ratio of each field type in the sample, and revising the field types according to the duty ratio result;

and extracting the characteristics according to the field types.

In the extraction process of the invention, the characteristics of different field types and Chinese descriptions of corresponding data are combined and considered, and the method is divided into two steps of primary identification and revision to comprehensively judge the field types, thereby improving the identification accuracy.

Preferably, the preliminary identification process includes: and carrying out preliminary identification on the data to be identified according to the existing field type database, or introducing an identification model trained by a neural network to carry out preliminary identification, so as to obtain a preliminary identification result of the field type. The different field types have respective characteristics, and the prior art in the field generally adopts a database, a training model and the like for comparison and identification, but the technology can only be used for primary identification, so that the implementation cost is reduced and a certain basic accuracy is ensured.

Preferably, the process of rejecting invalid data includes: defining an invalid table and an invalid field, and uniformly judging an empty table, a zombie table, a log table, a backup table, a temporary table, a single-field table and a low-heat table as invalid tables through metadata information and data content judgment of the tables; uniformly judging the null field and the single value field as invalid fields; and identifying and rejecting invalid tables and fields. The invalid table and the invalid field cover various common invalid data, and after being removed, the processing pressure of the subsequent data extraction and analysis can be reduced.

Preferably, the process of revising the field type includes: the Chinese description of the data is subjected to word segmentation and semantic recognition by utilizing an NLP natural language processing module, the path recognition of the approximate word or the approximate word is carried out through a type decision tree after analysis, the semantics of the Chinese description are not matched with the field type, and the Chinese description is marked as a suspected revised field type; and then sampling the data content with the same or similar semantic meaning for multiple times, counting the duty ratio conditions of different field types in the sampled data, taking the type with the duty ratio exceeding a threshold value as a recommended revised field type, and finally revising the recommended revised field type into the field type to which the real storage data belongs. The natural language processing technology can perform word segmentation and semantic recognition on the middle description, and the decision tree can perform path recognition with similar meaning so as to help judge whether the middle description belongs to a suspected revised field type, and finally, the revision process is complementary to the primary recognition by taking the duty ratio as a judgment standard determination result in a mode of setting a threshold value, so that the recognition accuracy is further improved.

Preferably, the field type includes at least one of a numeric type, a text type, and a date type.

Preferably, the extracting the feature according to the field type includes: extracting features and feature values of the numerical value type fields by means of average value, maximum value, minimum value, median, variance, quartile range, numerical value cluster and length cluster; for text type fields, statistical attribute features are distributed from length clusters and structures, and extraction on content features is performed through word segmentation and semantic recognition of data content; and carrying out structural analysis on the date type field, and carrying out feature extraction on the date format and the length.

Preferably, after the revision field type is finished, the method further comprises the step of verifying: the date data are converted into text data and copied into verification groups and interference groups, the verification groups are inserted into the year, month and day descriptions according to the original date format, the interference groups are added with counting unit descriptions according to the original date data digits, the verification groups and the interference groups are inserted into the text data adjacent to the verification groups, semantic recognition is carried out on the spliced text data through an NLP natural language processing module, the recognition speed of each pair of the interference groups and the verification groups is recorded, if the recognition speed of the verification groups is faster than that of the interference groups and exceeds an amplitude threshold value, the corresponding original date data are classified as a suspected error type through verification, and otherwise, the corresponding original date data are classified as a suspected error type. Since the date type data or the numerical value type data are often associated with the adjacent text type data, when the original recognition is correct, the text spliced by the verification group is easier to recognize, so that the recognition speed is higher, and if the original recognition is wrong, the text spliced by the verification group is wrong, so that the text is even slower compared with the advantage that the interference group does not have the recognition speed, and is classified as a suspected error type.

The essential effects of the invention include: the method has the advantages that the relation between the universality of table processing and detection and the meaning represented by data of different field types is considered, the corresponding characteristics can be extracted for each field only according to the table header information and the data content, the automation and the scale of the characteristic extraction of the data are realized, the accurate detection object identification and positioning are provided for the data quality problem investigation, and the foundation is provided for the improvement of the subsequent quality detection work efficiency.

Detailed Description

The technical scheme of the present application will be described below with reference to examples. In addition, numerous specific details are set forth in the following description in order to provide a better understanding of the present invention. It will be understood by those skilled in the art that the present invention may be practiced without some of these specific details. In some instances, well known methods, procedures, components, and circuits have not been described in detail so as not to obscure the present invention.

Example 1:

a large-scale data characteristic intelligent extraction method based on data content mainly aims at field types including numerical value type, text type and date type.

The method comprises the following steps:

s01: and carrying out preliminary identification of field types on the data, and eliminating invalid data.

Wherein the preliminary identification process comprises: and carrying out preliminary identification on the data to be identified according to the existing field type database, or introducing an identification model trained by a neural network to carry out preliminary identification, so as to obtain a preliminary identification result of the field type. The different field types have respective characteristics, and in the prior art in the field, a database, a training model and the like are generally adopted for comparison and identification, but in the embodiment, the technology can only be used for preliminary identification, so that the implementation cost is reduced, and a certain basic accuracy is ensured.

The process of rejecting invalid data includes: defining an invalid table and an invalid field, and uniformly judging an empty table, a zombie table, a log table, a backup table, a temporary table, a single-field table and a low-heat table as invalid tables through metadata information and data content judgment of the tables; uniformly judging the null field and the single value field as invalid fields; and identifying and rejecting invalid tables and fields. The invalid table and the invalid field cover various common invalid data, and after being removed, the processing pressure of the subsequent data extraction and analysis can be reduced.

S02: judging the Chinese description and field type of the data, sampling the unmatched data, calculating the duty ratio of each field type in the sample, and revising the field type according to the duty ratio result.

The process of revising the field type includes: the Chinese description of the data is subjected to word segmentation and semantic recognition by utilizing an NLP natural language processing module, the path recognition of the approximate word or the approximate word is carried out through a type decision tree after analysis, the semantics of the Chinese description are not matched with the field type, and the Chinese description is marked as a suspected revised field type; and then sampling the data content with the same or similar semantic meaning for multiple times, counting the duty ratio conditions of different field types in the sampled data, taking the type with the duty ratio exceeding a threshold value as a recommended revised field type, and finally revising the recommended revised field type into the field type to which the real storage data belongs. The natural language processing technology can perform word segmentation and semantic recognition on the middle description, and the decision tree can perform path recognition with similar meaning so as to help judge whether the middle description belongs to a suspected revised field type, and finally, the revision process is complementary to the primary recognition by taking the duty ratio as a judgment standard determination result in a mode of setting a threshold value, so that the recognition accuracy is further improved.

S03: and extracting the characteristics according to the field types.

The process of extracting features from field types includes: extracting features and feature values of the numerical value type fields by means of average value, maximum value, minimum value, median, variance, quartile range, numerical value cluster and length cluster; for text type fields, statistical attribute features are distributed from length clusters and structures, and extraction on content features is performed through word segmentation and semantic recognition of data content; and carrying out structural analysis on the date type field, and carrying out feature extraction on the date format and the length.

In addition, more specifically, the data characteristics and the characteristic extraction method applicable to the field type can be searched from the data characteristic library, all applicable data characteristic extraction methods of the field type are traversed according to the dependence and the mutual exclusion relation network of the corresponding data characteristics, for example, after a certain data field is determined to be numerical value, the characteristic extraction algorithm loads the characteristic extraction methods of length, integer, positive number, negative number, decimal and the like, and the service characteristic extraction method of mobile phone number, postal code and the like, and the characteristics of concentrated length, integer, mobile phone number and the like can be obtained through continuous identification and extraction of the data content, and meanwhile, the two mutually exclusive characteristics of positive and negative are distinguished, so that the multi-angle characteristics and characteristic values of the field are obtained.

Example 2:

this embodiment is generally identical to the previous embodiment, except that after the end of revising the field type, before extracting the features, a verification step is further included: the date data are converted into text data and copied into verification groups and interference groups, the verification groups are inserted into the year, month and day descriptions according to the original date format, the interference groups are added with counting unit descriptions according to the number of bits of the original date data, the verification groups and the interference groups are inserted into the text data adjacent to the verification groups, semantic recognition is carried out on the spliced text data through an NLP natural language processing module, the recognition speed of each pair of the interference groups and the verification groups is recorded, if the recognition speed of the verification groups is faster than that of the interference groups and exceeds an amplitude threshold, the verification is passed, otherwise, the corresponding original date data are listed as suspected error types. Since the date type data or the numerical value type data are often associated with the adjacent text type data, when the original recognition is correct, the text spliced by the verification group is easier to recognize, so that the recognition speed is higher, and if the original recognition is wrong, the text spliced by the verification group is wrong, so that the text is even slower compared with the advantage that the interference group does not have the recognition speed, and is classified as a suspected error type.

The substantial effects of the above embodiments include: the method has the advantages that the relation between the universality of table processing and detection and the meaning represented by data of different field types is considered, the corresponding characteristics can be extracted for each field only according to the table header information and the data content, the automation and the scale of the characteristic extraction of the data are realized, the accurate detection object identification and positioning are provided for the data quality problem investigation, and the foundation is provided for the improvement of the subsequent quality detection work efficiency.

From the description of the above embodiments, those skilled in the art will appreciate that, in practical applications, the above-mentioned functions may be distributed by different functional modules according to needs, that is, the internal structure of a specific apparatus is divided into different functional modules, so as to complete all or part of the functions described above.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be essentially or a part contributing to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions to cause a device (may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. The large-scale data characteristic intelligent extraction method based on the data content is characterized by comprising the following steps of:

extracting features according to the field types;

the process of revising the field type includes: the Chinese description of the data is subjected to word segmentation and semantic recognition by utilizing an NLP natural language processing module, the path recognition of the approximate word or the approximate word is carried out through a type decision tree after analysis, the semantics of the Chinese description are not matched with the field type, and the Chinese description is marked as a suspected revised field type; then, sampling the data content with the same or similar semantic meaning in the Chinese description for multiple times, counting the proportion situation of different field types in the sampled data, taking the type with the proportion exceeding a threshold value as a recommended revised field type, and finally revising the recommended revised field type into the field type of the real stored data;

after the revision field type is finished, the method further comprises the step of verifying: the date data are converted into text data and copied into verification groups and interference groups, the verification groups are inserted into the year, month and day descriptions according to the original date format, the interference groups are added with counting unit descriptions according to the original date data digits, the verification groups and the interference groups are inserted into the text data adjacent to the verification groups, semantic recognition is carried out on the spliced text data through an NLP natural language processing module, the recognition speed of each pair of the interference groups and the verification groups is recorded, if the recognition speed of the verification groups is faster than that of the interference groups and exceeds an amplitude threshold value, the corresponding original date data are classified as a suspected error type through verification, and otherwise, the corresponding original date data are classified as a suspected error type.

2. The method for intelligently extracting large-scale data features based on data content according to claim 1, wherein the preliminary identification process comprises the following steps: and carrying out preliminary identification on the data to be identified according to the existing field type database, or introducing an identification model trained by a neural network to carry out preliminary identification, so as to obtain a preliminary identification result of the field type.

3. The method for intelligently extracting large-scale data features based on data content according to claim 1, wherein the process of rejecting invalid data comprises the following steps: defining an invalid table and an invalid field, and uniformly judging an empty table, a zombie table, a log table, a backup table, a temporary table, a single-field table and a low-heat table as invalid tables through metadata information and data content judgment of the tables; uniformly judging the null field and the single value field as invalid fields; and identifying and rejecting invalid tables and fields.

4. The method for intelligent extraction of large-scale data features based on data content according to claim 1, wherein the field type includes at least one of a numeric type, a text type, and a date type.

5. The method for intelligently extracting large-scale data features based on data content according to claim 4, wherein the process of extracting features according to field types comprises the following steps: extracting features and feature values of the numerical value type fields by means of average value, maximum value, minimum value, median, variance, quartile range, numerical value cluster and length cluster; for text type fields, statistical attribute features are distributed from length clusters and structures, and extraction on content features is performed through word segmentation and semantic recognition of data content; and carrying out structural analysis on the date type field, and carrying out feature extraction on the date format and the length.