CN113569005A

CN113569005A - Large-scale data feature intelligent extraction method based on data content

Info

Publication number: CN113569005A
Application number: CN202110670587.7A
Authority: CN
Inventors: 葛俊; 梁云丹; 黄建平; 张旭东; 张建松; 陈浩
Original assignee: State Grid Corp of China SGCC; State Grid Zhejiang Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Zhejiang Electric Power Co Ltd
Priority date: 2021-06-17
Filing date: 2021-06-17
Publication date: 2021-10-29
Anticipated expiration: 2041-06-17
Also published as: CN113569005B

Abstract

The invention discloses a large-scale data feature intelligent extraction method based on data content, which comprises the following steps: performing initial identification on field types of the data, and eliminating invalid data; judging the Chinese description and the field type of the data, sampling unmatched data, calculating the ratio of each field type in the sample, and revising the field type according to the ratio result; features are extracted according to field types. The substantial effects of the invention include: the universality of table processing and detection and the relation between meanings represented by different field type data are considered, corresponding features can be extracted for each field only according to the header information and the data content, the automation and scale of the feature extraction of the data are realized, accurate detection object identification and positioning are provided for the data quality problem investigation, and a foundation is provided for improving the subsequent quality detection work efficiency.

Description

Large-scale data feature intelligent extraction method based on data content

Technical Field

The invention relates to the technical field of data feature extraction, in particular to a large-scale data feature intelligent extraction method based on data content.

Background

With the development of digital economy, various industries do not pursue the scale of data volume one by one, the requirement on data quality is higher and higher in the data application process, and in the face of massive data resources, how to find the problem of positioning data quality more quickly, accurately and intelligently, and develop corresponding treatment work is the key and core of current enterprise-level data asset management.

In the prior art, the invention disclosed in publication No. CN105554152A discloses a method and apparatus for extracting data features. In more detailed technical content, the invention disclosed in publication No. CN108256074A also discloses a method for verification processing, including obtaining models of a data warehouse to be verified, each model including a plurality of field information, the field information including field definitions and field types; checking the field information according to a pre-stored data dictionary, wherein the data dictionary comprises a plurality of standard expressions, and each standard expression comprises a standard definition and a standard type; and if the field definition is matched with the standard definition and the field type is not matched with the standard type, modifying the field type to be consistent with the standard type. The method verifies the model of the data warehouse according to the standard expression, and when the field definition is matched with the standard definition and the field type is not matched with the standard type, the field type is pertinently modified to be consistent with the standard type, so that a standard consistent model is obtained.

In the prior art, the related problems are solved in different ways, and in the traditional data quality management mode, the selection of the problem detection object needs to specify specific and specific data tables and fields by service experts according to service specifications and experience knowledge, and needs to specify the characteristics of each field and the applicable rules, the method and the result have extremely high requirements on the experience and professional skill of the service expert, the range of the detection object of the data quality problem is relatively limited, and the method is highly dependent on the service expert, for large-scale mass data, service experts are required to respectively and one by one designate corresponding detection objects and ranges, and the universality of the data characteristics is weak, the maintenance is time-consuming and labor-consuming, the definition of a large-scale and automatic data quality detection object and the extraction of the corresponding data characteristics cannot be realized, and the efficiency of data quality audit is low and is seriously influenced by manual experience.

Disclosure of Invention

Aiming at the problems in the prior art, the invention aims to provide a large-scale data feature intelligent extraction method based on data content, which considers the universality of processing and detecting tables, does not need to provide any additional corresponding knowledge of the tables, can extract corresponding features for each field only according to the header information and the data content, realizes the automation and the large-scale of the feature extraction of the data, does not need to appoint data objects and feature conditions of data quality detection one by one, reduces the dependence on the knowledge and experience of service personnel, provides accurate detection object identification and positioning for the data quality problem investigation, and provides a foundation for improving the working efficiency of subsequent quality detection.

The technical scheme of the invention is as follows.

A large-scale data feature intelligent extraction method based on data content comprises the following steps:

performing initial identification on field types of the data, and eliminating invalid data;

judging the Chinese description and the field type of the data, sampling unmatched data, calculating the ratio of each field type in the sample, and revising the field type according to the ratio result;

features are extracted according to field types.

In the extraction process, the characteristics of different field types and the Chinese description of corresponding data are combined and considered, the field types are comprehensively judged by two steps of primary identification and revision, and the identification accuracy is improved.

Preferably, the process of preliminary identification includes: and carrying out primary identification on data to be identified according to the existing field type database, or introducing an identification model trained by a neural network to carry out primary identification to obtain a primary identification result of the field type. Different field types have respective characteristics, and the prior art in the field generally adopts a database, a training model and the like for comparison and identification, but the technology can only be used for primary identification, so that the implementation cost is reduced, and certain basic accuracy is guaranteed.

Preferably, the process of eliminating invalid data includes: defining an invalid table and invalid fields, and judging through metadata information and data content of the table to uniformly judge an empty table, a zombie table, a log table, a backup table, a temporary table, a single-character field table and a low-heat table into the invalid table; uniformly judging null fields and single-value fields as invalid fields; and identifying and eliminating invalid tables and fields. The invalid table and the invalid fields cover various common invalid data, and processing pressure of subsequent data extraction and analysis can be reduced after the invalid data are removed.

Preferably, the process of revising the field type includes: performing word segmentation and semantic recognition on the Chinese description of the data by using an NLP (natural language processing) module, performing path recognition on approximate words or approximate characters through a type decision tree after analysis, and marking the Chinese description as a suspected revised field type when the semantic of the Chinese description is not matched with the field type; and then sampling the data contents with the same or similar semantic meanings described in the Chinese text for multiple times, counting the occupation ratio conditions of different field types in the sampled data, taking the type with the occupation ratio exceeding a threshold value as a recommended revision field type, and finally revising the recommended revision field type as the field type to which the real storage data belongs. The natural language processing technology can perform word segmentation and semantic recognition on the Chinese description, the decision tree can perform path recognition with similar meanings to help judge whether the Chinese description belongs to a suspected revised field type, and finally, the result is determined by setting a threshold value in a mode of taking the proportion as a judgment standard, and the revision process is a supplement to the primary recognition, so that the recognition accuracy is further improved.

Preferably, the field type includes at least one of a numeric type, a text type, and a date type.

Preferably, the extracting features according to field types includes: extracting features and characteristic values of the numerical type field by using a mean value, a maximum value, a minimum value, a median, a variance, a quartile distance, numerical clustering and length clustering; for text type fields, counting attribute features from length clustering and structure distribution, and extracting content features through word segmentation and semantic recognition of data contents; and carrying out structure analysis on the date type field, and carrying out feature extraction on the date format and the date length.

Preferably, after the revision field type is finished, the method further comprises the following verification step: converting date data into text data, copying the text data into a verification group and an interference group, inserting the verification group into year, month and day description according to an original date format, adding a counting unit description according to the original date data digit number of the interference group, inserting the verification group and the interference group into self adjacent text data, performing semantic identification on the spliced text data through an NLP natural language processing module, and recording the identification speed of each pair of the interference group and the verification group, wherein if the identification speed of the verification group is higher than that of the interference group and exceeds an amplitude threshold value, the verification is passed, and otherwise, the corresponding original date data is listed as a suspected error type. Whether the data is date type data or numerical value type data, the data is often connected with the adjacent text type data, when the original identification is correct, the text spliced by the verification group is easier to identify, so the identification speed is high, and if the original identification is wrong, the text spliced by the verification group is wrong, so the text spliced by the verification group has no advantage of the identification speed compared with an interference group, and is even slower, so the text spliced by the verification group is classified as a suspected error type.

The substantial effects of the invention include: the universality of table processing and detection and the relation between meanings represented by different field type data are considered, corresponding features can be extracted for each field only according to the header information and the data content, the automation and scale of the feature extraction of the data are realized, accurate detection object identification and positioning are provided for the data quality problem investigation, and a foundation is provided for improving the subsequent quality detection work efficiency.

Detailed Description

The technical solution of the present application will be described with reference to the following examples. In addition, numerous specific details are set forth below in order to provide a better understanding of the present invention. It will be understood by those skilled in the art that the present invention may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present invention.

Example 1:

a large-scale data feature intelligent extraction method based on data content mainly aims at field types including numerical type, text type and date type.

The method comprises the following steps:

s01: and carrying out initial identification on the field type of the data, and rejecting invalid data.

Wherein the process of preliminary identification comprises: and carrying out primary identification on data to be identified according to the existing field type database, or introducing an identification model trained by a neural network to carry out primary identification to obtain a primary identification result of the field type. Different field types have respective characteristics, and the prior art in the field generally adopts a database, a training model and the like for comparison and identification, but in the embodiment, the technology can only be used for primary identification, the implementation cost is reduced, and certain basic accuracy is guaranteed.

The process of eliminating invalid data comprises the following steps: defining an invalid table and invalid fields, and judging through metadata information and data content of the table to uniformly judge an empty table, a zombie table, a log table, a backup table, a temporary table, a single-character field table and a low-heat table into the invalid table; uniformly judging null fields and single-value fields as invalid fields; and identifying and eliminating invalid tables and fields. The invalid table and the invalid fields cover various common invalid data, and processing pressure of subsequent data extraction and analysis can be reduced after the invalid data are removed.

S02: judging the Chinese description and the field type of the data, sampling the unmatched data, calculating the ratio of each field type in the sample, and revising the field type according to the ratio result.

The process of revising the field type includes: performing word segmentation and semantic recognition on the Chinese description of the data by using an NLP (natural language processing) module, performing path recognition on approximate words or approximate characters through a type decision tree after analysis, and marking the Chinese description as a suspected revised field type when the semantic of the Chinese description is not matched with the field type; and then sampling the data contents with the same or similar semantic meanings described in the Chinese text for multiple times, counting the occupation ratio conditions of different field types in the sampled data, taking the type with the occupation ratio exceeding a threshold value as a recommended revision field type, and finally revising the recommended revision field type as the field type to which the real storage data belongs. The natural language processing technology can perform word segmentation and semantic recognition on the Chinese description, the decision tree can perform path recognition with similar meanings to help judge whether the Chinese description belongs to a suspected revised field type, and finally, the result is determined by setting a threshold value in a mode of taking the proportion as a judgment standard, and the revision process is a supplement to the primary recognition, so that the recognition accuracy is further improved.

S03: features are extracted according to field types.

The process of extracting features according to field types comprises the following steps: extracting features and characteristic values of the numerical type field by using a mean value, a maximum value, a minimum value, a median, a variance, a quartile distance, numerical clustering and length clustering; for text type fields, counting attribute features from length clustering and structure distribution, and extracting content features through word segmentation and semantic recognition of data contents; and carrying out structure analysis on the date type field, and carrying out feature extraction on the date format and the date length.

In addition, more specifically, the data features and the feature extraction methods applicable to the field type can be searched from the data feature library, and all the applicable data feature extraction methods of the field type are traversed according to the dependency of the corresponding data features and the mutual exclusion relationship network, for example, after a certain data field is determined to be a numerical type, the feature extraction algorithm loads the methods for extracting the attribute features such as length, integer, positive number, negative number, decimal number and the like, and the methods for extracting the service features such as mobile phone number, zip code and the like.

Example 2:

this embodiment is generally consistent with the previous embodiment, and is different in that after the revision field type is finished and before the feature is extracted, the method further includes a verification step: converting date data into text data, copying the text data into a verification group and an interference group, inserting the verification group into a year, month and day description according to an original date format, adding a counting unit description according to the original date data digit number by the interference group, inserting the verification group and the interference group into the adjacent text data, performing semantic identification on the spliced text data through an NLP (non-line-segment) natural language processing module, and recording the identification speed of each pair of the interference group and the verification group, wherein if the identification speed of the verification group is higher than that of the interference group and exceeds an amplitude threshold value, the verification is passed, and otherwise, the corresponding original date data is listed as a suspected error type. Whether the data is date type data or numerical value type data, the data is often connected with the adjacent text type data, when the original identification is correct, the text spliced by the verification group is easier to identify, so the identification speed is high, and if the original identification is wrong, the text spliced by the verification group is wrong, so the text spliced by the verification group has no advantage of the identification speed compared with an interference group, and is even slower, so the text spliced by the verification group is classified as a suspected error type.

The substantial effects of the above embodiments include: the universality of table processing and detection and the relation between meanings represented by different field type data are considered, corresponding features can be extracted for each field only according to the header information and the data content, the automation and scale of the feature extraction of the data are realized, accurate detection object identification and positioning are provided for the data quality problem investigation, and a foundation is provided for improving the subsequent quality detection work efficiency.

Through the description of the above embodiments, those skilled in the art can understand that in practical applications, the above function distribution may be performed by different functional modules as needed, that is, the internal structure of a specific device is divided into different functional modules, so as to perform all or part of the above described functions.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partially contributed to by the prior art, or all or part of the technical solutions may be embodied in the form of a software product, where the software product is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A large-scale data feature intelligent extraction method based on data content is characterized by comprising the following steps:

features are extracted according to field types.

2. The method for intelligently extracting large-scale data features based on data contents according to claim 1, wherein the process of preliminary identification comprises: and carrying out primary identification on data to be identified according to the existing field type database, or introducing an identification model trained by a neural network to carry out primary identification to obtain a primary identification result of the field type.

3. The method for intelligently extracting large-scale data features based on data contents according to claim 1, wherein the process of eliminating invalid data comprises the following steps: defining an invalid table and invalid fields, and judging through metadata information and data content of the table to uniformly judge an empty table, a zombie table, a log table, a backup table, a temporary table, a single-character field table and a low-heat table into the invalid table; uniformly judging null fields and single-value fields as invalid fields; and identifying and eliminating invalid tables and fields.

4. The method for intelligently extracting large-scale data features based on data contents according to claim 1 or 2, wherein the process of revising the field type comprises the following steps: performing word segmentation and semantic recognition on the Chinese description of the data by using an NLP (natural language processing) module, performing path recognition on approximate words or approximate characters through a type decision tree after analysis, and marking the Chinese description as a suspected revised field type when the semantic of the Chinese description is not matched with the field type; and then sampling the data contents with the same or similar semantic meanings described in the Chinese text for multiple times, counting the occupation ratio conditions of different field types in the sampled data, taking the type with the occupation ratio exceeding a threshold value as a recommended revision field type, and finally revising the recommended revision field type as the field type to which the real storage data belongs.

5. The method according to claim 1, wherein the field type includes at least one of numeric type, text type and date type.

6. The method for intelligently extracting large-scale data features based on data contents according to claim 5, wherein the process of extracting features according to field types comprises the following steps: extracting features and characteristic values of the numerical type field by using a mean value, a maximum value, a minimum value, a median, a variance, a quartile distance, numerical clustering and length clustering; for text type fields, counting attribute features from length clustering and structure distribution, and extracting content features through word segmentation and semantic recognition of data contents; and carrying out structure analysis on the date type field, and carrying out feature extraction on the date format and the date length.

7. The method for intelligently extracting large-scale data features based on data contents according to claim 4, wherein after the revision field type is finished, the method further comprises a verification step of: converting date data into text data, copying the text data into a verification group and an interference group, inserting the verification group into year, month and day description according to an original date format, adding a counting unit description according to the original date data digit number of the interference group, inserting the verification group and the interference group into self adjacent text data, performing semantic identification on the spliced text data through an NLP natural language processing module, and recording the identification speed of each pair of the interference group and the verification group, wherein if the identification speed of the verification group is higher than that of the interference group and exceeds an amplitude threshold value, the verification is passed, and otherwise, the corresponding original date data is listed as a suspected error type.