CN117708102A

CN117708102A - Intelligent matching and checking method for data standard

Info

Publication number: CN117708102A
Application number: CN202311451931.9A
Authority: CN
Inventors: 逯明; 胡运燕; 叶国强
Original assignee: Zhuhai Huafa Group Technology Research Institute Co ltd
Current assignee: Zhuhai Huafa Group Technology Research Institute Co ltd
Priority date: 2023-11-03
Filing date: 2023-11-03
Publication date: 2024-03-15

Abstract

The invention discloses a method for intelligently matching and checking data standards, which comprises the following steps: establishing a data standard set: the data standard is divided into a field naming standard and a field coding standard; extracting system metadata and field data: extracting metadata in a system database and sample data corresponding to fields by a data exploration tool, and storing the metadata and the sample data in an algorithm training library; matching fields to be standardized through a machine learning algorithm; manually supplementing unmatched fields; find data standard problems and drive rectification. The data standard in the multi-source heterogeneous system database is unified by establishing the data standard and carrying out matching check with the data in each system, so that a solid foundation is provided for data processing analysis and data application; the supervised machine learning algorithm replaces huge workload of manual recognition data standard, and the working efficiency can be improved by 30-50% based on the difference of model training accuracy and data standard characteristics.

Description

Intelligent matching and checking method for data standard

Technical Field

The invention relates to the technical field of data processing, in particular to a method for intelligently matching and checking data standards.

Background

In the aspect of enterprise digital construction, the establishment of data standards is a standardized guide for enterprise data definition, so that enterprise data can effectively serve business decisions. The more perfect the data standard is established, the lower the cost of data governance.

However, in general, enterprises can have heterogeneous multi-source informatization systems, and although the implementation of enterprise business is not affected, huge workload is generated for the implementation of data standards in a floor mode.

Disclosure of Invention

The invention aims to provide a method for intelligently matching and checking data standards, which aims to solve the problems in the background technology.

In order to achieve the above purpose, the present invention provides the following technical solutions: a method for intelligently matching and checking data standards comprises the following steps:

step S1: establishing a data standard set:

the data standard is divided into a field naming standard and a field coding standard;

step S2: extracting system metadata and field data:

extracting metadata in a system database and sample data corresponding to fields by a data exploration tool, and storing the metadata and the sample data in an algorithm training library;

step S3: matching fields to be standardized through a machine learning algorithm:

matching the specified data standard with the extracted metadata and sample data by establishing a supervised machine learning algorithm;

step S4: by manually supplementing the unmatched fields:

after the test, the recognition rate of the association between the data table and the data standard is 30% -50% after the model training, and for the fields which do not recognize the association relationship, but still need to perform data standardization, the fields are manually supplemented;

step S5: find data standard problems and drive rectification:

after the association result of the finally identified data standard and the data table is output, the following data standard problems are output:

1) Table naming is not standard;

2) The naming of the table field is not standard;

3) The data encoding in the table field is not in accordance with the data standard;

and distributing the data standard problems to each system data manager to carry out unified correction of the data standard.

Preferably, the step S1 field naming standard includes: naming the data table, for example, the Day data table is Day; naming of the data field, for example, DATE field is DATE, and TIME field is TIME;

the field coding standard includes: according to the data category, detailed corresponding codes, such as currency, are specified, the codes are CNY and USD, and the country codes are CN, US, enterprise organization codes and personnel position codes.

Preferably, in the step S3, a model is specifically built through metadata field feature vectors and data feature vectors, sample data is labeled and returned to the model for learning, and then the trained model is used for matching a full amount of metadata fields and field data, so that a matching relationship between a data standard and actual data is built.

Preferably, the step S1 is preceded by big data analysis, summarizing the data element information in the data to form a standard data element format, maintaining the standard data element format in a data element standard management system, and performing word segmentation processing on the data element by using a search server elastic search.

Preferably, the word segmentation processing is performed on the data element by using the search server elastic search, and the word segmentation processing implementation steps are as follows:

firstly, submitting data to an elastic search database;

the word segmentation controller is used for segmenting the corresponding sentences, and the weight and the word segmentation result are stored in a database;

ranking and scoring the search results according to the weights as the user searches for data, and presenting the returned results to the user.

Preferably, the word segmentation controller uses a Chinese word segmentation plugin to segment the Chinese name of the standard data element to construct a Chinese dictionary; the word segmentation controller uses an English word segmentation plug-in to segment English names in standard data elements, and an English dictionary is constructed. .

Preferably, the step S4 selects a data table to be detected through the data source management system, maintains a field to be detected, stores a field marked in the automatic scheduling task query library table, and searches and lists associated data meta-information according to the chinese-english name of the field to form a detection example plan; and after the plan is finished, performing manual auditing and complementation, further verifying the accuracy of system detection, correcting errors, accumulating and optimizing the detection system, and submitting a detection report after the verification is finished.

Compared with the prior art, the invention has the beneficial effects that:

1. unifying the data standards in the multi-source heterogeneous system database by establishing the data standards and carrying out matching check on the data standards and the data in each system, and providing a solid foundation for data processing analysis and data application;

2. the supervised machine learning algorithm replaces huge workload of manual recognition data standard, and the working efficiency can be improved by 30-50% based on the difference of model training accuracy and data standard characteristics.

Drawings

FIG. 1 is a schematic flow chart of the present invention;

FIG. 2 is a schematic diagram of supervised machine learning according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1-2, the present invention provides a technical solution: a method for intelligently matching and checking data standards comprises the following steps:

step S1: establishing a data standard set:

step S2: extracting system metadata and field data:

step S4: by manually supplementing the unmatched fields:

step S5: find data standard problems and drive rectification:

1) Table naming is not standard;

2) The naming of the table field is not standard;

In the present invention, the step S1 field naming standard includes: naming the data table, for example, the Day data table is Day; naming of the data field, for example, DATE field is DATE, and TIME field is TIME;

In the invention, the step S3 is to build a model through metadata field feature vectors and data feature vectors, label sample data and return the model for learning, and then use the trained model for matching the full amount of metadata fields and field data, thereby building the matching relation between the data standard and the actual data.

In the invention, the big data analysis is carried out before the step S1, the data element information in the data is summarized to form a standard data element format, the standard data element format is maintained to a data element standard management system, and the search server elastic search is utilized to segment the data elements.

In the invention, the word segmentation processing is carried out on the data element by utilizing the search server elastic search, and the word segmentation processing implementation steps are as follows:

firstly, submitting data to an elastic search database;

In the invention, the word segmentation controller uses a Chinese word segmentation plugin to segment the Chinese name of the standard data element to construct a Chinese dictionary; the word segmentation controller uses an English word segmentation plug-in to segment English names in standard data elements, and an English dictionary is constructed.

In the invention, the step S4 selects the data table to be detected through the data source management system, maintains the field to be detected, stores the field marked in the automatic scheduling task query library table, and searches and lists the associated data meta-information according to the Chinese and English names of the field to form a detection example plan; and after the plan is finished, performing manual auditing and complementation, further verifying the accuracy of system detection, correcting errors, accumulating and optimizing the detection system, and submitting a detection report after the verification is finished.

The invention comprises the following steps: establishing a data standard set: the data standard is divided into a field naming standard and a field encoding standard, wherein: the field naming standards include: naming the data table, for example, the Day data table is Day; naming of the data field, for example, DATE field is DATE, and TIME field is TIME; the field coding standard includes: according to the data category, specifying detailed corresponding codes, such as currency, CNY, USD and the like, country codes such as CN, US and the like, enterprise organization codes, personnel position codes and the like; extracting system metadata and field data: extracting metadata in a system database and sample data corresponding to fields by a data exploration tool, and storing the metadata and the sample data in an algorithm training library; matching fields to be standardized through a machine learning algorithm: matching a specified data standard with extracted metadata and sample data by establishing a supervised machine learning algorithm, specifically establishing a model by using metadata field feature vectors and data feature vectors, marking the sample data and returning the model for learning, and then matching a trained model with a full amount of metadata fields and field data, thereby establishing a matching relation between the data standard and actual data; by manually supplementing the unmatched fields: after the test, the recognition rate of the association between the data table and the data standard is 30% -50% after the model training, and for the fields which do not recognize the association relationship, but still need to perform data standardization, the fields are manually supplemented; find data standard problems and drive rectification: after the association result of the finally identified data standard and the data table is output, the following data standard problems are output: table naming is not standard; the naming of the table field is not standard; the data encoding in the table field is not in accordance with the data standard; and distributing the data standard problems to each system data manager to carry out unified correction of the data standard.

The invention unifies the data standards in the multi-source heterogeneous system database by establishing the data standards and carrying out matching check with the data in each system, and provides a solid foundation for data processing analysis and data application; the supervised machine learning algorithm replaces huge workload of manual recognition data standard, and the working efficiency can be improved by 30-50% based on the difference of model training accuracy and data standard characteristics.

While embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations may be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A method for intelligently matching and checking data standards is characterized in that: the method comprises the following steps:

step S1: establishing a data standard set:

step S2: extracting system metadata and field data:

step S4: by manually supplementing the unmatched fields:

step S5: find data standard problems and drive rectification:

1) Table naming is not standard;

2) The naming of the table field is not standard;

2. The method for intelligently matching and checking data standards according to claim 1, wherein: the step S1 field naming standard comprises the following steps: naming the data table, for example, the Day data table is Day; naming of the data field, for example, DATE field is DATE, and TIME field is TIME;

3. The method for intelligently matching and checking data standards according to claim 1, wherein: the step S3 is specifically to build a model through metadata field feature vectors and data feature vectors, annotate sample data and return the model for learning, and then use the trained model for matching the full amount of metadata fields and field data, so as to build a matching relation between data standards and actual data.

4. The method for intelligently matching and checking data standards according to claim 1, wherein: and step S1, analyzing the big data, summarizing the data element information in the data to form a standard data element format, maintaining the standard data element format to a data element standard management system, and performing word segmentation processing on the data element by utilizing a search server elastic search.

5. The method for intelligently matching and checking data standards according to claim 4, wherein: the word segmentation processing is carried out on the data elements by utilizing the search server elastic search, and the word segmentation processing implementation steps are as follows:

firstly, submitting data to an elastic search database;

6. The method for intelligently matching and checking data standards according to claim 5, wherein: the word segmentation controller uses a Chinese word segmentation plug-in to segment the Chinese name of the standard data element to construct a Chinese dictionary; the word segmentation controller uses an English word segmentation plug-in to segment English names in standard data elements, and an English dictionary is constructed.

7. The method for intelligently matching and checking data standards according to claim 1, wherein: step S4, selecting a data table to be detected through a data source management system, maintaining a field to be detected, storing a field marked in an automatic scheduling task query library table, and searching and listing associated data meta-information according to the Chinese and English names of the field to form a detection example plan; and after the plan is finished, performing manual auditing and complementation, further verifying the accuracy of system detection, correcting errors, accumulating and optimizing the detection system, and submitting a detection report after the verification is finished.