CN117708102A - Intelligent matching and checking method for data standard - Google Patents

Intelligent matching and checking method for data standard Download PDF

Info

Publication number
CN117708102A
CN117708102A CN202311451931.9A CN202311451931A CN117708102A CN 117708102 A CN117708102 A CN 117708102A CN 202311451931 A CN202311451931 A CN 202311451931A CN 117708102 A CN117708102 A CN 117708102A
Authority
CN
China
Prior art keywords
data
standard
field
matching
metadata
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311451931.9A
Other languages
Chinese (zh)
Inventor
逯明
胡运燕
叶国强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuhai Huafa Group Technology Research Institute Co ltd
Original Assignee
Zhuhai Huafa Group Technology Research Institute Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuhai Huafa Group Technology Research Institute Co ltd filed Critical Zhuhai Huafa Group Technology Research Institute Co ltd
Priority to CN202311451931.9A priority Critical patent/CN117708102A/en
Publication of CN117708102A publication Critical patent/CN117708102A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for intelligently matching and checking data standards, which comprises the following steps: establishing a data standard set: the data standard is divided into a field naming standard and a field coding standard; extracting system metadata and field data: extracting metadata in a system database and sample data corresponding to fields by a data exploration tool, and storing the metadata and the sample data in an algorithm training library; matching fields to be standardized through a machine learning algorithm; manually supplementing unmatched fields; find data standard problems and drive rectification. The data standard in the multi-source heterogeneous system database is unified by establishing the data standard and carrying out matching check with the data in each system, so that a solid foundation is provided for data processing analysis and data application; the supervised machine learning algorithm replaces huge workload of manual recognition data standard, and the working efficiency can be improved by 30-50% based on the difference of model training accuracy and data standard characteristics.

Description

Intelligent matching and checking method for data standard
Technical Field
The invention relates to the technical field of data processing, in particular to a method for intelligently matching and checking data standards.
Background
In the aspect of enterprise digital construction, the establishment of data standards is a standardized guide for enterprise data definition, so that enterprise data can effectively serve business decisions. The more perfect the data standard is established, the lower the cost of data governance.
However, in general, enterprises can have heterogeneous multi-source informatization systems, and although the implementation of enterprise business is not affected, huge workload is generated for the implementation of data standards in a floor mode.
Disclosure of Invention
The invention aims to provide a method for intelligently matching and checking data standards, which aims to solve the problems in the background technology.
In order to achieve the above purpose, the present invention provides the following technical solutions: a method for intelligently matching and checking data standards comprises the following steps:
step S1: establishing a data standard set:
the data standard is divided into a field naming standard and a field coding standard;
step S2: extracting system metadata and field data:
extracting metadata in a system database and sample data corresponding to fields by a data exploration tool, and storing the metadata and the sample data in an algorithm training library;
step S3: matching fields to be standardized through a machine learning algorithm:
matching the specified data standard with the extracted metadata and sample data by establishing a supervised machine learning algorithm;
step S4: by manually supplementing the unmatched fields:
after the test, the recognition rate of the association between the data table and the data standard is 30% -50% after the model training, and for the fields which do not recognize the association relationship, but still need to perform data standardization, the fields are manually supplemented;
step S5: find data standard problems and drive rectification:
after the association result of the finally identified data standard and the data table is output, the following data standard problems are output:
1) Table naming is not standard;
2) The naming of the table field is not standard;
3) The data encoding in the table field is not in accordance with the data standard;
and distributing the data standard problems to each system data manager to carry out unified correction of the data standard.
Preferably, the step S1 field naming standard includes: naming the data table, for example, the Day data table is Day; naming of the data field, for example, DATE field is DATE, and TIME field is TIME;
the field coding standard includes: according to the data category, detailed corresponding codes, such as currency, are specified, the codes are CNY and USD, and the country codes are CN, US, enterprise organization codes and personnel position codes.
Preferably, in the step S3, a model is specifically built through metadata field feature vectors and data feature vectors, sample data is labeled and returned to the model for learning, and then the trained model is used for matching a full amount of metadata fields and field data, so that a matching relationship between a data standard and actual data is built.
Preferably, the step S1 is preceded by big data analysis, summarizing the data element information in the data to form a standard data element format, maintaining the standard data element format in a data element standard management system, and performing word segmentation processing on the data element by using a search server elastic search.
Preferably, the word segmentation processing is performed on the data element by using the search server elastic search, and the word segmentation processing implementation steps are as follows:
firstly, submitting data to an elastic search database;
the word segmentation controller is used for segmenting the corresponding sentences, and the weight and the word segmentation result are stored in a database;
ranking and scoring the search results according to the weights as the user searches for data, and presenting the returned results to the user.
Preferably, the word segmentation controller uses a Chinese word segmentation plugin to segment the Chinese name of the standard data element to construct a Chinese dictionary; the word segmentation controller uses an English word segmentation plug-in to segment English names in standard data elements, and an English dictionary is constructed. .
Preferably, the step S4 selects a data table to be detected through the data source management system, maintains a field to be detected, stores a field marked in the automatic scheduling task query library table, and searches and lists associated data meta-information according to the chinese-english name of the field to form a detection example plan; and after the plan is finished, performing manual auditing and complementation, further verifying the accuracy of system detection, correcting errors, accumulating and optimizing the detection system, and submitting a detection report after the verification is finished.
Compared with the prior art, the invention has the beneficial effects that:
1. unifying the data standards in the multi-source heterogeneous system database by establishing the data standards and carrying out matching check on the data standards and the data in each system, and providing a solid foundation for data processing analysis and data application;
2. the supervised machine learning algorithm replaces huge workload of manual recognition data standard, and the working efficiency can be improved by 30-50% based on the difference of model training accuracy and data standard characteristics.
Drawings
FIG. 1 is a schematic flow chart of the present invention;
FIG. 2 is a schematic diagram of supervised machine learning according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1-2, the present invention provides a technical solution: a method for intelligently matching and checking data standards comprises the following steps:
step S1: establishing a data standard set:
the data standard is divided into a field naming standard and a field coding standard;
step S2: extracting system metadata and field data:
extracting metadata in a system database and sample data corresponding to fields by a data exploration tool, and storing the metadata and the sample data in an algorithm training library;
step S3: matching fields to be standardized through a machine learning algorithm:
matching the specified data standard with the extracted metadata and sample data by establishing a supervised machine learning algorithm;
step S4: by manually supplementing the unmatched fields:
after the test, the recognition rate of the association between the data table and the data standard is 30% -50% after the model training, and for the fields which do not recognize the association relationship, but still need to perform data standardization, the fields are manually supplemented;
step S5: find data standard problems and drive rectification:
after the association result of the finally identified data standard and the data table is output, the following data standard problems are output:
1) Table naming is not standard;
2) The naming of the table field is not standard;
3) The data encoding in the table field is not in accordance with the data standard;
and distributing the data standard problems to each system data manager to carry out unified correction of the data standard.
In the present invention, the step S1 field naming standard includes: naming the data table, for example, the Day data table is Day; naming of the data field, for example, DATE field is DATE, and TIME field is TIME;
the field coding standard includes: according to the data category, detailed corresponding codes, such as currency, are specified, the codes are CNY and USD, and the country codes are CN, US, enterprise organization codes and personnel position codes.
In the invention, the step S3 is to build a model through metadata field feature vectors and data feature vectors, label sample data and return the model for learning, and then use the trained model for matching the full amount of metadata fields and field data, thereby building the matching relation between the data standard and the actual data.
In the invention, the big data analysis is carried out before the step S1, the data element information in the data is summarized to form a standard data element format, the standard data element format is maintained to a data element standard management system, and the search server elastic search is utilized to segment the data elements.
In the invention, the word segmentation processing is carried out on the data element by utilizing the search server elastic search, and the word segmentation processing implementation steps are as follows:
firstly, submitting data to an elastic search database;
the word segmentation controller is used for segmenting the corresponding sentences, and the weight and the word segmentation result are stored in a database;
ranking and scoring the search results according to the weights as the user searches for data, and presenting the returned results to the user.
In the invention, the word segmentation controller uses a Chinese word segmentation plugin to segment the Chinese name of the standard data element to construct a Chinese dictionary; the word segmentation controller uses an English word segmentation plug-in to segment English names in standard data elements, and an English dictionary is constructed.
In the invention, the step S4 selects the data table to be detected through the data source management system, maintains the field to be detected, stores the field marked in the automatic scheduling task query library table, and searches and lists the associated data meta-information according to the Chinese and English names of the field to form a detection example plan; and after the plan is finished, performing manual auditing and complementation, further verifying the accuracy of system detection, correcting errors, accumulating and optimizing the detection system, and submitting a detection report after the verification is finished.
The invention comprises the following steps: establishing a data standard set: the data standard is divided into a field naming standard and a field encoding standard, wherein: the field naming standards include: naming the data table, for example, the Day data table is Day; naming of the data field, for example, DATE field is DATE, and TIME field is TIME; the field coding standard includes: according to the data category, specifying detailed corresponding codes, such as currency, CNY, USD and the like, country codes such as CN, US and the like, enterprise organization codes, personnel position codes and the like; extracting system metadata and field data: extracting metadata in a system database and sample data corresponding to fields by a data exploration tool, and storing the metadata and the sample data in an algorithm training library; matching fields to be standardized through a machine learning algorithm: matching a specified data standard with extracted metadata and sample data by establishing a supervised machine learning algorithm, specifically establishing a model by using metadata field feature vectors and data feature vectors, marking the sample data and returning the model for learning, and then matching a trained model with a full amount of metadata fields and field data, thereby establishing a matching relation between the data standard and actual data; by manually supplementing the unmatched fields: after the test, the recognition rate of the association between the data table and the data standard is 30% -50% after the model training, and for the fields which do not recognize the association relationship, but still need to perform data standardization, the fields are manually supplemented; find data standard problems and drive rectification: after the association result of the finally identified data standard and the data table is output, the following data standard problems are output: table naming is not standard; the naming of the table field is not standard; the data encoding in the table field is not in accordance with the data standard; and distributing the data standard problems to each system data manager to carry out unified correction of the data standard.
The invention unifies the data standards in the multi-source heterogeneous system database by establishing the data standards and carrying out matching check with the data in each system, and provides a solid foundation for data processing analysis and data application; the supervised machine learning algorithm replaces huge workload of manual recognition data standard, and the working efficiency can be improved by 30-50% based on the difference of model training accuracy and data standard characteristics.
While embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations may be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (7)

1. A method for intelligently matching and checking data standards is characterized in that: the method comprises the following steps:
step S1: establishing a data standard set:
the data standard is divided into a field naming standard and a field coding standard;
step S2: extracting system metadata and field data:
extracting metadata in a system database and sample data corresponding to fields by a data exploration tool, and storing the metadata and the sample data in an algorithm training library;
step S3: matching fields to be standardized through a machine learning algorithm:
matching the specified data standard with the extracted metadata and sample data by establishing a supervised machine learning algorithm;
step S4: by manually supplementing the unmatched fields:
after the test, the recognition rate of the association between the data table and the data standard is 30% -50% after the model training, and for the fields which do not recognize the association relationship, but still need to perform data standardization, the fields are manually supplemented;
step S5: find data standard problems and drive rectification:
after the association result of the finally identified data standard and the data table is output, the following data standard problems are output:
1) Table naming is not standard;
2) The naming of the table field is not standard;
3) The data encoding in the table field is not in accordance with the data standard;
and distributing the data standard problems to each system data manager to carry out unified correction of the data standard.
2. The method for intelligently matching and checking data standards according to claim 1, wherein: the step S1 field naming standard comprises the following steps: naming the data table, for example, the Day data table is Day; naming of the data field, for example, DATE field is DATE, and TIME field is TIME;
the field coding standard includes: according to the data category, detailed corresponding codes, such as currency, are specified, the codes are CNY and USD, and the country codes are CN, US, enterprise organization codes and personnel position codes.
3. The method for intelligently matching and checking data standards according to claim 1, wherein: the step S3 is specifically to build a model through metadata field feature vectors and data feature vectors, annotate sample data and return the model for learning, and then use the trained model for matching the full amount of metadata fields and field data, so as to build a matching relation between data standards and actual data.
4. The method for intelligently matching and checking data standards according to claim 1, wherein: and step S1, analyzing the big data, summarizing the data element information in the data to form a standard data element format, maintaining the standard data element format to a data element standard management system, and performing word segmentation processing on the data element by utilizing a search server elastic search.
5. The method for intelligently matching and checking data standards according to claim 4, wherein: the word segmentation processing is carried out on the data elements by utilizing the search server elastic search, and the word segmentation processing implementation steps are as follows:
firstly, submitting data to an elastic search database;
the word segmentation controller is used for segmenting the corresponding sentences, and the weight and the word segmentation result are stored in a database;
ranking and scoring the search results according to the weights as the user searches for data, and presenting the returned results to the user.
6. The method for intelligently matching and checking data standards according to claim 5, wherein: the word segmentation controller uses a Chinese word segmentation plug-in to segment the Chinese name of the standard data element to construct a Chinese dictionary; the word segmentation controller uses an English word segmentation plug-in to segment English names in standard data elements, and an English dictionary is constructed.
7. The method for intelligently matching and checking data standards according to claim 1, wherein: step S4, selecting a data table to be detected through a data source management system, maintaining a field to be detected, storing a field marked in an automatic scheduling task query library table, and searching and listing associated data meta-information according to the Chinese and English names of the field to form a detection example plan; and after the plan is finished, performing manual auditing and complementation, further verifying the accuracy of system detection, correcting errors, accumulating and optimizing the detection system, and submitting a detection report after the verification is finished.
CN202311451931.9A 2023-11-03 2023-11-03 Intelligent matching and checking method for data standard Pending CN117708102A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311451931.9A CN117708102A (en) 2023-11-03 2023-11-03 Intelligent matching and checking method for data standard

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311451931.9A CN117708102A (en) 2023-11-03 2023-11-03 Intelligent matching and checking method for data standard

Publications (1)

Publication Number Publication Date
CN117708102A true CN117708102A (en) 2024-03-15

Family

ID=90157757

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311451931.9A Pending CN117708102A (en) 2023-11-03 2023-11-03 Intelligent matching and checking method for data standard

Country Status (1)

Country Link
CN (1) CN117708102A (en)

Similar Documents

Publication Publication Date Title
CN106682150B (en) Information processing method and device
CN109446221B (en) Interactive data exploration method based on semantic analysis
CN108922633A (en) A kind of disease name standard convention method and canonical system
CN108717433A (en) A kind of construction of knowledge base method and device of programming-oriented field question answering system
CN109102157A (en) A kind of bank's work order worksheet processing method and system based on deep learning
CN111125116B (en) Method and system for positioning code field in service table and corresponding code table
CN113254507B (en) Intelligent construction and inventory method for data asset directory
CN115547466B (en) Medical institution registration and review system and method based on big data
CN116991869A (en) Method for automatically generating database query statement based on NLP language model
CN111078835A (en) Resume evaluation method and device, computer equipment and storage medium
CN114625748A (en) SQL query statement generation method and device, electronic equipment and readable storage medium
CN111222028A (en) Intelligent data crawling method
CN111767476B (en) Method for constructing space-time big data spatialization engine of smart city based on HMM model
CN116303641B (en) Laboratory report management method supporting multi-data source visual configuration
CN110956030B (en) Method and system for comparing configuration information of remote machine of transformer substation
CN115617689A (en) Software defect positioning method based on CNN model and domain features
CN117708102A (en) Intelligent matching and checking method for data standard
CN116431746A (en) Address mapping method and device based on coding library, electronic equipment and storage medium
CN115982316A (en) Multi-mode-based text retrieval method, system and medium
CN115470861A (en) Data processing method and device and electronic equipment
CN110837735B (en) Intelligent data analysis and identification method and system
CN113590781A (en) Terminal express delivery code prediction method and system, electronic device and readable storage medium
CN114004575A (en) Personalized recruitment system and method for realizing personalization of recruitment system
CN113792081A (en) Method and system for automatically checking data assets
CN109299381A (en) A kind of software defect retrieval and analysis system and method based on semantic concept

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication