CN115587125A

CN115587125A - Metadata management method and device

Info

Publication number: CN115587125A
Application number: CN202211282562.0A
Authority: CN
Inventors: 马静; 杨卓群; 姜凯文
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2022-10-19
Filing date: 2022-10-19
Publication date: 2023-01-10

Abstract

The application discloses a metadata management method and device. Wherein, the method comprises the following steps: acquiring first metadata of all stock before a target moment in a target data system, and generating a metadata base based on the first metadata; classifying all first metadata in a metadata base by using a pre-trained data classification model, and dividing each first metadata into first hot metadata or first cold metadata; determining first sub-thermal metadata which meets a preset data standard and second sub-thermal metadata which does not meet the preset data standard in the plurality of first thermal metadata, and performing data correction on the second sub-thermal metadata; and synchronizing the first sub-thermal metadata and the data-corrected second sub-thermal metadata into the block chain. The method and the device solve the technical problems that metadata are relatively disordered in management in the related technology, so that the metadata are low in quality and difficult to apply.

Description

Metadata management method and device

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a metadata management method and device.

Background

With the rapid development of the big data era, enterprises which are built through information for many years accumulate a large number of IT systems and a large amount of metadata. However, the aggregation of large-scale data causes unreasonable distribution and sharing of data, leakage of private data, misuse of data, and low data quality, so that problems that data decision is not credible and the like need to be solved urgently.

In general, enterprise-level metadata management architectures typically employ a centralized management architecture and a distributed management architecture. The centralized management architecture is convenient for metadata standardized unified management and application, but in order to ensure the consistency of data, the centralized management architecture needs to process a large amount of data and has high requirements on storage and platforms; the distributed management architecture has the advantages that the metadata can always be kept up-to-date and effective, the query is simple, but the distributed management architecture has difficulty in ensuring the consistency of the data, and the data standards of different data sources are difficult to unify.

Therefore, the accuracy, integrity and consistency of the metadata cannot be guaranteed by the two management architectures, so that the quality of the metadata is poor, and the enterprise data is difficult to effectively manage through the metadata.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the application provides a metadata management method and a metadata management device, so as to at least solve the technical problems that metadata management is disordered in the related technology, so that the metadata quality is not high and the metadata is difficult to apply.

According to an aspect of an embodiment of the present application, there is provided a metadata management method including: acquiring first metadata of all stock before a target moment in a target data system, and generating a metadata base based on the first metadata; classifying all first metadata in a metadata base by using a pre-trained data classification model, and dividing each first metadata into first hot metadata or first cold metadata; determining first sub-thermal metadata which meets a preset data standard and second sub-thermal metadata which does not meet the preset data standard in the plurality of first thermal metadata, and performing data correction on the second sub-thermal metadata; and synchronizing the first sub-thermal metadata and the data-corrected second sub-thermal metadata into a block chain.

Optionally, obtaining second metadata of an increment after the target time in the target data system; and performing data verification on the second metadata, wherein the data verification at least comprises the following steps: data accuracy verification, data integrity verification and data consistency verification; and when the second metadata passes the data verification, adding the second metadata into the metadata base, and synchronizing the second metadata into the block chain.

Optionally, after a preset time period, classifying all the first cold metadata in the metadata database by using the data classification model again, and dividing each first cold metadata into second hot metadata or second cold metadata; determining third sub-thermal metadata which meets a preset data standard and fourth sub-thermal metadata which does not meet the preset data standard in the plurality of second thermal metadata, and performing data correction on the fourth sub-thermal metadata; and synchronizing the third sub-thermal metadata and the data-corrected fourth sub-thermal metadata into the block chain.

Optionally, the obtaining first metadata of all stock amounts before the target time in the target data system, and generating a metadata base based on the first metadata includes: determining a target acquisition task, wherein the target acquisition task at least comprises the following steps: table building statements, path information, required authority and acquisition frequency; acquiring data of a plurality of relational databases and non-relational databases in a target data system based on a target acquisition task to obtain technical metadata of all stocks before a target moment; performing data cleaning and conversion on the technical metadata to obtain first metadata, wherein the first metadata at least comprises: data source system information, database information, data table information, table field information, index information, and constraint information.

Optionally, the training process of the data classification model includes: constructing a neural network model to be trained based on a gating cycle unit, wherein the neural network model comprises an input layer, an output layer, a reset gate and an update gate; determining the operation times and the latest operation time of each first metadata in the metadata base in a target time period by analyzing the metadata operation logs, and sequencing the first metadata according to the operation times and the latest operation time; for each first metadata, determining a training word vector and a hyperparameter corresponding to the first metadata, wherein the training word vector at least comprises: the meta-data name of the first meta-data, the operation times of the first meta-data in the target time period and the latest operation time of the first meta-data in the target time period, and the hyper-parameter is a weighted average of the operation time ranking of the first meta-data and the latest operation time ranking of the first meta-data; and sequentially inputting each training word vector and the hyper-parameter into the neural network model for iterative training, and adjusting the model parameters of the neural network model based on a back propagation algorithm to obtain a data classification model.

Optionally, performing data modification on the second sub-thermal metadata includes: for each second sub-thermal metadata, determining a plurality of table metadata in the data source system corresponding to the second sub-thermal metadata, and determining a plurality of field metadata in each table metadata; vectorizing a plurality of field metadata through a one-hot coding technology, determining a central word of each table metadata based on a continuous bag-of-words model algorithm, and determining central subjects of a plurality of table metadata based on a probability latent semantic analysis algorithm; determining a target sub data system matched with the central theme from the target data system, and determining a correction value of the second sub thermal metadata based on data in the target sub data system; and performing data correction on the second sub-thermal metadata based on the correction value.

Optionally, synchronizing the first sub-thermal metadata and the data-modified second sub-thermal metadata into the block chain includes: and uploading the first sub-thermal metadata and the second sub-thermal metadata after data correction to a target block chain platform, wherein the target block chain platform is used for verifying and determining the weight of the first sub-thermal metadata and the second sub-thermal metadata after data correction, and performing trusted transaction on the first sub-thermal metadata and the second sub-thermal metadata after data correction based on an intelligent contract mode.

Optionally, performing data check on the second metadata includes: inputting the second metadata into a pre-trained data verification model to obtain a target confidence coefficient output by the data verification model, wherein the data verification model is used for carrying out accuracy verification, integrity verification and consistency verification on the input data and outputting the confidence coefficient that the data passes the verification; when the target confidence coefficient is larger than a preset confidence coefficient threshold value, determining that the second metadata passes data verification; and when the target confidence coefficient is not greater than a preset confidence coefficient threshold value, sending the second metadata to a manual verification module for manual data verification.

According to another aspect of the embodiments of the present application, there is also provided a metadata management apparatus, including: the acquisition module is used for acquiring first metadata of all stock before the target time in the target data system and generating a metadata base based on the first metadata; the classification module is used for classifying all first metadata in the metadata database by utilizing a pre-trained data classification model and dividing each first metadata into first hot metadata or first cold metadata; the correction module is used for determining first sub-thermal metadata which meets a preset data standard and second sub-thermal metadata which does not meet the preset data standard in the plurality of first thermal metadata and performing data correction on the second sub-thermal metadata; and the synchronization module is used for synchronizing the first sub-thermal metadata and the second sub-thermal metadata after data correction into the block chain.

According to another aspect of the embodiments of the present application, there is also provided an electronic device, including: a memory in which a computer program is stored, and a processor configured to execute the above-described metadata management method by the computer program.

In the embodiment of the application, first metadata of all stock before a target time in a target data system are obtained, and a metadata base is generated based on the first metadata; classifying all first metadata in a metadata base by using a pre-trained data classification model, and dividing each first metadata into first hot metadata or first cold metadata; determining first sub-thermal metadata which meets a preset data standard and second sub-thermal metadata which does not meet the preset data standard in the plurality of first thermal metadata, and performing data correction on the second sub-thermal metadata; and synchronizing the first sub-thermal metadata and the data-corrected second sub-thermal metadata into the block chain. After the first metadata is classified through the pre-trained data classification model, the classified second sub-hot metadata is accurately corrected, the corrected second sub-hot metadata and the first sub-hot metadata are synchronized into a block chain, and sharing transaction of the metadata in an enterprise is achieved, so that the quality and data consistency of the metadata are guaranteed, meanwhile, the latest and effective state of the metadata can be guaranteed, the query mode is simple, and the technical problems that in the related technology, management of the metadata is disordered, the quality of the metadata is low, and the metadata is difficult to apply are solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a schematic flow chart diagram of a metadata management method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart diagram illustrating an alternative metadata management method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an alternative GRU neural network model according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a metadata management apparatus according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For better understanding of the embodiments of the present application, the partial nouns or terms appearing during the description of the embodiments of the present application are first interpreted as follows:

GRU (gated cycle unit): reset gates and update gates are introduced in the GRU, wherein the reset gates help to capture short-term dependencies in the time series, and the update gates help to capture long-term dependencies in the time series. Inputs of a reset gate and an update gate in the GRU are both a current time step input and a previous time step hidden state, and an output can be obtained through calculation of a full connection layer with an activation function being a sigmoid function, namely the GRU can better capture a dependence relation of a larger time step distance in a time sequence through the control information flow of the learned gate. The calculation mode of the hidden state is modified in the GRU through a reset gate and an update gate, and the output value of the hidden state is obtained by performing combined calculation on the hidden state of the previous time step and the candidate hidden state of the current step through the output of the update gate of the current step.

One-Hot (One-Hot) encoding: also known as one-bit-efficient encoding, mainly uses an N-bit status register to encode N states, each state being represented by its own independent register bit and having only one bit active at any time. One-Hot encoding is the representation of a classification variable as a binary vector, which first requires mapping the classification values to integer values and representing each integer value as a binary vector. Except for the index of the integer, are zero values, which are marked as 1.

CBOW (Continuous Bag Of Words) model: CBOW is a variant of the word2vec model, which is obtained by predicting the central word in a single pass from the context, so that all words in the context (excluding the middle words) are used as input, and the central word most likely to be in the center can be obtained by CBOW.

Probabilistic Latent Semantic Analysis (PLSA) model: the probabilistic latent semantic analysis is also called probabilistic latent semantic indexing, and is an unsupervised learning method for topic analysis on a text set by using a probabilistic generative model, namely, a text set is given, wherein each text discusses a plurality of topics, each topic is represented by a plurality of words, and the probabilistic latent semantic analysis is performed on the text set, so that the topics of each text and the words of each topic can be observed. The most important characteristic of the probabilistic latent semantic analysis model is that a probabilistic production model is used for topic analysis of a text, and hidden variables can be used for representing topics. A set of text can be converted into text-word co-occurrence data, i.e. a word-text matrix, by the model.

Example 1

Metadata management is used as a basis of data management, for enterprises, successful realization of data value must rely on high-quality metadata, and only accurate, complete and consistent metadata is used for constructing a high-value metadata catalog, so that service data value acquisition, business mode innovation and management risk control are realized. For a telecommunications carrier, there are tens of thousands of systems per telecommunications carrier due to the high degree of informatization of the telecommunications carrier, and thus the metadata thereof is also enormous. Although these data have already formed metadata and converged into a big data lake, a metadata repository is formed. However, due to the existence of empty fields, messy codes, no Chinese comments, a large number of repeated tables and fields, etc., in the data, the metadata quality is low, and powerful support cannot be provided for business decisions and data applications.

Therefore, the related art cannot ensure the accuracy, integrity and consistency of the metadata, so that the quality of the metadata is poor and the enterprise data is difficult to effectively manage through the metadata.

In order to solve the foregoing problem, an embodiment of the present application provides a metadata management method, where a pre-trained data classification model is used to classify first metadata to obtain first hot metadata or first cold metadata, where the first cold metadata does not enter a chain temporarily. And accurately correcting second sub-hot metadata which is not in accordance with the preset data standard in the first hot metadata, and synchronizing the corrected second sub-hot metadata and the first sub-hot metadata into a block chain so as to realize the sharing transaction of metadata in enterprises, thereby ensuring the quality and consistency of the metadata, simultaneously ensuring the latest and effective state of the metadata and relatively simplifying the query mode.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than here.

Fig. 1 is a schematic flowchart of an alternative metadata management method according to an embodiment of the present application, and as shown in fig. 1, the method at least includes steps S102-S108, where:

step S102, acquiring first metadata of all stock before the target time in the target data system, and generating a metadata base based on the first metadata.

According to an optional embodiment of the present application, the sealing of all metadata in the data system may be implemented by: firstly, determining a target acquisition task, wherein the target acquisition task at least comprises the following steps: table building statements, path information, required authority and acquisition frequency; acquiring data of a plurality of relational databases and non-relational databases in a target data system based on a target acquisition task to obtain technical metadata of all stocks before a target moment; performing data cleaning and conversion on the technical metadata to obtain first metadata, wherein the first metadata at least comprises: data source system information, database information, data table information, table field information, index information, and constraint information.

For example, fig. 2 shows a flow diagram of an optional metadata management method, as shown in fig. 2, data collection is performed on a relational database and a non-relational database in a data system such as an MSS domain, an OSS domain, a BSS domain, or a network element platform at the current time according to a collection task, all technical metadata before the current time can be obtained, the technical metadata can form a data resource list, the data resource list is cleaned and converted, data in which repeated occurrence, messy codes, empty fields, and no chinese comments are deleted, formats of all data are unified, first metadata can be obtained, and finally, all the first metadata are gathered in a big data lake to form a metadata base at the same time. The metadata base can be used for application under different scenes, management of enterprise-level data asset directories and decision analysis of big data.

And step S104, classifying all first metadata in the metadata database by using a pre-trained data classification model, and dividing each first metadata into first hot metadata or first cold metadata.

According to an alternative embodiment of the present application, the training of the data classification model may be accomplished by steps S1-S4:

step S1, a GRU neural network model to be trained (namely, a neural network model based on a gate control cycle unit) is constructed, wherein the GRU neural network model comprises an input layer, an output layer, a reset gate and an update gate.

Specifically, fig. 3 shows a schematic diagram of an alternative GRU neural network model, as shown in fig. 3, by inputting x at the current time _t And the hidden state h transferred by the previous node _t-1 And inputting the related information of the previous node into the GRU neural network model to obtain the output value of the current hidden layer and the hidden state h transmitted to the next node _t 。

Training word vector x at t moment _t A super parameter value c and an output value h of a GRU neural network model hidden layer at the previous moment _t-1 As a new word vector [ h ] _t-1 ,x _t ,c]The word vector [ h ] _t-1 ,x _t ,c]As input vectors for the model.

Vector the word [ h _t-1 ,x _t ,c]The information is input to a reset gate (reset gate) for determining how much past information needs to be forgotten, and therefore, the expression of the reset gate is as follows:

wherein w _r Representing the weight parameter of the reset gate.

New word vector [ h _t-1 ,x _t ,c]Input to an update gate (update gate) that determines how much past information is to be passed into the future or how much information of the information at the previous time and the current time needs to be passed on, and therefore, the expression of the update gate is as follows:

wherein, w _z To update the weight parameters of the door.

It should be noted that the expressions of the reset gate and the update gate are both active functions and sigmoid functions, and only parameters and uses of linear transformation are different. And the sigmoid function is used to map a real number into an interval of (0, 1), and therefore, the sigmoid function is expressed as follows:

will reset the gate output r _t Input to the candidate hidden layer

As new information at the current time, therefore, the expression of the candidate hidden layer is as follows:

wherein, W is a weight parameter of the candidate hidden layer, tanh is a hyperbolic tangent function, and the expression is:

then, the output value of the candidate hidden layer is used

Inputting the data into the hidden layer to obtain the output value h of the hidden layer _t Thus, the hidden layer is represented by the following formula:

finally, the output value h of the hidden layer is compared _t The input to the output layer, the final output result y is obtained, and therefore, the expression formula of the output layer is as follows:

y＝w _y *h _t +b _y

wherein, w _y As a weight parameter of the output layer, b _y Is the offset of the output layer weight parameter.

And S2, determining the operation times and the latest operation time of each first metadata in the metadata base in a target time period by analyzing the metadata operation logs, and sequencing the first metadata according to the operation times and the latest operation time.

The number of operations may be the number of accesses or the number of modifications of the first metadata.

Specifically, the number of times of recent access or modification, the access time, or the modification time of each first metadata in the metadata repository may be determined by parsing the metadata operation log. And sequencing according to the operation times of each first metadata in the metadata dictionary from small to large to obtain a sequence number A corresponding to each metadata, and sequencing according to the access time or modification time of each first metadata in the metadata dictionary from far to near to obtain a sequence number B corresponding to each first metadata.

For example, when the number of times of modification of the xth first metadata in the metadata base is arranged at 268 th in all 1000 metadata, A is equal to 268/1000. When the access time of the xth first metadata in the metadata database is arranged at the 865 th bit in all 1000 metadata, B is equal to 865/1000, so that the hyperparameter C of the xth first metadata is equal to 1133/2000.

S3, determining a training word vector and a hyper-parameter corresponding to the first metadata for each piece of first metadata, wherein the training word vector at least comprises: the meta-data name of the first meta-data, the number of operations of the first meta-data in the target time period, and the latest operation time of the first meta-data in the target time period, and the hyper-parameter is a weighted average of the ranking of the number of operations of the first meta-data and the ranking of the latest operation time of the first meta-data.

Specifically, the training word vector of the first metadata may be determined according to a format [ a metadata name of the first metadata, a recent operation number of the first metadata, a recent access time or modification time of the first metadata ], as shown in table 1.

TABLE 1

First metadata name	Number of operations	Access time or modification time
			lan_id	55	2022/05/30/16：46
cust_id	34	2021/12/25/09：27

Meanwhile, in order to improve the performance and effect of the data classification model, the optimal hyper-parameter of the model at the beginning of the learning process is usually determined in advance. In the embodiment of the application, the corresponding hyper-parameter is determined according to the operation times of the first metadata in the target time period and the latest operation time of the first metadata in the target time period, and the determined hyper-parameter is used as the optimal hyper-parameter of the GRU neural network model.

Specifically, after the first metadata are sorted according to the operation times and the latest operation time to obtain the sequence number a and the sequence number B, the super-parameter C may be calculated according to the sequence number a and the sequence number B, and the specific calculation formula is as follows:

and S4, sequentially inputting each training word vector and the hyper-parameter into the neural network model for iterative training, and adjusting the model parameters of the neural network model based on a back propagation algorithm to obtain a data classification model.

Specifically, a loss function J of the model is determined according to a training word vector and a hyper-parameter corresponding to the first metadata, wherein a partial derivative formula of the loss function J with respect to the input layer is as follows:

the partial derivative of the loss function J with respect to the output layer is formulated as follows:

the partial derivative formula of the loss function J with respect to the update gate is as follows:

the partial derivative of the loss function J with respect to the reset gate is formulated as follows:

and then updating model parameters of the model through a back propagation algorithm, so that the trained GRU neural network model is more accurate, and the performance and effect of the obtained data classification model are better.

Therefore, the output value y of the GRU neural network model can be used as the output of the final data classification model, and y belongs to [0,1]. When the output y =1 of the GRU neural network model, the first metadata is indicated as first thermal metadata; when the output y =0 of the GRU neural network model, the first metadata is described as the first cold metadata. Through the data classification model, accurate classification of all first metadata in the metadata base can be achieved.

After the classification of the first metadata is completed, the thermal metadata which does not meet the data standard in the first thermal metadata is accurately corrected through the following steps, and the corrected thermal metadata is synchronized into the block chain.

Step S106, determining first sub-thermal metadata meeting a preset data standard and second sub-thermal metadata not meeting the preset data standard in the plurality of first thermal metadata, and performing data correction on the second sub-thermal metadata.

According to an alternative embodiment of the present application, the data modification of the second sub-thermal metadata may be implemented by: for each second sub-thermal metadata, determining a plurality of table metadata in the data source system corresponding to the second sub-thermal metadata, and determining a plurality of field metadata in each table metadata; vectorizing a plurality of field metadata through a unique hot coding technology, determining a central word of each table metadata based on a continuous bag-of-words model algorithm, and determining central subjects of a plurality of table metadata based on a probability latent semantic analysis algorithm; determining a target sub data system matched with the central theme from the target data system, and determining a correction value of the second sub thermal metadata based on data in the target sub data system; and performing data correction on the second sub-thermal metadata based on the correction value.

Specifically, first, the data source system set is denoted as P = { P = { P ₁ ,P ₂ ,P ₃ ,…,P _t ,…,P _m In which P is _t The tth data source system set which does not meet the preset data standard is represented, and m represents the total number of the data source system sets which do not meet the preset data standard. Since each data source system has a plurality of table metadata, the table metadata set in the data source system can be written as P (x) = { T = { (T) } ₁ ,T ₂ ,T ₃ ,…,T _t ,…,T _n In which T is _t Representing the t-th table metadata, and n representing the total number of table metadata in the data source system. In addition, each table metadata includes a plurality of field metadata, so that T can be recorded as a field metadata set in one table metadata _x ＝{C ₁ ,C ₂ ,C ₃ ,…,C _t ,…,C _k In which C is _t Represents the t-th field metadata, and k represents the total number of field metadata of the table metadata set.

Secondly, vectorizing each field metadata in the table metadata by adopting One-Hot coding (namely, single Hot coding), and determining the central word of each field metadata in the table metadata by utilizing a CBOW (continuous bag of words) algorithm. If the field metadata T (x) is recorded as w _input The probability of occurrence of the predicted central word is denoted as w _t Then, the core word of each field metadata in the table metadata can be expressed as:

p(w _t |w _input )＝w _j ＝exp(w _j )/∑exp(w _it )

preferably, the maximum conditional probability is used to determine the core word of each field metadata in each table metadata, which can be specifically represented by the following formula:

maxp(w _t |w _input )＝maxlogw _j ＝w _j -log∑exp(w _j )

after determining the core word for each field metadata in each table metadata, a PLSA topic model (i.e., probabilistic latent semantic analysis) algorithm may be utilized, where the probability P (P) is first applied _n ) Selecting a data source system P from a metadata set P _m Then with probability P (T) _n |P _n ) Selecting a piece of table metadata T from a data source system set P (x) _n Finally with probability P (C) _k |T _n ) Selecting a field metadata C from the table metadata T (x) _k The calculation is carried out through a maximum likelihood function, and the calculation formula is as follows:

wherein n (P) in the above formula _m |C _k ) Is represented by (P) _m |C _k ) The number of occurrences, m 'represents the number of table metadata, and k' represents the number of field metadata.

Taking logarithms on two sides of the formula respectively, and finally taking the prediction center main body with the maximum probability as the center theme of each table metadata of each data system, wherein the specific calculation formula is as follows:

after the central theme of the table metadata is determined, the sub-data system matched with the central theme type is determined from the data system, and the correction value of the second sub-hot metadata is determined according to the data in the sub-data system, so that the second sub-hot metadata is accurately corrected to meet the preset data standard of the data system.

And step S108, synchronizing the first sub-hot metadata and the data-corrected second sub-hot metadata into a block chain.

According to an optional embodiment of the application, the first sub-thermal metadata and the data-modified second sub-thermal metadata are uploaded to a target blockchain platform, wherein the target blockchain platform is used for verifying and determining the weight of the first sub-thermal metadata and the data-modified second sub-thermal metadata, and performing trusted transaction on the first sub-thermal metadata and the data-modified second sub-thermal metadata based on an intelligent contract mode.

Specifically, the second sub-thermal metadata and the first sub-thermal metadata after the correction in step S106 are uploaded to the blockchain platform, and for the first sub-thermal metadata and the second sub-thermal metadata after the data correction on the blockchain platform, the effective thermal metadata is verified and determined, and the first sub-thermal metadata and the second sub-thermal metadata after the data correction are subjected to query and retrieval operations in a form of a data service intelligent contract according to requirements of an enterprise system, wherein the query and retrieval operations are to perform query and retrieval operations on SQL statements in the first metadata library through a PostgreSQL external table function, and return a result to realize a trusted transaction.

After the chaining operation of the target time and the stock data (i.e., the first metadata) of the data system before the target time is completed through the above process steps S106 to S108, the chaining operation of the incremental data of the data system after the target time needs to be implemented.

According to an optional embodiment of the present application, first obtaining incremental second metadata in the target data system after the target time; and then carrying out data verification on the second metadata, wherein the data verification at least comprises the following steps: data accuracy verification, data integrity verification and data consistency verification; and when the second metadata passes the data verification, adding the second metadata into the metadata base, and synchronizing the second metadata into the block chain.

The data accuracy check is a data check method for providing data uniqueness, and comprises single index value range uniqueness detection, same-group multi-index joint value range uniqueness detection and multiple-group multi-index value range uniqueness detection; the data integrity check is a data integrity check method, and comprises single index value domain non-null detection and same group multi-index joint value domain non-null detection; the data consistency check is a data consistency check method, and comprises field-based detection, index-based detection and calculation-based detection.

Optionally, the data check on the second metadata may be implemented as follows: inputting the second metadata into a pre-trained data verification model to obtain a target confidence coefficient output by the data verification model, wherein the data verification model is used for carrying out accuracy verification, integrity verification and consistency verification on the input data and outputting the confidence coefficient that the data passes the verification; when the target confidence coefficient is greater than a preset confidence coefficient threshold value, determining that the second metadata passes data verification; and when the target confidence coefficient is not greater than a preset confidence coefficient threshold value, sending the second metadata to a manual verification module for manual data verification.

Specifically, according to the setting condition of the collection task in step S102, metadata is continuously added to the metadata database in the enterprise system. Therefore, the second metadata newly added to the data system after the target moment can be obtained, the second metadata is input into the pre-trained data verification model, the second metadata is verified according to the preset verification rules such as data accuracy, data integrity and data consistency, and the confidence coefficient of the second metadata is output. Comparing the output confidence with a preset confidence threshold value, judging whether the second metadata passes data verification, adding the second metadata passing the data verification into a metadata base, and synchronizing the second metadata into the block chain; and continuing to manually verify the second metadata which fails the data verification.

Further, after the hot metadata are uploaded to the block chain platform through the steps S106 to S108, the first cold metadata may be classified, and the hot metadata meeting the preset data standard in the first cold metadata are uploaded to the block chain.

According to an optional embodiment of the present application, after a preset time period elapses, the data classification model is reused to classify all the first cold metadata in the metadata base, and each first cold metadata is divided into second hot metadata or second cold metadata; determining third sub-thermal metadata which meets a preset data standard and fourth sub-thermal metadata which does not meet the preset data standard in the plurality of second thermal metadata, and performing data correction on the fourth sub-thermal metadata; and synchronizing the third sub-thermal metadata and the data-corrected fourth sub-thermal metadata into the block chain.

For example, every other quarter, all the first cold metadata in the metadata base is classified through a pre-trained data classification model, and the latest second hot metadata and the latest second cold metadata are distinguished; and then, performing data correction on fourth sub-thermal metadata which does not meet the system preset data standard in each second thermal metadata through the step S106, and finally, synchronously uploading third sub-thermal metadata which meets the system preset in the second thermal metadata and the corrected fourth sub-thermal metadata to the block chain.

In the embodiment of the application, first metadata of all stock before a target moment in a target data system is obtained, and a metadata base is generated based on the first metadata; classifying all first metadata in a metadata base by using a pre-trained data classification model, and dividing each first metadata into first hot metadata or first cold metadata; determining first sub-thermal metadata which meets a preset data standard and second sub-thermal metadata which does not meet the preset data standard in the plurality of first thermal metadata, and performing data correction on the second sub-thermal metadata; and synchronizing the first sub-thermal metadata and the data-corrected second sub-thermal metadata into the block chain. After the first metadata is classified through the pre-trained data classification model, the classified second sub-hot metadata is accurately corrected, the corrected second sub-hot metadata and the first sub-hot metadata are synchronized into a block chain, and sharing transaction of the metadata in an enterprise is achieved, so that the quality and consistency of the metadata are guaranteed, meanwhile, the latest and effective state of the metadata can be guaranteed, the query mode is simple, and the technical problems that in the related technology, management of the metadata is disordered, the quality of the metadata is low, and the metadata is difficult to apply are solved.

Example 2

According to an embodiment of the present application, there is further provided a metadata management apparatus for implementing the metadata management method in embodiment 1, as shown in fig. 4, the metadata management apparatus at least includes an obtaining module 41, a classifying module 42, a modifying module 43, and a synchronizing module 44, where:

an obtaining module 41, configured to obtain first metadata of all stock before the target time in the target data system, and generate a metadata base based on the first metadata.

According to an alternative embodiment of the present application, the sealing of all metadata in the data system may be implemented by: firstly, determining a target acquisition task, wherein the target acquisition task at least comprises the following steps: table building statements, path information, required authority and acquisition frequency; acquiring data of a plurality of relational databases and non-relational databases in a target data system based on a target acquisition task to obtain technical metadata of all stocks before a target moment; performing data cleaning and conversion on the technical metadata to obtain first metadata, wherein the first metadata at least comprises: data source system information, database information, data table information, table field information, index information, and constraint information.

And the classification module 42 is configured to classify all the first metadata in the metadata base by using the pre-trained data classification model, and divide each first metadata into first hot metadata or first cold metadata.

According to an alternative embodiment of the present application, the classification module 42 may complete the training of the data classification model through steps S1-S4:

step S1, a GRU neural network model to be trained (namely, a neural network model based on a gating cycle unit) is constructed, wherein the GRU neural network model comprises an input layer, an output layer, a reset gate and an update gate.

And S2, determining the operation times and the latest operation time of each first metadata in the metadata base in a target time period by analyzing the metadata operation logs, and sequencing the plurality of first metadata according to the operation times and the latest operation time.

Specifically, the output value y of the GRU neural network model can be used as the output of the final data classification model, and y belongs to [0,1]. When the output y =1 of the GRU neural network model, the first metadata is described as first thermal metadata; when the output y =0 of the GRU neural network model, the first metadata is described as the first cold metadata. Through the data classification model, accurate classification of all first metadata in the metadata base can be achieved.

After the classification of the first metadata is completed, the thermal metadata which does not meet the data standard in the first thermal metadata needs to be accurately corrected through the following steps, and the corrected thermal metadata is synchronized into the block chain.

And a modification module 43, configured to determine a first sub-thermal metadata of the plurality of first thermal metadata that meets a preset data standard and a second sub-thermal metadata that does not meet the preset data standard, and perform data modification on the second sub-thermal metadata.

According to an alternative embodiment of the present application, the modification module 43 may implement data modification on the second sub-thermal metadata by: for each second sub-thermal metadata, determining a plurality of table metadata in the data source system corresponding to the second sub-thermal metadata, and determining a plurality of field metadata in each table metadata; vectorizing a plurality of field metadata through a unique hot coding technology, determining a central word of each table metadata based on a continuous bag-of-words model algorithm, and determining central subjects of a plurality of table metadata based on a probability latent semantic analysis algorithm; determining a target sub data system matched with the central theme from the target data system, and determining a correction value of the second sub thermal metadata based on data in the target sub data system; and performing data correction on the second sub-thermal metadata based on the correction value.

And a synchronization module 44, configured to synchronize the first sub-thermal metadata and the data-modified second sub-thermal metadata into the block chain.

According to an optional embodiment of the present application, the synchronization module 44 uploads the first sub-thermal metadata and the data-modified second sub-thermal metadata to the target blockchain platform, where the target blockchain platform is configured to check and determine the weight of the first sub-thermal metadata and the data-modified second sub-thermal metadata, and perform a trusted transaction on the first sub-thermal metadata and the data-modified second sub-thermal metadata based on an intelligent contract.

Specifically, the second sub-thermal metadata and the first sub-thermal metadata corrected by the correction module 43 are uploaded to the blockchain platform, and for the first sub-thermal metadata and the second sub-thermal metadata corrected by the data on the blockchain platform, the effective ground thermal metadata is verified and determined, and according to the requirement of the enterprise system, the first sub-thermal metadata and the second sub-thermal metadata corrected by the data are queried and retrieved in the form of a data service intelligent contract, wherein the querying and retrieving operation is to query and retrieve the SQL statement in the first metadata library through a PostgreSQL external table function, and return the result, so as to implement the trusted transaction.

After the link entering operation of the target time and the stock data (i.e., the first metadata) of the data system before the target time is completed by the modules, the link entering operation of the incremental data of the data system after the target time needs to be realized.

According to an optional embodiment of the present application, first obtaining second metadata of an increment in a target data system after a target time; and then carrying out data verification on the second metadata, wherein the data verification at least comprises the following steps: data accuracy verification, data integrity verification and data consistency verification; and when the second metadata passes the data verification, adding the second metadata into the metadata base, and synchronizing the second metadata into the block chain.

The data accuracy check is a data check method for providing data uniqueness, and comprises single-index value range uniqueness detection, same-group multi-index joint value range uniqueness detection and multiple-group multi-index value range uniqueness detection; the data integrity check is a data integrity check method, and comprises single index value field non-null detection and same group multi-index joint value field non-null detection; the data consistency check is a data consistency check method, and comprises field-based detection, index-based detection and calculation-based detection.

Optionally, the data check on the second metadata may be implemented as follows: inputting the second metadata into a pre-trained data verification model to obtain a target confidence coefficient output by the data verification model, wherein the data verification model is used for carrying out accuracy verification, integrity verification and consistency verification on the input data and outputting the confidence coefficient that the data passes the verification; when the target confidence coefficient is larger than a preset confidence coefficient threshold value, determining that the second metadata passes data verification; and when the target confidence coefficient is not greater than a preset confidence coefficient threshold value, sending the second metadata to a manual verification module for manual data verification.

Specifically, according to the setting condition of the collection task in the acquisition module 41, metadata is continuously added to the metadata database in the enterprise system. Therefore, the second metadata newly added to the data system after the target moment can be obtained, the second metadata is input into the pre-trained data verification model, the second metadata is verified according to the preset verification rules such as data accuracy, data integrity and data consistency, and the confidence coefficient of the second metadata is output. Comparing the output confidence with a preset confidence threshold value, judging whether the second metadata passes data verification, adding the second metadata passing the data verification into a metadata base, and synchronizing the second metadata into the block chain; and continuing to manually verify the second metadata which fails the data verification.

Furthermore, after the hot metadata are uploaded to the block chain platform through the modules, the first cold metadata can be classified, and the hot metadata meeting the preset data standard in the first cold metadata are uploaded to the block chain.

It should be noted that, modules in the metadata management apparatus in the embodiment of the present application correspond to implementation steps of the metadata management method in embodiment 1 one to one, and because the detailed description has been already made in embodiment 1, details that are not partially embodied in this embodiment may refer to embodiment 1, and are not described herein again.

Example 3

According to an embodiment of the present application, there is also provided a nonvolatile storage medium including a stored program, where a device in which the nonvolatile storage medium is located executes the metadata management method in embodiment 1 by running the program.

Specifically, the device in which the nonvolatile storage medium is located executes the following steps by running the program: acquiring first metadata of all stock before a target moment in a target data system, and generating a metadata base based on the first metadata; classifying all first metadata in a metadata base by using a pre-trained data classification model, and dividing each first metadata into first hot metadata or first cold metadata; determining first sub-thermal metadata which meets a preset data standard and second sub-thermal metadata which does not meet the preset data standard in the plurality of first thermal metadata, and performing data correction on the second sub-thermal metadata; and synchronizing the first sub-thermal metadata and the data-corrected second sub-thermal metadata into a block chain.

Optionally, performing data modification on the second sub-thermal metadata includes: for each second sub-thermal metadata, determining a plurality of table metadata in the data source system corresponding to the second sub-thermal metadata, and determining a plurality of field metadata in each table metadata; vectorizing a plurality of field metadata through a unique hot coding technology, determining a central word of each table metadata based on a continuous bag-of-words model algorithm, and determining central subjects of a plurality of table metadata based on a probability latent semantic analysis algorithm; determining a target sub data system matched with the central theme from the target data system, and determining a correction value of the second sub thermal metadata based on data in the target sub data system; and performing data correction on the second sub-thermal metadata based on the correction value.

Example 4

According to an embodiment of the present application, there is also provided a processor configured to execute a program, where the program executes the metadata management method in embodiment 1.

Specifically, the program executes the following steps when running: acquiring first metadata of all stock before a target moment in a target data system, and generating a metadata base based on the first metadata; classifying all first metadata in a metadata database by using a pre-trained data classification model, and dividing each first metadata into first hot metadata or first cold metadata; determining first sub-thermal metadata which meets a preset data standard and second sub-thermal metadata which does not meet the preset data standard in the plurality of first thermal metadata, and performing data correction on the second sub-thermal metadata; and synchronizing the first sub-thermal metadata and the data-corrected second sub-thermal metadata into a block chain.

Optionally, the training process of the data classification model includes: constructing a neural network model to be trained based on a gating cycle unit, wherein the neural network model comprises an input layer, an output layer, a reset gate and an update gate; determining the operation times and the latest operation time of each first metadata in the metadata base in a target time period by analyzing the metadata operation logs, and sequencing the first metadata according to the operation times and the latest operation time; for each first metadata, determining a training word vector and a hyper-parameter corresponding to the first metadata, wherein the training word vector at least comprises: the meta-data name of the first meta-data, the operation times of the first meta-data in the target time period and the latest operation time of the first meta-data in the target time period, and the hyper-parameter is a weighted average of the operation time ranking of the first meta-data and the latest operation time ranking of the first meta-data; and sequentially inputting each training word vector and the hyper-parameter into the neural network model for iterative training, and adjusting the model parameters of the neural network model based on a back propagation algorithm to obtain a data classification model.

Example 5

According to an embodiment of the present application, there is also provided an electronic device, including: a memory in which a computer program is stored, and a processor configured to execute the metadata management method in embodiment 1 by the computer program.

In particular, the processor is configured to implement the following steps by computer program execution: acquiring first metadata of all stock before a target moment in a target data system, and generating a metadata base based on the first metadata; classifying all first metadata in a metadata base by using a pre-trained data classification model, and dividing each first metadata into first hot metadata or first cold metadata; determining first sub-thermal metadata which meets a preset data standard and second sub-thermal metadata which does not meet the preset data standard in the plurality of first thermal metadata, and performing data correction on the second sub-thermal metadata; and synchronizing the first sub-thermal metadata and the data-corrected second sub-thermal metadata into the block chain.

Optionally, performing data modification on the second sub-thermal metadata includes: for each second hot sub-metadata, determining a plurality of table metadata in the data source system corresponding to the second hot sub-metadata, and determining a plurality of field metadata in each table metadata; vectorizing a plurality of field metadata through a unique hot coding technology, determining a central word of each table metadata based on a continuous bag-of-words model algorithm, and determining central subjects of a plurality of table metadata based on a probability latent semantic analysis algorithm; determining a target sub data system matched with the central theme from the target data system, and determining a correction value of the second sub thermal metadata based on data in the target sub data system; and performing data correction on the second sub-thermal metadata based on the correction value.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described apparatus embodiments are merely illustrative, and for example, a division of a unit may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or may not be executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application, or portions or all or portions of the technical solutions that contribute to the prior art, may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, and various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. A metadata management method, comprising:

acquiring first metadata of all stock before a target moment in a target data system, and generating a metadata base based on the first metadata;

classifying all the first metadata in the metadata base by using a pre-trained data classification model, and dividing each first metadata into first hot metadata or first cold metadata;

determining first sub-thermal metadata which meets a preset data standard and second sub-thermal metadata which does not meet the preset data standard in the plurality of first thermal metadata, and performing data correction on the second sub-thermal metadata; and synchronizing the first sub-thermal metadata and the second sub-thermal metadata after data modification into a block chain.

2. The method of claim 1, further comprising:

acquiring incremental second metadata in the target data system after the target time;

performing data verification on the second metadata, wherein the data verification at least comprises: data accuracy verification, data integrity verification and data consistency verification;

when the second metadata passes the data verification, adding the second metadata to the metadata base, and synchronizing the second metadata to the block chain.

3. The method of claim 1, further comprising:

after a preset time period, classifying all the first cold metadata in the metadata base by reusing the data classification model, and dividing each first cold metadata into second hot metadata or second cold metadata;

determining third sub-thermal metadata which meets the preset data standard and fourth sub-thermal metadata which does not meet the preset data standard in the plurality of second thermal metadata, and performing data correction on the fourth sub-thermal metadata;

and synchronizing the third sub-thermal metadata and the data-corrected fourth sub-thermal metadata into a block chain.

4. The method of claim 1, wherein obtaining first metadata for all inventories in the target data system prior to the target time and generating a metadata base based on the first metadata comprises:

determining a target acquisition task, wherein the target acquisition task at least comprises: table building statements, path information, required authority and acquisition frequency;

acquiring data of a plurality of relational databases and non-relational databases in the target data system based on the target acquisition task to obtain technical metadata of all stocks before the target moment;

performing data cleaning and conversion on the technical metadata to obtain the first metadata, wherein the first metadata at least comprises: data source system information, database information, data table information, table field information, index information, and constraint information.

5. The method of claim 1, wherein the training process of the data classification model comprises:

constructing a neural network model to be trained based on a gating cycle unit, wherein the neural network model comprises an input layer, an output layer, a reset gate and an update gate;

determining the operation times and the latest operation time of each first metadata in the metadata base in a target time period by analyzing a metadata operation log, and sequencing the first metadata according to the operation times and the latest operation time respectively;

for each piece of the first metadata, determining a training word vector and a hyperparameter corresponding to the first metadata, wherein the training word vector at least comprises: the meta-data name of the first meta-data, the number of operations of the first meta-data in the target time period, and the latest operation time of the first meta-data in the target time period, wherein the hyper-parameter is a weighted average of the ranking of the number of operations of the first meta-data and the ranking of the latest operation time of the first meta-data;

and sequentially inputting the training word vectors and the hyperparameters into the neural network model for iterative training, and adjusting model parameters of the neural network model based on a back propagation algorithm to obtain the data classification model.

6. The method of claim 1, wherein performing data modification on the second sub-thermal metadata comprises:

for each second sub-thermal metadata, determining a plurality of table metadata in a data source system corresponding to the second sub-thermal metadata, and determining a plurality of field metadata in each table metadata;

vectorizing the plurality of field metadata through a one-hot coding technology, determining a central word of each table metadata based on a continuous bag-of-words model algorithm, and determining a central theme of the plurality of table metadata based on a probabilistic latent semantic analysis algorithm;

determining a target sub data system matched with the central theme from the target data system, and determining a correction value of the second sub thermal metadata based on data in the target sub data system;

and performing data correction on the second sub-thermal metadata based on the correction value.

7. The method of claim 1, wherein synchronizing the first sub-thermal metadata and the data-modified second sub-thermal metadata into a blockchain comprises:

and uploading the first sub-thermal metadata and the second sub-thermal metadata after data modification to a target block chain platform, wherein the target block chain platform is used for verifying and determining the first sub-thermal metadata and the second sub-thermal metadata after data modification, and performing trusted transaction on the first sub-thermal metadata and the second sub-thermal metadata after data modification based on an intelligent contract mode.

8. The method of claim 2, wherein performing a data check on the second metadata comprises:

inputting the second metadata into a pre-trained data verification model to obtain a target confidence coefficient output by the data verification model, wherein the data verification model is used for performing accuracy verification, integrity verification and consistency verification on the input data and outputting the confidence coefficient that the data passes the verification;

when the target confidence degree is larger than a preset confidence degree threshold value, determining that the second metadata passes the data verification;

and when the target confidence coefficient is not greater than a preset confidence coefficient threshold value, sending the second metadata to a manual verification module for manual data verification.

9. A metadata management apparatus, characterized by comprising:

the acquisition module is used for acquiring first metadata of all stock before a target moment in a target data system and generating a metadata base based on the first metadata;

the classification module is used for classifying all the first metadata in the metadata base by utilizing a pre-trained data classification model and dividing each first metadata into first hot metadata or first cold metadata;

the correction module is used for determining first sub-thermal metadata meeting a preset data standard and second sub-thermal metadata not meeting the preset data standard in the plurality of first thermal metadata and performing data correction on the second sub-thermal metadata;

and the synchronization module is used for synchronizing the first sub-thermal metadata and the second sub-thermal metadata after data correction into a block chain.

10. An electronic device, comprising: a memory in which a computer program is stored, and a processor configured to execute the metadata management method of any one of claims 1 to 7 by the computer program.