CN114969001B

CN114969001B - Database metadata field matching method, device, equipment and medium

Info

Publication number: CN114969001B
Application number: CN202210570053.1A
Authority: CN
Inventors: 傅玉鑫; 孙永超; 申传旺; 李照川; 罗森; 张艳雪
Original assignee: Chaozhou Zhuoshu Big Data Industry Development Co Ltd
Current assignee: Chaozhou Zhuoshu Big Data Industry Development Co Ltd
Priority date: 2022-05-24
Filing date: 2022-05-24
Publication date: 2024-05-10
Anticipated expiration: 2042-05-24
Also published as: CN114969001A

Abstract

The embodiment of the specification discloses a database metadata field matching method, which comprises the following steps: acquiring unmatched database fields, and obtaining trunk feature words of the unmatched database fields; if the trunk characteristic words of the unmatched database fields exist in the numerical values of the trunk data structure, matching the database metadata fields corresponding to the unmatched database fields in the trunk data structure; if the trunk feature words of the unmatched database fields do not exist in the numerical values of the trunk data structure, converting the trunk feature words of the unmatched database fields into word vectors to be matched according to a pre-trained word vector model, and determining vector values of the trunk feature words of the unmatched database fields; calculating the vector value of the trunk feature words of the unmatched database fields, and the similarity between the vector value and the vector value of the database metadata field in the pre-generated vector data structure; and matching the database metadata fields corresponding to the unmatched database fields according to the similarity.

Description

Database metadata field matching method, device, equipment and medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a medium for matching metadata fields of a database.

Background

The database fields are the basic components of the database, which represent attributes of a database record object, and each database field includes information about a specific attribute in the database record object, such as student name, student contact phone, home address, etc.

The metadata field is a meaning representation of the database field, the meaning of the representation is more universal and the scope is wider, and the metadata field can be used as a category to classify the database field, for example, the metadata field of the "student name" field is "name" and the metadata field of the "home address" field is "address".

Database metadata field matching means that metadata fields with similar meaning are found for matching through analysis of database field names, one metadata field can match a plurality of database fields, and one database field can only have one corresponding metadata field. The matching of the metadata fields of the database can clear the association relation among the database fields, and the categories of the database fields are divided, so that the data quality is further improved, and the logic among the data is clearer.

In the prior art, database metadata field matching is mostly carried out manually, so that the efficiency is low, and the requirements of users cannot be met.

Disclosure of Invention

One or more embodiments of the present disclosure provide a method, an apparatus, a device, and a medium for matching metadata fields of a database, which are used to solve the following technical problems:

One or more embodiments of the present disclosure adopt the following technical solutions:

one or more embodiments of the present disclosure provide a database metadata field matching method, including:

Acquiring unmatched database fields, and performing preset dependency syntactic analysis on the unmatched database fields to obtain trunk feature words of the unmatched database fields;

judging whether the backbone characteristic words of the unmatched database fields exist in a pre-generated backbone data structure;

if the trunk characteristic words of the unmatched database fields exist in the numerical values of the trunk data structure, matching the database metadata fields corresponding to the unmatched database fields in the trunk data structure;

If the trunk feature words of the unmatched database fields do not exist in the numerical values of the trunk data structure, converting the trunk feature words of the unmatched database fields into word vectors to be matched according to a pre-trained word vector model, and determining vector values of the trunk feature words of the unmatched database fields;

calculating the vector value of the trunk feature words of the unmatched database fields, and the similarity between the vector value and the vector value of the database metadata field in the pre-generated vector data structure;

And matching the database metadata fields corresponding to the unmatched database fields according to the similarity.

One or more embodiments of the present disclosure provide a database metadata field matching apparatus, including:

The method comprises the steps of obtaining an analysis unit, obtaining unmatched database fields, and carrying out preset dependency syntactic analysis on the unmatched database fields to obtain trunk feature words of the unmatched database fields;

the judging unit is used for judging whether the backbone characteristic words of the unmatched database fields exist in a pre-generated backbone data structure;

The first matching unit is used for matching the database metadata field corresponding to the unmatched database field in the backbone data structure if the backbone characteristic words of the unmatched database field exist in the numerical value of the backbone data structure;

The vector conversion unit is used for converting the trunk feature words of the unmatched database fields into word vectors to be matched according to a pre-trained word vector model and determining vector values of the trunk feature words of the unmatched database fields if the trunk feature words of the unmatched database fields do not exist in the numerical values of the trunk data structure;

the computing unit is used for computing the similarity between the vector value of the trunk feature words of the unmatched database field and the vector value of the database metadata field in the pre-generated vector data structure;

And the second matching unit matches the database metadata field corresponding to the unmatched database field according to the similarity.

One or more embodiments of the present specification provide a database metadata field matching apparatus, including:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to:

One or more embodiments of the present specification provide a non-volatile computer storage medium storing computer-executable instructions configured to:

The above-mentioned at least one technical scheme that this description embodiment adopted can reach following beneficial effect:

According to the embodiment of the specification, metadata field matching is conducted, so that the relation among data can be combed, the similar relation among the data can be cleared, and the quality of data assets can be improved. During the period, the embodiment of the specification uses the word vector method in natural language processing to judge the similarity between words, so that the accuracy of similarity judgment can be ensured while the efficiency is high. The dependency syntax analysis is used for extracting main information from the fields, so that the influence of irrelevant service words on the meta-field matching result can be reduced. In addition, the embodiment of the specification can automatically and quickly match the element fields for the newly added fields, does not need to manually select and manually match the element fields from a plurality of element fields, and is more efficient.

Drawings

In order to more clearly illustrate the embodiments of the present description or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some of the embodiments described in the present description, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. In the drawings:

FIG. 1 is a flow diagram of a database metadata field matching method according to one or more embodiments of the present disclosure;

FIG. 2 is a schematic diagram of data preprocessing provided in one or more embodiments of the present disclosure;

FIG. 3 is a schematic diagram of a database metadata field matching process provided by one or more embodiments of the present disclosure;

FIG. 4 is a schematic diagram of a database metadata field matching device according to one or more embodiments of the present disclosure;

fig. 5 is a schematic structural diagram of a database metadata field matching device according to one or more embodiments of the present disclosure.

Detailed Description

The embodiment of the specification provides a database metadata field matching method, device, equipment and medium.

In order to make the technical solutions in the present specification better understood by those skilled in the art, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present disclosure.

Fig. 1 is a schematic flow chart of a database metadata field matching method provided in one or more embodiments of the present disclosure, where the flow chart may be executed by a database metadata field matching system, and the system may automatically find a metadata field having a meaning similar to that of the database field name to match through analysis of the database field name, so that manual matching of the database metadata field is omitted, and the matching efficiency of the database metadata field is higher, and the requirement of a user is better met. Some input parameters or intermediate results in the flow allow for manual intervention adjustments to help improve accuracy.

The method flow steps of the embodiment of the present specification are as follows:

S102, obtaining unmatched database fields, and performing preset dependency syntactic analysis on the unmatched database fields to obtain trunk feature words of the unmatched database fields.

In the embodiment of the present disclosure, when performing a preset dependency syntax analysis on the unmatched database field to obtain a trunk feature word of the unmatched database field, the dependency syntax analysis may be performed on the unmatched database field first to generate a dependency syntax tree; obtaining the part of speech and the dependency relation of the field phrase in the unmatched database field according to the dependency syntax tree; then, according to the part of speech and dependency relationship of the field phrase in the unmatched database field, determining noun independent fields and direct objects independent of other field phrases, and taking the noun independent fields as nouns imaginary root in the dependency syntax tree; finally, the backbone feature word of the unmatched database field may be determined from the direct object and the noun imaginary root. For example, the field phrase of the student name and detailed address, the trunk information name and address of the field phrase are independent of other words of the field phrase, and can be used as nouns imaginary root in the dependency syntax tree.

S104, judging whether the backbone characteristic words of the unmatched database fields exist in a pre-generated backbone data structure.

In the embodiment of the present disclosure, before determining whether there is a trunk feature word of the unmatched database field in the pre-generated trunk data structure, the generated trunk data structure is required, see the following sections:

The required metadata field of the database can be determined according to the service requirement and the data table structure, the data table structure can be a preset database table structure, and the table structure can define the field, the type, the primary key, the external key and the index of one table; matching the pre-acquired database field with the database metadata field, and storing the matched database field and the matched database metadata field into a data set; then, performing dependency syntax analysis on database fields in the dataset to obtain trunk feature words of the database fields; finally, the trunk feature words of the database field and the database metadata field can be matched, and the trunk feature words of the matched database field and the database metadata field are stored into the data set through a trunk data structure. The backbone data structure may be a preset data structure, and is mainly used for storing the backbone characteristic words and the database metadata fields of the matched database fields.

Further, in the embodiment of the present disclosure, a first key value pair is stored in the backbone data structure, the metadata field of the database is used as a key value of the first key value pair, and a list formed by backbone feature words of the database corresponding to the metadata field of the database is used as a numerical value of the first key value pair. Through the storage mode, the main characteristic words of the database field and the metadata field of the database can be better stored and managed.

Further, in the field data structure of the embodiment of the present disclosure, a second key value pair is stored, the database metadata field is used as a key value of the second key value pair, and a list formed by database fields corresponding to the database metadata field is used as a value of the second key value pair. Through the storage mode, the database fields and the database metadata fields can be better stored and managed.

And S106, if the trunk characteristic words of the unmatched database fields exist in the numerical value of the trunk data structure, matching the database metadata fields corresponding to the unmatched database fields in the trunk data structure.

S108, if the trunk feature words of the unmatched database fields do not exist in the numerical values of the trunk data structure, converting the trunk feature words of the unmatched database fields into word vectors to be matched according to a pre-trained word vector model, and determining vector values of the trunk feature words of the unmatched database fields.

S110, calculating the vector value of the trunk feature words of the unmatched database fields, and similarity between the vector value and the vector value of the database metadata field in the pre-generated vector data structure.

Before calculating the similarity between the vector values of the backbone feature words of the unmatched database fields and the vector values of the database metadata fields in the pre-generated vector data structure, the vector data structure needs to be generated, see the following parts:

The trunk characteristic words of the database field corresponding to the database metadata field can be converted into word vectors according to the word vector model, and the word vectors are used as vector values of the database metadata field; matching the database metadata field with the vector value of the database metadata field, and storing the matched vector values of the database metadata field and the database metadata field into the data set; finally, for the matching vector values of the database metadata field and the database metadata field, a third key value pair may be stored in a vector data structure, where the database metadata field is used as a key value of the third key value pair, and a vector value of the database metadata field is used as a numerical value of the third key value pair.

It should be noted that, in the embodiment of the present disclosure, each database metadata field may correspond to a trunk feature word of a plurality of database fields; at this time, according to the word vector model, the trunk feature words of the database fields corresponding to the database metadata fields are converted into word vectors, and when the word vectors are used as vector values of the database metadata fields, the trunk feature words of a plurality of database fields corresponding to the database metadata fields can be converted into a plurality of word vectors according to the word vector model; adding and averaging the word vectors to obtain an average word vector; finally, the average word vector may be used as a vector value for the database metadata field.

And S112, matching the database metadata fields corresponding to the unmatched database fields according to the similarity.

In the embodiment of the present specification, the above technical solutions may be further described:

The above-mentioned dependency syntax analysis is a basic method of natural language processing that obtains dependency relationships between individual words in a phrase by generating a dependency syntax tree.

The dependency syntax tree of the embodiment of the present specification has the following features:

(1) There is and only one word (ROOT, virtual ROOT node, imaginary ROOT for short) that is independent of the other words.

(2) All words except this ROOT word must depend on other words.

(3) Each word cannot depend on multiple words.

(4) If word A depends on B, then word C, which is located between A and B, can only depend on words between A, B or AB.

Through dependency syntax analysis, dependencies between words can be obtained, for example: a main-and-predicate relationship, a guest-and-move relationship, a part-of-speech, a direct object, etc. The main information of the phrase can be found through the association relation among the words, so that the modifier words are removed, and the main characteristic words of the phrase are obtained.

The word vector of the embodiment of the present specification is another important method of natural language processing, which is a method of mapping words into a multidimensional vector through one-hot encoding or neural network training. The single-hot coding is realized by using the position codes of words in a word list as corresponding word vectors, and the single-hot coding is simpler in implementation mode, but the semantic features of the words are difficult to obtain. According to the embodiment of the specification, a method for training word vectors by using a neural network can be used, the training of an unsupervised text prediction task is mainly carried out through a large number of text input neural networks, the output of a trained network hidden layer is used as the word vectors, and the extracted part of network features is used as a word vector generator. Since the word method is based on a text prediction task, word vectors trained by the word method can contain basic semantic features of words, and better effects can be obtained by applying the word method to the embodiment of the specification. The method for training word vectors by using the neural network may comprise one of word2vec, glove, bert, and preferably, word2vec is used as a word vector generation method in the embodiment of the present disclosure.

The database metadata field matching in the embodiment of the present specification has the following characteristics:

The purpose of the embodiment of the specification is to automatically and efficiently match the corresponding database metadata fields according to the names of the database fields, and clear the association relationship between the data fields, so as to provide a basis for clearing the data relationship and make the data organization clearer.

The specific technical solution of the embodiments of the present specification may be as follows:

1. Data preparation

The data required herein are as follows:

(1) And sorting all required database metadata fields according to the service requirements and the existing table structure.

(2) And using the database metadata fields for manual annotation matching of part of the existing database fields as a data set.

(3) Pre-trained word2vec word vector model.

2. Data preprocessing

Loading the marked data set, storing by using a map data structure, storing a plurality of key value pairs in the map, using a database metadata field as a key value of the map, and using a list formed by a plurality of database fields corresponding to the database metadata field as a value of the map. The map may be defined as a standard_map.

Because the embodiment of the specification expects to obtain the main meaning of the field after removing the service descriptive vocabulary, the method can use the noun imaginary ROOT ROOT and the word corresponding to the direct object obtained after performing the dependency syntax analysis on the database field name as the main feature word of the field, and use a new map to store the database metadata field and the main feature word of the database field, wherein the database metadata field is used as the key value key of the map, and use the list composed of the main feature words of a plurality of database fields corresponding to the database metadata field as the value of the map, and perform the de-duplication operation on the value in the list. This map may be defined as a feature_map.

In order to measure the similarity between the database field and the metadata field, the words are required to be converted into word vectors, and the cosine similarity among the vectors is calculated to reflect the similarity among the words. Therefore, the embodiment of the specification can convert each word in the value in each key value pair in the feature_map into the word vector w_vec by using the word2vec model, add and average the word vector w_vec to obtain the average feature vector ave_vec of each database metadata field, store the average feature vector ave_vec by using the map, and use the database metadata field as the key value key of the map and the average feature vector ave_vec as the value of the map. Define this map as vector_map.

The above-mentioned data preprocessing process can be referred to as a data preprocessing schematic diagram shown in fig. 2.

3. Database metadata field matching

Inputting an unmatched database field, performing dependency syntax analysis on the unmatched database field to generate a dependency syntax tree, and obtaining the part of speech and the dependency relation of each word in the database field according to the dependency syntax tree. Since the present embodiment of the specification requires backbone information of the database field, it is necessary to extract the direct object of the field phrase as the backbone information. In addition, since the field such as "student name", "detailed address" whose trunk information "name", "address" is independent of other words of the field phrase is imaginary root in the dependency syntax tree, the field trunk information should contain noun imaginary root of the dependency syntax tree in addition to the direct object. So after dependency parsing of the database fields, the direct object and noun imaginary root can be chosen as the backbone information for the fields. It should be noted that, the trunk information of the database field mentioned herein is the trunk feature word of the database field.

And searching in the value of the feature_map by using the extracted field trunk information, and if the field exists, directly matching the field with the corresponding key value database metadata field in the feature_map, wherein the matching is completed.

If the trunk information does not exist in the feature_map, word vector conversion is carried out on the field trunk information by using a word2vec model to obtain a corresponding vector field_vector, cosine similarity is calculated between the field vector and all ave vecs in the vectormap, the larger the cosine similarity is, the closer the distance between vectors is indicated, the semantics of the words are more similar, so that a metadata field corresponding to ave vec with the largest cosine similarity of the field vectoris selected as a matching item of a database field, the database field is updated to the standard_map, and the feature_map is updated at the same time, and the matching is completed.

In the process of calculating the cosine similarity, if the maximum cosine similarity between the field_vector and all ave_vecs is smaller than 0.4, it is indicated that the maximum cosine similarity is not close to the existing metadata field, and the new metadata field should be manually allocated or added.

The database metadata field matching process described above may be referred to as a database metadata field matching process schematic shown in fig. 3.

It should be noted that, the metadata field matching in the embodiment of the present disclosure is helpful to comb the relationship between data, and to clear the similar relationship between data, so as to improve the quality of the data asset. During the period, the embodiment of the specification uses the word vector method in natural language processing to judge the similarity between words, so that the accuracy of similarity judgment can be ensured while the efficiency is high. The dependency syntax analysis is used for extracting main information from the fields, so that the influence of irrelevant service words on the meta-field matching result can be reduced. In addition, the embodiment of the specification can automatically and quickly match the element fields for the newly added fields, does not need to manually select and manually match the element fields from a plurality of element fields, and is more efficient. The embodiment of the specification can also generate a corresponding relation standard_map between the database field and the meta field.

Fig. 4 is a schematic structural diagram of a database metadata field matching apparatus according to one or more embodiments of the present disclosure, including: an acquisition analysis unit 402, a judgment unit 404, a first matching unit 406, a vector conversion unit 408, a calculation unit 410, and a second matching unit 412.

The acquisition analysis unit 402 acquires unmatched database fields, and performs preset dependency syntactic analysis on the unmatched database fields to obtain trunk feature words of the unmatched database fields;

A judging unit 404, configured to judge whether there is a trunk feature word of the unmatched database field in a pre-generated trunk data structure;

The first matching unit 406 matches, if the backbone feature word of the unmatched database field exists in the numerical value of the backbone data structure, the database metadata field corresponding to the unmatched database field in the backbone data structure;

The vector conversion unit 408 is configured to convert, according to a pre-trained word vector model, the trunk feature word of the unmatched database field into a word vector to be matched, and determine a vector value of the trunk feature word of the unmatched database field if the trunk feature word of the unmatched database field does not exist in the numerical value of the trunk data structure;

A calculating unit 410, configured to calculate a similarity between the vector value of the trunk feature word of the unmatched database field and the vector value of the database metadata field in the pre-generated vector data structure;

And a second matching unit 412, configured to match the database metadata fields corresponding to the unmatched database fields according to the similarity.

Fig. 5 is a schematic structural diagram of a database metadata field matching device according to one or more embodiments of the present disclosure, including:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus, devices, non-volatile computer storage medium embodiments, the description is relatively simple, as it is substantially similar to method embodiments, with reference to the section of the method embodiments being relevant.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The foregoing is merely one or more embodiments of the present description and is not intended to limit the present description. Various modifications and alterations to one or more embodiments of this description will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, or the like, which is within the spirit and principles of one or more embodiments of the present description, is intended to be included within the scope of the claims of the present description.

Claims

1. A method for matching metadata fields of a database, comprising:

2. The method of claim 1, wherein the determining if there is a backbone feature word for the unmatched database field in the pre-generated backbone data structure, the method further comprises:

determining required database metadata fields according to service requirements and data table structures;

Matching a database field obtained in advance with a database metadata field, and storing the matched database field and the matched database metadata field into a data set;

Performing dependency syntax analysis on database fields in the dataset to obtain trunk feature words of the database fields;

And matching the trunk feature words of the database field with the database metadata field, and storing the matched trunk feature words of the database field and the matched database metadata field into the data set through a trunk data structure.

3. The method of claim 2, wherein a first key value pair is stored in the backbone data structure, the database metadata field is used as a key value of the first key value pair, and a list of backbone feature words of the database field corresponding to the database metadata field is used as a value of the first key value pair.

4. The method of claim 2, wherein a second key pair is stored in the field data structure, the database metadata field is used as a key of the second key pair, and a list of database fields corresponding to the database metadata field is used as a value of the second key pair.

5. The method of claim 2, wherein before calculating the similarity of the vector values of the backbone feature words of the unmatched database fields to the vector values of the database metadata fields in the pre-generated vector data structure, the method further comprises:

According to the word vector model, converting the trunk feature words of the database field corresponding to the database metadata field into word vectors, and taking the word vectors as vector values of the database metadata field;

Matching the database metadata field with the vector value of the database metadata field, and storing the matched vector values of the database metadata field and the database metadata field into the data set;

And storing vector values of the matched database metadata fields and the matched database metadata fields through a vector data structure, wherein a third key value pair is stored in the vector data structure, the database metadata fields are used as key values of the third key value pair, and the vector values of the database metadata fields are used as numerical values of the third key value pair.

6. The method of claim 5, wherein each database metadata field corresponds to a backbone feature word of a plurality of database fields;

according to the word vector model, the trunk feature words of the database field corresponding to the database metadata field are converted into word vectors, and the word vectors are used as vector values of the database metadata field, and the method specifically comprises the following steps:

according to the word vector model, converting trunk feature words of a plurality of database fields corresponding to the database metadata fields into a plurality of word vectors;

Adding and averaging the word vectors to obtain an average word vector;

And taking the average word vector as a vector value of the metadata field of the database.

7. The method of claim 1, wherein the performing a predetermined dependency syntax analysis on the unmatched database field to obtain a backbone feature word of the unmatched database field specifically includes:

performing dependency syntax analysis on the unmatched database fields to generate a dependency syntax tree;

obtaining part of speech and dependency relation of field phrases in the unmatched database fields according to the dependency syntax tree;

according to the part of speech and dependency relationship of the field phrases in the unmatched database fields, determining noun independent fields and direct objects independent of other field phrases, and taking the noun independent fields as nouns imaginary root in the dependency syntax tree;

And determining trunk characteristic words of the unmatched database fields according to the direct object and the noun imaginary root.

8. A database metadata field matching apparatus, comprising:

9. A database metadata field matching device, comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

10. A non-transitory computer storage medium storing computer-executable instructions configured to: