CN114969001A - Database metadata field matching method, device, equipment and medium - Google Patents

Database metadata field matching method, device, equipment and medium Download PDF

Info

Publication number
CN114969001A
CN114969001A CN202210570053.1A CN202210570053A CN114969001A CN 114969001 A CN114969001 A CN 114969001A CN 202210570053 A CN202210570053 A CN 202210570053A CN 114969001 A CN114969001 A CN 114969001A
Authority
CN
China
Prior art keywords
database
fields
unmatched
trunk
field
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210570053.1A
Other languages
Chinese (zh)
Other versions
CN114969001B (en
Inventor
傅玉鑫
孙永超
申传旺
李照川
罗森
张艳雪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chaozhou Zhuoshu Big Data Industry Development Co Ltd
Original Assignee
Chaozhou Zhuoshu Big Data Industry Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chaozhou Zhuoshu Big Data Industry Development Co Ltd filed Critical Chaozhou Zhuoshu Big Data Industry Development Co Ltd
Priority to CN202210570053.1A priority Critical patent/CN114969001B/en
Publication of CN114969001A publication Critical patent/CN114969001A/en
Application granted granted Critical
Publication of CN114969001B publication Critical patent/CN114969001B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/243Natural language query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the specification discloses a database metadata field matching method, which comprises the following steps: obtaining unmatched database fields and obtaining trunk characteristic words of the unmatched database fields; if the trunk characteristic words of unmatched database fields exist in the numerical values of the trunk data structure, matching database metadata fields corresponding to the unmatched database fields in the trunk data structure; if the numerical value of the main data structure does not have main characteristic words of unmatched database fields, converting the main characteristic words of the unmatched database fields into word vectors to be matched according to a pre-trained word vector model, and determining vector values of the main characteristic words of the unmatched database fields; calculating the similarity between the vector value of the trunk characteristic words of unmatched database fields and the vector value of database metadata fields in a vector data structure generated in advance; and matching database metadata fields corresponding to unmatched database fields according to the similarity.

Description

Database metadata field matching method, device, equipment and medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a medium for matching metadata fields of a database.
Background
The database fields are basic components of the database and represent the attributes of a database record object, and each database field comprises information of a specific attribute in the database record object, such as the name of a student, the contact number of the student, the home address and the like.
The metadata field is a meaning representation of the database field, the meaning representation is more universal and more extensive, and the database field can be classified by using the metadata field as a category, for example, the metadata field of the "student name" field is "name", and the metadata field of the "home address" field is "address".
Database metadata field matching means that metadata fields similar to the database field names are found through analysis of the database field names and are matched, one metadata field can match a plurality of database fields, and only one corresponding metadata field can exist in one database field. The matching of the metadata fields of the database can clear up the incidence relation among the database fields and divide the categories of the database fields, so that the data quality is further improved, and the logic among the data is clearer.
In the prior art, the database metadata field matching is mostly carried out in a manual mode, the efficiency is low, and the requirements of users cannot be met.
Disclosure of Invention
One or more embodiments of the present specification provide a database metadata field matching method, apparatus, device, and medium, which are used to solve the following technical problems:
in the prior art, the database metadata field matching is mostly carried out in a manual mode, the efficiency is low, and the requirements of users cannot be met.
One or more embodiments of the present disclosure adopt the following technical solutions:
one or more embodiments of the present specification provide a database metadata field matching method, including:
obtaining unmatched database fields, and performing preset dependency syntax analysis on the unmatched database fields to obtain trunk characteristic words of the unmatched database fields;
judging whether a trunk characteristic word of the unmatched database field exists in a pre-generated trunk data structure;
if the trunk characteristic words of the unmatched database fields exist in the numerical values of the trunk data structure, matching database metadata fields corresponding to the unmatched database fields in the trunk data structure;
if the trunk characteristic words of the unmatched database fields do not exist in the numerical values of the trunk data structure, converting the trunk characteristic words of the unmatched database fields into word vectors to be matched according to a pre-trained word vector model, and determining vector values of the trunk characteristic words of the unmatched database fields;
calculating the similarity between the vector value of the trunk characteristic words of the unmatched database fields and the vector value of the database metadata fields in a vector data structure generated in advance;
and matching database metadata fields corresponding to the unmatched database fields according to the similarity.
One or more embodiments of the present specification provide a database metadata field matching apparatus, including:
the acquisition and analysis unit is used for acquiring unmatched database fields and carrying out preset dependency syntax analysis on the unmatched database fields to obtain trunk characteristic words of the unmatched database fields;
the judging unit is used for judging whether the trunk characteristic words of the unmatched database fields exist in a pre-generated trunk data structure;
the first matching unit is used for matching a database metadata field corresponding to the unmatched database field in the trunk data structure if the trunk characteristic word of the unmatched database field exists in the numerical value of the trunk data structure;
a vector conversion unit, configured to convert the stem feature words of the unmatched database fields into word vectors to be matched according to a pre-trained word vector model if the stem feature words of the unmatched database fields do not exist in the values of the stem data structure, and determine vector values of the stem feature words of the unmatched database fields;
the calculation unit is used for calculating the similarity between the vector value of the trunk characteristic word of the unmatched database field and the vector value of the database metadata field in a vector data structure generated in advance;
and the second matching unit is used for matching the database metadata fields corresponding to the unmatched database fields according to the similarity.
One or more embodiments of the present specification provide a database metadata field matching apparatus, including:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to:
obtaining unmatched database fields, and performing preset dependency syntax analysis on the unmatched database fields to obtain trunk characteristic words of the unmatched database fields;
judging whether a trunk characteristic word of the unmatched database field exists in a pre-generated trunk data structure;
if the trunk characteristic words of the unmatched database fields exist in the numerical values of the trunk data structure, matching database metadata fields corresponding to the unmatched database fields in the trunk data structure;
if the trunk characteristic words of the unmatched database fields do not exist in the numerical values of the trunk data structure, converting the trunk characteristic words of the unmatched database fields into word vectors to be matched according to a pre-trained word vector model, and determining vector values of the trunk characteristic words of the unmatched database fields;
calculating the similarity between the vector value of the trunk characteristic words of the unmatched database fields and the vector value of the database metadata fields in a vector data structure generated in advance;
and matching database metadata fields corresponding to the unmatched database fields according to the similarity.
One or more embodiments of the present specification provide a non-transitory computer storage medium storing computer-executable instructions configured to:
obtaining unmatched database fields, and performing preset dependency syntax analysis on the unmatched database fields to obtain trunk characteristic words of the unmatched database fields;
judging whether a trunk characteristic word of the unmatched database field exists in a pre-generated trunk data structure;
if the trunk characteristic words of the unmatched database fields exist in the numerical values of the trunk data structure, matching database metadata fields corresponding to the unmatched database fields in the trunk data structure;
if the trunk characteristic words of the unmatched database fields do not exist in the numerical values of the trunk data structure, converting the trunk characteristic words of the unmatched database fields into word vectors to be matched according to a pre-trained word vector model, and determining vector values of the trunk characteristic words of the unmatched database fields;
calculating the similarity between the vector value of the trunk characteristic word of the unmatched database field and the vector value of the database metadata field in a vector data structure generated in advance;
and matching database metadata fields corresponding to the unmatched database fields according to the similarity.
The embodiment of the specification adopts at least one technical scheme which can achieve the following beneficial effects:
the embodiment of the specification performs metadata field matching, which is beneficial to combing the relationship among data, cleaning up the close relationship among data and improving the quality of data assets. Meanwhile, the similarity judgment between the words is carried out by using a word vector method in natural language processing, so that the accuracy of the similarity judgment can be ensured at the same time of high efficiency. And the main information extraction is carried out on the field by using the dependency syntax analysis, so that the influence of irrelevant service vocabularies on the meta field matching result can be reduced. In addition, the embodiment of the specification can automatically and quickly match the meta-fields for the newly added fields, does not need manual selection from a plurality of meta-fields and manual matching, and is more efficient.
Drawings
In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the present specification, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative effort. In the drawings:
FIG. 1 is a flow diagram illustrating a database metadata field matching method according to one or more embodiments of the present disclosure;
FIG. 2 is a schematic diagram of data preprocessing provided by one or more embodiments of the present disclosure;
FIG. 3 is a diagram of a database metadata field matching process provided in one or more embodiments of the present description;
FIG. 4 is a block diagram illustrating an exemplary database metadata field matching apparatus according to one or more embodiments of the present disclosure;
fig. 5 is a schematic structural diagram of a database metadata field matching apparatus according to one or more embodiments of the present specification.
Detailed Description
The embodiment of the specification provides a database metadata field matching method, device, equipment and medium.
In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present specification, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present specification without any creative effort shall fall within the protection scope of the present specification.
Fig. 1 is a schematic flowchart of a database metadata field matching method according to one or more embodiments of the present disclosure, where the process may be executed by a database metadata field matching system, and the system may automatically find metadata fields with similar meanings to the database field names through analysis of the database field names to perform matching, so as to save manual database metadata field matching, improve database metadata field matching efficiency, and better meet the requirements of users. Certain input parameters or intermediate results in the flow allow for manual intervention adjustments to help improve accuracy.
The method of the embodiment of the specification comprises the following steps:
s102, obtaining unmatched database fields, and carrying out preset dependency syntax analysis on the unmatched database fields to obtain the trunk characteristic words of the unmatched database fields.
In this embodiment of the present specification, when performing preset dependency parsing on the unmatched database field to obtain a stem feature word of the unmatched database field, the dependency parsing may be performed on the unmatched database field first to generate a dependency syntax tree; then, obtaining the part of speech and the dependency relationship of the field phrases in the unmatched database fields according to the dependency syntax tree; then, according to the part of speech and the dependency relationship of the field phrases in the unmatched database fields, determining noun independent fields and direct objects which do not depend on other field phrases, and taking the noun independent fields as noun virtual roots in the dependency syntax tree; finally, stem feature words of the unmatched database fields may be determined based on the direct object and the noun radix. For example, field phrases of student names and detailed addresses, and other words whose backbone information names and addresses are independent of the field phrases, can be used as noun virtual roots in the dependency syntax tree.
And S104, judging whether the trunk characteristic words of the unmatched database fields exist in a pre-generated trunk data structure.
In this embodiment of the present specification, before determining whether a backbone feature word of the unmatched database field exists in a pre-generated backbone data structure, the backbone data structure needs to be generated first, see the following sections:
the required database metadata field can be determined according to the business requirement and a data table structure, the data table structure can be a preset database table structure, and the table structure can define the field, the type, the main key, the external key and the index of a table; matching the pre-acquired database fields with database metadata fields, and storing the matched database fields and the database metadata fields into a data set; then, performing dependency syntax analysis on the database fields in the data set to obtain the trunk characteristic words of the database fields; finally, the backbone feature words of the database fields may be matched with the database metadata fields, and the matched backbone feature words of the database fields and the database metadata fields may be stored in the data set through a backbone data structure. The main data structure can be a preset data structure, and is mainly used for storing main characteristic words of matched database fields and database metadata fields.
Further, in the backbone data structure of the embodiment of the present specification, a first key-value pair is stored, the database metadata field is used as a key value of the first key-value pair, and a list composed of backbone feature words of the database field corresponding to the database metadata field is used as a numerical value of the first key-value pair. By the storage mode, the main characteristic words of the database fields and the database metadata fields can be better stored and managed.
Further, a second key-value pair is stored in the field data structure of the embodiment of the present specification, the database metadata field is used as a key value of the second key-value pair, and a list composed of database fields corresponding to the database metadata field is used as a numerical value of the second key-value pair. The storage management can be better performed on the database fields and the database metadata fields through the storage mode.
And S106, if the trunk characteristic words of the unmatched database fields exist in the numerical values of the trunk data structure, matching the database metadata fields corresponding to the unmatched database fields in the trunk data structure.
And S108, if the trunk characteristic words of the unmatched database fields do not exist in the numerical values of the trunk data structure, converting the trunk characteristic words of the unmatched database fields into word vectors to be matched according to a pre-trained word vector model, and determining the vector values of the trunk characteristic words of the unmatched database fields.
And S110, calculating the similarity between the vector value of the trunk characteristic words of the unmatched database fields and the vector value of the database metadata fields in the vector data structure generated in advance.
Before calculating the similarity between the vector value of the stem feature words of the unmatched database fields and the vector value of the database metadata field in the vector data structure generated in advance, the vector data structure generated in advance is needed, and the following parts are referred to:
converting the stem characteristic words of the database fields corresponding to the database metadata fields into word vectors according to the word vector model, and taking the word vectors as vector values of the database metadata fields; matching the database metadata field with the vector value of the database metadata field, and storing the matched database metadata field and the vector value of the database metadata field into the data set; finally, the matched vector values of the database metadata field and the database metadata field can be stored through a vector data structure, a third key value pair is stored in the vector data structure, the database metadata field is used as a key value of the third key value pair, and the vector value of the database metadata field is used as a numerical value of the third key value pair.
It should be noted that, in the embodiment of the present specification, each database metadata field may correspond to a backbone feature word of a plurality of database fields; at this time, according to the word vector model, the trunk characteristic words of the database fields corresponding to the database metadata fields are converted into word vectors, and when the word vectors are used as vector values of the database metadata fields, the trunk characteristic words of a plurality of database fields corresponding to the database metadata fields can be converted into a plurality of word vectors according to the word vector model; then adding the word vectors and averaging to obtain an average word vector; finally, the average word vector may be taken as a vector value for the database metadata field.
And S112, matching database metadata fields corresponding to the unmatched database fields according to the similarity.
In the embodiments of the present specification, with respect to the above technical solutions, the following may be further explained:
the above-mentioned dependency syntax analysis is a basic method of natural language processing, which obtains the dependency relationship between individual words in a phrase by generating a dependency syntax tree.
The dependency syntax tree of the embodiments of the present specification has the following characteristics:
(1) there is only one word (ROOT, virtual ROOT node, abbreviated as virtual ROOT), and it does not depend on other words.
(2) All words except this ROOT word must depend on other words.
(3) Each word cannot depend on multiple words.
(4) If the word A depends on B, then the word C, which is located between A and B, can only depend on words between A, B or AB.
Through dependency parsing, the dependency relationship between words can be obtained, for example: a cardinal relationship, a motile relationship, a formal foreign language, a direct object, etc. The trunk information of the phrases can be found through the incidence relation among the words, so that the decorative words are removed, and trunk characteristic words of the phrases are obtained.
The word vector of the embodiment of the present specification is another important method for natural language processing, which is a method for mapping words into a multidimensional vector through one-hot coding or neural network training. The one-hot coding is to code the position of a word in a word list as a corresponding word vector, and the implementation mode of the one-hot coding is simpler, but the semantic characteristics of the word are difficult to obtain. In the embodiment of the description, a method for training word vectors by using a neural network can be used, an unsupervised text prediction task is mainly trained by inputting a large amount of texts into the neural network, the output of a trained network hidden layer is used as a word vector, and a part extracted by network features is used as a generator of the word vector. Because the word method is based on the text prediction task, the word vector obtained by training contains the basic semantic features of words, and the method applied to the embodiment of the specification can also obtain better effect. The method for training the word vector by using the neural network may include one of word2vec, glove and Bert, and preferably, the word2vec may be used as a word vector generation method in the embodiments of the present specification.
The database metadata field matching of the embodiment of the specification has the following characteristics:
the database fields are basic components of the database and represent the attributes of a database record object, and each database field comprises information of a specific attribute in the database record object, such as the name of a student, the contact number of the student, the home address and the like.
The metadata field is a meaning representation of the database field, the meaning representation is more universal and more extensive, and the database field can be classified by using the metadata field as a category, for example, the metadata field of the "student name" field is "name", and the metadata field of the "family address" field is "address".
Database metadata field matching means that metadata fields similar to the database field names are found through analysis of the database field names and matched, one metadata field can be matched with a plurality of database fields, and one database field only has one corresponding metadata field. The matching of the metadata fields of the database can clear up the incidence relation among the database fields and divide the categories of the database fields, so that the data quality is further improved, and the logic among the data is clearer.
The purpose of the embodiments of the present specification is to automatically and efficiently match the database field names with the corresponding database metadata fields, clear the association relationship between the data fields, provide a basis for clearing the data relationship, and make the data arrangement clearer.
The specific technical scheme of the embodiment of the specification can be as follows:
1. data preparation
The data required herein are as follows:
(1) and sorting out all required database metadata fields according to the service requirements and the existing table structure.
(2) And manually marking matched database metadata fields as a data set by using partial existing database fields.
(3) A pre-trained word2vec word vector model.
2. Data pre-processing
And loading the marked data set, storing by using a map data structure, storing a plurality of key value pairs in the map, using database metadata fields as key values of the map, and using list consisting of a plurality of database fields corresponding to the database metadata fields as values of the map. This map may be defined as a standard _ map.
The dependency syntax analysis is performed on all database fields in the standard _ map, and since the embodiment of the present specification expects to obtain the main field meaning after removing the service descriptive vocabulary, the nominal virtual ROOT obtained after performing the dependency syntax analysis on the database field name and the word corresponding to the direct object may be used as the main feature word of the field, and a new map is used to store the database metadata field and the main feature word of the database field, where the database metadata field is used as the key value of the map, the list composed of the main feature words of the database fields corresponding to the database metadata field is used as the value of the map, and the deduplication operation is performed on the value in the list. This map may be defined as feature _ map.
To measure the similarity between the database field and the metadata field, words are converted into word vectors, and the similarity between the words is reflected by calculating the cosine similarity between the vectors. Therefore, in the embodiment of the present specification, for each word in the value in each key value pair in feature _ map, word2vec model is used to convert into word vector w _ vec, and the word vector w _ vec is added and averaged to obtain average feature vector ave _ vec of each database metadata field, which is stored by using map, the database metadata field is used as the key value of map, and the average feature vector ave _ vec is used as the value of map. This map may be defined as vector _ map.
The data preprocessing process can be referred to the data preprocessing diagram shown in fig. 2.
3. Database metadata field matching
Inputting an unmatched database field, performing dependency syntax analysis on the unmatched database field to generate a dependency syntax tree, and obtaining the part of speech and the dependency relationship of each word in the database field according to the dependency syntax tree. Since the embodiment of the present specification needs to obtain the backbone information of the database field, it is necessary to extract the direct object of the field phrase as the backbone information. In addition, since the stem information "name" and "address" of the field such as "student name" and "detailed address" is not dependent on other words of the field phrase and is used as a virtual root in the dependency syntax tree, the field stem information should include a nominal virtual root of the dependency syntax tree in addition to the direct object. Therefore, after the dependency syntax analysis is carried out on the database field, the direct object and the noun virtual root can be selected as the main information of the field. It should be noted that the backbone information of the database field mentioned here is the backbone feature words of the database field mentioned above.
And searching in the value of the feature _ map by using the extracted field backbone information, and if the field backbone information exists, directly matching the field with the key value database metadata field corresponding to the field backbone information in the feature _ map, thereby completing matching.
If the main stem information does not exist in the feature _ map, word vector conversion is carried out on the word main stem information by using a word2vec model to obtain a corresponding vector field _ vector, cosine similarity is calculated between the vector field _ vector and all ave _ vecs in the vector _ map, if the cosine similarity is larger, the distance between the vectors is closer, the semantics of the words are more similar, therefore, a metadata field corresponding to the ave _ vec with the largest cosine similarity of the field _ vector is selected as a matching item of the database field, the database field is updated to the standard _ map, the feature _ map is updated at the same time, and matching is completed.
In the above process of calculating cosine similarity, if the maximum cosine similarity between field _ vector and all ave _ vecs is less than 0.4, it indicates that the field _ vector is not close to the currently existing metadata field to a great extent, and a new metadata field should be manually allocated or added.
The database metadata field matching process described above can be referred to the database metadata field matching process schematic shown in fig. 3.
It should be noted that, in the embodiment of the present disclosure, metadata field matching is performed, which is helpful for combing the relationship between data, cleaning up the close relationship between data, and improving the quality of data assets. Meanwhile, the similarity judgment between the words is carried out by using a word vector method in natural language processing, so that the accuracy of the similarity judgment can be ensured at the same time of high efficiency. And the main information extraction is carried out on the field by using the dependency syntax analysis, so that the influence of irrelevant service vocabularies on the meta field matching result can be reduced. In addition, the embodiment of the specification can automatically and quickly match the meta-fields for the newly added fields, does not need manual selection from a plurality of meta-fields and manual matching, and is more efficient. The embodiment of the present specification may also generate a correspondence relationship standard _ map between the database field and the meta field.
Fig. 4 is a schematic structural diagram of a database metadata field matching apparatus according to one or more embodiments of the present specification, including: the device comprises an acquisition analysis unit 402, a judgment unit 404, a first matching unit 406, a vector conversion unit 408, a calculation unit 410 and a second matching unit 412.
An obtaining and analyzing unit 402, configured to obtain an unmatched database field, and perform preset dependency syntax analysis on the unmatched database field to obtain a stem feature word of the unmatched database field;
a judging unit 404, configured to judge whether a trunk feature word of the unmatched database field exists in a pre-generated trunk data structure;
a first matching unit 406, configured to match a database metadata field corresponding to the unmatched database field in the backbone data structure if a backbone feature word of the unmatched database field exists in a numerical value of the backbone data structure;
a vector conversion unit 408, configured to, if the trunk feature words of the unmatched database fields do not exist in the numerical values of the trunk data structure, convert the trunk feature words of the unmatched database fields into word vectors to be matched according to a pre-trained word vector model, and determine vector values of the trunk feature words of the unmatched database fields;
a calculating unit 410, configured to calculate a similarity between a vector value of a stem feature word of the unmatched database field and a vector value of a database metadata field in a vector data structure generated in advance;
the second matching unit 412 matches the database metadata fields corresponding to the unmatched database fields according to the similarity.
Fig. 5 is a schematic structural diagram of a database metadata field matching device according to one or more embodiments of the present specification, including:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to cause the at least one processor to:
obtaining unmatched database fields, and performing preset dependency syntax analysis on the unmatched database fields to obtain trunk characteristic words of the unmatched database fields;
judging whether a trunk characteristic word of the unmatched database field exists in a pre-generated trunk data structure;
if the trunk characteristic words of the unmatched database fields exist in the numerical values of the trunk data structure, matching database metadata fields corresponding to the unmatched database fields in the trunk data structure;
if the trunk characteristic words of the unmatched database fields do not exist in the numerical values of the trunk data structure, converting the trunk characteristic words of the unmatched database fields into word vectors to be matched according to a pre-trained word vector model, and determining vector values of the trunk characteristic words of the unmatched database fields;
calculating the similarity between the vector value of the trunk characteristic words of the unmatched database fields and the vector value of the database metadata fields in a vector data structure generated in advance;
and matching database metadata fields corresponding to the unmatched database fields according to the similarity.
One or more embodiments of the present specification provide a non-transitory computer storage medium storing computer-executable instructions configured to:
obtaining unmatched database fields, and performing preset dependency syntax analysis on the unmatched database fields to obtain trunk characteristic words of the unmatched database fields;
judging whether a trunk characteristic word of the unmatched database field exists in a pre-generated trunk data structure;
if the trunk characteristic words of the unmatched database fields exist in the numerical values of the trunk data structure, matching database metadata fields corresponding to the unmatched database fields in the trunk data structure;
if the trunk characteristic words of the unmatched database fields do not exist in the numerical values of the trunk data structure, converting the trunk characteristic words of the unmatched database fields into word vectors to be matched according to a pre-trained word vector model, and determining vector values of the trunk characteristic words of the unmatched database fields;
calculating the similarity between the vector value of the trunk characteristic words of the unmatched database fields and the vector value of the database metadata fields in a vector data structure generated in advance;
and matching database metadata fields corresponding to the unmatched database fields according to the similarity.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the embodiments of the apparatus, the device, and the nonvolatile computer storage medium, since they are substantially similar to the embodiments of the method, the description is simple, and for the relevant points, reference may be made to the partial description of the embodiments of the method.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The above description is intended to represent one or more embodiments of the present disclosure, and should not be taken to be limiting of the present disclosure. Various modifications and alterations to one or more embodiments of the present description will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement or the like made within the spirit and principle of one or more embodiments of the present specification should be included in the scope of the claims of the present specification.

Claims (10)

1. A database metadata field matching method, comprising:
obtaining unmatched database fields, and performing preset dependency syntax analysis on the unmatched database fields to obtain trunk characteristic words of the unmatched database fields;
judging whether a trunk characteristic word of the unmatched database field exists in a pre-generated trunk data structure;
if the trunk characteristic words of the unmatched database fields exist in the numerical values of the trunk data structure, matching database metadata fields corresponding to the unmatched database fields in the trunk data structure;
if the stem characteristic words of the unmatched database fields do not exist in the numerical values of the stem data structure, converting the stem characteristic words of the unmatched database fields into word vectors to be matched according to a pre-trained word vector model, and determining vector values of the stem characteristic words of the unmatched database fields;
calculating the similarity between the vector value of the trunk characteristic word of the unmatched database field and the vector value of the database metadata field in a vector data structure generated in advance;
and matching database metadata fields corresponding to the unmatched database fields according to the similarity.
2. The method of claim 1, wherein the determining whether the stem feature term of the unmatched database field exists in a pre-generated stem data structure further comprises:
determining a required database metadata field according to the service requirement and the data table structure;
matching a database field obtained in advance with a database metadata field, and storing the matched database field and the database metadata field into a data set;
performing dependency syntax analysis on the database fields in the data set to obtain the trunk characteristic words of the database fields;
and matching the trunk characteristic words of the database fields with the database metadata fields, and storing the matched trunk characteristic words of the database fields and the database metadata fields into the data set through a trunk data structure.
3. The method of claim 2, wherein the backbone data structure stores a first key-value pair, wherein the database metadata field serves as a key value of the first key-value pair, and wherein a list of backbone feature words of the database field corresponding to the database metadata field serves as a numerical value of the first key-value pair.
4. The method of claim 2, wherein the field data structure stores a second key-value pair, wherein the database metadata field serves as a key of the second key-value pair, and wherein a list of database fields corresponding to the database metadata field serves as a value of the second key-value pair.
5. The method of claim 2, wherein before calculating the similarity between the vector value of the stem feature words of the unmatched database field and the vector value of the database metadata field in the pre-generated vector data structure, the method further comprises:
converting the trunk characteristic words of the database fields corresponding to the database metadata fields into word vectors according to the word vector model, and taking the word vectors as vector values of the database metadata fields;
matching the database metadata field with the vector value of the database metadata field, and storing the matched database metadata field and the vector value of the database metadata field into the data set;
and storing the matched vector values of the database metadata field and the database metadata field through a vector data structure, wherein a third key value pair is stored in the vector data structure, the database metadata field is used as a key value of the third key value pair, and the vector value of the database metadata field is used as a numerical value of the third key value pair.
6. The method of claim 5, wherein each of the database metadata fields corresponds to a stem feature term of a plurality of database fields;
converting the stem feature words of the database fields corresponding to the database metadata fields into word vectors according to the word vector model, and taking the word vectors as vector values of the database metadata fields, specifically including:
converting the trunk characteristic words of the database fields corresponding to the database metadata fields into a plurality of word vectors according to the word vector model;
adding the word vectors and averaging to obtain an average word vector;
and taking the average word vector as a vector value of the database metadata field.
7. The method according to claim 1, wherein the performing a preset dependency parsing on the unmatched database field to obtain a stem feature word of the unmatched database field specifically includes:
performing dependency syntax analysis on the unmatched database fields to generate a dependency syntax tree;
obtaining the part of speech and the dependency relationship of the field phrases in the unmatched database fields according to the dependency syntax tree;
determining noun independent fields and direct objects independent of other field phrases according to the part of speech and dependency relationship of the field phrases in the unmatched database fields, and taking the noun independent fields as noun virtual roots in the dependency syntax tree;
and determining the trunk characteristic words of the unmatched database fields according to the direct object and the noun virtual root.
8. A database metadata field matching apparatus, comprising:
the acquisition and analysis unit is used for acquiring unmatched database fields and carrying out preset dependency syntax analysis on the unmatched database fields to obtain the stem characteristic words of the unmatched database fields;
the judging unit is used for judging whether the trunk characteristic words of the unmatched database fields exist in a pre-generated trunk data structure;
the first matching unit is used for matching a database metadata field corresponding to the unmatched database field in the trunk data structure if the trunk characteristic word of the unmatched database field exists in the numerical value of the trunk data structure;
the vector conversion unit is used for converting the trunk characteristic words of the unmatched database fields into word vectors to be matched according to a pre-trained word vector model if the trunk characteristic words of the unmatched database fields do not exist in the numerical values of the trunk data structure, and determining the vector values of the trunk characteristic words of the unmatched database fields;
the calculating unit is used for calculating the similarity between the vector value of the trunk characteristic word of the unmatched database field and the vector value of the database metadata field in a vector data structure generated in advance;
and the second matching unit is used for matching the database metadata fields corresponding to the unmatched database fields according to the similarity.
9. A database metadata field matching apparatus, comprising:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,
the memory stores instructions executable by the at least one processor to enable the at least one processor to:
obtaining unmatched database fields, and performing preset dependency syntax analysis on the unmatched database fields to obtain stem feature words of the unmatched database fields;
judging whether a trunk characteristic word of the unmatched database field exists in a pre-generated trunk data structure;
if the trunk characteristic words of the unmatched database fields exist in the numerical values of the trunk data structure, matching database metadata fields corresponding to the unmatched database fields in the trunk data structure;
if the trunk characteristic words of the unmatched database fields do not exist in the numerical values of the trunk data structure, converting the trunk characteristic words of the unmatched database fields into word vectors to be matched according to a pre-trained word vector model, and determining vector values of the trunk characteristic words of the unmatched database fields;
calculating the similarity between the vector value of the trunk characteristic words of the unmatched database fields and the vector value of the database metadata fields in a vector data structure generated in advance;
and matching database metadata fields corresponding to the unmatched database fields according to the similarity.
10. A non-transitory computer storage medium having stored thereon computer-executable instructions configured to:
obtaining unmatched database fields, and performing preset dependency syntax analysis on the unmatched database fields to obtain trunk characteristic words of the unmatched database fields;
judging whether a trunk characteristic word of the unmatched database field exists in a pre-generated trunk data structure;
if the trunk characteristic words of the unmatched database fields exist in the numerical values of the trunk data structure, matching database metadata fields corresponding to the unmatched database fields in the trunk data structure;
if the stem characteristic words of the unmatched database fields do not exist in the numerical values of the stem data structure, converting the stem characteristic words of the unmatched database fields into word vectors to be matched according to a pre-trained word vector model, and determining vector values of the stem characteristic words of the unmatched database fields;
calculating the similarity between the vector value of the trunk characteristic words of the unmatched database fields and the vector value of the database metadata fields in a vector data structure generated in advance;
and matching database metadata fields corresponding to the unmatched database fields according to the similarity.
CN202210570053.1A 2022-05-24 2022-05-24 Database metadata field matching method, device, equipment and medium Active CN114969001B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210570053.1A CN114969001B (en) 2022-05-24 2022-05-24 Database metadata field matching method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210570053.1A CN114969001B (en) 2022-05-24 2022-05-24 Database metadata field matching method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN114969001A true CN114969001A (en) 2022-08-30
CN114969001B CN114969001B (en) 2024-05-10

Family

ID=82956194

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210570053.1A Active CN114969001B (en) 2022-05-24 2022-05-24 Database metadata field matching method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN114969001B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115827645A (en) * 2023-02-15 2023-03-21 畅捷通信息技术股份有限公司 Cross-service-field matching method, device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170139984A1 (en) * 2015-11-13 2017-05-18 International Business Machines Corporation Method And System For Semantic-Based Queries Using Word Vector Representation
CN109766436A (en) * 2018-12-04 2019-05-17 北京明略软件系统有限公司 A kind of matched method and apparatus of data element of the field and knowledge base of tables of data
CN109783635A (en) * 2017-11-13 2019-05-21 埃森哲环球解决方案有限公司 Use machine learning and fuzzy matching AUTOMATIC ZONING classifying documents and identification metadata
CN111652299A (en) * 2020-05-26 2020-09-11 泰康保险集团股份有限公司 Method and equipment for automatically matching service data
CN112380321A (en) * 2020-11-19 2021-02-19 深圳季连科技有限公司 Primary and secondary database distribution method based on bill knowledge graph and related equipment
CN112580691A (en) * 2020-11-25 2021-03-30 北京北大千方科技有限公司 Term matching method, matching system and storage medium of metadata field

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170139984A1 (en) * 2015-11-13 2017-05-18 International Business Machines Corporation Method And System For Semantic-Based Queries Using Word Vector Representation
CN109783635A (en) * 2017-11-13 2019-05-21 埃森哲环球解决方案有限公司 Use machine learning and fuzzy matching AUTOMATIC ZONING classifying documents and identification metadata
CN109766436A (en) * 2018-12-04 2019-05-17 北京明略软件系统有限公司 A kind of matched method and apparatus of data element of the field and knowledge base of tables of data
CN111652299A (en) * 2020-05-26 2020-09-11 泰康保险集团股份有限公司 Method and equipment for automatically matching service data
CN112380321A (en) * 2020-11-19 2021-02-19 深圳季连科技有限公司 Primary and secondary database distribution method based on bill knowledge graph and related equipment
CN112580691A (en) * 2020-11-25 2021-03-30 北京北大千方科技有限公司 Term matching method, matching system and storage medium of metadata field

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115827645A (en) * 2023-02-15 2023-03-21 畅捷通信息技术股份有限公司 Cross-service-field matching method, device and storage medium

Also Published As

Publication number Publication date
CN114969001B (en) 2024-05-10

Similar Documents

Publication Publication Date Title
CN108804521B (en) Knowledge graph-based question-answering method and agricultural encyclopedia question-answering system
CN108304372B (en) Entity extraction method and device, computer equipment and storage medium
WO2021093755A1 (en) Matching method and apparatus for questions, and reply method and apparatus for questions
CN107704453B (en) Character semantic analysis method, character semantic analysis terminal and storage medium
CN111026886B (en) Multi-round dialogue processing method for professional scene
CN112069298A (en) Human-computer interaction method, device and medium based on semantic web and intention recognition
CN111274267A (en) Database query method and device and computer readable storage medium
WO2020005601A1 (en) Semantic parsing of natural language query
CN111694940A (en) User report generation method and terminal equipment
CN112052324A (en) Intelligent question answering method and device and computer equipment
CN111666764A (en) XLNET-based automatic summarization method and device
CN111813923A (en) Text summarization method, electronic device and storage medium
CN111625621A (en) Document retrieval method and device, electronic equipment and storage medium
CN112650833A (en) API (application program interface) matching model establishing method and cross-city government affair API matching method
CN114969001B (en) Database metadata field matching method, device, equipment and medium
CN111859950A (en) Method for automatically generating lecture notes
CN113343692B (en) Search intention recognition method, model training method, device, medium and equipment
CN116628173B (en) Intelligent customer service information generation system and method based on keyword extraction
CN113779987A (en) Event co-reference disambiguation method and system based on self-attention enhanced semantics
CN111813916A (en) Intelligent question and answer method, device, computer equipment and medium
CN111460114A (en) Retrieval method, device, equipment and computer readable storage medium
CN111400495A (en) Video bullet screen consumption intention identification method based on template characteristics
CN115858733A (en) Cross-language entity word retrieval method, device, equipment and storage medium
CN111783465B (en) Named entity normalization method, named entity normalization system and related device
CN112416754B (en) Model evaluation method, terminal, system and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant