CN112580691A - Term matching method, matching system and storage medium of metadata field - Google Patents

Term matching method, matching system and storage medium of metadata field Download PDF

Info

Publication number
CN112580691A
CN112580691A CN202011342621.XA CN202011342621A CN112580691A CN 112580691 A CN112580691 A CN 112580691A CN 202011342621 A CN202011342621 A CN 202011342621A CN 112580691 A CN112580691 A CN 112580691A
Authority
CN
China
Prior art keywords
matching
trained
term
words
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011342621.XA
Other languages
Chinese (zh)
Other versions
CN112580691B (en
Inventor
袁建华
熊赟
和瑞楷
夏曙东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CHINA TRANSINFO TECHNOLOGY CORP
Original Assignee
CHINA TRANSINFO TECHNOLOGY CORP
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CHINA TRANSINFO TECHNOLOGY CORP filed Critical CHINA TRANSINFO TECHNOLOGY CORP
Priority to CN202011342621.XA priority Critical patent/CN112580691B/en
Publication of CN112580691A publication Critical patent/CN112580691A/en
Application granted granted Critical
Publication of CN112580691B publication Critical patent/CN112580691B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/28Determining representative reference patterns, e.g. by averaging or distorting; Generating dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a term matching method, a term matching system and a storage medium of metadata fields, which comprise the following steps: preprocessing a first search term in a metadata training set to obtain a second search term; judging the second search term, matching the second search term with the vocabulary to be trained in the vocabulary database to be trained, and acquiring the search term to be searched, wherein the matching is unsuccessful; performing word segmentation on the words to be retrieved by using a conditional random field word segmentation algorithm to obtain a plurality of third retrieval words of each word to be retrieved, matching the third retrieval words with the words to be trained, and determining matching words; updating a vocabulary database to be trained according to the matched words, and training a conditional random field word segmentation algorithm to obtain a trained classifier and a trained word vocabulary; and matching the words to be matched in the metadata fields to be matched by using the trained classifier and the vocabulary. The conditional random field word segmentation algorithm is used for segmenting words to be retrieved, matched words are determined, and the matching accuracy of the words to be retrieved can be improved on the premise of automatic matching.

Description

Term matching method, matching system and storage medium of metadata field
Technical Field
The present application relates to the field of data identification technologies, and in particular, to a term matching method, a term matching system, and a storage medium for metadata fields.
Background
Most businesses spend a great deal of time and effort, dealing with disorganized and thin-fitting data. Their employees either cannot find the appropriate data or do not trust the data found. Most importantly, self-service and data autonomy processes are restricted by a wide variety of industry regulations. As a result, enterprises attempt to repair data through a variety of labor-intensive tasks (including writing custom programs, developing global replacement functions, etc.), which severely impacts productivity of data analysts and data scientists. This is especially true for large enterprises, where years of co-purchase clouds have integrated systems and databases of various colors, resulting in extremely complex data environments. While maintaining these legacy data environments has been exhausting for businesses, new data continues to be generated at an unpredictable rate.
In view of the foregoing, it is desirable to provide a term matching method, matching system, and storage medium for metadata fields that are automatic and accurate.
Disclosure of Invention
To solve the above problems, the present application proposes a term matching method, matching system, and storage medium for metadata fields.
In a first aspect, the present application provides a term matching method for metadata fields, including:
preprocessing a first search term in a metadata training set to obtain a second search term;
judging the second search term, matching the second search term with the vocabulary to be trained in the vocabulary database to be trained, and acquiring the search term to be searched which is not matched successfully;
performing word segmentation on the words to be retrieved by using a conditional random field word segmentation algorithm to obtain a plurality of third retrieval words of each word to be retrieved, matching the third retrieval words with the words to be trained, and determining matched words;
updating a vocabulary database to be trained according to the matched words, and training the conditional random field word segmentation algorithm to obtain a trained classifier and a trained vocabulary;
and matching the metadata fields to be matched by using the trained classifier and the trained vocabulary.
Preferably, before the preprocessing the first search term in the metadata training set to obtain the second search term, the method further includes:
collecting table data, cleaning illegal characters in the table data, and establishing a vocabulary table to be trained;
and establishing a vocabulary database to be trained by using the vocabulary to be trained.
Preferably, the preprocessing the first search term in the metadata training set to obtain the second search term includes:
acquiring all first search terms in metadata in a metadata training set;
and removing illegal characters in each first search word to obtain a second search word corresponding to each first search word.
Preferably, the determining the second search term and matching the second search term with the vocabulary to be trained in the vocabulary database to be trained to obtain the search term includes:
judging the second search term to obtain a Chinese second search term and a non-Chinese second search term;
directly matching the Chinese second search terms in a vocabulary database to be trained, and determining the Chinese second search terms which cannot be matched with the matched terms;
and taking the non-Chinese second search term and the Chinese second search term which is not matched with the matching term as the to-be-searched term.
Preferably, the using a conditional random field word segmentation algorithm to segment the words to be retrieved to obtain a plurality of third retrieval words of each word to be retrieved, and matching the third retrieval words with the words to be trained to determine matching words includes:
performing word segmentation on the longer to-be-retrieved word by using a conditional random field word segmentation algorithm, and generating a plurality of third retrieval words of each word segmentation of the to-be-retrieved word, wherein the third retrieval words comprise: full spelling codes, abbreviated spelling codes, English names and/or Chinese names;
performing character matching on all the third search terms of each to-be-searched term and the to-be-trained vocabulary, and calculating the matching degree of each third search term corresponding to each to-be-searched term;
and determining the matching words of the words to be retrieved according to the matching degree and the matching threshold value.
Preferably, the matching the metadata fields to be matched by using the trained classifier and the trained vocabulary includes:
matching the classifier with the trained metadata fields to be matched, and acquiring the metadata fields with unsuccessful matching as words to be matched;
using a trained conditional random field word segmentation algorithm as a trained classifier, and performing word segmentation on each word to be matched to obtain a plurality of third search words corresponding to each word to be matched;
performing character matching on the plurality of third search terms in the trained vocabulary, and calculating the matching degree of each third search term;
and determining a matching word corresponding to each word to be matched according to the matching degree and the matching threshold, wherein the matching word is a matching term of the word to be matched.
Preferably, the table data includes:
the system comprises a client glossary, an industry standard field interpretation mapping table, a metadata field approved by a business system version and an interpretation comparison table, wherein each table comprises: chinese name, English name, full spelling code and simple spelling code.
Preferably, the types of the first search term and the second search term each include: chinese, full pinyin, shorthand pinyin, english acronym, and/or mixture.
In a second aspect, the present application provides a term matching system for metadata fields, comprising:
the training module is used for preprocessing a first search term in the metadata training set to obtain a second search term; judging the second search term, matching the second search term with the vocabulary to be trained in the vocabulary database to be trained, and acquiring the search term to be searched which is not matched successfully; performing word segmentation on the words to be retrieved by using a conditional random field word segmentation algorithm to obtain a plurality of third retrieval words of each word to be retrieved, matching the third retrieval words with the words to be trained, and determining matched words; updating a vocabulary database to be trained according to the matched words, and training the conditional random field word segmentation algorithm to obtain a trained classifier and a trained vocabulary;
and the matching module is used for matching the words to be matched in the metadata fields to be matched by using the trained classifier and the trained vocabulary.
In a third aspect, the present application is directed to a storage medium storing computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform a term matching method for metadata fields as described above.
The application has the advantages that: the second search word is judged and matched with the vocabulary in the trained vocabulary table database to obtain the unsuccessfully matched to-be-searched word, the to-be-searched word is segmented by using the conditional random field segmentation algorithm and classified by using the trained classifier to obtain a plurality of third search words, the third search words are matched with the trained vocabulary table, the matching degree is calculated, the matching words are determined according to the matching threshold, and the matching accuracy of the to-be-searched word can be improved on the premise of automatic matching.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to denote like parts throughout the drawings. In the drawings:
FIG. 1 is a schematic diagram of the steps of a term matching method for metadata fields provided herein;
FIG. 2 is a flow chart illustrating a term matching method for metadata fields provided herein;
FIG. 3 is a schematic diagram of the Aho-Corasick algorithm of a term matching method for metadata fields provided by the present application for preprocessing a pattern string into a finite state automaton;
FIG. 4 is a diagram illustrating a data structure of an adjacency matrix of a term matching method for metadata fields provided in the present application;
FIG. 5 is a diagram illustrating an adjacency list data structure for solving the Viterbi algorithm for term matching of metadata fields as provided herein;
FIG. 6 is a schematic diagram of a term matching system for metadata fields as provided herein.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
In a first aspect, according to an embodiment of the present application, a term matching method for metadata fields is provided, as shown in fig. 1, including:
s101, preprocessing a first search term in a metadata training set to obtain a second search term;
s102, judging the second search term, matching the second search term with the vocabulary to be trained in the vocabulary database to be trained, and acquiring the search term to be searched which is not matched successfully;
s103, performing word segmentation on the words to be retrieved by using a conditional random field word segmentation algorithm to obtain a plurality of third retrieval words of each word to be retrieved, matching the third retrieval words with the words to be trained, and determining matched words;
s104, updating a vocabulary database to be trained according to the matched words, and training the conditional random field word segmentation algorithm to obtain a trained classifier and a trained vocabulary;
and S105, matching the metadata fields to be matched by using the trained classifier and the trained vocabulary.
Preferably, before preprocessing the first search term in the metadata training set to obtain the second search term, the method further includes: collecting table data, cleaning illegal characters in the table data, and establishing a vocabulary table to be trained; and establishing a vocabulary database to be trained by using the vocabulary to be trained.
The vocabulary database to be trained is formed by vocabulary and vocabulary participles, Chinese corresponding to the vocabulary and/or the participles, simple spelling codes, full spelling codes, English abbreviation or mixed lookup tables.
The collected table data includes: the system comprises a client glossary, an industry standard field interpretation mapping table, a metadata field and an interpretation comparison table which are approved by a business system in a fixed version mode, and the like. Wherein each table comprises: chinese name, English name, full spelling code, simple spelling code, etc.
Preprocessing a first search term in a metadata training set to obtain a second search term, wherein the preprocessing comprises the following steps: acquiring all first search terms in metadata in a metadata training set; and removing illegal characters in each first search word to obtain a second search word corresponding to each first search word.
Wherein, the first search word in the metadata training set is a word with a label (a well-matched word is determined). The types of the first search term and the second search term each include: chinese, full pinyin, shorthand pinyin, english acronym, and/or mixture.
Judging the second search term, and matching the second search term with the vocabulary to be trained in the vocabulary database to be trained to obtain the search term, wherein the search term comprises the following steps:
judging the second search term to obtain a Chinese second search term and a non-Chinese second search term; directly matching the Chinese second search terms in the vocabulary database to be trained, and determining the Chinese second search terms which cannot be matched with the matching terms; and taking the non-Chinese second search term and the Chinese second search term which is not matched with the matching term as the to-be-searched term. Direct matching means directly retrieving matches.
Performing word segmentation on the words to be retrieved by using a conditional random field word segmentation algorithm to obtain a plurality of third retrieval words of each word to be retrieved, matching the third retrieval words with the words to be trained, and determining the matching words, wherein the word segmentation method comprises the following steps: performing word segmentation on a long word to be retrieved by using a conditional random field word segmentation algorithm, and generating a plurality of third retrieval words of each word segmentation of the word to be retrieved, wherein the third retrieval words comprise: full spelling codes, abbreviated spelling codes, English names and/or Chinese names; performing character matching on all third search terms of each to-be-searched term and the to-be-trained vocabulary, and calculating the matching degree of each third search term corresponding to each to-be-searched term; and determining the matching words of the words to be searched according to the matching degree and the matching threshold value. The method for obtaining a plurality of third search terms of each to-be-searched term further comprises the following steps: and directly generating a plurality of third search terms of the search terms to be searched. If the second search term is a Chinese word, a plurality of third search terms corresponding to the second search term can be directly generated.
The third search term is Chinese, simple spelling, English abbreviation and/or mixture of the above words of the second search term.
The character matching is measured according to the ratio of the matching number of the characters in the third search word and the corresponding item (word) characters in the vocabulary to be trained to the longest character string length in the characters, and the matching degree CD is defined as:
Figure BDA0002798964650000061
wherein, ICIndicating the number of matched characters, LMAXThe maximum value of the number of the characters in the third search word and the number of the characters of the corresponding item in the corresponding item characters in the vocabulary to be trained is represented.
Matching metadata fields to be matched using the trained classifier and the trained vocabulary, comprising: matching the metadata fields to be matched by using a trained vocabulary, and acquiring the metadata fields with unsuccessful matching as words to be matched; using a trained Conditional Random Field (CRF) word segmentation algorithm as a trained classifier, and performing word segmentation on each word to be matched to obtain a plurality of third search terms corresponding to each word to be matched; performing character matching on the plurality of third search terms in a trained vocabulary list, and calculating the matching degree of each third search term; and determining a matching word corresponding to each word to be matched according to the matching degree and the matching threshold, wherein the matching word is a matching term of the word to be matched. And when the trained vocabulary database is updated by using the matched words, storing the matched words and the first search words, the second search words and/or the third search words corresponding to the matched words. The vocabulary database includes a plurality of vocabularies. When matching is carried out, the matching is carried out according to decision suggestions and the matching degree sequence, wherein the decision suggestions are matching words (matching terms) which are selected as the words to be matched and have the maximum matching degree; the matching degree sorting is that if the classification result with the maximum matching degree is multiple, that is, the matching degrees of the multiple classification results are the same and the maximum, the matching word which is ranked at the top or the matching degree of which is calculated at the top is selected as the matching word to be matched.
The following examples are provided to further illustrate the present application.
As shown in fig. 2, firstly, a vocabulary to be trained needs to be established for matching with the vocabulary in the metadata training set, and therefore, the collection at least includes: the system comprises a client glossary of Chinese names, English names, full spelling codes and/or simple spelling codes and the like, and table data of an industry standard field interpretation mapping table, a metadata field approved by a service system in a fixed version, an interpretation comparison table and the like. And then, pre-training the model according to the existing industry-oriented data set and the labels to form a classifier based on machine learning so as to provide decision suggestions and term allocation rules. And determining a confidence score by establishing a term matching degree of 0-1, and modeling according to the alphabetical order in the keywords and the matched number.
In the process of establishing the vocabulary table to be trained, firstly, the collected table data needs to be cleaned, illegal characters in the table data are removed, vocabulary and vocabulary participles are established, and the vocabulary table to be trained is formed by a lookup table corresponding to the vocabulary and/or the vocabulary participles, Chinese (Chinese name), abbreviated spelling codes, full spelling codes, English (English name), English abbreviations or mixture of the vocabulary and the vocabulary; and establishing a vocabulary database to be trained by using the vocabulary to be trained.
Obtaining metadata of search words (first search words) with determined meanings, forming a metadata training set, removing illegal characters in each first search word, and obtaining a second search word of each first search word.
And judging the type of the second search term, determining the Chinese words and the non-Chinese words in the second search term, and obtaining the Chinese second search term and the non-Chinese second search term. The type of the second search term includes: full spellings, abbreviated spellings, english abbreviations, and/or hybrids. And performing direct retrieval matching on the second Chinese retrieval word in the vocabulary database to be trained, determining the second Chinese retrieval word which cannot be matched with the matching word, and taking the second non-Chinese retrieval word and the second Chinese retrieval word which cannot be matched with the matching word as the to-be-retrieved word. And the corresponding matching words of the Chinese search words matched with the matching words are directly used for updating the vocabulary table database to be trained.
And performing word segmentation on the words to be retrieved by using a conditional random field word segmentation algorithm to obtain a plurality of third retrieval words of each word to be retrieved, matching the third retrieval words with the words to be trained, and determining the matching words. Specifically, a plurality of third search terms of the to-be-searched term may be directly generated, or a conditional random field word segmentation algorithm may be used to segment a longer to-be-searched term, so as to generate a plurality of third search terms of each segment of the to-be-searched term, where the third search terms include: full spelling codes, abbreviated spelling codes, English names and/or Chinese names. The Chinese of the word to be searched can obtain the full spelling code and the simple spelling code according to the spelling rule, carry on the search through Chinese-English dictionary thesaurus and get the English name, and then can obtain the English simple writing code according to English and English simple writing look-up table.
If the to-be-searched word is a name, directly generating a third search word of the to-be-searched word, wherein the third search word comprises the following steps: name, xingming, and/or xm, etc. Assuming that the word to be searched is xing _ ming, performing word segmentation on the word to be searched by using a conditional random field word segmentation algorithm, removing illegal characters in each word segmentation to obtain two words of xing and ming, and generating a third search word, wherein the third search word corresponding to the word segmentation xing comprises: family name, surname, family name, last name, and x, etc.; the third search term corresponding to the participle ming comprises the following steps: name, given name, first name, m, etc.
Traversing the vocabulary to be trained in the vocabulary to be trained, performing character matching on all third search terms of each search term to be trained, and calculating the matching degree of each third search term corresponding to each search term to be trained and the vocabulary to be trained; and determining the matching words of the words to be searched according to the matching degree and the set matching threshold. And selecting the Chinese names of the vocabularies with the highest sum of the matching degrees and larger than the matching threshold value as new search words. And for the search word of which the third search word is English, if the full spelling code is the same as the simple spelling code, the matching degree of the full spelling code is not 1, and the matching degree of the simple spelling code is 1, selecting all Chinese names of the vocabulary with the matching degree of the simple spelling code of 1.
Assume that a longer Chinese search word "A, B, C, D" is taken as an example, after the word segmentation, 4 word segmentations of "A", "B", "C", and "D" are obtained, and Chinese names, English names, full spelling codes, simple spelling codes, and the like corresponding to the four characters of "A", "B", "C", and "D" are obtained as a third search word.
And matching the third search term corresponding to each participle by using character matching. Taking character matching of Chinese names of ' A ', ' B ', ' C ' and ' D ' as an example, if matched Chinese names are ' turtle ', ' hole-ethyl-hexyl ', ' dipropyl ' and ' pudding respectively, then the matching degrees are 1/2 respectively; 1/3, respectively; 1/2, respectively; 1/2. The method comprises the following steps that character matching is continuously carried out on the full spelling codes of A, B, C and D, and the obtained matching degrees are respectively 1; 2/3, respectively; 4/5, respectively; and 0, and the matching threshold is 1, and if the sum of the matching degrees of the full spelling codes of the A, the B and the C and the Chinese name of the D is the highest and is equal to about 2.967, the Chinese names of the full spelling codes of the A, the B and the C and the Chinese name of the D are used as the matching word of the word to be searched.
A confidence score is performed for each candidate matching word for every third term to aid in selecting the correct matching word, which automatically assigns terms to the metadata if the matching confidence reaches 90%. And if the matching is unsuccessful, performing manual matching, and updating the vocabulary to be trained by using the manual matching result.
The classification of the to-be-searched words comprises the following steps: chinese search terms, English search terms, mixed search terms, simple spelling search terms and English abbreviation search terms. The Chinese retrieval words only comprise Chinese characters, the English retrieval words only comprise English characters, and the rest are mixed retrieval words. The Chinese name of the Chinese search word is the search word itself, and the English name is a blank character string; the English name of the English search word is the search word itself, and the Chinese name is a null character string; the Chinese name and the English name of the mixed search word are the search word itself.
In the process of matching characters with the third search word, character strings of corresponding items of the vocabulary to be trained in the vocabulary to be trained are matched from left to right, in the third search word, the character-by-character matching is carried out, the matching number is calculated, the sequence of the occurrence of the characters is ignored, and the English characters are not distinguished from case to case.
When traversing the vocabulary to be trained in the vocabulary to be trained, traversing full spelling codes, Chinese names, simple spelling codes and the like of the Chinese word to be searched in sequence and calculating the matching degree; for English to-be-searched words, sequentially traversing English names, full spelling codes, simple spelling codes and the like and calculating the matching degree of the English to-be-searched words; and for the mixed word to be searched, traversing the full spelling codes, the English names, the simple spelling codes, the Chinese names and the like in sequence and calculating the matching degree of the full spelling codes, the English names, the simple spelling codes, the Chinese names and the like. If the full spelling code or the English name is found in the vocabulary to be trained and the full spelling code or the English name of the word to be searched is the vocabulary with the matching degree of 1 in the traversal calculation process, determining that the Chinese name of the word to be trained is a new search word, finishing the traversal, completing the matching of the word to be searched, and updating the vocabulary to be trained.
After all the first vocabularies in the metadata training set are matched, training and updating a conditional random field word segmentation algorithm according to matching words really corresponding to all the first vocabularies in the metadata training set and matching words actually matched, taking the trained conditional random field word segmentation algorithm as a trained classifier, and finally obtaining the trained classifier and the trained and updated vocabulary. And then matching the metadata fields to be matched by using the method in the training, the trained classifier and the trained vocabulary.
The words of the unsuccessfully matched words to be searched are segmented by using a conditional random field word segmentation algorithm, then the words are classified by using a trained classifier, the classification result of each third search word is obtained, the third search words are matched with a trained vocabulary, the matching degree is calculated, the matching words are determined according to the matching threshold, and the matching accuracy of the words to be searched can be improved on the premise of automatic matching.
For the vocabulary to be trained and the trained vocabulary, the dictionary storage data structure comprises: chinese has 7000 common words, 56000 common words, therefore, although it is easy to load these data into memory, it is difficult to perform high-concurrency millisecond operation, here, a Double-Array Trie (Double-Array Trie) structure is adopted, only two linear arrays are used to represent the Trie Tree, the structure effectively combines the efficient characteristic of Digital Search Tree (Digital Search Tree) retrieval time and the compact characteristic of chain-represented Trie space structure, and can complete single-mode matching in O (n) time.
The essence of the double-array Trie is a deterministic finite state automata (DFA), each node represents a state of the automata, state transfer is carried out according to different variables, and when the state reaches an end state or the state cannot be transferred, one query operation is completed. The relation between the characters contained in all keys of the double-number group is expressed by simple mathematical addition operation, thereby not only improving the retrieval speed, but also saving a large number of pointers used in a chain structure and saving the storage space.
If multi-pattern matching is completed in O (n) time and a word graph is constructed, the Aho-Corasick algorithm is required to preprocess the pattern string into a finite state automaton, as shown in FIG. 3, such as the pattern string is he/she/his/hers, and the text is "ushers". Thus, the first time to leaf node 5, the next matching can be directly from node 2, and all the pattern strings can be identified in one traversal.
The conditional random field word segmentation algorithm is a word-based word segmentation algorithm, word matching is not performed on sentences in advance by the word-based word segmentation algorithm, but the word segmentation is regarded as a sequence marking problem, and one word is marked as B (begin), I (inside), O (out), E (end) and S (single). Therefore, the method can be regarded as a classification problem of each word, and the method inputs the characteristics formed by each word and the words before and after the word and outputs the classification mark. And solving the classification problem by using a statistical machine learning method. The conditional random field is a discriminant undirected graph model, which can model conditional probabilities of a plurality of variables based on given observed values, and defines the conditional probability P (Y | X) for a given labeled sequence Y and observed sequence X, rather than modeling joint probabilities. The conditional random field word segmentation algorithm is the most common word segmentation, part of speech tagging and entity recognition algorithm at present, and has good recognition capability on the words which are not logged in. In the implementation mode of the application, a conditional random field word segmentation algorithm in a statistical machine learning method is used, problems are abstracted through a series of algorithms, then a model is obtained, and the obtained model is used for solving similar problems. The model may be regarded as a function, and the input word may be regarded as X, so that the label f (X) ═ Y may be obtained for each word. In addition, models are generally classified into two categories in machine learning: the essential difference between a generative model and a discriminant model is the generative relationship between X and Y. The generative model models the P (X, Y) joint probability by taking the assumption that 'output Y generates input X according to a certain rule'; the discriminant model considers that Y is determined by X, and directly models the posterior probability P (Y | X). The two have respective advantages and disadvantages, the relationship description of the generated model to the variable is clearer, and the discriminant model is easy to establish and learn.
The embodiment of the application uses the adjacency matrix to store the matching degrees of the words to be searched, the third search words and the words to be trained, and stores the matching degrees of the words to be matched and the matching words in the trained vocabulary when the words to be matched in the metadata fields to be matched are matched. The adjacency matrix is used to store relationships between single words and word-to-word relationships. And storing the result of the matching calculation and the result weight of the to-be-searched word and the matched word by using the adjacency list. The data structure used in calculating the degree of match stores the weights and stores the weights between the search terms and the matching terms.
As shown in fig. 4, the adjacency matrix uses array subscripts to represent nodes, and values to represent the weights of edges, i.e., d [ i ] [ j ] ═ v represents the weight of an edge between a node i and a node j as v.
The adjacency list establishes a single linked list for each node in the graph, and the storage space can be greatly saved for the sparse graph. The nodes in the ith singly linked list represent the edges attached to vertex i, as shown in FIG. 5. In practical applications, especially when the Viterbi algorithm is used to solve the optimal path, it is preferable to use the adjacency list to store the graph since the graph is traversed according to the breadth-first policy, which is convenient for accessing all nodes under a certain node.
In a second aspect, the present application provides a term matching system for metadata fields, as shown in fig. 6, including:
the training module 101 is configured to pre-process a first search term in a metadata training set to obtain a second search term; judging the second search term, matching the second search term with the vocabulary to be trained in the vocabulary database to be trained, and acquiring the search term to be searched, wherein the matching is unsuccessful; performing word segmentation on the words to be retrieved by using a conditional random field word segmentation algorithm to obtain a plurality of third retrieval words of each word to be retrieved, matching the third retrieval words with the words to be trained, and determining matching words; updating a vocabulary database to be trained according to the matched words, and training a conditional random field word segmentation algorithm to obtain a trained classifier and a trained vocabulary;
and the matching module 102 is configured to match the metadata fields to be matched by using the trained classifier and the trained vocabulary.
In a third aspect, the present application is directed to a storage medium storing computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the above-described term matching method for metadata fields.
According to the method, the second search word is judged and matched with the vocabulary in the trained vocabulary table database to obtain the search word to be searched which is unsuccessful in matching, the search word to be searched is segmented by using a conditional random field segmentation algorithm to obtain a plurality of third search words of the plurality of search words to be searched, the third search words are matched with the trained vocabulary table, the matching degree is calculated, the matching words are determined according to the matching threshold, and the matching accuracy of the search words to be searched can be improved on the premise of automatic matching. And a Double-Array Trie structure is adopted, so that the retrieval speed is improved, and the storage space is saved.
The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method for term matching of metadata fields, comprising:
preprocessing a first search term in a metadata training set to obtain a second search term;
judging the second search term, matching the second search term with the vocabulary to be trained in the vocabulary database to be trained, and acquiring the search term to be searched which is not matched successfully;
performing word segmentation on the words to be retrieved by using a conditional random field word segmentation algorithm to obtain a plurality of third retrieval words of each word to be retrieved, matching the third retrieval words with the words to be trained, and determining matched words;
updating a vocabulary database to be trained according to the matched words, and training the conditional random field word segmentation algorithm to obtain a trained classifier and a trained vocabulary;
and matching the metadata fields to be matched by using the trained classifier and the trained vocabulary.
2. The method of claim 1, wherein before preprocessing the first term in the metadata training set to obtain the second term, the method further comprises:
collecting table data, cleaning illegal characters in the table data, and establishing a vocabulary table to be trained;
and establishing a vocabulary database to be trained by using the vocabulary to be trained.
3. The method of claim 1, wherein the preprocessing a first term in a metadata training set to obtain a second term comprises:
acquiring all first search terms in metadata in a metadata training set;
and removing illegal characters in each first search word to obtain a second search word corresponding to each first search word.
4. The method for term matching of metadata fields as claimed in claim 1, wherein said determining the second search term to match the vocabulary to be trained in the vocabulary database to obtain the search term comprises:
judging the second search term to obtain a Chinese second search term and a non-Chinese second search term;
directly matching the Chinese second search terms in a vocabulary database to be trained, and determining the Chinese second search terms which cannot be matched with the matched terms;
and taking the non-Chinese second search term and the Chinese second search term which is not matched with the matching term as the to-be-searched term.
5. The method for term matching of metadata fields as claimed in claim 1, wherein the said word segmentation using conditional random field segmentation algorithm to obtain a plurality of third terms of each word to be retrieved, matching with the word to be trained to determine the matching word comprises:
performing word segmentation on the longer to-be-retrieved word by using a conditional random field word segmentation algorithm, and generating a plurality of third retrieval words of each word segmentation of the to-be-retrieved word, wherein the third retrieval words comprise: full spelling codes, abbreviated spelling codes, English names and/or Chinese names;
performing character matching on all the third search terms of each to-be-searched term and the to-be-trained vocabulary, and calculating the matching degree of each third search term corresponding to each to-be-searched term;
and determining the matching words of the words to be retrieved according to the matching degree and the matching threshold value.
6. The method for term matching of metadata fields according to claim 1, wherein said matching of metadata fields to be matched using the trained classifier and the trained vocabulary comprises:
matching the metadata fields to be matched by using a trained vocabulary, and acquiring the metadata fields with unsuccessful matching as words to be matched;
using a trained conditional random field word segmentation algorithm as a trained classifier, and performing word segmentation on each word to be matched to obtain a plurality of third search words corresponding to each word to be matched;
performing character matching on the plurality of third search terms in the trained vocabulary, and calculating the matching degree of each third search term;
and determining a matching word corresponding to each word to be matched according to the matching degree and the matching threshold, wherein the matching word is a matching term of the word to be matched.
7. The term matching method for metadata fields according to claim 2, wherein the table data comprises:
the system comprises a client glossary, an industry standard field interpretation mapping table, a metadata field approved by a business system version and an interpretation comparison table, wherein each table comprises: chinese name, English name, full spelling code and simple spelling code.
8. The term matching method for metadata fields according to claim 3, wherein the types of the first search term and the second search term each comprise: chinese, full pinyin, shorthand pinyin, english acronym, and/or mixture.
9. A term matching system for metadata fields, comprising:
the training module is used for preprocessing a first search term in the metadata training set to obtain a second search term; judging the second search term, matching the second search term with the vocabulary to be trained in the vocabulary database to be trained, and acquiring the search term to be searched which is not matched successfully; performing word segmentation on the words to be retrieved by using a conditional random field word segmentation algorithm to obtain a plurality of third retrieval words of each word to be retrieved, matching the third retrieval words with the words to be trained, and determining matched words; updating a vocabulary database to be trained according to the matched words, and training the conditional random field word segmentation algorithm to obtain a trained classifier and a trained vocabulary;
and the matching module is used for matching the metadata fields to be matched by using the trained classifier and the trained vocabulary.
10. A storage medium storing computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform a term matching method for a metadata field as claimed in any one of claims 1 to 8.
CN202011342621.XA 2020-11-25 2020-11-25 Term matching method, matching system and storage medium for metadata field Active CN112580691B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011342621.XA CN112580691B (en) 2020-11-25 2020-11-25 Term matching method, matching system and storage medium for metadata field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011342621.XA CN112580691B (en) 2020-11-25 2020-11-25 Term matching method, matching system and storage medium for metadata field

Publications (2)

Publication Number Publication Date
CN112580691A true CN112580691A (en) 2021-03-30
CN112580691B CN112580691B (en) 2024-05-14

Family

ID=75123569

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011342621.XA Active CN112580691B (en) 2020-11-25 2020-11-25 Term matching method, matching system and storage medium for metadata field

Country Status (1)

Country Link
CN (1) CN112580691B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114969001A (en) * 2022-05-24 2022-08-30 浪潮卓数大数据产业发展有限公司 Database metadata field matching method, device, equipment and medium

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751430A (en) * 2008-12-12 2010-06-23 汉王科技股份有限公司 Electronic dictionary fuzzy searching method
CN103336850A (en) * 2013-07-24 2013-10-02 昆明理工大学 Method and device for confirming index word in database retrieval system
CN103412858A (en) * 2012-07-02 2013-11-27 清华大学 Method for large-scale feature matching of text content or network content analyses
CN103810168A (en) * 2012-11-06 2014-05-21 深圳市世纪光速信息技术有限公司 Search application method, device and terminal
CN106933883A (en) * 2015-12-31 2017-07-07 中移(苏州)软件技术有限公司 Point of interest Ordinary search word sorting technique, device based on retrieval daily record
CN108255813A (en) * 2018-01-23 2018-07-06 重庆邮电大学 A kind of text matching technique based on term frequency-inverse document and CRF
CN108304375A (en) * 2017-11-13 2018-07-20 广州腾讯科技有限公司 A kind of information identifying method and its equipment, storage medium, terminal
US20180232443A1 (en) * 2017-02-16 2018-08-16 Globality, Inc. Intelligent matching system with ontology-aided relation extraction
US10242320B1 (en) * 2018-04-19 2019-03-26 Maana, Inc. Machine assisted learning of entities
CN110309368A (en) * 2018-03-26 2019-10-08 腾讯科技(深圳)有限公司 Determination method, apparatus, storage medium and the electronic device of data address
CN110931137A (en) * 2018-09-19 2020-03-27 京东方科技集团股份有限公司 Machine-assisted dialog system, method and device
CN111291195A (en) * 2020-01-21 2020-06-16 腾讯科技(深圳)有限公司 Data processing method, device, terminal and readable storage medium
CN111310456A (en) * 2020-02-13 2020-06-19 支付宝(杭州)信息技术有限公司 Entity name matching method, device and equipment
CN111899829A (en) * 2020-07-31 2020-11-06 青岛百洋智能科技股份有限公司 Full-text retrieval matching engine based on ICD9/10 participle lexicon
CN111967242A (en) * 2020-08-17 2020-11-20 支付宝(杭州)信息技术有限公司 Text information extraction method, device and equipment
CN112148885A (en) * 2020-09-04 2020-12-29 上海晏鼠计算机技术股份有限公司 Intelligent searching method and system based on knowledge graph

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751430A (en) * 2008-12-12 2010-06-23 汉王科技股份有限公司 Electronic dictionary fuzzy searching method
CN103412858A (en) * 2012-07-02 2013-11-27 清华大学 Method for large-scale feature matching of text content or network content analyses
CN103810168A (en) * 2012-11-06 2014-05-21 深圳市世纪光速信息技术有限公司 Search application method, device and terminal
US20150234927A1 (en) * 2012-11-06 2015-08-20 Tencent Technology (Shenzhen) Company Limited Application search method, apparatus, and terminal
CN103336850A (en) * 2013-07-24 2013-10-02 昆明理工大学 Method and device for confirming index word in database retrieval system
CN106933883A (en) * 2015-12-31 2017-07-07 中移(苏州)软件技术有限公司 Point of interest Ordinary search word sorting technique, device based on retrieval daily record
US20180232443A1 (en) * 2017-02-16 2018-08-16 Globality, Inc. Intelligent matching system with ontology-aided relation extraction
CN108304375A (en) * 2017-11-13 2018-07-20 广州腾讯科技有限公司 A kind of information identifying method and its equipment, storage medium, terminal
CN108255813A (en) * 2018-01-23 2018-07-06 重庆邮电大学 A kind of text matching technique based on term frequency-inverse document and CRF
CN110309368A (en) * 2018-03-26 2019-10-08 腾讯科技(深圳)有限公司 Determination method, apparatus, storage medium and the electronic device of data address
US10242320B1 (en) * 2018-04-19 2019-03-26 Maana, Inc. Machine assisted learning of entities
CN110931137A (en) * 2018-09-19 2020-03-27 京东方科技集团股份有限公司 Machine-assisted dialog system, method and device
CN111291195A (en) * 2020-01-21 2020-06-16 腾讯科技(深圳)有限公司 Data processing method, device, terminal and readable storage medium
CN111310456A (en) * 2020-02-13 2020-06-19 支付宝(杭州)信息技术有限公司 Entity name matching method, device and equipment
CN111899829A (en) * 2020-07-31 2020-11-06 青岛百洋智能科技股份有限公司 Full-text retrieval matching engine based on ICD9/10 participle lexicon
CN111967242A (en) * 2020-08-17 2020-11-20 支付宝(杭州)信息技术有限公司 Text information extraction method, device and equipment
CN112148885A (en) * 2020-09-04 2020-12-29 上海晏鼠计算机技术股份有限公司 Intelligent searching method and system based on knowledge graph

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
FANJIN MAI 等: "Improved Chinese Word Segmentation Disambiguation Model Based on Conditional Random Fields", 《PROCEEDINGS OF THE 4TH INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND NETWORKS 》, vol. 355, 31 December 2015 (2015-12-31) *
蒋婷: "学科领域本体学习及学术资源语义标注研究", 《中国博士学位论文全文数据库 信息科技辑》, 15 June 2018 (2018-06-15) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114969001A (en) * 2022-05-24 2022-08-30 浪潮卓数大数据产业发展有限公司 Database metadata field matching method, device, equipment and medium
CN114969001B (en) * 2022-05-24 2024-05-10 浪潮卓数大数据产业发展有限公司 Database metadata field matching method, device, equipment and medium

Also Published As

Publication number Publication date
CN112580691B (en) 2024-05-14

Similar Documents

Publication Publication Date Title
US8001018B2 (en) System and method for automated part-number mapping
CN112115238A (en) Question-answering method and system based on BERT and knowledge base
CN104408173A (en) Method for automatically extracting kernel keyword based on B2B platform
Reyes-Galaviz et al. A supervised gradient-based learning algorithm for optimized entity resolution
CN113673252B (en) Automatic join recommendation method for data table based on field semantics
CN109614493B (en) Text abbreviation recognition method and system based on supervision word vector
CN114911945A (en) Knowledge graph-based multi-value chain data management auxiliary decision model construction method
CN112036178A (en) Distribution network entity related semantic search method
CN116737967B (en) Knowledge graph construction and perfecting system and method based on natural language
CN111325018A (en) Domain dictionary construction method based on web retrieval and new word discovery
JPH0816620A (en) Data sorting device/method, data sorting tree generation device/method, derivative extraction device/method, thesaurus construction device/method, and data processing system
Singh et al. SciDr at SDU-2020: IDEAS--Identifying and Disambiguating Everyday Acronyms for Scientific Domain
CN115982338A (en) Query path ordering-based domain knowledge graph question-answering method and system
CN112862569B (en) Product appearance style evaluation method and system based on image and text multi-modal data
CN112784049B (en) Text data-oriented online social platform multi-element knowledge acquisition method
CN114239828A (en) Supply chain affair map construction method based on causal relationship
CN114493783A (en) Commodity matching method based on double retrieval mechanism
CN112580691B (en) Term matching method, matching system and storage medium for metadata field
CN113963748A (en) Protein knowledge map vectorization method
Islam et al. Applications of corpus-based semantic similarity and word segmentation to database schema matching
CN117648984A (en) Intelligent question-answering method and system based on domain knowledge graph
CN117150006A (en) Intelligent import and export commodity classification method integrating knowledge patterns
AL-Khassawneh et al. Improving triangle-graph based text summarization using hybrid similarity function
CN115544211A (en) Method for external trade and external law indexing and industry risk assessment
CN112613318B (en) Entity name normalization system, method thereof and computer readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant