CN112580691A - Term matching method, matching system and storage medium of metadata field - Google Patents
Term matching method, matching system and storage medium of metadata field Download PDFInfo
- Publication number
- CN112580691A CN112580691A CN202011342621.XA CN202011342621A CN112580691A CN 112580691 A CN112580691 A CN 112580691A CN 202011342621 A CN202011342621 A CN 202011342621A CN 112580691 A CN112580691 A CN 112580691A
- Authority
- CN
- China
- Prior art keywords
- matching
- trained
- term
- words
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 38
- 230000011218 segmentation Effects 0.000 claims abstract description 64
- 238000007781 pre-processing Methods 0.000 claims abstract description 14
- 239000000203 mixture Substances 0.000 claims description 5
- 238000013507 mapping Methods 0.000 claims description 4
- 238000004140 cleaning Methods 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 5
- 238000010801 machine learning Methods 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 241000270666 Testudines Species 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- VLKZOEOYAKHREP-UHFFFAOYSA-N n-Hexane Chemical group CCCCCC VLKZOEOYAKHREP-UHFFFAOYSA-N 0.000 description 1
- 235000011962 puddings Nutrition 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/28—Determining representative reference patterns, e.g. by averaging or distorting; Generating dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Library & Information Science (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a term matching method, a term matching system and a storage medium of metadata fields, which comprise the following steps: preprocessing a first search term in a metadata training set to obtain a second search term; judging the second search term, matching the second search term with the vocabulary to be trained in the vocabulary database to be trained, and acquiring the search term to be searched, wherein the matching is unsuccessful; performing word segmentation on the words to be retrieved by using a conditional random field word segmentation algorithm to obtain a plurality of third retrieval words of each word to be retrieved, matching the third retrieval words with the words to be trained, and determining matching words; updating a vocabulary database to be trained according to the matched words, and training a conditional random field word segmentation algorithm to obtain a trained classifier and a trained word vocabulary; and matching the words to be matched in the metadata fields to be matched by using the trained classifier and the vocabulary. The conditional random field word segmentation algorithm is used for segmenting words to be retrieved, matched words are determined, and the matching accuracy of the words to be retrieved can be improved on the premise of automatic matching.
Description
Technical Field
The present application relates to the field of data identification technologies, and in particular, to a term matching method, a term matching system, and a storage medium for metadata fields.
Background
Most businesses spend a great deal of time and effort, dealing with disorganized and thin-fitting data. Their employees either cannot find the appropriate data or do not trust the data found. Most importantly, self-service and data autonomy processes are restricted by a wide variety of industry regulations. As a result, enterprises attempt to repair data through a variety of labor-intensive tasks (including writing custom programs, developing global replacement functions, etc.), which severely impacts productivity of data analysts and data scientists. This is especially true for large enterprises, where years of co-purchase clouds have integrated systems and databases of various colors, resulting in extremely complex data environments. While maintaining these legacy data environments has been exhausting for businesses, new data continues to be generated at an unpredictable rate.
In view of the foregoing, it is desirable to provide a term matching method, matching system, and storage medium for metadata fields that are automatic and accurate.
Disclosure of Invention
To solve the above problems, the present application proposes a term matching method, matching system, and storage medium for metadata fields.
In a first aspect, the present application provides a term matching method for metadata fields, including:
preprocessing a first search term in a metadata training set to obtain a second search term;
judging the second search term, matching the second search term with the vocabulary to be trained in the vocabulary database to be trained, and acquiring the search term to be searched which is not matched successfully;
performing word segmentation on the words to be retrieved by using a conditional random field word segmentation algorithm to obtain a plurality of third retrieval words of each word to be retrieved, matching the third retrieval words with the words to be trained, and determining matched words;
updating a vocabulary database to be trained according to the matched words, and training the conditional random field word segmentation algorithm to obtain a trained classifier and a trained vocabulary;
and matching the metadata fields to be matched by using the trained classifier and the trained vocabulary.
Preferably, before the preprocessing the first search term in the metadata training set to obtain the second search term, the method further includes:
collecting table data, cleaning illegal characters in the table data, and establishing a vocabulary table to be trained;
and establishing a vocabulary database to be trained by using the vocabulary to be trained.
Preferably, the preprocessing the first search term in the metadata training set to obtain the second search term includes:
acquiring all first search terms in metadata in a metadata training set;
and removing illegal characters in each first search word to obtain a second search word corresponding to each first search word.
Preferably, the determining the second search term and matching the second search term with the vocabulary to be trained in the vocabulary database to be trained to obtain the search term includes:
judging the second search term to obtain a Chinese second search term and a non-Chinese second search term;
directly matching the Chinese second search terms in a vocabulary database to be trained, and determining the Chinese second search terms which cannot be matched with the matched terms;
and taking the non-Chinese second search term and the Chinese second search term which is not matched with the matching term as the to-be-searched term.
Preferably, the using a conditional random field word segmentation algorithm to segment the words to be retrieved to obtain a plurality of third retrieval words of each word to be retrieved, and matching the third retrieval words with the words to be trained to determine matching words includes:
performing word segmentation on the longer to-be-retrieved word by using a conditional random field word segmentation algorithm, and generating a plurality of third retrieval words of each word segmentation of the to-be-retrieved word, wherein the third retrieval words comprise: full spelling codes, abbreviated spelling codes, English names and/or Chinese names;
performing character matching on all the third search terms of each to-be-searched term and the to-be-trained vocabulary, and calculating the matching degree of each third search term corresponding to each to-be-searched term;
and determining the matching words of the words to be retrieved according to the matching degree and the matching threshold value.
Preferably, the matching the metadata fields to be matched by using the trained classifier and the trained vocabulary includes:
matching the classifier with the trained metadata fields to be matched, and acquiring the metadata fields with unsuccessful matching as words to be matched;
using a trained conditional random field word segmentation algorithm as a trained classifier, and performing word segmentation on each word to be matched to obtain a plurality of third search words corresponding to each word to be matched;
performing character matching on the plurality of third search terms in the trained vocabulary, and calculating the matching degree of each third search term;
and determining a matching word corresponding to each word to be matched according to the matching degree and the matching threshold, wherein the matching word is a matching term of the word to be matched.
Preferably, the table data includes:
the system comprises a client glossary, an industry standard field interpretation mapping table, a metadata field approved by a business system version and an interpretation comparison table, wherein each table comprises: chinese name, English name, full spelling code and simple spelling code.
Preferably, the types of the first search term and the second search term each include: chinese, full pinyin, shorthand pinyin, english acronym, and/or mixture.
In a second aspect, the present application provides a term matching system for metadata fields, comprising:
the training module is used for preprocessing a first search term in the metadata training set to obtain a second search term; judging the second search term, matching the second search term with the vocabulary to be trained in the vocabulary database to be trained, and acquiring the search term to be searched which is not matched successfully; performing word segmentation on the words to be retrieved by using a conditional random field word segmentation algorithm to obtain a plurality of third retrieval words of each word to be retrieved, matching the third retrieval words with the words to be trained, and determining matched words; updating a vocabulary database to be trained according to the matched words, and training the conditional random field word segmentation algorithm to obtain a trained classifier and a trained vocabulary;
and the matching module is used for matching the words to be matched in the metadata fields to be matched by using the trained classifier and the trained vocabulary.
In a third aspect, the present application is directed to a storage medium storing computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform a term matching method for metadata fields as described above.
The application has the advantages that: the second search word is judged and matched with the vocabulary in the trained vocabulary table database to obtain the unsuccessfully matched to-be-searched word, the to-be-searched word is segmented by using the conditional random field segmentation algorithm and classified by using the trained classifier to obtain a plurality of third search words, the third search words are matched with the trained vocabulary table, the matching degree is calculated, the matching words are determined according to the matching threshold, and the matching accuracy of the to-be-searched word can be improved on the premise of automatic matching.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to denote like parts throughout the drawings. In the drawings:
FIG. 1 is a schematic diagram of the steps of a term matching method for metadata fields provided herein;
FIG. 2 is a flow chart illustrating a term matching method for metadata fields provided herein;
FIG. 3 is a schematic diagram of the Aho-Corasick algorithm of a term matching method for metadata fields provided by the present application for preprocessing a pattern string into a finite state automaton;
FIG. 4 is a diagram illustrating a data structure of an adjacency matrix of a term matching method for metadata fields provided in the present application;
FIG. 5 is a diagram illustrating an adjacency list data structure for solving the Viterbi algorithm for term matching of metadata fields as provided herein;
FIG. 6 is a schematic diagram of a term matching system for metadata fields as provided herein.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
In a first aspect, according to an embodiment of the present application, a term matching method for metadata fields is provided, as shown in fig. 1, including:
s101, preprocessing a first search term in a metadata training set to obtain a second search term;
s102, judging the second search term, matching the second search term with the vocabulary to be trained in the vocabulary database to be trained, and acquiring the search term to be searched which is not matched successfully;
s103, performing word segmentation on the words to be retrieved by using a conditional random field word segmentation algorithm to obtain a plurality of third retrieval words of each word to be retrieved, matching the third retrieval words with the words to be trained, and determining matched words;
s104, updating a vocabulary database to be trained according to the matched words, and training the conditional random field word segmentation algorithm to obtain a trained classifier and a trained vocabulary;
and S105, matching the metadata fields to be matched by using the trained classifier and the trained vocabulary.
Preferably, before preprocessing the first search term in the metadata training set to obtain the second search term, the method further includes: collecting table data, cleaning illegal characters in the table data, and establishing a vocabulary table to be trained; and establishing a vocabulary database to be trained by using the vocabulary to be trained.
The vocabulary database to be trained is formed by vocabulary and vocabulary participles, Chinese corresponding to the vocabulary and/or the participles, simple spelling codes, full spelling codes, English abbreviation or mixed lookup tables.
The collected table data includes: the system comprises a client glossary, an industry standard field interpretation mapping table, a metadata field and an interpretation comparison table which are approved by a business system in a fixed version mode, and the like. Wherein each table comprises: chinese name, English name, full spelling code, simple spelling code, etc.
Preprocessing a first search term in a metadata training set to obtain a second search term, wherein the preprocessing comprises the following steps: acquiring all first search terms in metadata in a metadata training set; and removing illegal characters in each first search word to obtain a second search word corresponding to each first search word.
Wherein, the first search word in the metadata training set is a word with a label (a well-matched word is determined). The types of the first search term and the second search term each include: chinese, full pinyin, shorthand pinyin, english acronym, and/or mixture.
Judging the second search term, and matching the second search term with the vocabulary to be trained in the vocabulary database to be trained to obtain the search term, wherein the search term comprises the following steps:
judging the second search term to obtain a Chinese second search term and a non-Chinese second search term; directly matching the Chinese second search terms in the vocabulary database to be trained, and determining the Chinese second search terms which cannot be matched with the matching terms; and taking the non-Chinese second search term and the Chinese second search term which is not matched with the matching term as the to-be-searched term. Direct matching means directly retrieving matches.
Performing word segmentation on the words to be retrieved by using a conditional random field word segmentation algorithm to obtain a plurality of third retrieval words of each word to be retrieved, matching the third retrieval words with the words to be trained, and determining the matching words, wherein the word segmentation method comprises the following steps: performing word segmentation on a long word to be retrieved by using a conditional random field word segmentation algorithm, and generating a plurality of third retrieval words of each word segmentation of the word to be retrieved, wherein the third retrieval words comprise: full spelling codes, abbreviated spelling codes, English names and/or Chinese names; performing character matching on all third search terms of each to-be-searched term and the to-be-trained vocabulary, and calculating the matching degree of each third search term corresponding to each to-be-searched term; and determining the matching words of the words to be searched according to the matching degree and the matching threshold value. The method for obtaining a plurality of third search terms of each to-be-searched term further comprises the following steps: and directly generating a plurality of third search terms of the search terms to be searched. If the second search term is a Chinese word, a plurality of third search terms corresponding to the second search term can be directly generated.
The third search term is Chinese, simple spelling, English abbreviation and/or mixture of the above words of the second search term.
The character matching is measured according to the ratio of the matching number of the characters in the third search word and the corresponding item (word) characters in the vocabulary to be trained to the longest character string length in the characters, and the matching degree CD is defined as:
wherein, ICIndicating the number of matched characters, LMAXThe maximum value of the number of the characters in the third search word and the number of the characters of the corresponding item in the corresponding item characters in the vocabulary to be trained is represented.
Matching metadata fields to be matched using the trained classifier and the trained vocabulary, comprising: matching the metadata fields to be matched by using a trained vocabulary, and acquiring the metadata fields with unsuccessful matching as words to be matched; using a trained Conditional Random Field (CRF) word segmentation algorithm as a trained classifier, and performing word segmentation on each word to be matched to obtain a plurality of third search terms corresponding to each word to be matched; performing character matching on the plurality of third search terms in a trained vocabulary list, and calculating the matching degree of each third search term; and determining a matching word corresponding to each word to be matched according to the matching degree and the matching threshold, wherein the matching word is a matching term of the word to be matched. And when the trained vocabulary database is updated by using the matched words, storing the matched words and the first search words, the second search words and/or the third search words corresponding to the matched words. The vocabulary database includes a plurality of vocabularies. When matching is carried out, the matching is carried out according to decision suggestions and the matching degree sequence, wherein the decision suggestions are matching words (matching terms) which are selected as the words to be matched and have the maximum matching degree; the matching degree sorting is that if the classification result with the maximum matching degree is multiple, that is, the matching degrees of the multiple classification results are the same and the maximum, the matching word which is ranked at the top or the matching degree of which is calculated at the top is selected as the matching word to be matched.
The following examples are provided to further illustrate the present application.
As shown in fig. 2, firstly, a vocabulary to be trained needs to be established for matching with the vocabulary in the metadata training set, and therefore, the collection at least includes: the system comprises a client glossary of Chinese names, English names, full spelling codes and/or simple spelling codes and the like, and table data of an industry standard field interpretation mapping table, a metadata field approved by a service system in a fixed version, an interpretation comparison table and the like. And then, pre-training the model according to the existing industry-oriented data set and the labels to form a classifier based on machine learning so as to provide decision suggestions and term allocation rules. And determining a confidence score by establishing a term matching degree of 0-1, and modeling according to the alphabetical order in the keywords and the matched number.
In the process of establishing the vocabulary table to be trained, firstly, the collected table data needs to be cleaned, illegal characters in the table data are removed, vocabulary and vocabulary participles are established, and the vocabulary table to be trained is formed by a lookup table corresponding to the vocabulary and/or the vocabulary participles, Chinese (Chinese name), abbreviated spelling codes, full spelling codes, English (English name), English abbreviations or mixture of the vocabulary and the vocabulary; and establishing a vocabulary database to be trained by using the vocabulary to be trained.
Obtaining metadata of search words (first search words) with determined meanings, forming a metadata training set, removing illegal characters in each first search word, and obtaining a second search word of each first search word.
And judging the type of the second search term, determining the Chinese words and the non-Chinese words in the second search term, and obtaining the Chinese second search term and the non-Chinese second search term. The type of the second search term includes: full spellings, abbreviated spellings, english abbreviations, and/or hybrids. And performing direct retrieval matching on the second Chinese retrieval word in the vocabulary database to be trained, determining the second Chinese retrieval word which cannot be matched with the matching word, and taking the second non-Chinese retrieval word and the second Chinese retrieval word which cannot be matched with the matching word as the to-be-retrieved word. And the corresponding matching words of the Chinese search words matched with the matching words are directly used for updating the vocabulary table database to be trained.
And performing word segmentation on the words to be retrieved by using a conditional random field word segmentation algorithm to obtain a plurality of third retrieval words of each word to be retrieved, matching the third retrieval words with the words to be trained, and determining the matching words. Specifically, a plurality of third search terms of the to-be-searched term may be directly generated, or a conditional random field word segmentation algorithm may be used to segment a longer to-be-searched term, so as to generate a plurality of third search terms of each segment of the to-be-searched term, where the third search terms include: full spelling codes, abbreviated spelling codes, English names and/or Chinese names. The Chinese of the word to be searched can obtain the full spelling code and the simple spelling code according to the spelling rule, carry on the search through Chinese-English dictionary thesaurus and get the English name, and then can obtain the English simple writing code according to English and English simple writing look-up table.
If the to-be-searched word is a name, directly generating a third search word of the to-be-searched word, wherein the third search word comprises the following steps: name, xingming, and/or xm, etc. Assuming that the word to be searched is xing _ ming, performing word segmentation on the word to be searched by using a conditional random field word segmentation algorithm, removing illegal characters in each word segmentation to obtain two words of xing and ming, and generating a third search word, wherein the third search word corresponding to the word segmentation xing comprises: family name, surname, family name, last name, and x, etc.; the third search term corresponding to the participle ming comprises the following steps: name, given name, first name, m, etc.
Traversing the vocabulary to be trained in the vocabulary to be trained, performing character matching on all third search terms of each search term to be trained, and calculating the matching degree of each third search term corresponding to each search term to be trained and the vocabulary to be trained; and determining the matching words of the words to be searched according to the matching degree and the set matching threshold. And selecting the Chinese names of the vocabularies with the highest sum of the matching degrees and larger than the matching threshold value as new search words. And for the search word of which the third search word is English, if the full spelling code is the same as the simple spelling code, the matching degree of the full spelling code is not 1, and the matching degree of the simple spelling code is 1, selecting all Chinese names of the vocabulary with the matching degree of the simple spelling code of 1.
Assume that a longer Chinese search word "A, B, C, D" is taken as an example, after the word segmentation, 4 word segmentations of "A", "B", "C", and "D" are obtained, and Chinese names, English names, full spelling codes, simple spelling codes, and the like corresponding to the four characters of "A", "B", "C", and "D" are obtained as a third search word.
And matching the third search term corresponding to each participle by using character matching. Taking character matching of Chinese names of ' A ', ' B ', ' C ' and ' D ' as an example, if matched Chinese names are ' turtle ', ' hole-ethyl-hexyl ', ' dipropyl ' and ' pudding respectively, then the matching degrees are 1/2 respectively; 1/3, respectively; 1/2, respectively; 1/2. The method comprises the following steps that character matching is continuously carried out on the full spelling codes of A, B, C and D, and the obtained matching degrees are respectively 1; 2/3, respectively; 4/5, respectively; and 0, and the matching threshold is 1, and if the sum of the matching degrees of the full spelling codes of the A, the B and the C and the Chinese name of the D is the highest and is equal to about 2.967, the Chinese names of the full spelling codes of the A, the B and the C and the Chinese name of the D are used as the matching word of the word to be searched.
A confidence score is performed for each candidate matching word for every third term to aid in selecting the correct matching word, which automatically assigns terms to the metadata if the matching confidence reaches 90%. And if the matching is unsuccessful, performing manual matching, and updating the vocabulary to be trained by using the manual matching result.
The classification of the to-be-searched words comprises the following steps: chinese search terms, English search terms, mixed search terms, simple spelling search terms and English abbreviation search terms. The Chinese retrieval words only comprise Chinese characters, the English retrieval words only comprise English characters, and the rest are mixed retrieval words. The Chinese name of the Chinese search word is the search word itself, and the English name is a blank character string; the English name of the English search word is the search word itself, and the Chinese name is a null character string; the Chinese name and the English name of the mixed search word are the search word itself.
In the process of matching characters with the third search word, character strings of corresponding items of the vocabulary to be trained in the vocabulary to be trained are matched from left to right, in the third search word, the character-by-character matching is carried out, the matching number is calculated, the sequence of the occurrence of the characters is ignored, and the English characters are not distinguished from case to case.
When traversing the vocabulary to be trained in the vocabulary to be trained, traversing full spelling codes, Chinese names, simple spelling codes and the like of the Chinese word to be searched in sequence and calculating the matching degree; for English to-be-searched words, sequentially traversing English names, full spelling codes, simple spelling codes and the like and calculating the matching degree of the English to-be-searched words; and for the mixed word to be searched, traversing the full spelling codes, the English names, the simple spelling codes, the Chinese names and the like in sequence and calculating the matching degree of the full spelling codes, the English names, the simple spelling codes, the Chinese names and the like. If the full spelling code or the English name is found in the vocabulary to be trained and the full spelling code or the English name of the word to be searched is the vocabulary with the matching degree of 1 in the traversal calculation process, determining that the Chinese name of the word to be trained is a new search word, finishing the traversal, completing the matching of the word to be searched, and updating the vocabulary to be trained.
After all the first vocabularies in the metadata training set are matched, training and updating a conditional random field word segmentation algorithm according to matching words really corresponding to all the first vocabularies in the metadata training set and matching words actually matched, taking the trained conditional random field word segmentation algorithm as a trained classifier, and finally obtaining the trained classifier and the trained and updated vocabulary. And then matching the metadata fields to be matched by using the method in the training, the trained classifier and the trained vocabulary.
The words of the unsuccessfully matched words to be searched are segmented by using a conditional random field word segmentation algorithm, then the words are classified by using a trained classifier, the classification result of each third search word is obtained, the third search words are matched with a trained vocabulary, the matching degree is calculated, the matching words are determined according to the matching threshold, and the matching accuracy of the words to be searched can be improved on the premise of automatic matching.
For the vocabulary to be trained and the trained vocabulary, the dictionary storage data structure comprises: chinese has 7000 common words, 56000 common words, therefore, although it is easy to load these data into memory, it is difficult to perform high-concurrency millisecond operation, here, a Double-Array Trie (Double-Array Trie) structure is adopted, only two linear arrays are used to represent the Trie Tree, the structure effectively combines the efficient characteristic of Digital Search Tree (Digital Search Tree) retrieval time and the compact characteristic of chain-represented Trie space structure, and can complete single-mode matching in O (n) time.
The essence of the double-array Trie is a deterministic finite state automata (DFA), each node represents a state of the automata, state transfer is carried out according to different variables, and when the state reaches an end state or the state cannot be transferred, one query operation is completed. The relation between the characters contained in all keys of the double-number group is expressed by simple mathematical addition operation, thereby not only improving the retrieval speed, but also saving a large number of pointers used in a chain structure and saving the storage space.
If multi-pattern matching is completed in O (n) time and a word graph is constructed, the Aho-Corasick algorithm is required to preprocess the pattern string into a finite state automaton, as shown in FIG. 3, such as the pattern string is he/she/his/hers, and the text is "ushers". Thus, the first time to leaf node 5, the next matching can be directly from node 2, and all the pattern strings can be identified in one traversal.
The conditional random field word segmentation algorithm is a word-based word segmentation algorithm, word matching is not performed on sentences in advance by the word-based word segmentation algorithm, but the word segmentation is regarded as a sequence marking problem, and one word is marked as B (begin), I (inside), O (out), E (end) and S (single). Therefore, the method can be regarded as a classification problem of each word, and the method inputs the characteristics formed by each word and the words before and after the word and outputs the classification mark. And solving the classification problem by using a statistical machine learning method. The conditional random field is a discriminant undirected graph model, which can model conditional probabilities of a plurality of variables based on given observed values, and defines the conditional probability P (Y | X) for a given labeled sequence Y and observed sequence X, rather than modeling joint probabilities. The conditional random field word segmentation algorithm is the most common word segmentation, part of speech tagging and entity recognition algorithm at present, and has good recognition capability on the words which are not logged in. In the implementation mode of the application, a conditional random field word segmentation algorithm in a statistical machine learning method is used, problems are abstracted through a series of algorithms, then a model is obtained, and the obtained model is used for solving similar problems. The model may be regarded as a function, and the input word may be regarded as X, so that the label f (X) ═ Y may be obtained for each word. In addition, models are generally classified into two categories in machine learning: the essential difference between a generative model and a discriminant model is the generative relationship between X and Y. The generative model models the P (X, Y) joint probability by taking the assumption that 'output Y generates input X according to a certain rule'; the discriminant model considers that Y is determined by X, and directly models the posterior probability P (Y | X). The two have respective advantages and disadvantages, the relationship description of the generated model to the variable is clearer, and the discriminant model is easy to establish and learn.
The embodiment of the application uses the adjacency matrix to store the matching degrees of the words to be searched, the third search words and the words to be trained, and stores the matching degrees of the words to be matched and the matching words in the trained vocabulary when the words to be matched in the metadata fields to be matched are matched. The adjacency matrix is used to store relationships between single words and word-to-word relationships. And storing the result of the matching calculation and the result weight of the to-be-searched word and the matched word by using the adjacency list. The data structure used in calculating the degree of match stores the weights and stores the weights between the search terms and the matching terms.
As shown in fig. 4, the adjacency matrix uses array subscripts to represent nodes, and values to represent the weights of edges, i.e., d [ i ] [ j ] ═ v represents the weight of an edge between a node i and a node j as v.
The adjacency list establishes a single linked list for each node in the graph, and the storage space can be greatly saved for the sparse graph. The nodes in the ith singly linked list represent the edges attached to vertex i, as shown in FIG. 5. In practical applications, especially when the Viterbi algorithm is used to solve the optimal path, it is preferable to use the adjacency list to store the graph since the graph is traversed according to the breadth-first policy, which is convenient for accessing all nodes under a certain node.
In a second aspect, the present application provides a term matching system for metadata fields, as shown in fig. 6, including:
the training module 101 is configured to pre-process a first search term in a metadata training set to obtain a second search term; judging the second search term, matching the second search term with the vocabulary to be trained in the vocabulary database to be trained, and acquiring the search term to be searched, wherein the matching is unsuccessful; performing word segmentation on the words to be retrieved by using a conditional random field word segmentation algorithm to obtain a plurality of third retrieval words of each word to be retrieved, matching the third retrieval words with the words to be trained, and determining matching words; updating a vocabulary database to be trained according to the matched words, and training a conditional random field word segmentation algorithm to obtain a trained classifier and a trained vocabulary;
and the matching module 102 is configured to match the metadata fields to be matched by using the trained classifier and the trained vocabulary.
In a third aspect, the present application is directed to a storage medium storing computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the above-described term matching method for metadata fields.
According to the method, the second search word is judged and matched with the vocabulary in the trained vocabulary table database to obtain the search word to be searched which is unsuccessful in matching, the search word to be searched is segmented by using a conditional random field segmentation algorithm to obtain a plurality of third search words of the plurality of search words to be searched, the third search words are matched with the trained vocabulary table, the matching degree is calculated, the matching words are determined according to the matching threshold, and the matching accuracy of the search words to be searched can be improved on the premise of automatic matching. And a Double-Array Trie structure is adopted, so that the retrieval speed is improved, and the storage space is saved.
The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (10)
1. A method for term matching of metadata fields, comprising:
preprocessing a first search term in a metadata training set to obtain a second search term;
judging the second search term, matching the second search term with the vocabulary to be trained in the vocabulary database to be trained, and acquiring the search term to be searched which is not matched successfully;
performing word segmentation on the words to be retrieved by using a conditional random field word segmentation algorithm to obtain a plurality of third retrieval words of each word to be retrieved, matching the third retrieval words with the words to be trained, and determining matched words;
updating a vocabulary database to be trained according to the matched words, and training the conditional random field word segmentation algorithm to obtain a trained classifier and a trained vocabulary;
and matching the metadata fields to be matched by using the trained classifier and the trained vocabulary.
2. The method of claim 1, wherein before preprocessing the first term in the metadata training set to obtain the second term, the method further comprises:
collecting table data, cleaning illegal characters in the table data, and establishing a vocabulary table to be trained;
and establishing a vocabulary database to be trained by using the vocabulary to be trained.
3. The method of claim 1, wherein the preprocessing a first term in a metadata training set to obtain a second term comprises:
acquiring all first search terms in metadata in a metadata training set;
and removing illegal characters in each first search word to obtain a second search word corresponding to each first search word.
4. The method for term matching of metadata fields as claimed in claim 1, wherein said determining the second search term to match the vocabulary to be trained in the vocabulary database to obtain the search term comprises:
judging the second search term to obtain a Chinese second search term and a non-Chinese second search term;
directly matching the Chinese second search terms in a vocabulary database to be trained, and determining the Chinese second search terms which cannot be matched with the matched terms;
and taking the non-Chinese second search term and the Chinese second search term which is not matched with the matching term as the to-be-searched term.
5. The method for term matching of metadata fields as claimed in claim 1, wherein the said word segmentation using conditional random field segmentation algorithm to obtain a plurality of third terms of each word to be retrieved, matching with the word to be trained to determine the matching word comprises:
performing word segmentation on the longer to-be-retrieved word by using a conditional random field word segmentation algorithm, and generating a plurality of third retrieval words of each word segmentation of the to-be-retrieved word, wherein the third retrieval words comprise: full spelling codes, abbreviated spelling codes, English names and/or Chinese names;
performing character matching on all the third search terms of each to-be-searched term and the to-be-trained vocabulary, and calculating the matching degree of each third search term corresponding to each to-be-searched term;
and determining the matching words of the words to be retrieved according to the matching degree and the matching threshold value.
6. The method for term matching of metadata fields according to claim 1, wherein said matching of metadata fields to be matched using the trained classifier and the trained vocabulary comprises:
matching the metadata fields to be matched by using a trained vocabulary, and acquiring the metadata fields with unsuccessful matching as words to be matched;
using a trained conditional random field word segmentation algorithm as a trained classifier, and performing word segmentation on each word to be matched to obtain a plurality of third search words corresponding to each word to be matched;
performing character matching on the plurality of third search terms in the trained vocabulary, and calculating the matching degree of each third search term;
and determining a matching word corresponding to each word to be matched according to the matching degree and the matching threshold, wherein the matching word is a matching term of the word to be matched.
7. The term matching method for metadata fields according to claim 2, wherein the table data comprises:
the system comprises a client glossary, an industry standard field interpretation mapping table, a metadata field approved by a business system version and an interpretation comparison table, wherein each table comprises: chinese name, English name, full spelling code and simple spelling code.
8. The term matching method for metadata fields according to claim 3, wherein the types of the first search term and the second search term each comprise: chinese, full pinyin, shorthand pinyin, english acronym, and/or mixture.
9. A term matching system for metadata fields, comprising:
the training module is used for preprocessing a first search term in the metadata training set to obtain a second search term; judging the second search term, matching the second search term with the vocabulary to be trained in the vocabulary database to be trained, and acquiring the search term to be searched which is not matched successfully; performing word segmentation on the words to be retrieved by using a conditional random field word segmentation algorithm to obtain a plurality of third retrieval words of each word to be retrieved, matching the third retrieval words with the words to be trained, and determining matched words; updating a vocabulary database to be trained according to the matched words, and training the conditional random field word segmentation algorithm to obtain a trained classifier and a trained vocabulary;
and the matching module is used for matching the metadata fields to be matched by using the trained classifier and the trained vocabulary.
10. A storage medium storing computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform a term matching method for a metadata field as claimed in any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011342621.XA CN112580691B (en) | 2020-11-25 | 2020-11-25 | Term matching method, matching system and storage medium for metadata field |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011342621.XA CN112580691B (en) | 2020-11-25 | 2020-11-25 | Term matching method, matching system and storage medium for metadata field |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112580691A true CN112580691A (en) | 2021-03-30 |
CN112580691B CN112580691B (en) | 2024-05-14 |
Family
ID=75123569
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011342621.XA Active CN112580691B (en) | 2020-11-25 | 2020-11-25 | Term matching method, matching system and storage medium for metadata field |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112580691B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114969001A (en) * | 2022-05-24 | 2022-08-30 | 浪潮卓数大数据产业发展有限公司 | Database metadata field matching method, device, equipment and medium |
Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101751430A (en) * | 2008-12-12 | 2010-06-23 | 汉王科技股份有限公司 | Electronic dictionary fuzzy searching method |
CN103336850A (en) * | 2013-07-24 | 2013-10-02 | 昆明理工大学 | Method and device for confirming index word in database retrieval system |
CN103412858A (en) * | 2012-07-02 | 2013-11-27 | 清华大学 | Method for large-scale feature matching of text content or network content analyses |
CN103810168A (en) * | 2012-11-06 | 2014-05-21 | 深圳市世纪光速信息技术有限公司 | Search application method, device and terminal |
CN106933883A (en) * | 2015-12-31 | 2017-07-07 | 中移(苏州)软件技术有限公司 | Point of interest Ordinary search word sorting technique, device based on retrieval daily record |
CN108255813A (en) * | 2018-01-23 | 2018-07-06 | 重庆邮电大学 | A kind of text matching technique based on term frequency-inverse document and CRF |
CN108304375A (en) * | 2017-11-13 | 2018-07-20 | 广州腾讯科技有限公司 | A kind of information identifying method and its equipment, storage medium, terminal |
US20180232443A1 (en) * | 2017-02-16 | 2018-08-16 | Globality, Inc. | Intelligent matching system with ontology-aided relation extraction |
US10242320B1 (en) * | 2018-04-19 | 2019-03-26 | Maana, Inc. | Machine assisted learning of entities |
CN110309368A (en) * | 2018-03-26 | 2019-10-08 | 腾讯科技(深圳)有限公司 | Determination method, apparatus, storage medium and the electronic device of data address |
CN110931137A (en) * | 2018-09-19 | 2020-03-27 | 京东方科技集团股份有限公司 | Machine-assisted dialog system, method and device |
CN111291195A (en) * | 2020-01-21 | 2020-06-16 | 腾讯科技(深圳)有限公司 | Data processing method, device, terminal and readable storage medium |
CN111310456A (en) * | 2020-02-13 | 2020-06-19 | 支付宝(杭州)信息技术有限公司 | Entity name matching method, device and equipment |
CN111899829A (en) * | 2020-07-31 | 2020-11-06 | 青岛百洋智能科技股份有限公司 | Full-text retrieval matching engine based on ICD9/10 participle lexicon |
CN111967242A (en) * | 2020-08-17 | 2020-11-20 | 支付宝(杭州)信息技术有限公司 | Text information extraction method, device and equipment |
CN112148885A (en) * | 2020-09-04 | 2020-12-29 | 上海晏鼠计算机技术股份有限公司 | Intelligent searching method and system based on knowledge graph |
-
2020
- 2020-11-25 CN CN202011342621.XA patent/CN112580691B/en active Active
Patent Citations (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101751430A (en) * | 2008-12-12 | 2010-06-23 | 汉王科技股份有限公司 | Electronic dictionary fuzzy searching method |
CN103412858A (en) * | 2012-07-02 | 2013-11-27 | 清华大学 | Method for large-scale feature matching of text content or network content analyses |
CN103810168A (en) * | 2012-11-06 | 2014-05-21 | 深圳市世纪光速信息技术有限公司 | Search application method, device and terminal |
US20150234927A1 (en) * | 2012-11-06 | 2015-08-20 | Tencent Technology (Shenzhen) Company Limited | Application search method, apparatus, and terminal |
CN103336850A (en) * | 2013-07-24 | 2013-10-02 | 昆明理工大学 | Method and device for confirming index word in database retrieval system |
CN106933883A (en) * | 2015-12-31 | 2017-07-07 | 中移(苏州)软件技术有限公司 | Point of interest Ordinary search word sorting technique, device based on retrieval daily record |
US20180232443A1 (en) * | 2017-02-16 | 2018-08-16 | Globality, Inc. | Intelligent matching system with ontology-aided relation extraction |
CN108304375A (en) * | 2017-11-13 | 2018-07-20 | 广州腾讯科技有限公司 | A kind of information identifying method and its equipment, storage medium, terminal |
CN108255813A (en) * | 2018-01-23 | 2018-07-06 | 重庆邮电大学 | A kind of text matching technique based on term frequency-inverse document and CRF |
CN110309368A (en) * | 2018-03-26 | 2019-10-08 | 腾讯科技(深圳)有限公司 | Determination method, apparatus, storage medium and the electronic device of data address |
US10242320B1 (en) * | 2018-04-19 | 2019-03-26 | Maana, Inc. | Machine assisted learning of entities |
CN110931137A (en) * | 2018-09-19 | 2020-03-27 | 京东方科技集团股份有限公司 | Machine-assisted dialog system, method and device |
CN111291195A (en) * | 2020-01-21 | 2020-06-16 | 腾讯科技(深圳)有限公司 | Data processing method, device, terminal and readable storage medium |
CN111310456A (en) * | 2020-02-13 | 2020-06-19 | 支付宝(杭州)信息技术有限公司 | Entity name matching method, device and equipment |
CN111899829A (en) * | 2020-07-31 | 2020-11-06 | 青岛百洋智能科技股份有限公司 | Full-text retrieval matching engine based on ICD9/10 participle lexicon |
CN111967242A (en) * | 2020-08-17 | 2020-11-20 | 支付宝(杭州)信息技术有限公司 | Text information extraction method, device and equipment |
CN112148885A (en) * | 2020-09-04 | 2020-12-29 | 上海晏鼠计算机技术股份有限公司 | Intelligent searching method and system based on knowledge graph |
Non-Patent Citations (2)
Title |
---|
FANJIN MAI 等: "Improved Chinese Word Segmentation Disambiguation Model Based on Conditional Random Fields", 《PROCEEDINGS OF THE 4TH INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND NETWORKS 》, vol. 355, 31 December 2015 (2015-12-31) * |
蒋婷: "学科领域本体学习及学术资源语义标注研究", 《中国博士学位论文全文数据库 信息科技辑》, 15 June 2018 (2018-06-15) * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114969001A (en) * | 2022-05-24 | 2022-08-30 | 浪潮卓数大数据产业发展有限公司 | Database metadata field matching method, device, equipment and medium |
CN114969001B (en) * | 2022-05-24 | 2024-05-10 | 浪潮卓数大数据产业发展有限公司 | Database metadata field matching method, device, equipment and medium |
Also Published As
Publication number | Publication date |
---|---|
CN112580691B (en) | 2024-05-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8001018B2 (en) | System and method for automated part-number mapping | |
CN112115238A (en) | Question-answering method and system based on BERT and knowledge base | |
CN104408173A (en) | Method for automatically extracting kernel keyword based on B2B platform | |
Reyes-Galaviz et al. | A supervised gradient-based learning algorithm for optimized entity resolution | |
CN113673252B (en) | Automatic join recommendation method for data table based on field semantics | |
CN109614493B (en) | Text abbreviation recognition method and system based on supervision word vector | |
CN114911945A (en) | Knowledge graph-based multi-value chain data management auxiliary decision model construction method | |
CN112036178A (en) | Distribution network entity related semantic search method | |
CN116737967B (en) | Knowledge graph construction and perfecting system and method based on natural language | |
CN111325018A (en) | Domain dictionary construction method based on web retrieval and new word discovery | |
JPH0816620A (en) | Data sorting device/method, data sorting tree generation device/method, derivative extraction device/method, thesaurus construction device/method, and data processing system | |
Singh et al. | SciDr at SDU-2020: IDEAS--Identifying and Disambiguating Everyday Acronyms for Scientific Domain | |
CN115982338A (en) | Query path ordering-based domain knowledge graph question-answering method and system | |
CN112862569B (en) | Product appearance style evaluation method and system based on image and text multi-modal data | |
CN112784049B (en) | Text data-oriented online social platform multi-element knowledge acquisition method | |
CN114239828A (en) | Supply chain affair map construction method based on causal relationship | |
CN114493783A (en) | Commodity matching method based on double retrieval mechanism | |
CN112580691B (en) | Term matching method, matching system and storage medium for metadata field | |
CN113963748A (en) | Protein knowledge map vectorization method | |
Islam et al. | Applications of corpus-based semantic similarity and word segmentation to database schema matching | |
CN117648984A (en) | Intelligent question-answering method and system based on domain knowledge graph | |
CN117150006A (en) | Intelligent import and export commodity classification method integrating knowledge patterns | |
AL-Khassawneh et al. | Improving triangle-graph based text summarization using hybrid similarity function | |
CN115544211A (en) | Method for external trade and external law indexing and industry risk assessment | |
CN112613318B (en) | Entity name normalization system, method thereof and computer readable medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |