CN116127956A

CN116127956A - Self-adaptive term normalization method based on double-tower model

Info

Publication number: CN116127956A
Application number: CN202310018843.3A
Authority: CN
Inventors: 袁静; 赵俊博; 陈刚; 鲁鹏; 周显锞
Original assignee: Institute Of Computer Innovation Technology Zhejiang University
Current assignee: Institute Of Computer Innovation Technology Zhejiang University
Priority date: 2023-01-06
Filing date: 2023-01-06
Publication date: 2023-05-16

Abstract

The invention discloses a self-adaptive term normalization method based on a double-tower model. Searching and recalling a plurality of standard terms similar to the term original words in the standard term dictionary by utilizing a searching mode for the term original words; constructing a sample pair, and carrying out equalization treatment to obtain an equalized sample pair set; inputting the equalized sample pairs into a Sentence-BERT double-tower model for training, outputting labels and prediction results by the Sentence-BERT double-tower model, processing standard term dictionary by using the trained Sentence-BERT double-tower model, obtaining Sentence vectors and storing the Sentence vectors in an offline vector database; and processing the term original words to be predicted by using a trained Sentence-BERT double-tower model, combining with an offline vector database to process and predict to obtain standard terms with high similarity, and then adding the standard terms to be predicted to the term original words to be predicted in a matching way. The invention can normalize and match an nonstandard term text, optimize processing in an industrial scene, has small calculated amount and high operation speed, and greatly improves the matching normalization efficiency.

Description

Self-adaptive term normalization method based on double-tower model

Technical Field

The invention relates to a natural language text data processing method in the field of artificial intelligence, in particular to a self-adaptive term normalization method based on a double-tower model.

Background

In the actual industrial application scene, there are a plurality of spoken words, short names, abbreviations and the like pointing to the same standard terms, which are called term primitive words, for example, when the names of the articles input by the user in the searching and asking scene are aliases of the names of the articles in the standard library, or the spoken word description used by doctors for writing diseases or operations. The term normalization may convert non-canonical term primitive words into standard terms, implement term normalization, and one term primitive word may correspond to a case of a plurality of standard terms.

In the current normalization technology, a pretilt model is used for fine adjustment, a term original word and a plurality of standard phrase pairs are subjected to classified prediction, and the standard term with the highest probability is directly taken, so that the number of standard terms corresponding to the term original word cannot be recognized in a self-adaptive manner, and the number of standard terms corresponds to the standard term; secondly, the efficiency is low and the reasoning speed is low when on-line prediction is carried out.

Disclosure of Invention

In order to solve the problems in the background art, the invention provides a self-adaptive term normalization method based on a double-tower model.

The technical scheme adopted by the invention is as follows:

1) Searching for a plurality of standard terms similar to the term original words in the standard term dictionary by utilizing a plurality of searching modes aiming at the term original words with the correct standard terms;

the term primitive word refers to a word which is input by a user and needs to be processed. The standard terminology is typically a known data table composed of standard terminology.

2) Respectively combining all standard terms searched and recalled with a term primitive word to form a negative sample pair, combining one standard term with the term primitive word to form a negative sample pair, and combining the term primitive word with a correct standard term known in advance to form a positive sample pair, so as to obtain all sample pairs;

3) Equalizing the positive and negative samples of all constructed sample pairs to obtain an equalized sample pair set;

4) Inputting the balanced sample pairs into a Sentence-BERT double-tower model for training, outputting labels and prediction results by the Sentence-BERT double-tower model, wherein the labels are the classification of similar or dissimilar sample pairs, and the prediction results are the number of standard terms corresponding to the term original words;

5) Carrying out reasoning calculation processing on all standard terms in the standard term dictionary by using a trained Sentence-BERT double-tower model, and storing Sentence vectors corresponding to all standard terms obtained in the Sentence-BERT double-tower model processing into an offline vector database;

6) Aiming at the original term of the to-be-predicted term input by a user, performing reasoning calculation processing on the original term of the to-be-predicted term by using a trained Sentence-BERT double-tower model, and performing processing prediction by combining an offline vector database to obtain standard term output with high similarity;

7) And 6) normalizing the standard term matching with high similarity obtained in the step 6) by adding the standard term matching to the term original word to be predicted, and calibrating the term original word to be predicted and then searching for use.

And respectively calculating the similarity between the term original word and each standard term in the standard term dictionary by using a plurality of search modes, wherein each search mode is used as a path, and each search mode searches for the incorrect standard terms which are recalled and ranked for the top T in the similarity between the term original word and the standard term, namely the incorrect standard terms which are correct to the term original word are not included. The similarity calculation method of the multi-path recalls and the recall number of each path can be expanded and adjusted according to actual conditions in an actual use scene.

The Sentence-BERT double-tower model comprises an original word branch, a standard word branch, a vector fusion module, a similarity classifier and a number prediction classifier; the original word branch and the standard word branch comprise a semantic module and a pooling module which are sequentially carried out, the semantic module of the original word branch and the semantic module of the standard word branch respectively receive the original word of the term and the standard term for processing, the pooling module of the original word branch outputs the original sentence sub-vector u to the number prediction classifier for carrying out the prediction judgment of the number of the standard term corresponding to the original word of the term, the pooling module of the original word branch and the standard word branch respectively outputs the original sentence sub-vector u and the standard word sentence vector v to the vector fusion module, and the vector fusion module outputs the result to the similarity classifier for carrying out the similarity probability judgment.

The Sentence-BERT double tower model is one of the Sentence-transducer structures. The semantic module is an encoder part of a standard transformer structure, an encoder unit of a transformer is formed by superposition of a Multi-head-attention+ Layer Normalization +feed+ Layer Normalization, and each layer of BERT is formed by one encoder unit.

In the step 4), the loss function is set as a result of weighted fusion of whether the similar two-classification cross entropy loss and the standard term number predicted softmax loss corresponding to the term primitive are carried out by the sample, and the result is expressed as:

loss＝γ·loss _binary +(1-γ)·loss _multiclass

where loss represents the total loss, γ is the weight of the cross entropy loss of the two classes of samples, whether they are similar or not, loss _binary For a cross-class entropy penalty of whether a pair of samples is similar, y is the true value of whether two pairs of samples in a class are similar,

is a predictor of whether two sample pairs are similar; loss of loss _multiclass Softmax penalty, y representing a number prediction of standard terms corresponding to a term primitive _j The label value of the real sample after one-hot is represented, j represents the j-th category in a plurality of categories, T represents the total category number, S _j Is the jth value of softmax penalty output vector S.

The step 6) is specifically as follows:

6.1 Using a trained Sentence-BERT double-tower model to process the term primitive word to be predicted and output a primitive word and Sentence vector u;

6.2 Processing the term primitive word to be predicted according to the same mode of the step 1) to obtain a plurality of standard terms similar to the term primitive word, and performing de-duplication;

6.3 Taking out standard term sentence vectors v corresponding to the standard terms obtained in the step 6.2) in a pre-stored offline vector database in a table look-up mode, and respectively forming sentence pairs by the term original sentence vectors u and the standard term sentence vectors v;

6.4 Sending each pair of Sentence pairs into a similarity classifier of a Sentence-BERT double-tower model to judge whether the obtained Sentence pairs are similar or not, and then sorting the pairs and each standard term obtained in the step 6.2) according to the sorting pairs of the similar probabilities;

6.5 Sending the vector after the term original word coding into a number prediction classifier of a Sentence-BERT double-tower model to predict the number of standard terms, and obtaining the number of predicted standard terms;

6.6 Cut-off the standard terms sequenced in step 6.4) by the number of standard terms predicted by the model in step 6.5), and output the standard terms with the number of standard terms sequenced at the front.

The steps 6.4) and 6.5) are performed in parallel.

According to the invention, the standard terms of T (T is more than or equal to 1) before ranking the similarity of the standard terms and the original words are recalled from the standard term library respectively according to a plurality of retrieval modes; constructing a data training data set according to the multi-path recall result and carrying out sample equalization on the data training data set; and carrying out self-adaptive truncation combined training on the similarity and the number of the corresponding standard terms according to the double-tower model, and further carrying out normalization processing on the actual terms to be processed by utilizing training results.

The beneficial effects of the invention are as follows:

the invention can quickly and accurately search and acquire a non-standard term text, and the corresponding accurate standard word is endowed to the term text, for example, the invention can be applied to industrial scenes such as text input processing of medical terms, can optimize the reasoning efficiency in the industrial scenes and shorten the online reasoning time.

The method has the advantages of small calculated amount and high operation speed, and greatly improves the matching normalization efficiency.

For example, normalization of clinical terms, hundreds of different writing methods can exist for the same operation, medicine, diagnosis and the like, and the problem to be solved by normalization is to find corresponding standard speaking methods for different writing methods. Based on the term normalization, related researchers can perform related statistical analysis on the electronic medical records. Because the expression modes of the original words are too various, a good result is difficult to obtain by using a single semantic similarity matching model. In addition, in terms of efficiency, whether the medical history is backlogged in a hospital or a large number of real-time electronic medical records are generated daily, a model with higher online reasoning speed is needed to be carried out. The conventional speed is far from the speed of new data generation.

Drawings

FIG. 1 is a flow chart of the normalization of the adaptive medical term based on the dual tower model in the present invention;

FIG. 2 is a core structure of the dual tower network model of the present invention;

FIG. 3 is a sample of term primitive word normalized data.

Detailed Description

In order to make the technical problems, technical schemes and beneficial effects to be solved by the embodiments of the present invention more clear, the present invention is further described in detail below with reference to the accompanying drawings and the embodiments. The specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

To achieve the above object, as shown in fig. 1, an embodiment of the present invention and its implementation process include:

specifically, the similarity between the term original word and each standard term in the standard term dictionary is calculated by using a plurality of search modes, each search mode is used as a path, and each search mode searches for incorrect standard terms which are recalled and ranked for the top T of the similarity between the term original word and the standard term, namely, the incorrect standard terms which are correct to the term original word are not included. The similarity calculation method of the multi-path recalls and the recall number of each path can be expanded and adjusted according to actual conditions in an actual use scene.

In particular implementations, three search modes of tf-idf, jaccard, BM25 are used, with 5 standard terms before each search recall, and 15 similar standard terms are recalled three-way altogether. And only one standard term is correct to the term primitive, the number ratio of positive sample pairs to negative sample pairs at this time is 1:15.

2) Combining all standard terms searched and recalled with the term primitive word into negative sample pairs respectively, combining one standard term with the term primitive word into a pair of negative sample pairs, combining the term primitive word with the correct standard terms known in advance into positive sample pairs, and possibly inputting a plurality of correct standard terms of the term primitive word so as to obtain all sample pairs;

3) Performing equalization treatment on positive and negative samples on all constructed sample pairs to reduce the influence of the proportion difference of the positive and negative samples on the model and obtain an equalized sample pair set;

the equalization process in step 3) is specifically an amplification process that can up-sample the pair of positive samples.

before training, negative pairs of samples are set as dissimilar classifications and positive pairs of samples are set as similar classifications.

Note that here, the sample pair is trained jointly for two classifications of similarity and a standard term number prediction corresponding to the term primitive.

In step 4), the loss function is set as a result of weighted fusion of the sample on whether similar two-class cross entropy loss and the softmax loss predicted by the standard term number corresponding to the term primitive, and the result is expressed as:

loss＝γ·loss _binary +(1-γ)·loss _multiclass

is a predictor of whether two sample pairs are similar; loss of loss _multiclass Softmax loss representing standard term number prediction corresponding to term primitiveLoss of y _j The label value of the real sample after one-hot is represented, j represents the j-th category in a plurality of categories, T represents the total category number, S _j Is the jth value of softmax penalty output vector S.

5) Carrying out reasoning calculation processing on all standard terms in a standard term dictionary by using a trained Sentence-BERT double-tower model, and storing a Sentence vector obtained in the Sentence-BERT double-tower model processing into an offline vector database, namely storing a standard word Sentence vector v obtained after each standard term processing into the offline vector database;

when the invention performs reasoning calculation each time, the Sentence vector reasoning is performed on the term original words which need to be normalized, compared with the interactive reasoning, the Sentence vector offline storage method using the Sentence-BERT double-tower model can save a great amount of reasoning time.

For a standard term dictionary with N standard terms, the Sentence vectors of the N standard terms are calculated in advance by using a trained Sentence-BERT double-tower model and stored, when a standard term corresponding to a term original term needs to be found each time, the model only needs to calculate the original term and Sentence vector of the term original term online, and then the similarity (cosine/dot product) is calculated by using the just-calculated original term and Sentence vector of the pre-calculated standard term or the just-calculated original term and Sentence vector of the term original term is input into a simple classifier for classification. Similarly, when the corresponding standard word needs to be found for x term original words, the number of online reasoning is only x.

The calculation speed of the post-connected cosine similarity or the simple classifier is far higher than the reasoning speed of the transducer model, so that the method can greatly accelerate the operation speed and improve the processing efficiency.

6.2 Using a multi-way recall method, processing the original term to be predicted according to the same mode of the step 1) to obtain a plurality of standard terms similar to the original term, and performing de-duplication;

6.3 Taking out the standard word sentence vector v corresponding to each standard term obtained in the step 6.2) in a pre-stored offline vector database in a table look-up mode, and respectively forming sentence pairs by the original word sentence vector u and each standard word sentence vector v;

6.5 The vector after the term original word coding is sent into a number prediction classifier of a Sentence-BERT double-tower model to conduct standard term number prediction, the number of the predicted standard terms is obtained, and the sequencing result in the step 6.4) can be adaptively truncated;

6.6 Cut off each standard term sequenced in step 6.4) by the number of standard terms predicted by the model in step 6.5), and output the standard term with the number of standard terms sequenced at the front.

The above steps 6.4) and 6.5) are performed in parallel.

7) The normalization is achieved by adding the standard term matching with high similarity obtained in the step 6) to the term primitive word to be predicted, as shown in fig. 3.

Claims

1. The self-adaptive term normalization method based on the double-tower model is characterized by comprising the following steps of: the method comprises the following steps:

1) Searching and recalling a plurality of standard terms similar to the term original words in the standard term dictionary by utilizing a plurality of searching modes aiming at the term original words;

2) All standard terms searched and recalled are respectively combined with the term primitive word to form a negative sample pair, one standard term and the term primitive word form a pair of negative sample pairs, and the term primitive word and the correct standard term form a positive sample pair, so that all sample pairs are obtained;

7) And 6) adding standard term matching with high similarity obtained in the step 6) to the term primitive word to be predicted to normalize.

2. The method for normalizing by self-adaptive terminology based on a double-tower model according to claim 1, wherein: and respectively calculating the similarity between the term original word and each standard term in the standard term dictionary by using a plurality of search modes, wherein each search mode searches for incorrect standard terms which are recalled and ranked for the top T of the similarity between the term original word and the standard term, namely standard terms which are correct to the term original word are not included.

3. The method for normalizing by self-adaptive terminology based on a double-tower model according to claim 1, wherein: the Sentence-BERT double-tower model comprises an original word branch, a standard word branch, a vector fusion module, a similarity classifier and a number prediction classifier; the original word branch and the standard word branch comprise a semantic module and a pooling module which are sequentially carried out, the semantic module of the original word branch and the semantic module of the standard word branch respectively receive the original word of the term and the standard term for processing, the pooling module of the original word branch outputs the original sentence sub-vector u to the number prediction classifier for carrying out the prediction judgment of the number of the standard term corresponding to the original word of the term, the pooling module of the original word branch and the standard word branch respectively outputs the original sentence sub-vector u and the standard word sentence vector v to the vector fusion module, and the vector fusion module outputs the result to the similarity classifier for carrying out the similarity probability judgment.

4. The method for normalizing by self-adaptive terminology based on a double-tower model according to claim 1, wherein: in the step 4), the loss function is set as a result of weighted fusion of whether the similar two-classification cross entropy loss and the standard term number predicted softmax loss corresponding to the term primitive are carried out by the sample, and the result is expressed as:

loss＝γ·loss _binary +(1-γ)·loss _multiclass

wherein loss represents total lossLoss, γ is the cross entropy loss weight of the two classes of samples, loss, whether or not the pairs are similar _binary For a cross-class entropy penalty of whether a pair of samples is similar, y is the true value of whether two pairs of samples in a class are similar,

5. The method for normalizing by self-adaptive terminology based on a double-tower model according to claim 1, wherein: the step 6) is specifically as follows:

6. The method for normalizing by self-adaptive terminology based on a double tower model according to claim 5, wherein: the steps 6.4) and 6.5) are performed in parallel.