CN114841148A

CN114841148A - Text recognition model training method, model training device and electronic equipment

Info

Publication number: CN114841148A
Application number: CN202210504719.3A
Authority: CN
Inventors: 金力; 李树超; 刘庆; 李晓宇; 孙显; 张雅楠; 董鹏程; 吕博
Original assignee: Aerospace Information Research Institute of CAS
Current assignee: Aerospace Information Research Institute of CAS
Priority date: 2022-05-10
Filing date: 2022-05-10
Publication date: 2022-08-02

Abstract

The invention discloses a text recognition model training method, which comprises the following steps: performing sample enhancement on a training text to obtain a plurality of training corpora, wherein the training corpora are provided with labels, and the training text comprises a natural language text; inputting the training corpus into a first model to obtain vectorization representation of the training corpus; calculating similarity information between weight information of each word relative to the training corpus in a single training corpus and vectorization representation of a plurality of training corpuses; training the first model by utilizing the similarity information and the weight information; inputting the vectorization representation into a second model to obtain the category score of the training corpus; and training the third model by using the category fraction and the vectorization representation, and determining the trained third model as a text recognition model. The invention also discloses a model training device, an electronic device, a storage medium and a computer program product.

Description

Text recognition model training method, model training device and electronic equipment

Technical Field

The invention belongs to the technical field of deep learning and small samples, and particularly relates to a text recognition model training method, a model training device, electronic equipment, a storage medium and a computer program product.

Background

Deep learning models have achieved advanced results in tasks such as image classification, text classification, and the like. The success of deep learning models, however, depends largely on a large amount of training data. In real scenes of the real world, some categories have only a small amount of data or a small amount of labeled data, and labeling unlabeled data consumes a lot of time and labor. In contrast, humans need only a small amount of data to learn quickly. For example, human beings can quickly judge newly given samples according to the previously learned knowledge and by combining the labels of several samples given currently, and the similar small sample learning goal is to make the machine closer to the thinking of human beings. Given a sentence S ═ wife of yao is yao ji, and given an entity e1 ═ yao ji, e2 ═ yao ji, and the relationship r ═ wife. Then, a sentence of S to be classified is input, i.e. "a husband is a duan", and two entities e1, e2, e are "duan", the machine can quickly make a judgment on the entities according to the knowledge of unsupervised learning and the given example, and output r as a couple.

Small sample learning aims at learning problem-solving models through a small number of samples, and machine learning and deep learning have succeeded in many fields due to the superior performance of big data training models. However, in many application scenarios in the real world, the sample size is small or the labeled sample is small, and the labeling of a large amount of unlabeled samples will consume a lot of manpower. Therefore, how to learn with a small number of samples becomes a problem that people need to pay attention at present. Given a base class with enough labeled samples, the small sample class is to identify new samples that are unlabeled, and there are only a few labeled samples per class. Most of the existing methods only focus on the relation between the marks and the marks, and do not fully supplement the class knowledge among the samples, and besides, the existing small sample learning method does not consider the importance of different words in the sentence to the sentence.

Disclosure of Invention

To solve at least part of the technical problems in the above and other aspects of the prior art, according to an embodiment of an aspect of the present invention, there is provided a text recognition model training method, including:

performing sample enhancement on the training text to obtain a plurality of training corpora, wherein the training corpora are provided with labels, and the training text comprises a natural language text;

inputting the training corpus into a first model to obtain vectorization representation of the training corpus;

calculating similarity information between weight information of each word in a single training corpus relative to the training corpus and vectorization representation of a plurality of training corpuses;

training the first model by using the similarity information and the weight information;

inputting the vectorization representation into a second model to obtain the category score of the training corpus;

and

and training a third model by using the category fraction and the vectorization representation, and determining the trained third model as the text recognition model.

In some embodiments of the invention, sample enhancing the training text comprises:

and performing sample enhancement on the training text by using a keyword mask mechanism.

In some embodiments of the invention, the first model is a BERT model, the second model is a fully connected model, the first model and the second model together form a teacher model, and the third model is a student model.

In some embodiments of the present invention, calculating similarity information between vectorized representations of a plurality of the corpus comprises:

and calculating similarity information among vectorization representations of the training corpuses by using a word shifting distance method.

In some embodiments of the present invention, calculating the weight information of each word in a single corpus with respect to the corpus comprises:

and calculating the weight information of each word in the single corpus relative to the corpus by using a bidirectional attention mechanism, and giving different weights to each word.

In some embodiments of the present invention, the second model is trained by a loss function consisting of a cross entropy function and a KL divergence.

According to an embodiment of an aspect of the present invention, there is provided a model training apparatus including:

the enhancement module is used for carrying out sample enhancement on the training text to obtain a training corpus, wherein the training corpus is provided with a label, and the training text comprises a natural language text;

the first processing module is used for inputting the training corpus into a first model to obtain vectorization representation of the training corpus;

the first training module is used for acquiring similarity information and weight information of the vectorization representation and training the first model by using the similarity information and the weight information;

the second processing module is used for inputting the vectorization representation into a second model to obtain the category score of the training corpus; and

and the second training module is used for training a third model by using the category fraction and the vectorization representation and determining the trained third model as the text recognition model.

According to an embodiment of an aspect of the present invention, there is provided an electronic apparatus including:

one or more processors; and

a memory for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method described above.

According to an embodiment of an aspect of the present invention, there is provided a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to implement the above-described method.

According to an embodiment of an aspect of the invention, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method described above.

By the text recognition model training method, the first model and the second model are used as teacher models to generate the class scores of the original data, the third model is used as a student model to train under the class scores provided by the teacher models, and class knowledge can be effectively supplemented. And similarity information among the training corpora is calculated, so that the problem that the judgment of the model on the strange words is inaccurate can be effectively solved. And the text enhancement is carried out on the training samples, and the text recognition model can effectively learn more context information.

Drawings

FIG. 1 schematically shows a flow diagram of a text recognition model training method according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a text recognition model training method according to an embodiment of the present invention;

FIG. 3 schematically shows a schematic structural view of a known distillation method according to an embodiment of the invention;

FIG. 4 is a schematic diagram illustrating the structure of a word-shifting distance method according to an embodiment of the present invention;

FIG. 5 schematically shows a prototype network schematic according to an embodiment of the invention;

FIG. 6 schematically illustrates a word shifting scheme in accordance with an embodiment of the invention;

FIG. 7 is a diagram schematically illustrating word shift distance calculation text similarity according to an embodiment of the present invention;

FIG. 8 schematically illustrates a knowledge distillation schematic according to an embodiment of the invention;

FIG. 9 schematically illustrates a BERT model diagram according to an embodiment of the invention;

FIG. 10 schematically illustrates an input schematic of a BERT according to an embodiment of the present invention;

FIG. 11 is a diagram that schematically illustrates a word-shift distance weight calculation, in accordance with an embodiment of the present invention;

FIG. 12 schematically illustrates a loss calculation schematic of a knowledge distillation method according to an embodiment of the invention;

FIG. 13 is a schematic diagram illustrating a visualization of sentence-level computation of weights in conjunction with Cosine similarity according to an embodiment of the present invention;

FIG. 14 is a schematic diagram illustrating a visualization of the calculation of weights by batch in combination with Cosine similarity, according to an embodiment of the present invention;

FIG. 15 is a schematic diagram illustrating a visualization of sentence-level computation weights in combination with Euclidean distances, according to an embodiment of the invention;

FIG. 16 is a schematic diagram illustrating a visualization of the calculation of weights by the batch category in combination with Euclidean distances according to an embodiment of the present invention;

FIG. 17 is a schematic illustrating the effect of distillation on vector space without knowledge in accordance with an embodiment of the present invention;

FIG. 18 schematically shows a diagram of the effect of using knowledge distillation on vector space according to an embodiment of the invention.

Detailed Description

In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.

It is to be understood that such description is merely illustrative and not intended to limit the scope of the present invention. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the invention. It may be evident, however, that one or more embodiments may be practiced without these specific details. Furthermore, in the following description, descriptions of well-known technologies are omitted so as to avoid unnecessarily obscuring the concepts of the present invention.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "comprising" as used herein indicates the presence of the features, steps, operations but does not preclude the presence or addition of one or more other features.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

In the technical scheme of the invention, the collection, storage, use, processing, transmission, provision, disclosure, application and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations, necessary confidentiality measures are taken, and the customs of the public order is not violated.

In the technical scheme of the invention, before the personal information of the user is acquired or collected, the authorization or the consent of the user is acquired.

An embodiment of the present invention provides a method for training a text recognition model, and fig. 1 schematically illustrates a flow chart of the method for training a text recognition model according to an embodiment of the present invention.

As shown in fig. 1, the method includes operations S101 to S106.

In operation S101, a sample enhancement is performed on a training text to obtain a plurality of training corpora, where the training corpora have tags, and the training text includes a natural language text.

In some embodiments of the invention, a keyword mask mechanism is used for sample enhancement, the generalization capability of the model is improved, a tf-idf algorithm is used for calculating a word with a higher tf-idf value in a training text, and the word is randomly replaced with a certain probability, so that the model is forced to learn more context information, wherein the tf-idf calculation mode is as follows:

tf-idf (term frequency) × idf (inverse document frequency); (A)

The calculation formula of the word frequency is as follows:

tf (word frequency) — the number of times a word appears in an article/the number of words of the article; (II)

The inverse document frequency is:

idf (inverse document frequency) log (total number of documents in corpus/number of documents containing the word); (III)

By means of random mask on the words with higher tf-idf values, the model does not neglect the processing of concept words, and the generalization of the model is improved.

In some embodiments of the present invention, the entity1 and the entity2 are added before the two corpora to be processed, and two special identifiers are added before the two corpora to be processed, so that the model can recognize the two corpora to be processed and play a role of prompting the model.

In operation S102, the corpus is input into the first model, and a vectorized representation of the corpus is obtained.

In some embodiments of the present invention, a vectorization representation of the corpus is obtained by processing the corpus using a first model, that is, a BERT model (Bidirectional Encoder retrieval from transforms), where the BERT model can output a sentence vector composed of word vectors of each word in a sentence, and the formula is as follows:

H＝[h ₁ ，h ₂ ，...，h _k ]＝E([w ₁ ，w ₂ ，...，w _k ]) (ii) a (IV)

Wherein, H represents vectorization representation of the training corpus, and E represents input training corpus.

In operation S103, similarity information between weight information of each word in a single corpus with respect to the corpus and vectorized representations of a plurality of corpuses is calculated.

In some embodiments of the present invention, the importance of each word in the sentence for measuring the similarity is calculated by using a word-shift distance method, the weight of some unimportant words is reduced, and the word sense distance is used for solving the minimum problem to obtain the weight of each word, and the formula is as follows:

wherein x is _ij Is the weight occupied by the jth word in the i sentences, c _ij Representing the word vector of the jth word of the ith sentence, the above equation can be converted into the following quadratic programming problem, as follows:

subject x _ij > 0, i 1., m, j 1., k; (VI)

Wherein s is _i Weight information representing target corpus, d _j And representing the weight information of the comparison training corpuses, calculating the weight information of each word relative to the comparison training corpuses in a single target training corpuses by utilizing a bidirectional attention mechanism, giving different weights to each word, and calculating the similarity information between the two training corpuses according to the weight information of the target training corpuses and the weight information of the comparison training corpuses.

In operation S104, the first model is trained using the similarity information and the weight information.

In some embodiments of the present invention, the obtained similarity information and weight information are input into a first model, that is, a BERT model, and the BERT model is trained to help the BERT model to more accurately process the training corpus, so as to obtain a more effective vectorization representation.

In operation S105, the vectorized representation is input into the second model, and the category score of the corpus is obtained.

In some embodiments of the present invention, the vectorization representation of the corpus generated by the BERT model is input into the second model, that is, the fully-connected model, so as to obtain the category score of the corpus, where the formula is as follows:

where τ is a hyperparameter for controlling the class score, p, of the second model output _i (x; tau) is the normalized class score, the softmax function is a normalized function consisting of exponential functions, the probability of the label class corresponding to the normalized class is far higher than the probability of other classes, and the second model learns the relation between different classes, wherein the relation between some classes is similar.

In operation S106, the third model is trained using the category score and the vectorized representation, and the trained third model is determined as the text recognition model.

FIG. 2 is a schematic structural diagram of a text recognition model training method according to an embodiment of the present invention.

In some embodiments of the present invention, as shown in fig. 2, the support set and the query set are two training texts whose similarity is to be calculated, the training texts are obtained through a data set, test data is generated after training of a Teacher model and a Student model, and finally query classification is performed by the Student model to generate a category score.

In some embodiments of the invention, the first model and the second model are used as a teacher model and the third model is used as a student model by a knowledge distillation method, and the third model is trained by using the output parameters of the first model and the second model, so that the third model can better represent the class characteristics of the training text.

In some embodiments of the invention, sample enhancing training text comprises: and performing sample enhancement on the training text by using a keyword mask mechanism, and improving the generalization capability of the first model.

Figure 3 schematically shows a schematic of the structure of a known distillation process according to an embodiment of the invention.

In some embodiments of the present invention, as shown in fig. 3, the first model is a BERT model, the second model is an FFN full-connected model, LR is a scoring function, the first model and the second model together form a teacher model, the third model is a student model, the student model is trained through a classroom model by using a knowledge distillation method, the teacher model can provide soft labels to train classification modules of the student model, and finally, the class score ψ is output through the student model.

In some embodiments of the present invention, calculating similarity information between vectorized representations of a plurality of corpus comprises: similarity information among vectorization representations of a plurality of training corpuses is calculated by using a word shift distance method, namely a word mover's distance method, so that the similarity information among sentences can be given, and adverse effects of a student model on new word representation errors on similarity calculation are solved.

Fig. 4 schematically shows a structural diagram of a word-shifting distance method according to an embodiment of the present invention.

In some embodiments of the present invention, as shown in FIG. 4, the distance between the "Ribavirin remaining textual to textual topics C term" and the "Thiabezole wave effective in differentiating the structural identities introduction" is calculated by using the word-shift distance methodWherein s is _i And d _i Information indicating the weight of the current word in the sentence in which it is located, s ₁ Weight information, s, of Ribavirin is represented ₂ Weight information, s, representing remains ₃ Denoted as essentials, s ₄ Denoted is the weight information, s, of chrono ₅ Representing the weight information of Heapatitis, s ₆ Denoted is weight information of C, s ₇ Representing the weight information of the flow, d ₁ Weight information of Thiabendazole, d ₂ Expressed is effective weight information, d ₃ Indicating the weighting information, d ₄ Representing the weight information of string, d ₅ Weight information of loci, weight information of infection, and C, d6 _ij Representing the Euclidean distance between a word of two sentences and a word of another sentence, X _ij And weight information indicating that one word between two sentences and one word in the other sentence account for the two sentences, and finally obtaining similarity information distance (Q, A).

In some embodiments of the present invention, calculating weight information of each word in a single corpus with respect to the corpus comprises: and calculating the weight information of each word in a single training corpus relative to the training corpus by using a bidirectional attention mechanism, giving different weights to each word, and judging the importance of the word by calculating the similarity between each word and the target training corpus so as to further improve the accuracy of the category score.

In some embodiments of the invention, the second model is trained by a loss function consisting of a cross-entropy function and a KL divergence.

In some embodiments of the invention, the loss function of the knowledge distillation section may consist of a cross entropy function that allows the model to be trained under the traditional one-hot label, and a KL divergence that allows the student model's predictions for each category to be close to the teacher model's predictions for each category. The KL divergence is expressed by the following equation:

wherein L is _kd Expressing the KL divergence, the loss function can be expressed as:

L _{stage 2} ＝αL _softmax +(1-α)L _kd (ii) a (eleven)

Wherein L is _{stage 2} The loss function is represented.

Fig. 5 schematically shows a prototype network schematic according to an embodiment of the invention.

In some embodiments of the present invention, the prototype network is a metric learning-based method proposed by snell et al, which maps text data into an embedding space by means of encoder coding, collects support set samples belonging to the same kind of relationship in the embedding space, disperses samples of different classes, and calculates a representation of each class in the embedding space using the aggregated samples, and refers to it as a prototype of each class, as shown in fig. 5. In the stage of query sample classification, the prototype network utilizes an encoder to map the query samples into an embedding space where the support set samples are located, then the distance between the center of each prototype and the support set samples is calculated in the embedding space, and finally the class represented by the prototype closest to the query samples in the embedding space is the relationship class of the query samples.

In some embodiments of the invention, let f _θ For the purposes of the example encoder,

for the kth sample in the nth class in the support set, q _i For the ith query sample, use the example encoder f _θ Will be provided with

Nonlinear mapping into embedding space, and then using the embedding vector to calculate the prototype:

when classifying the query examples, firstly mapping the query samples to the embedding space to which the prototype belongs to obtain f _θ (q _i ) The distance function is d (·), and the prototype network can calculate the probability that the query sample belongs to the relationship class represented by each prototype on the basis of the softmax function.

Wherein

Is the probability that the ith query sample belongs to the nth class in the support set.

Fig. 6 schematically shows a word shift pattern diagram according to an embodiment of the present invention.

In some embodiments of the present invention, Word Move's Distance (WMD), which is a method for measuring the Distance between two sample instances, can be used to calculate the similarity between two samples with different lengths, and the larger the Word move Distance, the lower the similarity between the samples, and vice versa. In particular, there are many ways to "move" one text to another in vector space, and word-move distance is one way to minimize the sum of distances. As shown in fig. 6, the words in one text are shifted to the corresponding words in the other text, and the smallest sum of the translation distances is the word shift distance.

FIG. 7 is a diagram schematically illustrating text similarity calculation based on word shift distance according to an embodiment of the present invention.

In some embodiments of the invention, a word2vec method is employed to encode each word into a vector. If the text distance is calculated only in terms of the minimum distance sum, it is highly likely that the calculation is inaccurate. For example, all contents of the text I are closely related to sports, only one word of the text II is not related to sports with other words closely related to sports, and only one word of the text III is related to sports with other words closely related to sports with one point. According to human judgment criteria, it is obvious that the distance between the first text and the third text is smaller than the distance between the first text and the second text. However, when the distance between the first text and the second text is calculated according to the minimum distance sum, all words in the first text are moved to the positions where the words in the second text, which are closely related to sports, are located, and when the similarity between the first text and the third text is calculated, all words in the first text are also moved to the positions where the words in the third text, which are closely related to sports, are located, and at this time, the distance (text 1, text 2) is distance (text 1, text 3), it is obvious that the prediction result is wrong. The word-shift distance method therefore proposes to calculate the distance of each word in the text from each word in the other text according to different weights. As shown in FIG. 7, each word has a weight in the text in which it is located, which is shared by all words of the face text at the time of transition. Definition T ∈ R ^n×n ，T _ij > 0 denotes a text d ₁ The word i in (1) is transferred to another text d ₂ The weight that the word j in (a) occupies,

representing word i in document d ₁ The weight of (1) to (b),

representing word j in document d ₂ The weight of the Chinese character is represented by c (i, j) which is the Euclidean distance between the word i and the word j, and the optimal path is solved by utilizing linear programming to obtain the text d ₁ And text d ₂ The word shift distance of (a) is as follows:

wherein the content of the first and second substances,

the first constraint indicates that each word in the first text is transferred to another text, and the weight of the first text may be 0 in the process of solving the optimal solution through linear planning. The second constraint indicates that the weight of each word in the second text must be the same as the weight of the word in the second text, thus avoiding the situation where all words in the first text are transferred to one word in the second text.

And

are obtained by calculating tf-idf or normalizing bag of words models.

In some embodiments of the present invention, in order to speed up the model calculation speed, a method of reducing unnecessary operations by calculating a lower limit is proposed. The basic idea is that solving the word-shift distance algorithm based on linear programming is too time-consuming, so a lower bound with a higher calculation speed is selected to solve the problem of reducing the calculation amount of the text with a longer distance when KNN classification is carried out, and if the lower bound value is greater than the distance calculated by linear programming, the word-shift distance calculated in the mode is consistent with the requirement of the maximum transfer distance. The lower bound of word shift distance is calculated as follows:

where the computational complexity has been derived from O (p) ³ logp) to o (dp), and calculating the word shift distance of the text in a lower bound calculation manner in a subsequent method.

In some embodiments of the invention, the knowledge distillation is a model compression method based on a teacher-student mode, and the knowledge distillation method based on the teacher-student mode can not only compress model parameters, but also output soft labels to improve the generalization performance of the model. Soft tags predicted by the teacher model may provide additional relationship information compared to hard tags (unique heat vectors), and the student model may learn relationship information from the source domain that cannot be learned with hard tags alone by mimicking the output of the teacher model.

Figure 8 schematically shows a schematic view of a knowledge distillation according to an embodiment of the invention.

In some embodiments of the invention, as shown in FIG. 8, soft target is obtained by dividing the predicted output of the teacher network by the Temperature parameter (Temperature) and making a Softmax calculation. The Temperature value is between 0 and 1, the distribution is relatively smooth, the larger the Temperature value is, the more even the distribution is, and the smaller the Temperature value is, the more steep the distribution is, and the worse the smoothing effect is. hard target is the true label of the sample and can be represented by One-hot vector. The total loss is a weighted sum of the KL divergence loss and the cross-entropy loss (expressed as KD loss and CE loss). The soft labels help the student network to identify simple samples in the initial training period, and the proportion of the soft labels needs to be properly reduced in the later training period so that the real labels help the model to identify difficult samples. To obtain a soft tag introduces a temperature scaling to soften the peaked softmax distribution:

where x is the data instance, i is the relationship class, s _i (x) Is the logit fraction obtained by x on the relation i, T is the temperature, knowledge of KL-divergence measurement of the distillation loss L _kd Comprises the following steps:

wherein t and s are teacher model and student model, C is total number of relationship, D _x Is a training data set.The cross entropy loss function for the student model learning hard tag is:

wherein, the hard label and the soft label are combined to obtain the joint loss function when the student network is trained:

L＝λL _ce +(1-λ)L _ka (ii) a (twenty-one)

Where λ is used to adjust the ratio of soft tags to hard tags.

Fig. 9 schematically shows a BERT model diagram according to an embodiment of the present invention.

In some embodiments of the invention, as shown in FIG. 9, the model consists of four modules: firstly, adopting a BERT (bidirectional Encoder representation from transformations) pre-training model to obtain vector codes of texts; secondly, a teacher-student training mechanism in knowledge distillation is introduced, and a teacher model learns the class distribution in a training set and teaches the class distribution to a student model, so that the student model can output the distribution condition of texts on the classes of the training set in a testing stage; then the prototype network layer adopts the word moving distance to self-adaptively adjust the word weight so as to calculate the distance between the query sample and the prototype; and finally, the classification layer classifies the query samples by utilizing the distance output by the prototype network layer and combining the query samples provided by the student model and the distribution of the prototypes on the categories.

Fig. 10 schematically shows an input schematic of BERT according to an embodiment of the present invention.

In some embodiments of the present invention, as shown in FIG. 10, BERT is a pre-trained language model based on a two-way transformer, which has a large number of parameters, and a network depth of tens of layers. The BERT family has a plurality of pre-training models which can be divided into TinyBERT, BERT Large and the like according to the parameter size, the invention adopts a basic BERT model as an encoder, and the BERT is finely adjusted on a FewRel2.0 data set to encode texts.

In some embodiments of the present invention, the input to the BERT consists of three parts:

word vector: firstly, segmenting a text by using a word list in BERT, finding each segmented text segment in the word list, and then converting a plaintext segment into a vector corresponding to the segment by looking up the table;

position vector: a transformer is different from sequence models such as RNN (neural network), LSTM (linear spline) and the like, and has no position information, so that a fixed position vector needs to be added to represent the position information of a word in a text before BERT (binary transcription) is input;

sentence vector: the BERT also adds a task of predicting upper and lower sentences during pre-training, namely judging whether the two sentences are the upper and lower sentences, and introducing two segmentation vectors E in order to distinguish the positions of the two sentences _A And E _B To distinguish the input sentences from one another.

In some embodiments of the invention, there may be different common sense issues as the training set and the test set come from different domains. The common sense refers to some important words, when most of texts in a certain category in a training set have the words, the model can be judged as the certain category as long as the words are seen, and therefore the generalization capability of the model is poor.

In some embodiments of the present invention, to solve the above problem, the present invention employs a keyword mask method. Firstly, calculating high-frequency words of each category by using a tf-idf algorithm, then copying sentences containing the words according to preset probability, replacing common sense words in the sentences by using a [ mask ] identifier, and finally adding the processed sentences to a training stage, so that the capability of judging the relationship categories by using context information by a model is enhanced, and the process of pre-training model coding is represented as follows:

H＝{h ₁ ，...，h _n ，h _n+1 }＝E([w ₁ ，...，w _n ，w _n+1 ]) (ii) a (twenty-two)

Wherein, w _n Representing the nth word of the sentence after being segmented, and E representing an encoder, such as: BERT, BERT Large, etc. H denotes the encoded hidden layer vector, H _n A vector representation representing the nth word.

Fig. 11 schematically shows a word shift distance weight calculation diagram according to an embodiment of the present invention.

In some embodiments of the present invention, after the vector encoding of the words output by the encoding layer, the prototype network layer calculates the similarity between the texts by using an adaptive word-shifting distance method, where the word-shifting distance is specifically expressed as:

wherein ML represents the length of the sentence, the sentences exceeding the length are cut off, and when the maximum length is not satisfied, the sentence is filled with 0, u _i And v _j Respectively a vector representation of the ith word in the first text and a vector representation of the jth word in the second text, s _i And d _j Respectively represents u _i And v _j The corresponding weight. In the aspect of calculating the word weight, as shown in fig. 11, the invention adopts the following specific calculation that the word and the whole test set are calculated, the importance of the word in the test set is judged, and the important word is assigned with a larger weight:

wherein v is _jk Representing the jth word in the kth support set text, namely representing all word vectors in the support set mean posing as the feature vector of the test field, then calculating the similarity between the word in the query set and the feature vector, and judging the importance degree of the word in the test field:

wherein, the first and the second end of the pipe are connected with each other,

denotes the ith word u _i And finally normalizing all weights to represent that:

next, the distance between the query sample and each support set category is calculated by using the prototype network in combination with the word shift distance phi (U, V):

wherein σ _j Representing the distance between the query sample and each class, larger representing higher similarity, p _φ (y-n) represents the probability distribution of the normalized query sample over the support set categories:

FIG. 12 schematically shows a loss calculation schematic of a known distillation method according to an embodiment of the invention.

In some embodiments of the present invention, the training process includes two stages, in the first stage, fine-tune is performed on the pre-training model using metric learning, and a triple loss function is used to improve the robustness of metric learning, which is as follows:

where d (-) is a Euclidean distance function, E (-) refers to an instance encoding operation, C _s Is the number of training classes, a _k Is an achor, p _k Is a positive example, n _k Is a negative example, δ is an interval hyperparameter. The triplet loss alone is not easy to converge, so the model is trained by adding softmax cross entropy loss and the triplet loss together, and the cross entropy loss is as follows:

wherein, y _i Is a genuine label, z _i Denotes the ith Euclidean distance d (E (a) _i )，E(p _i ) Then the cross entropy loss and the triplet loss are combined to obtain an overall loss function as follows:

L _stage1 ＝L _softmax +L _triplet . (thirty-one)

In some embodiments of the present invention, during the second stage of training, it is ensured that the information learned in the first stage is not lost, and the parameters of the logistic regression and feedforward neural network (FFN + LR) are optimized, where FFN + LR provides features between classes in the training set. Fixing the BERT parameters in a conventional training mode, updating only partial parameters of FFN + LR, and not extracting features between classes well, because there is similarity between classes, for example, samples under two relations of "competition" and "hostile" are distributed more closely in space than samples under two relations of "competition" and "cooperation". Therefore, in order to enable the model to better learn the distribution information, the invention adopts a knowledge distillation mechanism to train the FFN + LR parameter. Firstly, a supervised training teacher model is arranged on a relation extraction data set, the teacher model consists of a BERT model and FFN + LR, at the moment, the teacher model can score the classes of samples in the training set, output the distribution of the samples on all the classes of the training set, and then introduce the output of a temperature coefficient softening peak value softmax to generate soft label shown in FIG. 12, wherein a softening function is expressed as:

where x represents the entered text instance, i is the relationship class of the entity in the text, s _i (x) Is the logit fraction obtained by x on the relation i, τ is the temperature coefficient. When there are m classes in the training set, the distribution of x in the training set classes is defined as:

P _x ＝{p ₁ (x；τ)，p ₂ (x；τ)，...，p _m (x; τ) }; (thirty-three)

Wherein, P _x For the Soft Label corresponding to the Soft Label in fig. 12, the KL divergence formula calculated by the student model and the teacher model is as follows:

wherein the content of the first and second substances,

and

representing the output probability of the teacher and student models, C being the total number of relationship classes, D _x Representing a training data set, and combining KL divergence and cross entropy loss as an integral loss function of a training student model:

L _stage2 ＝αL _softmax +(1-α)L _kd . (thirty-five)

In some embodiments of the invention, the similarity is measured by combining two independent paths:

1. the prototype network layer outputs the distribution P of the query sample on the test category calculated according to the measurement _σ ；

2. Distribution P of FFN + LR output query samples and support set samples over training set classes _query And

using these two distributions, the distribution of the query sample over the test categories can also be calculated

Wherein n is the total number of analogs in the test set, P _query ＝{r ₁ ，r ₂ ，...，r _m }，r _i Is the score of the sample on the ith category of the training set, and is obtained by the same way

r _i Calculated from FFN and LR:

r _i ＝σ(W _i H+b _i ) (ii) a (thirty-eight)

The more similar the distribution of the query sample and the support set sample on the training set class is, the more similar the distribution of the sample on each class of the training set is, so that the similarity between the two samples can be calculated. This similarity

Can measure similarity P _σ Combining, when metric distribution P _σ When errors occur, the method can be used

Distribution to correct errors, the score P of the final classification _n Expressed as:

where λ is the adjustment coefficient used to adjust the metric distribution P _σ And class distribution

The weight taken at the time of final classification.

In some embodiments of the invention, as shown in table 1, the invention performed experiments on three small sample relational extraction datasets: FewRel 1.0, FewRel2.0 and DDI-13. There are 100 relationships to FewRel 1.0, and 700 instances of each relationship. FewRel2.0 is a test set from the corpus construction in the biological field and DDI-13 is a data set constructed from the pharmaceutical literature field.

Table 1 data set statistics

In some embodiments of the present invention, the evaluation of small sample relationship extraction is evaluated using a precision ratio P (precision) based on whether each query sample can be accurately classified.

In some embodiments of the invention, the invention completes experimental verification based on python language, and tools such as Pythrch, Numpy, matplotlib and the like are used in data processing and network building, and the experiment in the invention is completed on a server carrying a Tesla V10016G GPU in Yingwei.

In some embodiments of the present invention, BERT-Base is adopted as a pretraining LM, the hidden layer dimension d of which is 768, and the pretrainer is composed of 12 layers of transformers, in order to prevent an overfitting phenomenon from occurring in training, a node retention rate of 0.30 is adopted for a model, an Adam optimizer is used to optimize model loss, the number of training steps is set to 20000 times, and an early stop is adopted, the accuracy of the model is calculated once on a verification set every 1000 steps, and if 3 consecutive epochs of the model are not promoted on the verification set, the training is stopped, and overfitting of the model is prevented.

In some embodiments of the present invention, in the testing stage, 1000 verification experiments are performed each time and the average accuracy of the experiments is calculated, and then the above steps are repeated 3 times, the average accuracy of 3 experiments is taken as the final testing accuracy, and the deviation from the mean value in the 3 results is within the range of accuracy fluctuation, wherein λ is set to 0.5 to calculate the final classification score. The invention adds two special characters < entity1> and < entity2> in front of the two entities to highlight the importance of the two words.

In some embodiments of The present invention, The keyword mask mechanism replaces frequently appearing words and entities in each category with [ mask ] identification in The training phase with a certain probability, because The present invention finds some high frequency words always appearing in one category, and these high frequency words rarely appear in other categories, for example, The word "lyrics" appears very frequently in The relationship of "making words", whereas The sentences "Romeo" and "Juliet" are "lovers" in The sentence "The lyrics of The song of The love store of Romeo and Juliet", but The model erroneously judges "Romeo" and "Juliet" as "making words" relationship due to The appearance of The word "lyrics", and after "The lyrics of" is removed, The model can correctly identify The relationship of entities. In order to relieve the influence of commonly known words and enable a model to learn more context information, the words are replaced by [ mask ] marks through a keyword mask method, the model judges the relation between entities in the text according to the context information, the selection of the mask probability has great influence on the result, the undersize mask probability does not act on the model, the oversize probability can cause the characteristic reduction of the text, in order to find the optimal keyword mask probability, the invention carries out a comparison experiment of the mask probability from 5% to 20% on the FewRel2.0 Pubmed, the experiment is completed based on a prototype network, and a method for adding word shift distance and self-adapting word shift distance is not adopted. The accuracy of the model is highest at a mask probability of 10%, and analysis results show that when the mask probability is smaller, the data enhancement effect is not obvious, a plurality of high-frequency words are not yet classified by the mask, the classification of the model is still influenced, the text characteristics are reduced due to the larger mask probability, and the model cannot learn complete semantic information, so that the model can fully learn sentence context information due to the 10% mask probability.

Fig. 13 schematically shows a visualization diagram of the sentence level in combination with the Cosine similarity calculation weight according to the embodiment of the present invention.

FIG. 14 is a schematic diagram illustrating a visualization of calculating weights by batch in combination with Cosine similarity, according to an embodiment of the present invention.

Fig. 15 is a schematic diagram illustrating a visualization of sentence level combined with euclidean distance calculation weights according to an embodiment of the present invention.

Fig. 16 schematically shows a visualization diagram of calculating the weight by combining the batch rank with the euclidean distance according to the embodiment of the present invention.

In some embodiments of the present invention, the word distance of cosine similarity (sensor) indicates that the similarity of a word to a sentence in the support set is calculated as the word weight si using a cosine similarity function. The word movement distance of the cosine similarity (batch) represents that a support set field vector is obtained by calculating all support set samples in the whole batch, and each word of a query example judges the importance degree of the word by calculating the similarity with the vector. Similarly, Euclidean distance refers to the use of Euclidean distance to compute the weight of each word, with the weights of each word in the support sample being computed in the same manner. As shown in fig. 13, 14, 15, and 16, the values in the graphs are obtained by multiplying the weights of the query word set and the support word set. As can be seen from fig. 13, 14, 15, and 16, compared with the cosine similarity-based adaptive word-shifting distance, the Euclidean distance-based adaptive word-shifting distance increases the weight of words such as "averaging" and "speech", and these words can better express the central meaning of the sentence, which is beneficial for classifying the entity relationship. Domain-specific nouns never seen during the training phase, such as "Thiabendazole" and "Chronic-hepatitis-C", are given lower weight. In this way, the method can reduce the occurrence of classification due to incorrect semantic analysis of domain-specific nouns. In addition, the weights computed for words in a query sentence against the entire batch sample are more reliable than weights computed for just one support sentence, because one support set sentence cannot reliably judge the importance of a word in a query text in the field of testing.

In addition, experiments were performed on several of the above methods for calculating weights, and the corresponding results are shown in table 2. The experiment did not use a keyword mask mechanism and knowledge distillation to introduce category information. As can be seen from table 2, the accuracy of calculating the word shift distance by using the euclidean distance method is significantly higher than the accuracy of calculating the word shift distance by using the cosine similarity. This is because in a high-dimensional vector space, euclidean distance can widen the gap between instances and highlight the effect of the emphasis word on the entire sentence when calculating the weights. Therefore, the method for calculating the word shift distance weight by adopting the Euclidean distance has the highest precision, and all models provided by the invention calculate the weight of the word shift distance by adopting the mode.

TABLE 2 influence of different ways of computing weights for word-shift distances on the results

FIG. 17 schematically shows a graph of the effect of distillation on vector space without knowledge according to an embodiment of the invention.

In some embodiments of the present invention, fig. 17 and 18 are visualized according to a t-SNE algorithm, and it can be observed from the graphs that when class features are not learned by a knowledge distillation method, after samples are aggregated according to classes, distances between the classes are not obviously interwoven by some classes, and then the distribution of support set samples and query samples on the class spaces cannot highlight the characteristics of the samples, and the intersection between the classes causes that errors easily occur in measuring the distribution of the samples on the class spaces. The reason for this is that the BERT parameters are obtained by metric learning training, while the FFN + LR parameters are obtained by supervised learning training with fixed BERT parameters, and because of the different tasks during training, the FFN + LR modules with less parameters are difficult to learn the feature representation of the class, and when the samples of the test class are directly added to the training set embedding space, the problem of feature embedding incompatibility is more likely to occur. A knowledge distillation method is introduced, a teacher model is trained together by using a BERT pre-training model and FFN + LR, so that the distribution of training set categories can be better learned, soft label information is taught to an FFN + LR module of a student model by using a teacher network for learning, the FFN + LR can learn more information among categories, and the categories are more obviously distinguished. The experimental effect is shown in fig. 17 and 18, and by comparing the distribution of samples in the two training modes, it can be seen that the knowledge distillation mode is more beneficial to learning the class distribution of the training set.

the first processing module is used for inputting the training corpus into the first model to obtain vectorization representation of the training corpus;

the first training module is used for acquiring similarity information and weight information expressed in vectorization and training the first model by using the similarity information and the weight information;

the second processing module is used for inputting the vectorization representation into a second model to obtain the category scores of the training corpora; and

and the second training module is used for training the third model by utilizing the category fraction and the vectorization representation and determining the trained third model as a text recognition model.

one or more processors; and

a memory for storing one or more programs,

According to an embodiment of an aspect of the invention, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the method described above.

So far, the embodiments of the present invention have been described in detail with reference to the accompanying drawings. It is to be noted that, in the attached drawings or in the description, the implementation modes not shown or described are all the modes known by the ordinary skilled person in the field of technology, and are not described in detail. In addition, the above definitions of the components are not limited to the specific structures, shapes or manners mentioned in the embodiments, and those skilled in the art may easily modify or replace them.

It should also be noted that in the particular examples of the invention, unless otherwise indicated, the numerical parameters set forth in the specification and attached claims are approximations that can vary depending upon the desired properties sought to be obtained by the teachings of the present invention. In particular, all numbers expressing dimensions, range conditions, and so forth, used in the specification and claims are to be understood as being modified in all instances by the term "about". Generally, the expression is meant to encompass variations of ± 10% in some embodiments, 5% in some embodiments, 1% in some embodiments, 0.5% in some embodiments by the specified amount.

It will be appreciated by a person skilled in the art that various combinations and/or combinations of features described in the various embodiments and/or in the claims of the invention are possible, even if such combinations or combinations are not explicitly described in the invention. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present invention may be made without departing from the spirit or teaching of the invention. All such combinations and/or associations fall within the scope of the present invention.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A text recognition model training method comprises the following steps:

carrying out sample enhancement on a training text to obtain a plurality of training corpora, wherein the training corpora are provided with labels, and the training text comprises a natural language text;

inputting the vectorization representation into a second model to obtain the category score of the training corpus; and

2. The training method of claim 1, wherein sample enhancing the training text comprises:

3. The training method of claim 1, wherein the first model is a BERT model, the second model is a fully connected model, the first model and the second model together comprise a teacher model, and the third model is a student model.

4. The training method according to claim 1, wherein calculating similarity information between vectorized representations of a plurality of the training corpuses comprises:

5. The training method according to claim 1, wherein calculating weight information of each word in a single corpus with respect to the corpus comprises:

6. Training method according to claim 1, wherein the second model is trained by a loss function consisting of a cross entropy function and a KL divergence.

7. A model training apparatus comprising:

the enhancement module is used for carrying out sample enhancement on a training text to obtain a training corpus, wherein the training corpus is provided with a label, and the training text comprises a natural language text;

8. An electronic device, comprising:

one or more processors; and

a memory for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-6.

9. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to carry out the method of any one of claims 1 to 6.

10. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1 to 6.