WO2021155705A1

WO2021155705A1 - Text prediction model training method and apparatus

Info

Publication number: WO2021155705A1
Application number: PCT/CN2020/132617
Authority: WO
Inventors: 李扬名; 姚开盛
Original assignee: 支付宝(杭州)信息技术有限公司
Priority date: 2020-02-06
Filing date: 2020-11-30
Publication date: 2021-08-12
Also published as: CN111274789A; CN111274789B

Abstract

Disclosed are a text prediction model training method executed by a computer, and a text prediction model training apparatus. A text prediction model comprises a first prediction network (11) based on a time sequence, a buffer (12), and a second prediction network (13) based on the buffer (12). The training method comprises: inputting a t-th word in training text into a first prediction network (11), such that the first prediction network determines a first prediction probability for the next word according a state vector obtained by means of time sequence processing; in addition, reading, from a buffer (12), several fragment vectors formed on the basis of the previous text, and a second prediction network (13) obtaining a second prediction probability for the next word according to these fragment vectors; then, by taking an interpolation weight coefficient λ as a weighting coefficient of the second prediction probability, and taking one minus λ as a weighting coefficient of the first prediction probability, weighing and synthesizing the second prediction probability and the first prediction probability in order to obtain a comprehensive prediction probability; and at least according to the comprehensive prediction probability and a (t+1)th word, determining a prediction loss regarding the t-th word, and thereby training a text prediction model.

Description

Training method and device of text prediction model

Technical field

One or more embodiments of this specification relate to the field of machine learning, and in particular to a training method and device for a text prediction model.

Background technique

With the rapid development of artificial intelligence and machine learning, various natural language processing tasks have been widely used in a variety of business implementation scenarios. For example, the text classification task can be used in the intelligent question answering customer service system to classify the question raised by the user as input text for user intention recognition, automatic question answering, or manual customer service dispatch. Text classification can also be used in various application scenarios, such as document data classification, public opinion analysis, spam identification, and so on. For another example, machine translation tasks in different languages are widely used in various automatic translation systems.

Generally, the language model is the basic model for performing the above-mentioned various specific natural language processing tasks. Language models need to be trained based on a lot of expectations. Among them, text prediction, that is, predicting subsequent texts based on existing texts, is a basic task for training language models.

Therefore, it is hoped that there will be an improved scheme that can be more effectively trained for text prediction tasks.

Summary of the invention

One or more embodiments of this specification describe a text prediction model and its training method, in which local context and long-range context are comprehensively used for prediction, thereby comprehensively improving the text prediction model's ability to understand text and predicting accuracy for subsequent text.

According to a first aspect, there is provided a method for training a text prediction model, the text prediction model including a first prediction network based on time series, a buffer, and a second prediction network based on the buffer, and the method includes: After sequentially inputting the first t-1 words in the current training text, the t-th word is input into the first prediction network, so that the first prediction network processes the state vector after the t-1th word, and State the word vector of the t-th word, determine the state vector after processing the t-th word as the first latent vector; and determine the first prediction probability for the next word according to the first latent vector; from the buffer Read several existing segment vectors, which are formed based on the text before the t-th word in the current training text, and each segment vector corresponds to a text segment with a length of L words The second prediction network determines the second prediction probability for the next word according to the several segment vectors; uses the interpolation weight coefficient as the weight coefficient of the second prediction probability, and subtracts the interpolation weight coefficient from 1. As the weighting coefficient of the first prediction probability, the first prediction probability and the second prediction probability are interpolated and weighted and integrated to obtain the comprehensive prediction probability for the next word; at least according to the comprehensive prediction probability and For the t+1th word in the training text, determine the prediction loss for the tth word; and train the text prediction model according to the prediction loss for each word in the current training text.

In one embodiment, the first prediction network includes a recurrent neural network RNN or a long short-term memory network LSTM.

According to an embodiment, the several segment vectors stored in the buffer include a first segment vector corresponding to any first text segment, where the first text segment includes the i-th word to the j-th word of the current training text, where i and j are both smaller than t, the first segment vector is obtained based on the difference between the first state vector and the second state vector, wherein the first state vector is the jth word processed by the first prediction network The second state vector is the state vector after the first prediction network processes the (i-1)th word.

According to an embodiment, the above method further includes, if the t-th word is the last word of the current text segment, determining a new segment vector according to the difference between the first latent vector and the second latent vector, where The second latent vector is the state vector after the first prediction network processes the tL-th word; the newly added segment vector is added to the buffer.

In one embodiment, the buffer has a limited storage capacity. In this case, before adding a new segment vector to the buffer, it is first determined whether the number of several fragment vectors already in the buffer reaches a predetermined threshold. Number; if the predetermined threshold number is reached, the earliest stored segment vector is deleted, and the newly added segment vector is stored in the buffer.

According to an embodiment, the second prediction network determines the second prediction probability for the next word in the following manner: determining several attention coefficients corresponding to the several segment vectors; taking the several attention coefficients as weighting factors, The several segment vectors are weighted and combined to obtain a context vector; and the second prediction probability is obtained according to the context vector and the linear transformation matrix.

According to an embodiment, the first prediction network obtains the first prediction probability according to the first hidden vector and the linear transformation matrix.

In a more specific embodiment, the second prediction network determines the attention coefficient in the following manner: according to the similarity between any i-th segment vector in the plurality of segment vectors and the first latent vector, determine The i-th attention coefficient.

In another more specific embodiment, the second prediction network determines the attention coefficient by using a first transformation matrix to transform any i-th segment vector in the plurality of segment vectors into a first intermediate vector; Use the second transformation matrix to transform the first hidden vector into a second intermediate vector; determine the similarity between the sum vector of the first intermediate vector and the second intermediate vector and the third vector; determine according to the similarity The i-th attention coefficient; wherein, the first transformation matrix, the second transformation matrix and the third vector are all trainable network parameters in the second prediction network.

According to an embodiment, the text prediction model further includes a strategy network; before performing interpolation weighted synthesis on the first prediction probability and the second prediction probability, the method further includes: the strategy network according to the first prediction probability A latent vector, outputting the interpolation weight coefficient; and the step of determining the prediction loss specifically includes: according to the comprehensive prediction probability, the t+1th word, the first prediction probability and the second prediction probability , And the interpolation weight coefficient to determine the prediction loss.

In one embodiment, the strategy network determines the interpolation weight coefficients by applying at least a strategy transformation matrix to the first latent vector to obtain a strategy vector, wherein the strategy transformation matrix is in the strategy network Trainable model parameters; determine the interpolation weight coefficient according to the element value of the predetermined dimension in the strategy vector.

In a further embodiment, the strategy network obtains the strategy vector by: determining the training strategy coefficient according to the current training text; applying the strategy transformation matrix to the first implicit vector and dividing by the training strategy Coefficient to obtain the strategy vector.

Furthermore, in an example, the training strategy coefficient may be determined according to the training sequence number of the current training text in the training sample set, so that the training strategy coefficient is negatively correlated with the training sequence number.

In another example, the training strategy coefficient may be determined according to the total text length of the current training text, so that the training strategy coefficient is negatively correlated with the total text length.

In an embodiment, the step of determining the prediction loss specifically includes: determining a first loss item according to the comprehensive prediction probability and the t+1th word; determining a second loss item according to the interpolation weight coefficient, The second loss term is negatively correlated with the interpolation weight coefficient; the reward term is determined according to the ratio of the second predicted probability and the first predicted probability to the probability values of the t+1th word , The reward item is positively correlated with the ratio; taking the reward item as the coefficient of the second loss item, sum the first loss item and the second loss item to determine the predicted loss .

According to a second aspect, there is provided a training device for a text prediction model, the text prediction model including a first prediction network based on time series, a buffer, and a second prediction network based on the buffer, the device including: A prediction unit configured to input the t-th word into the first prediction network after sequentially inputting the first t-1 words in the current training text, so that the first prediction network processes the t-1th word according to After the state vector, and the word vector of the t-th word, the state vector after processing the t-th word is determined as the first hidden vector; and according to the first hidden vector, the first predicted probability for the next word is determined Reading unit, configured to read several existing segment vectors from the buffer, the existing several segment vectors are formed based on the text before the t-th word in the current training text, and each A segment vector corresponds to a text segment with a length of L words; a second prediction unit configured to enable the second prediction network to determine a second prediction probability for the next word according to the plurality of segment vectors; a synthesis unit, configured In order to use the interpolation weight coefficient as the weighting coefficient of the second prediction probability, the difference of 1 minus the interpolation weight coefficient is used as the weighting coefficient of the first prediction probability. The predicted probability is interpolated and weighted and integrated to obtain the comprehensive predicted probability for the next word; the loss determination unit is configured to determine the target for the t-th word at least according to the comprehensive predicted probability and the t+1-th word in the training text The training unit is configured to train the text prediction model according to the prediction loss for each word in the current training text.

According to a third aspect, there is provided a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method of the first aspect.

According to a fourth aspect, there is provided a computing device, including a memory and a processor, characterized in that executable code is stored in the memory, and when the processor executes the executable code, the method of the first aspect is implemented .

According to the text prediction model provided by the embodiment of this specification, on the basis of using the first prediction network based on time series to predict the next word, the segment vector of the previous text segment in the buffer is also used as long-range context information, and the first Second, the prediction network makes predictions based on the long-range context. When performing interpolation synthesis on the prediction results of the first prediction network and the second prediction network, the strategy network can be used to generate an interpolation weight coefficient for the current word. When training the above-mentioned text prediction model, by introducing reward items and interpolation weight coefficients into the loss function, the exploration and utilization of long-range context are conditionally encouraged, thereby further improving the prediction accuracy of the text prediction model.

Description of the drawings

In order to explain the technical solutions of the embodiments of the present invention more clearly, the following will briefly introduce the drawings used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. A person of ordinary skill in the art can obtain other drawings based on these drawings without creative work.

FIG. 1 is a schematic diagram of a text prediction model according to an embodiment disclosed in this specification;

Fig. 2 shows a flowchart of a method for training a text prediction model according to an embodiment;

Figure 3 shows an example of performing prediction processing for a specific training text;

Fig. 4 shows a schematic diagram of determining a segment vector of a text segment according to an embodiment;

Fig. 5 shows a flow of steps for determining a second predicted probability according to an embodiment;

Fig. 6 shows a schematic block diagram of a training device for a text prediction model according to an embodiment.

Detailed ways

The following describes the solutions provided in this specification with reference to the accompanying drawings.

As mentioned earlier, text prediction is a basic task of natural language processing. Accordingly, it is hoped to train a text prediction model with higher prediction accuracy.

Taking into account the order of words in the text and the importance of context to the semantic understanding of the text, in one solution, a neural network model based on time sequence is used, such as recurrent neural network RNN, long short-term memory neural network LSTM, and gated recurrent unit GRU_RNN , As the basic network of the text prediction model. However, when text prediction is based only on temporal neural networks, especially when LSTM is used for prediction, only local contexts that are very close to the current word can often be captured, so that it is trapped in the local understanding of the text, and it is difficult to capture the distance from the current word. Long-range context that is far away but helpful for its semantic understanding.

In order to better capture and utilize the long-range context, thereby improving the accuracy of text prediction, in the embodiments of this specification, a new text prediction model and its training method are proposed. The model divides the input text into text fragments, and stores the characterization vector of the text fragments in the buffer as a long-range context. When predicting the next word for the current word, the implicit vector corresponding to the current word and the representation vector stored in the buffer are comprehensively considered for prediction.

Fig. 1 is a schematic diagram of a text prediction model according to an embodiment disclosed in this specification. As shown in FIG. 1, the text prediction model includes a first prediction network 11 based on a time sequence, a buffer 12, a second prediction network 13 based on a buffer, and optionally a strategy network 14.

The first prediction network 11 includes a time series neural network, such as RNN, LSTM, and GRU_RNN. According to the working mode of the time series neural network, when the training text is input into the text prediction model, the first prediction network 11 reads the words in the training text in turn, and performs iterative processing on each word in turn. When performing iterative processing on each word W _t , according to the state vector h _t-1 after processing the previous word W _t-1 and the word vector of the current word, the state vector h _t after the iterative processing of the current word is obtained. The first prediction network 11 may also include a multi-layer perceptron MLP, which obtains the first prediction result p for the next word _{based on the state vector h t corresponding to the current word.}

The buffer 12 is used to store the characterization vector of the text segment (span) before the current word, that is, the segment vector. The length L of the text segment can be a predetermined length, for example, 2 words, 3 words, 5 words, and so on. In an embodiment, for a text segment formed from the i-th word to the j-th word (j=i+L-1), the segment vector may be the state vector corresponding to the j-th word output by the first prediction network 11 It is obtained by the difference of the state vector corresponding to the i-1th word.

The second prediction network 13 performs prediction operations based on the existing segment vectors stored in the buffer 12 to obtain the second prediction result q for the next word. The second prediction result q reflects the prediction result based on the long-range context.

Then, the first prediction result p and the second prediction result q are integrated. The interpolation weight coefficient λ can be used to interpolate and synthesize the two to obtain a comprehensive prediction result.

The above interpolation weight coefficients can be preset hyperparameters or trainable parameters. Optionally and preferably, the interpolation weight coefficient is different for each word, and is determined by the policy network 14. Specifically, the strategy network 14 obtains the state vector h _t corresponding to the current word from the first prediction network 11, and performs operations based on the state vector to obtain the interpolation weight coefficient λ for the current word, which is used for the first prediction result and the second prediction Synthesis of results.

It can be seen that the text prediction model shown in Figure 1 has at least the following characteristics. First, on the basis of using the temporal neural network for prediction, the segment vectors corresponding to the text segment before the current word are also stored in the buffer, and these segment vectors are used as the long-range context to perform prediction based on the long-range context. The final prediction result is a combination of the two parts of the prediction. Furthermore, the strategy network can be used to dynamically adjust the proportion of long-range prediction results, thereby further improving the accuracy of prediction.

The training process of the above-mentioned text prediction model will be described in detail below.

Fig. 2 shows a flowchart of a method for training a text prediction model according to an embodiment. It can be understood that the text prediction model has the structure shown in FIG. 1, and the method can be executed by any device, device, platform, or device cluster with computing and processing capabilities.

Before performing the steps shown in Figure 2, the following preparatory process can be performed in advance. First of all, get the training expectation, that is, the training sample set, which includes a large amount of training text. Before inputting a certain training text into the text prediction model, firstly, word embedding is performed on the training text, and each word in the training text is converted into a word vector, thereby converting the training text into a word vector sequence. In one embodiment, word embedding can be realized by one-hot encoding. At this time, the dimension of each word vector corresponds to the number V of words in the lexicon. In other embodiments, the conversion of word vectors can also be realized by other word embedding methods, for example, the word2vec method, and so on.

In one embodiment, the training text is Chinese text. At this point, in an example, the training text can be segmented first, and then word embedding can be performed for each word after the segmentation. In another example, each Chinese character is directly processed as a word. Therefore, the "word" in the following includes the case of single Chinese characters.

After word embedding, the training text can be input to the text prediction model for prediction and training. As mentioned above, the basic network of the text prediction model is still a time-series neural network. Therefore, for the current training text, each word (more specifically, a word vector) is input into the text prediction model in turn. Correspondingly, the text prediction model performs prediction processing on each input word in turn. The following describes the prediction processing process and training process of the text prediction model in combination with any t-th word in the training text.

As shown in FIG. 2, in step 21, the t-th word in the current training text is input into the first prediction network in the text prediction model. It can be understood that before this, the first t-1 words in the current training text have been sequentially input into the text prediction model.

As mentioned above, the first prediction network includes a time-series neural network, which jointly determines the state at the next moment according to the state at the previous moment and the current input. At the last moment, the t-1th word has been processed and the word vector x _t _{of the t-th word W t} is currently input, the first prediction network is based on the state vector h _{t- after processing the t-1th word. 1} , and the word vector x _{t of} the t-th word, determine the state vector h _t after processing the t-th word. This process can be expressed by the following formula (1):

h _t ＝Φ(x _t , h _t-1 ) (1)

Among them, Φ is a state transition function, and the specific function form depends on the network form of the sequential neural network, such as RNN or LSTM. The dimension of the state vector is denoted as d dimension.

Hereinafter, for simplicity and clarity, the state vector h _t after processing the current t-th word is called the first hidden vector.

The first prediction network may also include a multilayer perceptron MLP, which is used to determine the first prediction probability p for the next word _{according to the first latent vector h t.} More specifically, the first predicted probability p may include that the next word is the probability distribution of each word in the vocabulary. Assuming that the number of words in the lexicon is V, then the first predicted probability p can be expressed as a V-dimensional vector.

In one embodiment, in order to determine the first prediction probability p, MLP first applies a linear transformation matrix O _t+1 _{to the first hidden vector h t} . The linear transformation matrix is a trainable parameter matrix. A hidden vector h _{t is} transformed or projected into a V-dimensional vector. Optionally, apply the softmax function after this to obtain the probability distribution for each word. Specifically, the first predicted probability p for the next word can be expressed as:

in,

Represents the transposition of _{h t.}

Figure 3 shows an example of performing prediction processing on a specific training text. In the example in Figure 3, assume that the current input is the 92nd word "no" in the training text. Then, in the first prediction network, the temporal neural network _{obtains the state corresponding to the 92nd word according to the state vector h 91} after processing the 91st word "have" and the word vector corresponding to the 92nd word "no" The vector h ₉₂ . Then, according to the state vector h ₉₂ , the MLP obtains the first predicted probability p for the next word, that is, the 93rd word.

It can be understood that, in general, the prediction result obtained according to the state vector of the time series neural network more reflects the influence of the local context closer to the current word on the understanding of the current word meaning. For example, in the example in Figure 3, since the local context of the current word "no" is "i have", based on this, the prediction results of the first prediction network will tend to be common collocation words in the local context, such as "trouble ", "idea", output a higher prediction probability.

In order to make better use of the long-range context information, in step 22, read several existing segment vectors from the buffer, which are formed based on the text before the t-th word in the current training text, and Each segment vector corresponds to a text segment with a length of L consecutive words. In other words, in the process of sequentially processing the first t-1 words, several text fragments can be formed according to the length L, and the characterization vectors of these text fragments, that is, the fragment vectors, are stored in the buffer as long-range context information .

Specifically, the length L of the text segment can be preset according to needs. For example, for a longer training text, you can set a longer segment length, such as 8 words, 10 words, etc., for a shorter training text, you can Set a shorter segment length, such as 2 words, 3 words, and so on.

In this way, in the process of processing the text before the t-th word, the first t-1 words can form several text fragments m _ij according to the preset length L, where i is the word sequence number at the beginning of the text fragment, and j is the end of the text fragment The word sequence number of, i and j are both smaller than t, and j=i+L-1. The characterization vector of the text segment, that is, the segment vector, can be obtained based on the state vector when the first prediction network processes each preceding word.

_{Specifically, in one embodiment, for the text segment m ij} formed from the i-th word to the j-th word, the segment vector is obtained based on the difference between the first state vector and the second state vector, where the first _{The state vector is the state vector h j} after the first prediction network processes the jth word, that is, the state vector after the end word (jth word) of the text segment m _ij is processed; the second state vector is the first _{The prediction network processes the state vector hi-1} after the (i-1)th word, that is, processes the state vector before the start word (i-th word) of _{the text segment m ij.}

Fig. 4 shows a schematic diagram of determining a segment vector of a text segment according to an embodiment. In the example of FIG. 4, a text segment is formed with 2 words as the segment length. _{For the current text fragment m 12-13} formed by the 12th word and the 13th word shown in the box in Figure 4, the fragment vector can be determined by h ₁₃ -h ₁₁ , where h ₁₃ is a temporal neural network processing The state vector after the 13th word, h ₁₁ is the state vector after the sequential neural network processes the 11th word (that is, before the 12th word), or in other words, is the state vector at the end of the previous text segment.

In another embodiment, for the text segment m _ij formed from the i-th word to the j-th word, the state vector after the first prediction network processes each word from the i-th word to the j-th word is obtained, Obtain L state vectors, sum or average the L state vectors, and use them as the segment vector corresponding to the _{text segment m ij.}

The fragment vector can also be obtained in other ways. However, preferably, the time-series neural network is used to process the state vector of each preceding word, and each segment vector is calculated. In this way, the processing result of the first prediction network can be reused and the calculation method of the segment vector can be simplified.

Using the above various segment vector calculation methods, several segment vectors can be obtained in the process of sequentially iteratively processing each word of the current training text by the first prediction network. Specifically, a counter with L as a loop can be set to count the number of words processed by the first prediction network. When the processed words are accumulated one by one, the counter is incremented. Each time L words are accumulated, a new text fragment is formed. The counter is cleared and counted again. At this time, the fragment vector of the newly added text fragment is calculated and stored in the buffer.

Correspondingly, for the currently processed t-th word, it can be determined whether the t-th word is the last word of the current text segment. Specifically, it can be judged whether the count of the counter reaches L. If it is the last word of the current text segment, the current text segment is regarded as a newly-added text segment, and the segment vector of the newly-added text segment is calculated. Specifically, in an embodiment, the newly added segment vector may be determined according to the difference between the aforementioned first hidden vector h _t and the second hidden vector h _tL , where the second hidden vector h _tL is the tL-th one processed by the first prediction network. The state vector after the word. Then, the newly added segment vector is added to the buffer.

In one embodiment, the buffer used to store each segment vector of the previous text has a limited capacity size B. Accordingly, the buffer can only store a limited number of N segment vectors. In this case, the buffer can be made to store the segment vectors of the N text segments closest to the currently processed word. Specifically, in one embodiment, when adding a newly-added segment vector to the buffer, it is first determined whether the number of several segment vectors already in the buffer reaches the above-mentioned threshold number N, and if it does not reach the threshold number N, the newly-added segment vector is directly added. The fragment vector is added to the buffer; if the number of existing fragment vectors has reached the threshold number N, the earliest stored fragment vector is deleted, and the newly added fragment vector is stored in the buffer.

Continue the example in Figure 3. The current input is the 92nd word "no" in the training text. At this time, multiple segment vectors based on the text before the 92nd word have been stored in the buffer, where each segment vector corresponds to a text segment formed by three consecutive words. The text fragment closest to the current word is the text fragment m _89-91 from the 89th word to the 91st word. Due to the limited capacity of the buffer, the earliest segment vector stored therein corresponds to the text segment m _16-18 , that is, the text segment formed from the 16th word to the 18th word.

It can be seen that the segment vectors stored in the buffer can represent text segments that are far away from the current word. Therefore, these segment vectors can be used as long-range context information to help understand the semantics of the current word, and then help predict the next word.

Therefore, in step 23, the second prediction network is used to determine the second prediction probability q for the next word according to several segment vectors stored in the buffer. Specifically, the second prediction network can use the attention mechanism to integrate several existing segment vectors into a context vector, and then determine the second prediction probability q based on the context vector.

Fig. 5 shows a flow of steps for determining the second predicted probability according to an embodiment. First, in step 51, several attention coefficients corresponding to several segment vectors are determined. Specifically, for any i-th segment vector _si _{among several segment vectors, the corresponding attention coefficient α t,i} can be determined based on the similarity measurement.

In an embodiment, the similarity γ _t,i between the i-th segment vector s _i and the first hidden vector h _t can be determined, and the similarity can be cosine similarity, similarity determined based on Euclidean distance, etc. Wait. Then, according to the similarity γ _t,i , the i-th attention coefficient α _{t,i is determined} . Specifically, the softmax function can be used to normalize the similarity corresponding to each segment vector to obtain the corresponding attention coefficient. For example, the i-th attention coefficient α _t,i can be determined as:

α _t,i ∝exp(γ _t,i ) (3)

In another embodiment, for the i-th segment vector _si , the corresponding similarity is determined in the following manner. Specifically, the first transformation matrix W _s can be used to transform the i-th segment vector s _i into a first intermediate vector W _s s _i ; and the second transformation matrix W _{h can} be used to transform the first hidden vector h _{t into the first intermediate vector W s s i.} Two intermediate vectors W _h h _t _{; Then, determine the similarity γ t,i} between the sum vector of the first intermediate vector and the second intermediate vector and the third vector v, namely:

γ _t,i =v ^T (W _h h _t +W _s s _i ) (4)

Wherein the first transformation matrix W _s, a second transform matrix, and the third vector v s _h are predicted second network may be trained network parameters.

Then, according to the similarity γ _t,i , the i-th attention coefficient α _t,i can be determined similarly using formula (3).

Next, in step 52, using each attention coefficient corresponding to each segment vector as a weighting factor, weighted combination of the foregoing several segment vectors, to obtain a context vector ξ _t .

In an example, the segment vectors s _i stored in the buffer can be sequentially arranged into a vector sequence C _t _{, and the attention coefficients α t,i} corresponding to each segment vector can be arranged into an attention vector α _t , In this way, the context vector ξ _t can be expressed as:

Therefore, in step 53, the second predicted probability q is obtained _{according to the context vector ξ t and a linear transformation matrix.} It can be understood that, similar to the first predicted probability p, the second predicted probability q may include the probability distribution of each word in the dictionary as the next word, so q is also a V-dimensional vector. Correspondingly, the linear transformation matrix used in step 53 is used to transform _{or project the d-dimensional context vector ξ t} into a V-dimensional vector. Specifically, the second predicted probability q can be expressed as:

Among them, O _t+1 is the linear transformation matrix for the context vector.

In one embodiment, the linear transformation matrix for the context vector in formula (6) is the same matrix as the linear transformation matrix for the first implicit vector in formula (2). In another embodiment, the second predictive network maintains the linear transformation matrix for the context vector in formula (6), which is independent of the linear transformation matrix used by the first predictive network in formula (2).

In the above manner, the second prediction network obtains the second prediction probability q for the next word according to the segment vector stored in the buffer. As mentioned above, the segment vector stored in the buffer area reflects the long-range context information. Therefore, the second prediction probability q obtained based on the segment vector can reflect the prediction of the next word based on the long-range context.

Continue the example in Figure 3. The buffer stores fragment vectors of previous text fragments. These previous text fragments contain text fragments relatively far from the current word, such as m _16-18 . Based on these segment vectors, the attention mechanism is used to obtain the second predicted probability q for the next word, which is made with more consideration of the long-range context. For example, since the text fragment m _16-18 contains the long-range context "good restaurant", the second predicted probability q tends to be related words in the long-range context, such as "appetite", which outputs a higher predicted probability.

On the basis of obtaining the first prediction probability p and the second prediction probability q respectively, in step 24 of FIG. 2, the interpolation weight coefficient λ is used as the weighting coefficient of the second prediction probability q, and 1 minus λ is used as the first prediction probability The weighting coefficient of p is to perform interpolation weighted synthesis on the first prediction probability and the second prediction probability to obtain the comprehensive prediction probability Pr for the next word, namely:

Pr=λ*q+(1-λ)*p (7)

Then, in step 25, the prediction loss for the t-th word is determined at least according to the above-mentioned comprehensive prediction probability Pr and the t+1-th word in the current training text.

In an embodiment, the above-mentioned interpolation weight coefficient is a preset hyperparameter or a trainable model parameter. At this time, the true next word in the training text, that is, the t+1th word, can be used as a label, and the prediction loss for the current word can be determined according to the comparison of the comprehensive prediction probability Pr and the label. For example, the cross entropy loss function can be used to determine the prediction loss Loss:

Loss=-logPr(x _t+1 |x _{1: t} ) (8)

Therefore, in step 26, a text prediction model is trained based on the total prediction loss for each word in the current training text. Specifically, the first prediction network and the second prediction network are updated in the direction in which the total prediction loss is reduced.

Further, the inventor found that for a piece of text, in most cases, the understanding of the current word and the prediction of the next word depend more on the local context, and only in a few cases, it depends on the long-range context. Therefore, when the first prediction probability and the second prediction probability are interpolated and integrated, it is preferable that the interpolation weight coefficient is not fixed, but is different from word to word depending on each word.

To this end, as shown in Figure 1, in one embodiment, on the basis of the first prediction network and the second prediction network, the text prediction model further includes a strategy network, which is used to determine the corresponding word for the current word. Interpolate the weight coefficient λ. The following describes how the strategy network determines the interpolation weight coefficients and its training methods.

Specifically, in order to determine the corresponding interpolation weight coefficient λ _t _{for the current t-th word, the strategy network may obtain the first latent vector h t} obtained by the first prediction network processing the t-th word, and according to the first latent vector h _t , calculate the interpolation weight coefficient λ _t .

In a specific embodiment, the policy network may apply a policy transformation matrix W _g _{to the above-mentioned first hidden vector h t to} obtain a policy vector W _g h _t , where the policy transformation matrix W _g is a trainable model maintained in the policy network The parameter can be an M*d-dimensional matrix, so that the d-dimensional first hidden vector is transformed into an M-dimensional strategy vector, where M is the preset number of dimensions. _{Then, the interpolation weight coefficient λ t} can be determined according to the element value of the predetermined dimension in the M-dimensional strategy vector. For example, the element value of a certain dimension after normalization of the strategy vector can be used as the interpolation weight coefficient λ _t , namely:

λ _t ∝exp(W _g h _t ) (9)

For example, typically, M=2 can be taken, and then a two-dimensional strategy vector can be obtained through the strategy transformation matrix. Then, based on the element value of one dimension in the 2-dimensional vector, the interpolation weight coefficient λ _{t can be obtained} . In a more simplified example, M=1 can be taken, then the strategy transformation matrix W _g degenerates to a vector, and the strategy vector degenerates to a value, and the interpolation weight coefficient λ _t can be obtained based on the value.

Further, in order to better control the size of the output interpolation weight coefficient, a training strategy coefficient T is also set in the strategy network. The training strategy coefficient T can be a hyperparameter that can be adjusted during the training process, more specifically according to Each training text is determined, so as to better adjust the output of the interpolation weight coefficient.

In this case, the above formula (9) can be modified to the following formula (10):

That is, on the basis of applying the strategy transformation matrix W _g to the first implicit vector, it is divided by the training strategy coefficient T mentioned above to obtain the strategy vector; then based on the strategy vector, the interpolation weight coefficient λ _{t is obtained} .

It can be seen from formula (10) that the smaller the training strategy coefficient T, the larger the interpolation weight coefficient obtained. According to formula (7), the interpolation weight coefficient is a weight coefficient applied to the second prediction probability. Therefore, the larger the interpolation weight coefficient, it means that the use of the remote context is encouraged.

Therefore, in one embodiment, a process similar to "annealing" may be used to set and adjust the aforementioned training strategy coefficients. Specifically, a larger training strategy coefficient T, or a higher temperature T, can be set at the beginning of training; then, as the training progresses, the training strategy coefficient T, or the temperature T, is gradually reduced. This means that as training progresses, text prediction models are encouraged to explore the use of long-range context.

In a specific example, the training strategy coefficient T can be determined according to the training sequence number of the current training text in the training sample set, so that the training strategy coefficient T is negatively correlated with the training sequence number. In other words, the smaller the training sequence number is, the closer it is to the beginning of training. At this time, the larger the training strategy coefficient T, the higher the temperature T; as the training sequence number increases, the temperature decreases, and the training strategy coefficient decreases.

On the other hand, the training strategy coefficient T for the current training text can also be determined according to the total text length of the current training text. Specifically, the training strategy coefficient T can be negatively correlated with the total text length. Therefore, for a longer training text, a smaller coefficient T can be set to obtain a larger interpolation weight coefficient, thereby more encouraging the use of long-range context.

Through the above multiple methods, the strategy network determines the corresponding interpolation weight coefficient λ _t for the current t-th word in the current training text. The interpolation weight coefficient is applied to the above formula (7) to obtain the comprehensive predicted probability Pr.

Continue the example in Figure 3. In Fig. 3, the first prediction network _{obtains the first prediction probability p for the 93rd word according to the current state vector h 92} of the 92nd word, that is, the first hidden vector; the second prediction network obtains the first prediction probability p for the 93rd word according to the storage The segment vector of, get the second predicted probability q. The strategy network obtains the interpolation weight coefficient according to the above-mentioned first hidden vector h ₉₂ and the training strategy coefficient T (shown as the "annealing" temperature in the figure). Therefore, the interpolation weight coefficient can be used to perform interpolation synthesis on the first prediction probability p and the second prediction probability q to obtain the comprehensive prediction probability Pr.

In order to train the strategy network, the aforementioned method of determining the prediction loss needs to be modified. When determining the prediction loss loss, not only the comprehensive prediction probabilities obtained from the first and second prediction networks are considered, but the output of the strategy network is also considered. Therefore, according to an embodiment, in the foregoing step 25, according to the comprehensive prediction probability and the t+1th word, and according to the first prediction probability p, the second prediction probability q, and the interpolation weight coefficients, determine Forecast loss Loss.

In an embodiment, in the case of combining a policy network, the prediction loss can be determined in the following manner. On the one hand, the first loss term L1 can be determined according to the comprehensive prediction probability Pr and the t+1th word. The first loss term L1 can take the form of cross-entropy loss, as shown in formula (8). In other words, the loss shown in formula (8) can be used as the first loss item L1 here.

On the other hand, according to the interpolation weight coefficient λ _t , the second loss term L2 is determined so that the second loss term is negatively related to the interpolation weight coefficient. For example, in an example, the second loss term can be set as:

L2=-logλ _t (11)

In other examples, the second loss term L2 can also be set to other forms of negative correlation, for example, 1/λ _t .

Further, according to a first and a second prediction probability q p are predicted probability for the value of the t + 1 terms of the probability of the ratio, r _t term prize is determined, the bonus items r _t positively correlated with the ratio; and, The reward term r _{t is} used as the coefficient of the second loss term L2, and the first loss term and the second loss term are summed to determine the predicted loss Loss.

In the case that the second loss term adopts the form of formula (11), the predicted loss Loss can be expressed as:

Loss=-logPr(x _t+1 |x _1:t )-η*r _t *logλ _t (12)

Among them, η is an optional adjustment coefficient, η>0.

As shown in formula (12), the first term in the loss function expression corresponds to the first loss term, which aims to increase the probability of correctly predicting the next word. The second term is the product of the reward term and the second loss term, which aims to conditionally encourage the exploration and use of the long-range context.

It can be seen that r _t *logλ _t is very similar in form to the policy gradient in reinforcement learning. In fact, encouraging exploration and use of long-range context can be embodied by the second loss term L2 itself, because a smaller value of the second loss term corresponds to a larger λ _t . However, as mentioned earlier, in fact, only a few cases need to rely on long-range context for prediction. Therefore, the encouragement of the long-range context should be carried out conditionally, and the condition is reflected _{by the reward item r t.} The adjustment of the reward term means that only when the prediction probability of the second prediction network for the correct next word is significantly higher than the prediction probability of the first prediction network, a larger interpolation weight coefficient λ _t is encouraged.

Specifically, the second prediction network outputs the second prediction probability q, where the probability value for the real t+1th word (that is, the correct next word) is q(x _t+1 |x _1:t ); The probability value of the first prediction network for the t+1th word is p(x _t+1 |x _1:t ). The ratio of the two can be defined as R:

R=q(x _t+1 |x _{1: t} )/p(x _t+1 |x _{1: t} ) (13)

The above ratio R may reflect the relative prediction accuracy of the second prediction network and the first prediction network for the correct next word. Setting the reward item r _t is positively related to the ratio R, that is, the larger the ratio R, the greater the reward item r _t . Moreover, during the training process, the correct next word, that is, the t+1th word, is known. Therefore, the size of the reward item can be clearly and uniquely determined. Therefore, this reward item can also be called Intrinsic Rewards.

The reward item r _t can be determined in a variety of ways based on the above ratio R.

In a specific example, the reward item r _{t is} determined in the following way (14):

Among them, ∈ is a minimum value, which is set in order to avoid _{mathematical problems caused when p(x t+1} |x _1:t ) is 0. Therefore, it can be considered

Approximately equal to the above ratio R.

More specifically, in an example, the above function f(z) can adopt the ReLU function:

The κ in formula (14) is used to amplify the effect of R in exponential form, and the β in formula (15) is used for linear amplification. These parameters can be set according to needs and practice. For example, in one example, κ=5 and β=3. In addition, the parameter a in formula (14) is the cutoff threshold, and the parameter b is the reference threshold. These thresholds can also be set according to needs and practice. For example, in an example, take a=10 and b=1.

In other examples, other specific forms may also be used to determine the reward item r _t according to the aforementioned ratio R, as long as the reward item r _t and the ratio R are positively correlated.

When the prediction loss is determined according to formula (12), if the prediction loss is to be reduced, on the basis of increasing the prediction probability of the correct word according to the first loss term, the second term is also required to be as small as possible. For this reason, when the prediction probability of the second prediction network for the next correct word is significantly higher than that of the first prediction network, that is, when the above-mentioned ratio R is larger, a larger reward term r _{t is obtained} , which forces the second loss term to be smaller , That is, the strategy network outputs a larger λ _t , so as to conditionally encourage a larger interpolation weight coefficient λ _t , that is, conditionally encourage the purpose of long-range context.

In this way, after the prediction loss is determined in step 25 according to the loss function of formula (12), in step 26, the text prediction model is trained according to the total prediction loss of each word, that is, in the direction in which the total prediction loss decreases, the first The model parameters in the prediction network, the second prediction network, and the strategy network achieve the above training goals.

Recalling the above process, according to the text prediction model of the embodiment of this specification, on the basis of using the first prediction network based on time sequence to predict the next word, it also uses the segment vector of the previous text segment stored in the buffer as the long-range context information, And use the second prediction network to make predictions based on the long-range context. When performing interpolation synthesis on the prediction results of the first prediction network and the second prediction network, the strategy network can be used to generate an interpolation weight coefficient for the current word. When training the above-mentioned text prediction model, by introducing reward items and interpolation weight coefficients into the loss function, the exploration and utilization of long-range context are conditionally encouraged, thereby further improving the accuracy of prediction.

According to another embodiment, a training device for a text prediction model is provided. The text prediction model includes a first prediction network based on time series, a buffer, a second prediction network based on the buffer, and the training device It can be deployed in any device, platform or device cluster with computing and processing capabilities. Fig. 6 shows a schematic block diagram of a training device for a text prediction model according to an embodiment. As shown in FIG. 6, the training device 600 includes: a first prediction unit 61 configured to input the t-th word into the first prediction network after sequentially inputting the first t-1 words in the current training text, so that The first prediction network determines the state vector after processing the t-th word as the first latent vector according to the state vector after processing the t-1th word and the word vector of the t-th word; An implicit vector determines the first prediction probability for the next word; the reading unit 62 is configured to read several existing segment vectors from the buffer, and the existing several segment vectors are based on the current training The text before the t-th word in the text is formed, and each segment vector corresponds to a text segment with a length of L words; the second prediction unit 63 is configured to make the second prediction network according to the plurality of segment vectors , Determine the second prediction probability for the next word; the synthesis unit 64 is configured to use the interpolation weight coefficient as the weight coefficient of the second prediction probability, and use the difference of 1 minus the interpolation weight coefficient as the first A weighting coefficient of prediction probability, interpolating and weighting the first prediction probability and the second prediction probability to obtain a comprehensive prediction probability for the next word; the loss determination unit 65 is configured to at least according to the comprehensive prediction probability and The t+1th word in the training text determines the prediction loss for the tth word; the training unit 66 is configured to train the text prediction model according to the prediction loss for each word in the current training text.

According to an embodiment, the several segment vectors stored in the buffer include a first segment vector corresponding to any first text segment, and the first text segment includes the i-th word to the j-th word of the current training text, Wherein i and j are both less than t, the first segment vector is obtained based on the difference between the first state vector and the second state vector, and the first state vector is the first prediction network processing the jth The state vector after the word, and the second state vector is the state vector after the first prediction network processes the (i-1)th word.

According to an embodiment, the device 600 further includes a storage unit (not shown), configured to: if the t-th word is the last word of the current text segment, then according to the difference between the first latent vector and the second latent vector The difference determines the newly added segment vector, where the second implicit vector is the state vector after the first prediction network processes the tLth word; the newly added segment vector is added to the buffer.

In one embodiment, the buffer has a limited storage capacity. In this case, the storage unit is further configured to: determine whether the number of fragment vectors already in the buffer reaches a predetermined threshold number; The predetermined threshold number is deleted, the earliest stored segment vector is deleted, and the newly added segment vector is stored in the buffer.

According to one embodiment, the second prediction network obtains the second prediction probability by: determining a number of attention coefficients corresponding to the several segment vectors; using the several attention coefficients as weighting factors, A number of segment vectors are weighted and combined to obtain a context vector; and the second prediction probability is obtained according to the context vector and the linear transformation matrix.

In an embodiment, when the first prediction network determines the first prediction probability, the first prediction probability is obtained according to the first hidden vector and the same linear transformation matrix as the second prediction network.

In a more specific embodiment, the second prediction network determines the attention coefficient in the following manner: according to the similarity between any i-th segment vector in the plurality of segment vectors and the first latent vector, determine the i-th segment vector Attention coefficient.

In another more specific embodiment, the second prediction network determines the attention coefficient by using a first transformation matrix to transform any i-th segment vector in the plurality of segment vectors into a first intermediate vector; A second transformation matrix, transforming the first hidden vector into a second intermediate vector; determining the similarity between the sum vector of the first intermediate vector and the second intermediate vector and the third vector; determining the i-th vector according to the similarity Attention coefficient; wherein, the first transformation matrix, the second transformation matrix and the third vector are all trainable network parameters in the second prediction network.

According to an embodiment, the text prediction model further includes a strategy network for outputting the interpolation weight coefficient according to the first latent vector; in this case, the loss determination unit 65 is further configured to The comprehensive prediction probability, the t+1th word, the first prediction probability and the second prediction probability, and the interpolation weight coefficient determine the prediction loss.

In one embodiment, the strategy network determines the interpolation weight coefficients in the following manner: at least a strategy transformation matrix is applied to the first implicit vector to obtain a strategy vector, wherein the strategy transformation matrix is Trainable model parameters; determine the interpolation weight coefficient according to the element value of the predetermined dimension in the strategy vector.

In a further embodiment, the strategy network obtains the strategy vector in the following manner: determining the training strategy coefficient according to the current training text; applying the strategy transformation matrix to the first implicit vector and dividing by the training strategy Coefficient to obtain the strategy vector.

Furthermore, in an example, the strategy network determining the training strategy coefficient specifically includes: determining the training strategy coefficient according to the training sequence number of the current training text in the training sample set, so that the training strategy coefficient is the same as the training strategy coefficient. The number of the training sequence is negatively correlated.

In another example, determining the training strategy coefficient by the strategy network specifically includes: determining the training strategy coefficient according to the total text length of the current training text, so that the training strategy coefficient is negatively correlated with the total text length.

In an embodiment, the loss determination unit 65 is specifically configured to: determine a first loss item according to the comprehensive predicted probability and the t+1th word; and determine a second loss item according to the interpolation weight coefficient Item, wherein the second loss item is negatively correlated with the interpolation weight coefficient; according to the ratio of the second prediction probability and the first prediction probability to the probability value of the t+1th word, the determination of the Reward item, the reward item is positively related to the ratio; taking the reward item as the coefficient of the second loss item, sum the first loss item and the second loss item to determine the Forecast loss.

Through the above device, the training of the text prediction model is realized.

According to another embodiment, there is also provided a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method described in conjunction with FIG. 2.

According to an embodiment of still another aspect, there is also provided a computing device, including a memory and a processor, the memory is stored with executable code, and when the processor executes the executable code, it implements the method described in conjunction with FIG. 2 method.

Those skilled in the art should be aware that, in one or more of the above examples, the functions described in the present invention can be implemented by hardware, software, firmware, or any combination thereof. When implemented by software, these functions can be stored in a computer-readable medium or transmitted as one or more instructions or codes on the computer-readable medium.

The specific embodiments described above further describe the purpose, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above are only specific embodiments of the present invention, and are not intended to limit the scope of the present invention. The protection scope, any modification, equivalent replacement, improvement, etc. made on the basis of the technical solution of the present invention shall be included in the protection scope of the present invention.

Claims

A method for training a text prediction model, the text prediction model comprising a first prediction network based on time series, a buffer, and a second prediction network based on the buffer, the method comprising:

After sequentially inputting the first t-1 words in the current training text, input the t-th word into the first prediction network, so that the first prediction network processes the state vector of the t-1th word according to the state vector, and The word vector of the t-th word, determining the state vector after processing the t-th word as the first latent vector; and determining the first predicted probability for the next word according to the first latent vector;

Read several existing segment vectors from the buffer, the existing several segment vectors are formed based on the text before the t-th word in the current training text, and each segment vector corresponds to a length of Text fragment of L words;

The second prediction network determines the second prediction probability for the next word according to the several segment vectors;

The interpolation weight coefficient is used as the weighting coefficient of the second prediction probability, and the difference of 1 minus the interpolation weight coefficient is used as the weighting coefficient of the first prediction probability. Probability is interpolated and weighted and integrated to obtain the comprehensive predicted probability for the next word;

Determine the prediction loss for the t-th word at least according to the comprehensive prediction probability and the t+1-th word in the training text;

Training the text prediction model according to the prediction loss for each word in the current training text.
The method according to claim 1, wherein the first prediction network comprises a recurrent neural network (RNN) or a long short-term memory network (LSTM).
The method according to claim 1, wherein the plurality of segment vectors includes a first segment vector corresponding to a first text segment, and the first text segment includes the i-th word to the j-th word of the current training text , Wherein i and j are both smaller than t, the first segment vector is obtained based on the difference between the first state vector and the second state vector, and the first state vector is the first prediction network processing the jth A state vector after a word, and the second state vector is a state vector after the first prediction network processes the (i-1)th word.
The method according to claim 1 or 3, further comprising:

If the t-th word is the last word of the current text segment, a new segment vector is determined according to the difference between the first latent vector and the second latent vector, where the second latent vector is the first prediction network Process the state vector after the tL word;

Adding the newly added segment vector to the buffer.
The method according to claim 4, wherein adding the newly added segment vector to the buffer comprises:

Judging whether the number of several segment vectors already in the buffer reaches a predetermined threshold number;

If the predetermined threshold number is reached, the earliest stored segment vector is deleted, and the newly added segment vector is stored in the buffer.
The method according to claim 1, wherein the second prediction network determines the second prediction probability for the next word according to the plurality of segment vectors, comprising:

Determining several attention coefficients respectively corresponding to the several segment vectors;

Using the several attention coefficients as weighting factors, weighting and combining the several segment vectors to obtain a context vector;

According to the context vector and the linear transformation matrix, the second predicted probability is obtained.
The method according to claim 6, wherein determining the first predicted probability for the next word according to the first hidden vector comprises:

According to the first hidden vector and the linear transformation matrix, the first predicted probability is obtained.
8. The method according to claim 6, wherein determining a plurality of attention coefficients respectively corresponding to the plurality of segment vectors comprises:

Determine the i-th attention coefficient according to the similarity between any i-th segment vector in the several segment vectors and the first hidden vector.
8. The method according to claim 6, wherein determining a plurality of attention coefficients respectively corresponding to the plurality of segment vectors comprises:

Using the first transformation matrix to transform any i-th segment vector in the plurality of segment vectors into a first intermediate vector;

Using a second transformation matrix to transform the first hidden vector into a second intermediate vector;

Determine the similarity between the sum vector of the first intermediate vector and the second intermediate vector and the third vector;

Determine the i-th attention coefficient according to the similarity;

Wherein, the first transformation matrix, the second transformation matrix and the third vector are all trainable network parameters in the second prediction network.
The method according to claim 1, wherein the text prediction model further comprises a strategy network; before performing interpolation weighted synthesis on the first prediction probability and the second prediction probability, the method further comprises:

The strategy network outputs the interpolation weight coefficient according to the first hidden vector;

Determining the prediction loss at least according to the comprehensive prediction probability and the t+1th word in the training text includes: according to the comprehensive prediction probability, the t+1th word, the first prediction probability and the t+1th word Second, the prediction probability, and the interpolation weight coefficient, determine the prediction loss.
The method according to claim 10, wherein the strategy network outputting the interpolation weight coefficient according to the first hidden vector comprises:

Applying at least a strategy transformation matrix to the first implicit vector to obtain a strategy vector, wherein the strategy transformation matrix is a trainable model parameter in the strategy network;

The interpolation weight coefficient is determined according to the element value of the predetermined dimension in the strategy vector.
The method according to claim 11, wherein applying at least a strategy transformation matrix to the first implicit vector to obtain a strategy vector comprises:

Determine the training strategy coefficient according to the current training text;

The strategy transformation matrix is applied to the first implicit vector and divided by the training strategy coefficient to obtain the strategy vector.
The method according to claim 12, wherein determining the training strategy coefficient according to the current training text comprises:

The training strategy coefficient is determined according to the training sequence number of the current training text in the training sample set, so that the training strategy coefficient is negatively correlated with the training sequence number.
The method according to claim 12, wherein determining the training strategy coefficient according to the current training text comprises:

The training strategy coefficient is determined according to the total text length of the current training text, so that the training strategy coefficient is negatively correlated with the total text length.
The method according to claim 10, wherein, according to the first prediction probability, the second prediction probability, the comprehensive prediction probability, the t+1th word, and the interpolation weight coefficient, the Forecast loss, including:

Determine the first loss item according to the comprehensive predicted probability and the t+1th word;

Determine a second loss term according to the interpolation weight coefficient, wherein the second loss term is negatively related to the interpolation weight coefficient;

Determine the reward item according to the ratio of the second predicted probability and the first predicted probability to the probability value of the t+1th word, and the reward item is positively related to the ratio;

The reward item is used as a coefficient of the second loss item, and the first loss item and the second loss item are summed to determine the predicted loss.
A training device for a text prediction model. The text prediction model includes a first prediction network based on time series, a buffer, and a second prediction network based on the buffer. The device includes:

The first prediction unit is configured to input the t-th word into the first prediction network after sequentially inputting the first t-1 words in the current training text, so that the first prediction network processes the t-1th word according to The state vector after the word, and the word vector of the t-th word, determine the state vector after processing the t-th word as the first hidden vector; and determine the first prediction for the next word according to the first hidden vector Probability

The reading unit is configured to read several existing segment vectors from the buffer, the existing several segment vectors are formed based on the text before the t-th word in the current training text, and each The segment vector corresponds to a text segment with a length of L words;

A second prediction unit configured to enable the second prediction network to determine a second prediction probability for the next word according to the several segment vectors;

The synthesis unit is configured to use the interpolation weight coefficient as the weighting coefficient of the second prediction probability, and use the difference value of 1 minus the interpolation weight coefficient as the weighting coefficient of the first prediction probability, and perform the calculation on the first prediction probability. The probability and the second predicted probability are interpolated and weighted and integrated to obtain the comprehensive predicted probability for the next word;

A loss determining unit, configured to determine a prediction loss for the t-th word at least according to the comprehensive prediction probability and the t+1-th word in the training text;

The training unit is configured to train the text prediction model according to the prediction loss for each word in the current training text.
The apparatus according to claim 16, wherein the first prediction network comprises a recurrent neural network (RNN) or a long short-term memory network (LSTM).
The device according to claim 16, wherein the plurality of segment vectors includes a first segment vector corresponding to a first text segment, and the first text segment includes the i-th word to the j-th word of the current training text , Wherein i and j are both smaller than t, the first segment vector is obtained based on the difference between the first state vector and the second state vector, and the first state vector is the first prediction network processing the jth A state vector after a word, and the second state vector is a state vector after the first prediction network processes the (i-1)th word.
The device according to claim 16 or 18, further comprising a storage unit configured to:

If the t-th word is the last word of the current text segment, a new segment vector is determined according to the difference between the first latent vector and the second latent vector, where the second latent vector is the first prediction network Process the state vector after the tL word;

Adding the newly added segment vector to the buffer.
The device according to claim 19, wherein the storage unit is further configured to:

Judging whether the number of several segment vectors already in the buffer reaches a predetermined threshold number;

If the predetermined threshold number is reached, the earliest stored segment vector is deleted, and the newly added segment vector is stored in the buffer.
The apparatus according to claim 16, wherein the second prediction network is specifically configured to:

Determining several attention coefficients respectively corresponding to the several segment vectors;

Using the several attention coefficients as weighting factors, weighting and combining the several segment vectors to obtain a context vector;

According to the context vector and the linear transformation matrix, the second predicted probability is obtained.
The device according to claim 21, wherein the first prediction network is specifically configured to:

According to the first hidden vector and the linear transformation matrix, the first predicted probability is obtained.
The device according to claim 21, wherein the second prediction network is specifically used for:

Determine the i-th attention coefficient according to the similarity between any i-th segment vector in the several segment vectors and the first hidden vector.
The device according to claim 21, wherein the second prediction network is specifically used for:

Using the first transformation matrix to transform any i-th segment vector in the plurality of segment vectors into a first intermediate vector;

Using a second transformation matrix to transform the first hidden vector into a second intermediate vector;

Determine the similarity between the sum vector of the first intermediate vector and the second intermediate vector and the third vector;

Determine the i-th attention coefficient according to the similarity;

Wherein, the first transformation matrix, the second transformation matrix and the third vector are all trainable network parameters in the second prediction network.
The device according to claim 16, wherein the text prediction model further comprises a strategy network for outputting the interpolation weight coefficient according to the first implicit vector;

The loss determination unit is configured to determine the prediction loss according to the comprehensive prediction probability, the t+1th word, the first prediction probability and the second prediction probability, and the interpolation weight coefficient.
The device according to claim 25, wherein the policy network is specifically used for:

Applying at least a strategy transformation matrix to the first implicit vector to obtain a strategy vector, wherein the strategy transformation matrix is a trainable model parameter in the strategy network;

The interpolation weight coefficient is determined according to the element value of the predetermined dimension in the strategy vector.
The device according to claim 26, wherein the obtaining of the policy vector by the policy network specifically comprises:

Determine the training strategy coefficient according to the current training text;

The strategy transformation matrix is applied to the first implicit vector and divided by the training strategy coefficient to obtain the strategy vector.
The device according to claim 27, wherein the determination of the training strategy coefficient by the strategy network specifically comprises:

The training strategy coefficient is determined according to the training sequence number of the current training text in the training sample set, so that the training strategy coefficient is negatively correlated with the training sequence number.
The device according to claim 27, wherein the determination of the training strategy coefficient by the strategy network specifically comprises:

The training strategy coefficient is determined according to the total text length of the current training text, so that the training strategy coefficient is negatively correlated with the total text length.
The apparatus according to claim 25, wherein the loss determination unit is configured to:

Determine the first loss item according to the comprehensive predicted probability and the t+1th word;

Determine a second loss term according to the interpolation weight coefficient, wherein the second loss term is negatively related to the interpolation weight coefficient;

Determine the reward item according to the ratio of the second predicted probability and the first predicted probability to the probability value of the t+1th word, and the reward item is positively related to the ratio;

The reward item is used as a coefficient of the second loss item, and the first loss item and the second loss item are summed to determine the predicted loss.
A computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method according to any one of claims 1-15.
A computing device, comprising a memory and a processor, characterized in that executable code is stored in the memory, and when the processor executes the executable code, the method described in any one of claims 1-15 is implemented. method.