WO2021155705A1 - Text prediction model training method and apparatus - Google Patents

Text prediction model training method and apparatus Download PDF

Info

Publication number
WO2021155705A1
WO2021155705A1 PCT/CN2020/132617 CN2020132617W WO2021155705A1 WO 2021155705 A1 WO2021155705 A1 WO 2021155705A1 CN 2020132617 W CN2020132617 W CN 2020132617W WO 2021155705 A1 WO2021155705 A1 WO 2021155705A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
prediction
word
text
segment
Prior art date
Application number
PCT/CN2020/132617
Other languages
French (fr)
Chinese (zh)
Inventor
李扬名
姚开盛
Original Assignee
支付宝(杭州)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 支付宝(杭州)信息技术有限公司 filed Critical 支付宝(杭州)信息技术有限公司
Publication of WO2021155705A1 publication Critical patent/WO2021155705A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • One or more embodiments of this specification relate to the field of machine learning, and in particular to a training method and device for a text prediction model.
  • the text classification task can be used in the intelligent question answering customer service system to classify the question raised by the user as input text for user intention recognition, automatic question answering, or manual customer service dispatch.
  • Text classification can also be used in various application scenarios, such as document data classification, public opinion analysis, spam identification, and so on.
  • machine translation tasks in different languages are widely used in various automatic translation systems.
  • the language model is the basic model for performing the above-mentioned various specific natural language processing tasks.
  • Language models need to be trained based on a lot of expectations.
  • text prediction that is, predicting subsequent texts based on existing texts, is a basic task for training language models.
  • One or more embodiments of this specification describe a text prediction model and its training method, in which local context and long-range context are comprehensively used for prediction, thereby comprehensively improving the text prediction model's ability to understand text and predicting accuracy for subsequent text.
  • a method for training a text prediction model including a first prediction network based on time series, a buffer, and a second prediction network based on the buffer, and the method includes: After sequentially inputting the first t-1 words in the current training text, the t-th word is input into the first prediction network, so that the first prediction network processes the state vector after the t-1th word, and State the word vector of the t-th word, determine the state vector after processing the t-th word as the first latent vector; and determine the first prediction probability for the next word according to the first latent vector; from the buffer Read several existing segment vectors, which are formed based on the text before the t-th word in the current training text, and each segment vector corresponds to a text segment with a length of L words The second prediction network determines the second prediction probability for the next word according to the several segment vectors; uses the interpolation weight coefficient as the weight coefficient of the second prediction probability, and subtracts the interpolation weight coefficient from
  • the first prediction probability and the second prediction probability are interpolated and weighted and integrated to obtain the comprehensive prediction probability for the next word; at least according to the comprehensive prediction probability and For the t+1th word in the training text, determine the prediction loss for the tth word; and train the text prediction model according to the prediction loss for each word in the current training text.
  • the first prediction network includes a recurrent neural network RNN or a long short-term memory network LSTM.
  • the several segment vectors stored in the buffer include a first segment vector corresponding to any first text segment, where the first text segment includes the i-th word to the j-th word of the current training text, where i and j are both smaller than t, the first segment vector is obtained based on the difference between the first state vector and the second state vector, wherein the first state vector is the jth word processed by the first prediction network
  • the second state vector is the state vector after the first prediction network processes the (i-1)th word.
  • the above method further includes, if the t-th word is the last word of the current text segment, determining a new segment vector according to the difference between the first latent vector and the second latent vector, where The second latent vector is the state vector after the first prediction network processes the tL-th word; the newly added segment vector is added to the buffer.
  • the buffer has a limited storage capacity. In this case, before adding a new segment vector to the buffer, it is first determined whether the number of several fragment vectors already in the buffer reaches a predetermined threshold. Number; if the predetermined threshold number is reached, the earliest stored segment vector is deleted, and the newly added segment vector is stored in the buffer.
  • the second prediction network determines the second prediction probability for the next word in the following manner: determining several attention coefficients corresponding to the several segment vectors; taking the several attention coefficients as weighting factors, The several segment vectors are weighted and combined to obtain a context vector; and the second prediction probability is obtained according to the context vector and the linear transformation matrix.
  • the first prediction network obtains the first prediction probability according to the first hidden vector and the linear transformation matrix.
  • the second prediction network determines the attention coefficient in the following manner: according to the similarity between any i-th segment vector in the plurality of segment vectors and the first latent vector, determine The i-th attention coefficient.
  • the second prediction network determines the attention coefficient by using a first transformation matrix to transform any i-th segment vector in the plurality of segment vectors into a first intermediate vector; Use the second transformation matrix to transform the first hidden vector into a second intermediate vector; determine the similarity between the sum vector of the first intermediate vector and the second intermediate vector and the third vector; determine according to the similarity The i-th attention coefficient; wherein, the first transformation matrix, the second transformation matrix and the third vector are all trainable network parameters in the second prediction network.
  • the text prediction model further includes a strategy network; before performing interpolation weighted synthesis on the first prediction probability and the second prediction probability, the method further includes: the strategy network according to the first prediction probability A latent vector, outputting the interpolation weight coefficient; and the step of determining the prediction loss specifically includes: according to the comprehensive prediction probability, the t+1th word, the first prediction probability and the second prediction probability , And the interpolation weight coefficient to determine the prediction loss.
  • the strategy network determines the interpolation weight coefficients by applying at least a strategy transformation matrix to the first latent vector to obtain a strategy vector, wherein the strategy transformation matrix is in the strategy network Trainable model parameters; determine the interpolation weight coefficient according to the element value of the predetermined dimension in the strategy vector.
  • the strategy network obtains the strategy vector by: determining the training strategy coefficient according to the current training text; applying the strategy transformation matrix to the first implicit vector and dividing by the training strategy Coefficient to obtain the strategy vector.
  • the training strategy coefficient may be determined according to the training sequence number of the current training text in the training sample set, so that the training strategy coefficient is negatively correlated with the training sequence number.
  • the training strategy coefficient may be determined according to the total text length of the current training text, so that the training strategy coefficient is negatively correlated with the total text length.
  • the step of determining the prediction loss specifically includes: determining a first loss item according to the comprehensive prediction probability and the t+1th word; determining a second loss item according to the interpolation weight coefficient, The second loss term is negatively correlated with the interpolation weight coefficient; the reward term is determined according to the ratio of the second predicted probability and the first predicted probability to the probability values of the t+1th word , The reward item is positively correlated with the ratio; taking the reward item as the coefficient of the second loss item, sum the first loss item and the second loss item to determine the predicted loss .
  • a training device for a text prediction model including a first prediction network based on time series, a buffer, and a second prediction network based on the buffer, the device including: A prediction unit configured to input the t-th word into the first prediction network after sequentially inputting the first t-1 words in the current training text, so that the first prediction network processes the t-1th word according to After the state vector, and the word vector of the t-th word, the state vector after processing the t-th word is determined as the first hidden vector; and according to the first hidden vector, the first predicted probability for the next word is determined Reading unit, configured to read several existing segment vectors from the buffer, the existing several segment vectors are formed based on the text before the t-th word in the current training text, and each A segment vector corresponds to a text segment with a length of L words; a second prediction unit configured to enable the second prediction network to determine a second prediction probability for the next word according to the plurality of segment vectors;
  • the predicted probability is interpolated and weighted and integrated to obtain the comprehensive predicted probability for the next word;
  • the loss determination unit is configured to determine the target for the t-th word at least according to the comprehensive predicted probability and the t+1-th word in the training text
  • the training unit is configured to train the text prediction model according to the prediction loss for each word in the current training text.
  • a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method of the first aspect.
  • a computing device including a memory and a processor, characterized in that executable code is stored in the memory, and when the processor executes the executable code, the method of the first aspect is implemented .
  • the segment vector of the previous text segment in the buffer is also used as long-range context information, and the first Second, the prediction network makes predictions based on the long-range context.
  • the strategy network can be used to generate an interpolation weight coefficient for the current word.
  • FIG. 1 is a schematic diagram of a text prediction model according to an embodiment disclosed in this specification
  • Fig. 2 shows a flowchart of a method for training a text prediction model according to an embodiment
  • Figure 3 shows an example of performing prediction processing for a specific training text
  • Fig. 4 shows a schematic diagram of determining a segment vector of a text segment according to an embodiment
  • Fig. 5 shows a flow of steps for determining a second predicted probability according to an embodiment
  • Fig. 6 shows a schematic block diagram of a training device for a text prediction model according to an embodiment.
  • text prediction is a basic task of natural language processing. Accordingly, it is hoped to train a text prediction model with higher prediction accuracy.
  • a neural network model based on time sequence is used, such as recurrent neural network RNN, long short-term memory neural network LSTM, and gated recurrent unit GRU_RNN .
  • RNN recurrent neural network
  • LSTM long short-term memory neural network
  • GRU_RNN gated recurrent unit
  • a new text prediction model and its training method are proposed.
  • the model divides the input text into text fragments, and stores the characterization vector of the text fragments in the buffer as a long-range context.
  • the implicit vector corresponding to the current word and the representation vector stored in the buffer are comprehensively considered for prediction.
  • Fig. 1 is a schematic diagram of a text prediction model according to an embodiment disclosed in this specification.
  • the text prediction model includes a first prediction network 11 based on a time sequence, a buffer 12, a second prediction network 13 based on a buffer, and optionally a strategy network 14.
  • the first prediction network 11 includes a time series neural network, such as RNN, LSTM, and GRU_RNN. According to the working mode of the time series neural network, when the training text is input into the text prediction model, the first prediction network 11 reads the words in the training text in turn, and performs iterative processing on each word in turn. When performing iterative processing on each word W t , according to the state vector h t-1 after processing the previous word W t-1 and the word vector of the current word, the state vector h t after the iterative processing of the current word is obtained.
  • the first prediction network 11 may also include a multi-layer perceptron MLP, which obtains the first prediction result p for the next word based on the state vector h t corresponding to the current word.
  • the buffer 12 is used to store the characterization vector of the text segment (span) before the current word, that is, the segment vector.
  • the length L of the text segment can be a predetermined length, for example, 2 words, 3 words, 5 words, and so on.
  • the segment vector may be the state vector corresponding to the j-th word output by the first prediction network 11 It is obtained by the difference of the state vector corresponding to the i-1th word.
  • the second prediction network 13 performs prediction operations based on the existing segment vectors stored in the buffer 12 to obtain the second prediction result q for the next word.
  • the second prediction result q reflects the prediction result based on the long-range context.
  • the interpolation weight coefficient ⁇ can be used to interpolate and synthesize the two to obtain a comprehensive prediction result.
  • the above interpolation weight coefficients can be preset hyperparameters or trainable parameters.
  • the interpolation weight coefficient is different for each word, and is determined by the policy network 14. Specifically, the strategy network 14 obtains the state vector h t corresponding to the current word from the first prediction network 11, and performs operations based on the state vector to obtain the interpolation weight coefficient ⁇ for the current word, which is used for the first prediction result and the second prediction Synthesis of results.
  • the text prediction model shown in Figure 1 has at least the following characteristics.
  • the segment vectors corresponding to the text segment before the current word are also stored in the buffer, and these segment vectors are used as the long-range context to perform prediction based on the long-range context.
  • the final prediction result is a combination of the two parts of the prediction.
  • the strategy network can be used to dynamically adjust the proportion of long-range prediction results, thereby further improving the accuracy of prediction.
  • Fig. 2 shows a flowchart of a method for training a text prediction model according to an embodiment. It can be understood that the text prediction model has the structure shown in FIG. 1, and the method can be executed by any device, device, platform, or device cluster with computing and processing capabilities.
  • the following preparatory process can be performed in advance.
  • the training expectation that is, the training sample set, which includes a large amount of training text.
  • word embedding is performed on the training text, and each word in the training text is converted into a word vector, thereby converting the training text into a word vector sequence.
  • word embedding can be realized by one-hot encoding.
  • the dimension of each word vector corresponds to the number V of words in the lexicon.
  • the conversion of word vectors can also be realized by other word embedding methods, for example, the word2vec method, and so on.
  • the training text is Chinese text.
  • the training text can be segmented first, and then word embedding can be performed for each word after the segmentation.
  • each Chinese character is directly processed as a word. Therefore, the "word” in the following includes the case of single Chinese characters.
  • the training text can be input to the text prediction model for prediction and training.
  • the basic network of the text prediction model is still a time-series neural network. Therefore, for the current training text, each word (more specifically, a word vector) is input into the text prediction model in turn.
  • the text prediction model performs prediction processing on each input word in turn. The following describes the prediction processing process and training process of the text prediction model in combination with any t-th word in the training text.
  • step 21 the t-th word in the current training text is input into the first prediction network in the text prediction model. It can be understood that before this, the first t-1 words in the current training text have been sequentially input into the text prediction model.
  • the first prediction network includes a time-series neural network, which jointly determines the state at the next moment according to the state at the previous moment and the current input.
  • the first prediction network is based on the state vector h t- after processing the t-1th word. 1 , and the word vector x t of the t-th word, determine the state vector h t after processing the t-th word. This process can be expressed by the following formula (1):
  • is a state transition function, and the specific function form depends on the network form of the sequential neural network, such as RNN or LSTM.
  • the dimension of the state vector is denoted as d dimension.
  • the state vector h t after processing the current t-th word is called the first hidden vector.
  • the first prediction network may also include a multilayer perceptron MLP, which is used to determine the first prediction probability p for the next word according to the first latent vector h t. More specifically, the first predicted probability p may include that the next word is the probability distribution of each word in the vocabulary. Assuming that the number of words in the lexicon is V, then the first predicted probability p can be expressed as a V-dimensional vector.
  • MLP in order to determine the first prediction probability p, MLP first applies a linear transformation matrix O t+1 to the first hidden vector h t .
  • the linear transformation matrix is a trainable parameter matrix.
  • a hidden vector h t is transformed or projected into a V-dimensional vector.
  • the first predicted probability p for the next word can be expressed as:
  • Figure 3 shows an example of performing prediction processing on a specific training text.
  • the current input is the 92nd word "no" in the training text.
  • the temporal neural network obtains the state corresponding to the 92nd word according to the state vector h 91 after processing the 91st word "have” and the word vector corresponding to the 92nd word "no" The vector h 92 .
  • the MLP obtains the first predicted probability p for the next word, that is, the 93rd word.
  • the prediction result obtained according to the state vector of the time series neural network more reflects the influence of the local context closer to the current word on the understanding of the current word meaning.
  • the prediction results of the first prediction network will tend to be common collocation words in the local context, such as "trouble ", "idea”, output a higher prediction probability.
  • step 22 read several existing segment vectors from the buffer, which are formed based on the text before the t-th word in the current training text, and Each segment vector corresponds to a text segment with a length of L consecutive words.
  • segment vectors corresponds to a text segment with a length of L consecutive words.
  • several text fragments can be formed according to the length L, and the characterization vectors of these text fragments, that is, the fragment vectors, are stored in the buffer as long-range context information .
  • the length L of the text segment can be preset according to needs. For example, for a longer training text, you can set a longer segment length, such as 8 words, 10 words, etc., for a shorter training text, you can Set a shorter segment length, such as 2 words, 3 words, and so on.
  • the first t-1 words can form several text fragments m ij according to the preset length L, where i is the word sequence number at the beginning of the text fragment, and j is the end of the text fragment
  • the characterization vector of the text segment that is, the segment vector, can be obtained based on the state vector when the first prediction network processes each preceding word.
  • the segment vector is obtained based on the difference between the first state vector and the second state vector, where the first The state vector is the state vector h j after the first prediction network processes the jth word, that is, the state vector after the end word (jth word) of the text segment m ij is processed; the second state vector is the first The prediction network processes the state vector hi-1 after the (i-1)th word, that is, processes the state vector before the start word (i-th word) of the text segment m ij.
  • Fig. 4 shows a schematic diagram of determining a segment vector of a text segment according to an embodiment.
  • a text segment is formed with 2 words as the segment length.
  • the fragment vector can be determined by h 13 -h 11 , where h 13 is a temporal neural network processing
  • h 13 is a temporal neural network processing
  • the state vector after the 13th word, h 11 is the state vector after the sequential neural network processes the 11th word (that is, before the 12th word), or in other words, is the state vector at the end of the previous text segment.
  • the state vector after the first prediction network processes each word from the i-th word to the j-th word is obtained, Obtain L state vectors, sum or average the L state vectors, and use them as the segment vector corresponding to the text segment m ij.
  • the fragment vector can also be obtained in other ways.
  • the time-series neural network is used to process the state vector of each preceding word, and each segment vector is calculated. In this way, the processing result of the first prediction network can be reused and the calculation method of the segment vector can be simplified.
  • segment vectors can be obtained in the process of sequentially iteratively processing each word of the current training text by the first prediction network.
  • a counter with L as a loop can be set to count the number of words processed by the first prediction network. When the processed words are accumulated one by one, the counter is incremented. Each time L words are accumulated, a new text fragment is formed. The counter is cleared and counted again. At this time, the fragment vector of the newly added text fragment is calculated and stored in the buffer.
  • the t-th word is the last word of the current text segment. Specifically, it can be judged whether the count of the counter reaches L. If it is the last word of the current text segment, the current text segment is regarded as a newly-added text segment, and the segment vector of the newly-added text segment is calculated. Specifically, in an embodiment, the newly added segment vector may be determined according to the difference between the aforementioned first hidden vector h t and the second hidden vector h tL , where the second hidden vector h tL is the tL-th one processed by the first prediction network. The state vector after the word. Then, the newly added segment vector is added to the buffer.
  • the buffer used to store each segment vector of the previous text has a limited capacity size B. Accordingly, the buffer can only store a limited number of N segment vectors. In this case, the buffer can be made to store the segment vectors of the N text segments closest to the currently processed word. Specifically, in one embodiment, when adding a newly-added segment vector to the buffer, it is first determined whether the number of several segment vectors already in the buffer reaches the above-mentioned threshold number N, and if it does not reach the threshold number N, the newly-added segment vector is directly added. The fragment vector is added to the buffer; if the number of existing fragment vectors has reached the threshold number N, the earliest stored fragment vector is deleted, and the newly added fragment vector is stored in the buffer.
  • the current input is the 92nd word "no" in the training text.
  • multiple segment vectors based on the text before the 92nd word have been stored in the buffer, where each segment vector corresponds to a text segment formed by three consecutive words.
  • the text fragment closest to the current word is the text fragment m 89-91 from the 89th word to the 91st word. Due to the limited capacity of the buffer, the earliest segment vector stored therein corresponds to the text segment m 16-18 , that is, the text segment formed from the 16th word to the 18th word.
  • segment vectors stored in the buffer can represent text segments that are far away from the current word. Therefore, these segment vectors can be used as long-range context information to help understand the semantics of the current word, and then help predict the next word.
  • the second prediction network is used to determine the second prediction probability q for the next word according to several segment vectors stored in the buffer.
  • the second prediction network can use the attention mechanism to integrate several existing segment vectors into a context vector, and then determine the second prediction probability q based on the context vector.
  • Fig. 5 shows a flow of steps for determining the second predicted probability according to an embodiment.
  • step 51 several attention coefficients corresponding to several segment vectors are determined. Specifically, for any i-th segment vector si among several segment vectors, the corresponding attention coefficient ⁇ t,i can be determined based on the similarity measurement.
  • the similarity ⁇ t,i between the i-th segment vector s i and the first hidden vector h t can be determined, and the similarity can be cosine similarity, similarity determined based on Euclidean distance, etc. Wait. Then, according to the similarity ⁇ t,i , the i-th attention coefficient ⁇ t,i is determined .
  • the softmax function can be used to normalize the similarity corresponding to each segment vector to obtain the corresponding attention coefficient.
  • the i-th attention coefficient ⁇ t,i can be determined as:
  • the corresponding similarity is determined in the following manner.
  • the first transformation matrix W s can be used to transform the i-th segment vector s i into a first intermediate vector W s s i
  • the second transformation matrix W h can be used to transform the first hidden vector h t into the first intermediate vector W s s i.
  • first transformation matrix W s, a second transform matrix, and the third vector v s h are predicted second network may be trained network parameters.
  • the i-th attention coefficient ⁇ t,i can be determined similarly using formula (3).
  • step 52 using each attention coefficient corresponding to each segment vector as a weighting factor, weighted combination of the foregoing several segment vectors, to obtain a context vector ⁇ t .
  • the segment vectors s i stored in the buffer can be sequentially arranged into a vector sequence C t , and the attention coefficients ⁇ t,i corresponding to each segment vector can be arranged into an attention vector ⁇ t ,
  • the context vector ⁇ t can be expressed as:
  • the second predicted probability q is obtained according to the context vector ⁇ t and a linear transformation matrix.
  • the second predicted probability q may include the probability distribution of each word in the dictionary as the next word, so q is also a V-dimensional vector.
  • the linear transformation matrix used in step 53 is used to transform or project the d-dimensional context vector ⁇ t into a V-dimensional vector.
  • the second predicted probability q can be expressed as:
  • O t+1 is the linear transformation matrix for the context vector.
  • the linear transformation matrix for the context vector in formula (6) is the same matrix as the linear transformation matrix for the first implicit vector in formula (2).
  • the second predictive network maintains the linear transformation matrix for the context vector in formula (6), which is independent of the linear transformation matrix used by the first predictive network in formula (2).
  • the second prediction network obtains the second prediction probability q for the next word according to the segment vector stored in the buffer.
  • the segment vector stored in the buffer area reflects the long-range context information. Therefore, the second prediction probability q obtained based on the segment vector can reflect the prediction of the next word based on the long-range context.
  • the buffer stores fragment vectors of previous text fragments. These previous text fragments contain text fragments relatively far from the current word, such as m 16-18 . Based on these segment vectors, the attention mechanism is used to obtain the second predicted probability q for the next word, which is made with more consideration of the long-range context. For example, since the text fragment m 16-18 contains the long-range context "good restaurant", the second predicted probability q tends to be related words in the long-range context, such as "appetite", which outputs a higher predicted probability.
  • the interpolation weight coefficient ⁇ is used as the weighting coefficient of the second prediction probability q, and 1 minus ⁇ is used as the first prediction probability
  • the weighting coefficient of p is to perform interpolation weighted synthesis on the first prediction probability and the second prediction probability to obtain the comprehensive prediction probability Pr for the next word, namely:
  • step 25 the prediction loss for the t-th word is determined at least according to the above-mentioned comprehensive prediction probability Pr and the t+1-th word in the current training text.
  • the above-mentioned interpolation weight coefficient is a preset hyperparameter or a trainable model parameter.
  • the true next word in the training text that is, the t+1th word
  • the prediction loss for the current word can be determined according to the comparison of the comprehensive prediction probability Pr and the label.
  • the cross entropy loss function can be used to determine the prediction loss Loss:
  • step 26 a text prediction model is trained based on the total prediction loss for each word in the current training text. Specifically, the first prediction network and the second prediction network are updated in the direction in which the total prediction loss is reduced.
  • the text prediction model further includes a strategy network, which is used to determine the corresponding word for the current word. Interpolate the weight coefficient ⁇ .
  • strategy network determines the interpolation weight coefficients and its training methods.
  • the strategy network may obtain the first latent vector h t obtained by the first prediction network processing the t-th word, and according to the first latent vector h t , calculate the interpolation weight coefficient ⁇ t .
  • the policy network may apply a policy transformation matrix W g to the above-mentioned first hidden vector h t to obtain a policy vector W g h t , where the policy transformation matrix W g is a trainable model maintained in the policy network
  • the parameter can be an M*d-dimensional matrix, so that the d-dimensional first hidden vector is transformed into an M-dimensional strategy vector, where M is the preset number of dimensions.
  • the interpolation weight coefficient ⁇ t can be determined according to the element value of the predetermined dimension in the M-dimensional strategy vector. For example, the element value of a certain dimension after normalization of the strategy vector can be used as the interpolation weight coefficient ⁇ t , namely:
  • a training strategy coefficient T is also set in the strategy network.
  • the training strategy coefficient T can be a hyperparameter that can be adjusted during the training process, more specifically according to Each training text is determined, so as to better adjust the output of the interpolation weight coefficient.
  • the interpolation weight coefficient is a weight coefficient applied to the second prediction probability. Therefore, the larger the interpolation weight coefficient, it means that the use of the remote context is encouraged.
  • a process similar to "annealing” may be used to set and adjust the aforementioned training strategy coefficients. Specifically, a larger training strategy coefficient T, or a higher temperature T, can be set at the beginning of training; then, as the training progresses, the training strategy coefficient T, or the temperature T, is gradually reduced. This means that as training progresses, text prediction models are encouraged to explore the use of long-range context.
  • the training strategy coefficient T can be determined according to the training sequence number of the current training text in the training sample set, so that the training strategy coefficient T is negatively correlated with the training sequence number. In other words, the smaller the training sequence number is, the closer it is to the beginning of training. At this time, the larger the training strategy coefficient T, the higher the temperature T; as the training sequence number increases, the temperature decreases, and the training strategy coefficient decreases.
  • the training strategy coefficient T for the current training text can also be determined according to the total text length of the current training text. Specifically, the training strategy coefficient T can be negatively correlated with the total text length. Therefore, for a longer training text, a smaller coefficient T can be set to obtain a larger interpolation weight coefficient, thereby more encouraging the use of long-range context.
  • the strategy network determines the corresponding interpolation weight coefficient ⁇ t for the current t-th word in the current training text.
  • the interpolation weight coefficient is applied to the above formula (7) to obtain the comprehensive predicted probability Pr.
  • the first prediction network obtains the first prediction probability p for the 93rd word according to the current state vector h 92 of the 92nd word, that is, the first hidden vector; the second prediction network obtains the first prediction probability p for the 93rd word according to the storage The segment vector of, get the second predicted probability q.
  • the strategy network obtains the interpolation weight coefficient according to the above-mentioned first hidden vector h 92 and the training strategy coefficient T (shown as the "annealing" temperature in the figure). Therefore, the interpolation weight coefficient can be used to perform interpolation synthesis on the first prediction probability p and the second prediction probability q to obtain the comprehensive prediction probability Pr.
  • the aforementioned method of determining the prediction loss needs to be modified.
  • the prediction loss loss not only the comprehensive prediction probabilities obtained from the first and second prediction networks are considered, but the output of the strategy network is also considered. Therefore, according to an embodiment, in the foregoing step 25, according to the comprehensive prediction probability and the t+1th word, and according to the first prediction probability p, the second prediction probability q, and the interpolation weight coefficients, determine Forecast loss Loss.
  • the prediction loss in the case of combining a policy network, can be determined in the following manner.
  • the first loss term L1 can be determined according to the comprehensive prediction probability Pr and the t+1th word.
  • the first loss term L1 can take the form of cross-entropy loss, as shown in formula (8). In other words, the loss shown in formula (8) can be used as the first loss item L1 here.
  • the second loss term L2 is determined so that the second loss term is negatively related to the interpolation weight coefficient.
  • the second loss term can be set as:
  • the second loss term L2 can also be set to other forms of negative correlation, for example, 1/ ⁇ t .
  • a first and a second prediction probability q p are predicted probability for the value of the t + 1 terms of the probability of the ratio, r t term prize is determined, the bonus items r t positively correlated with the ratio; and, The reward term r t is used as the coefficient of the second loss term L2, and the first loss term and the second loss term are summed to determine the predicted loss Loss.
  • the predicted loss Loss can be expressed as:
  • is an optional adjustment coefficient, ⁇ >0.
  • the first term in the loss function expression corresponds to the first loss term, which aims to increase the probability of correctly predicting the next word.
  • the second term is the product of the reward term and the second loss term, which aims to conditionally encourage the exploration and use of the long-range context.
  • r t *log ⁇ t is very similar in form to the policy gradient in reinforcement learning.
  • encouraging exploration and use of long-range context can be embodied by the second loss term L2 itself, because a smaller value of the second loss term corresponds to a larger ⁇ t .
  • the encouragement of the long-range context should be carried out conditionally, and the condition is reflected by the reward item r t.
  • the adjustment of the reward term means that only when the prediction probability of the second prediction network for the correct next word is significantly higher than the prediction probability of the first prediction network, a larger interpolation weight coefficient ⁇ t is encouraged.
  • the second prediction network outputs the second prediction probability q, where the probability value for the real t+1th word (that is, the correct next word) is q(x t+1
  • the probability value of the first prediction network for the t+1th word is p(x t+1
  • the ratio of the two can be defined as R:
  • the above ratio R may reflect the relative prediction accuracy of the second prediction network and the first prediction network for the correct next word.
  • Setting the reward item r t is positively related to the ratio R, that is, the larger the ratio R, the greater the reward item r t .
  • the correct next word that is, the t+1th word, is known. Therefore, the size of the reward item can be clearly and uniquely determined. Therefore, this reward item can also be called Intrinsic Rewards.
  • the reward item r t can be determined in a variety of ways based on the above ratio R.
  • the reward item r t is determined in the following way (14):
  • is a minimum value, which is set in order to avoid mathematical problems caused when p(x t+1
  • the above function f(z) can adopt the ReLU function:
  • the ⁇ in formula (14) is used to amplify the effect of R in exponential form, and the ⁇ in formula (15) is used for linear amplification.
  • the parameter a in formula (14) is the cutoff threshold, and the parameter b is the reference threshold.
  • the prediction loss is determined according to formula (12)
  • the prediction loss is to be reduced, on the basis of increasing the prediction probability of the correct word according to the first loss term, the second term is also required to be as small as possible.
  • the prediction probability of the second prediction network for the next correct word is significantly higher than that of the first prediction network, that is, when the above-mentioned ratio R is larger, a larger reward term r t is obtained , which forces the second loss term to be smaller , That is, the strategy network outputs a larger ⁇ t , so as to conditionally encourage a larger interpolation weight coefficient ⁇ t , that is, conditionally encourage the purpose of long-range context.
  • step 26 the text prediction model is trained according to the total prediction loss of each word, that is, in the direction in which the total prediction loss decreases, the first The model parameters in the prediction network, the second prediction network, and the strategy network achieve the above training goals.
  • the text prediction model of the embodiment of this specification on the basis of using the first prediction network based on time sequence to predict the next word, it also uses the segment vector of the previous text segment stored in the buffer as the long-range context information, And use the second prediction network to make predictions based on the long-range context.
  • the strategy network can be used to generate an interpolation weight coefficient for the current word.
  • a training device for a text prediction model includes a first prediction network based on time series, a buffer, a second prediction network based on the buffer, and the training device It can be deployed in any device, platform or device cluster with computing and processing capabilities.
  • Fig. 6 shows a schematic block diagram of a training device for a text prediction model according to an embodiment. As shown in FIG.
  • the training device 600 includes: a first prediction unit 61 configured to input the t-th word into the first prediction network after sequentially inputting the first t-1 words in the current training text, so that The first prediction network determines the state vector after processing the t-th word as the first latent vector according to the state vector after processing the t-1th word and the word vector of the t-th word; An implicit vector determines the first prediction probability for the next word;
  • the reading unit 62 is configured to read several existing segment vectors from the buffer, and the existing several segment vectors are based on the current training The text before the t-th word in the text is formed, and each segment vector corresponds to a text segment with a length of L words;
  • the second prediction unit 63 is configured to make the second prediction network according to the plurality of segment vectors , Determine the second prediction probability for the next word;
  • the synthesis unit 64 is configured to use the interpolation weight coefficient as the weight coefficient of the second prediction probability, and use the difference of 1 minus the interpolation weight coefficient
  • the first prediction network includes a recurrent neural network RNN or a long short-term memory network LSTM.
  • the several segment vectors stored in the buffer include a first segment vector corresponding to any first text segment, and the first text segment includes the i-th word to the j-th word of the current training text, Wherein i and j are both less than t, the first segment vector is obtained based on the difference between the first state vector and the second state vector, and the first state vector is the first prediction network processing the jth The state vector after the word, and the second state vector is the state vector after the first prediction network processes the (i-1)th word.
  • the device 600 further includes a storage unit (not shown), configured to: if the t-th word is the last word of the current text segment, then according to the difference between the first latent vector and the second latent vector The difference determines the newly added segment vector, where the second implicit vector is the state vector after the first prediction network processes the tLth word; the newly added segment vector is added to the buffer.
  • a storage unit (not shown), configured to: if the t-th word is the last word of the current text segment, then according to the difference between the first latent vector and the second latent vector The difference determines the newly added segment vector, where the second implicit vector is the state vector after the first prediction network processes the tLth word; the newly added segment vector is added to the buffer.
  • the buffer has a limited storage capacity.
  • the storage unit is further configured to: determine whether the number of fragment vectors already in the buffer reaches a predetermined threshold number; The predetermined threshold number is deleted, the earliest stored segment vector is deleted, and the newly added segment vector is stored in the buffer.
  • the second prediction network obtains the second prediction probability by: determining a number of attention coefficients corresponding to the several segment vectors; using the several attention coefficients as weighting factors, A number of segment vectors are weighted and combined to obtain a context vector; and the second prediction probability is obtained according to the context vector and the linear transformation matrix.
  • the first prediction probability is obtained according to the first hidden vector and the same linear transformation matrix as the second prediction network.
  • the second prediction network determines the attention coefficient in the following manner: according to the similarity between any i-th segment vector in the plurality of segment vectors and the first latent vector, determine the i-th segment vector Attention coefficient.
  • the second prediction network determines the attention coefficient by using a first transformation matrix to transform any i-th segment vector in the plurality of segment vectors into a first intermediate vector; A second transformation matrix, transforming the first hidden vector into a second intermediate vector; determining the similarity between the sum vector of the first intermediate vector and the second intermediate vector and the third vector; determining the i-th vector according to the similarity Attention coefficient; wherein, the first transformation matrix, the second transformation matrix and the third vector are all trainable network parameters in the second prediction network.
  • the text prediction model further includes a strategy network for outputting the interpolation weight coefficient according to the first latent vector; in this case, the loss determination unit 65 is further configured to The comprehensive prediction probability, the t+1th word, the first prediction probability and the second prediction probability, and the interpolation weight coefficient determine the prediction loss.
  • the strategy network determines the interpolation weight coefficients in the following manner: at least a strategy transformation matrix is applied to the first implicit vector to obtain a strategy vector, wherein the strategy transformation matrix is Trainable model parameters; determine the interpolation weight coefficient according to the element value of the predetermined dimension in the strategy vector.
  • the strategy network obtains the strategy vector in the following manner: determining the training strategy coefficient according to the current training text; applying the strategy transformation matrix to the first implicit vector and dividing by the training strategy Coefficient to obtain the strategy vector.
  • the strategy network determining the training strategy coefficient specifically includes: determining the training strategy coefficient according to the training sequence number of the current training text in the training sample set, so that the training strategy coefficient is the same as the training strategy coefficient. The number of the training sequence is negatively correlated.
  • determining the training strategy coefficient by the strategy network specifically includes: determining the training strategy coefficient according to the total text length of the current training text, so that the training strategy coefficient is negatively correlated with the total text length.
  • the loss determination unit 65 is specifically configured to: determine a first loss item according to the comprehensive predicted probability and the t+1th word; and determine a second loss item according to the interpolation weight coefficient Item, wherein the second loss item is negatively correlated with the interpolation weight coefficient; according to the ratio of the second prediction probability and the first prediction probability to the probability value of the t+1th word, the determination of the Reward item, the reward item is positively related to the ratio; taking the reward item as the coefficient of the second loss item, sum the first loss item and the second loss item to determine the Forecast loss.
  • a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method described in conjunction with FIG. 2.
  • a computing device including a memory and a processor, the memory is stored with executable code, and when the processor executes the executable code, it implements the method described in conjunction with FIG. 2 method.
  • the functions described in the present invention can be implemented by hardware, software, firmware, or any combination thereof.
  • these functions can be stored in a computer-readable medium or transmitted as one or more instructions or codes on the computer-readable medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Disclosed are a text prediction model training method executed by a computer, and a text prediction model training apparatus. A text prediction model comprises a first prediction network (11) based on a time sequence, a buffer (12), and a second prediction network (13) based on the buffer (12). The training method comprises: inputting a t-th word in training text into a first prediction network (11), such that the first prediction network determines a first prediction probability for the next word according a state vector obtained by means of time sequence processing; in addition, reading, from a buffer (12), several fragment vectors formed on the basis of the previous text, and a second prediction network (13) obtaining a second prediction probability for the next word according to these fragment vectors; then, by taking an interpolation weight coefficient λ as a weighting coefficient of the second prediction probability, and taking one minus λ as a weighting coefficient of the first prediction probability, weighing and synthesizing the second prediction probability and the first prediction probability in order to obtain a comprehensive prediction probability; and at least according to the comprehensive prediction probability and a (t+1)th word, determining a prediction loss regarding the t-th word, and thereby training a text prediction model.

Description

文本预测模型的训练方法及装置Training method and device of text prediction model 技术领域Technical field
本说明书一个或多个实施例涉及机器学习领域,尤其涉及文本预测模型的训练方法和装置。One or more embodiments of this specification relate to the field of machine learning, and in particular to a training method and device for a text prediction model.
背景技术Background technique
随着人工智能和机器学习的快速发展,各种自然语言处理任务已广泛应用于多种业务实施场景。例如,文本分类任务可以用于在智能问答客服系统中,将用户提出的问题作为输入文本进行分类,以进行用户意图识别,自动问答,或者人工客服派单等。文本分类还可用于,例如文档数据归类,舆情分析,垃圾信息识别等等多种应用场景。又例如,不同语种的机器翻译任务广泛用于各种自动翻译系统。With the rapid development of artificial intelligence and machine learning, various natural language processing tasks have been widely used in a variety of business implementation scenarios. For example, the text classification task can be used in the intelligent question answering customer service system to classify the question raised by the user as input text for user intention recognition, automatic question answering, or manual customer service dispatch. Text classification can also be used in various application scenarios, such as document data classification, public opinion analysis, spam identification, and so on. For another example, machine translation tasks in different languages are widely used in various automatic translation systems.
一般地,语言模型是进行上述各种具体的自然语言处理任务的基础模型。语言模型需要基于大量预料进行训练。其中,文本预测,即根据已有文本预测后续文本,是对语言模型进行训练的一种基础任务。Generally, the language model is the basic model for performing the above-mentioned various specific natural language processing tasks. Language models need to be trained based on a lot of expectations. Among them, text prediction, that is, predicting subsequent texts based on existing texts, is a basic task for training language models.
因此,希望能有改进的方案,可以更为有效地针对文本预测任务进行训练。Therefore, it is hoped that there will be an improved scheme that can be more effectively trained for text prediction tasks.
发明内容Summary of the invention
本说明书一个或多个实施例描述了一种文本预测模型及其训练方法,其中综合利用局部上下文和长程上下文进行预测,全面提高文本预测模型对文本的理解能力和针对后续文本的预测准确性。One or more embodiments of this specification describe a text prediction model and its training method, in which local context and long-range context are comprehensively used for prediction, thereby comprehensively improving the text prediction model's ability to understand text and predicting accuracy for subsequent text.
根据第一方面,提供了一种文本预测模型的训练方法,所述文本预测模型包括基于时序的第一预测网络,缓存器,基于所述缓存器的第二预测网络,所述方法包括:在依次输入当前训练文本中的前t-1个词之后,将第t个词输入所述第一预测网络,使得所述第一预测网络根据处理第t-1个词后的状态向量,以及所述第t个词的词向量,确定处理第t个词后的状态向量作为第一隐向量;并根据该第一隐向量,确定对于下一个词的第一预测概率;从所述缓存器中读取已有的若干片段向量,所述已有的若干片段向量基于所述当前训练文本中所述第t个词之前的文本形成,且每个片段向量对应于长度为L个词的文本片段;所述第二预测网络根据所述若干片段向量,确定对于下一个词的第 二预测概率;以内插权重系数作为所述第二预测概率的加权系数,以1减去所述内插权重系数的差值作为所述第一预测概率的加权系数,对所述第一预测概率和第二预测概率进行内插加权综合,得到对于下一个词的综合预测概率;至少根据所述综合预测概率和所述训练文本中第t+1个词,确定针对第t个词的预测损失;根据所述当前训练文本中针对各个词的预测损失,训练所述文本预测模型。According to a first aspect, there is provided a method for training a text prediction model, the text prediction model including a first prediction network based on time series, a buffer, and a second prediction network based on the buffer, and the method includes: After sequentially inputting the first t-1 words in the current training text, the t-th word is input into the first prediction network, so that the first prediction network processes the state vector after the t-1th word, and State the word vector of the t-th word, determine the state vector after processing the t-th word as the first latent vector; and determine the first prediction probability for the next word according to the first latent vector; from the buffer Read several existing segment vectors, which are formed based on the text before the t-th word in the current training text, and each segment vector corresponds to a text segment with a length of L words The second prediction network determines the second prediction probability for the next word according to the several segment vectors; uses the interpolation weight coefficient as the weight coefficient of the second prediction probability, and subtracts the interpolation weight coefficient from 1. As the weighting coefficient of the first prediction probability, the first prediction probability and the second prediction probability are interpolated and weighted and integrated to obtain the comprehensive prediction probability for the next word; at least according to the comprehensive prediction probability and For the t+1th word in the training text, determine the prediction loss for the tth word; and train the text prediction model according to the prediction loss for each word in the current training text.
在一个实施例中,所述第一预测网络包括循环神经网络RNN或长短期记忆网络LSTM。In one embodiment, the first prediction network includes a recurrent neural network RNN or a long short-term memory network LSTM.
根据一个实施例,缓存器中存储的若干片段向量包括任意的第一文本片段对应的第一片段向量,其中第一文本片段包括所述当前训练文本的第i个词到第j个词,其中i和j均小于t,所述第一片段向量基于第一状态向量和第二状态向量的差值而获得,其中所述第一状态向量为所述第一预测网络处理所述第j个词后的状态向量,所述第二状态向量为所述第一预测网络处理第(i-1)个词后的状态向量。According to an embodiment, the several segment vectors stored in the buffer include a first segment vector corresponding to any first text segment, where the first text segment includes the i-th word to the j-th word of the current training text, where i and j are both smaller than t, the first segment vector is obtained based on the difference between the first state vector and the second state vector, wherein the first state vector is the jth word processed by the first prediction network The second state vector is the state vector after the first prediction network processes the (i-1)th word.
根据一种实施方式,上述方法还包括,若所述第t个词为当前文本片段的最后一个词,则根据所述第一隐向量和第二隐向量的差值确定新增片段向量,其中第二隐向量为所述第一预测网络处理第t-L个词后的状态向量;将所述新增片段向量添加至所述缓存器。According to an embodiment, the above method further includes, if the t-th word is the last word of the current text segment, determining a new segment vector according to the difference between the first latent vector and the second latent vector, where The second latent vector is the state vector after the first prediction network processes the tL-th word; the newly added segment vector is added to the buffer.
在一个实施例中,缓存器具有有限存储容量,在这样的情况下,在将新增片段向量添加至缓存器前,首先判断所述缓存器中已有的若干片段向量的数目是否达到预定阈值数目;如果达到所述预定阈值数目,则删除其中最早存入的片段向量,再将所述新增片段向量存入所述缓存器。In one embodiment, the buffer has a limited storage capacity. In this case, before adding a new segment vector to the buffer, it is first determined whether the number of several fragment vectors already in the buffer reaches a predetermined threshold. Number; if the predetermined threshold number is reached, the earliest stored segment vector is deleted, and the newly added segment vector is stored in the buffer.
根据一个实施例,第二预测网络通过以下方式确定对于下一个词的第二预测概率:确定与所述若干片段向量分别对应的若干注意力系数;以所述若干注意力系数为权重因子,对所述若干片段向量加权组合,得到上下文向量;根据所述上下文向量和线性变换矩阵,得到所述第二预测概率。According to an embodiment, the second prediction network determines the second prediction probability for the next word in the following manner: determining several attention coefficients corresponding to the several segment vectors; taking the several attention coefficients as weighting factors, The several segment vectors are weighted and combined to obtain a context vector; and the second prediction probability is obtained according to the context vector and the linear transformation matrix.
根据一个实施例,第一预测网络根据所述第一隐向量和所述线性变换矩阵,得到所述第一预测概率。According to an embodiment, the first prediction network obtains the first prediction probability according to the first hidden vector and the linear transformation matrix.
在一个更具体的实施例中,第二预测网络通过以下方式确定所述注意力系数:根据所述若干片段向量中任意的第i片段向量与所述第一隐向量之间的相似度,确定第i注意力系数。In a more specific embodiment, the second prediction network determines the attention coefficient in the following manner: according to the similarity between any i-th segment vector in the plurality of segment vectors and the first latent vector, determine The i-th attention coefficient.
在另一个更具体的实施例中,第二预测网络通过以下方式确定所述注意力系数:利用第一变换矩阵,将所述若干片段向量中任意的第i片段向量变换为第一中间向量;利用第二变换矩阵,将所述第一隐向量变换为第二中间向量;确定第一中间向量和第二中间向量的和向量与第三向量之间的相似度;根据所述相似度,确定第i注意力系数;其中,所述第一变换矩阵,第二变换矩阵和第三向量均为所述第二预测网络中的可训练网络参数。In another more specific embodiment, the second prediction network determines the attention coefficient by using a first transformation matrix to transform any i-th segment vector in the plurality of segment vectors into a first intermediate vector; Use the second transformation matrix to transform the first hidden vector into a second intermediate vector; determine the similarity between the sum vector of the first intermediate vector and the second intermediate vector and the third vector; determine according to the similarity The i-th attention coefficient; wherein, the first transformation matrix, the second transformation matrix and the third vector are all trainable network parameters in the second prediction network.
根据一种实施方式,所述文本预测模型还包括策略网络;在对所述第一预测概率和第二预测概率进行内插加权综合之前,所述方法还包括:所述策略网络根据所述第一隐向量,输出所述内插权重系数;并且所述确定预测损失的步骤具体包括:根据所述综合预测概率,所述第t+1个词,所述第一预测概率和第二预测概率,以及所述内插权重系数,确定所述预测损失。According to an embodiment, the text prediction model further includes a strategy network; before performing interpolation weighted synthesis on the first prediction probability and the second prediction probability, the method further includes: the strategy network according to the first prediction probability A latent vector, outputting the interpolation weight coefficient; and the step of determining the prediction loss specifically includes: according to the comprehensive prediction probability, the t+1th word, the first prediction probability and the second prediction probability , And the interpolation weight coefficient to determine the prediction loss.
在一个实施例中,所述策略网络通过以下方式确定所述内插权重系数:对所述第一隐向量至少施加策略变换矩阵,得到策略向量,其中所述策略变换矩阵为所述策略网络中可训练的模型参数;根据所述策略向量中预定维度的元素值,确定所述内插权重系数。In one embodiment, the strategy network determines the interpolation weight coefficients by applying at least a strategy transformation matrix to the first latent vector to obtain a strategy vector, wherein the strategy transformation matrix is in the strategy network Trainable model parameters; determine the interpolation weight coefficient according to the element value of the predetermined dimension in the strategy vector.
在一个进一步的实施例中,策略网络通过以下方式得到策略向量:根据所述当前训练文本,确定训练策略系数;对所述第一隐向量施加所述策略变换矩阵,并除以所述训练策略系数,得到所述策略向量。In a further embodiment, the strategy network obtains the strategy vector by: determining the training strategy coefficient according to the current training text; applying the strategy transformation matrix to the first implicit vector and dividing by the training strategy Coefficient to obtain the strategy vector.
更进一步的,在一个例子中,可以根据当前训练文本在训练样本集中的训练顺序编号,确定所述训练策略系数,使得所述训练策略系数与所述训练顺序编号负相关。Furthermore, in an example, the training strategy coefficient may be determined according to the training sequence number of the current training text in the training sample set, so that the training strategy coefficient is negatively correlated with the training sequence number.
在另一例子中,可以根据所述当前训练文本的文本总长度,确定所述训练策略系数,使得所述训练策略系数与所述文本总长度负相关。In another example, the training strategy coefficient may be determined according to the total text length of the current training text, so that the training strategy coefficient is negatively correlated with the total text length.
在一个实施例中,确定预测损失的步骤具体包括:根据所述综合预测概率和所述第t+1个词,确定第一损失项;根据所述内插权重系数,确定第二损失项,其中所述第二损失项与所述内插权重系数负相关;根据所述第二预测概率和第一预测概率分别针对所述第t+1个词的概率值的比值,确定所述奖励项,所述奖励项正相关于所述比值;以所述奖励项作为所述第二损失项的系数,对所述第一损失项和所述第二损失项求和,从而确定所述预测损失。In an embodiment, the step of determining the prediction loss specifically includes: determining a first loss item according to the comprehensive prediction probability and the t+1th word; determining a second loss item according to the interpolation weight coefficient, The second loss term is negatively correlated with the interpolation weight coefficient; the reward term is determined according to the ratio of the second predicted probability and the first predicted probability to the probability values of the t+1th word , The reward item is positively correlated with the ratio; taking the reward item as the coefficient of the second loss item, sum the first loss item and the second loss item to determine the predicted loss .
根据第二方面,提供了一种文本预测模型的训练装置,所述文本预测模型包括基于时序的第一预测网络,缓存器,基于所述缓存器的第二预测网络,所述装置包括:第一 预测单元,配置为在依次输入当前训练文本中的前t-1个词之后,将第t个词输入所述第一预测网络,使得所述第一预测网络根据处理第t-1个词后的状态向量,以及所述第t个词的词向量,确定处理第t个词后的状态向量作为第一隐向量;并根据该第一隐向量,确定对于下一个词的第一预测概率;读取单元,配置为从所述缓存器中读取已有的若干片段向量,所述已有的若干片段向量基于所述当前训练文本中所述第t个词之前的文本形成,且每个片段向量对应于长度为L个词的文本片段;第二预测单元,配置为使得所述第二预测网络根据所述若干片段向量,确定对于下一个词的第二预测概率;综合单元,配置为以内插权重系数作为所述第二预测概率的加权系数,以1减去所述内插权重系数的差值作为所述第一预测概率的加权系数,对所述第一预测概率和第二预测概率进行内插加权综合,得到对于下一个词的综合预测概率;损失确定单元,配置为至少根据所述综合预测概率和所述训练文本中第t+1个词,确定针对第t个词的预测损失;训练单元,配置为根据所述当前训练文本中针对各个词的预测损失,训练所述文本预测模型。According to a second aspect, there is provided a training device for a text prediction model, the text prediction model including a first prediction network based on time series, a buffer, and a second prediction network based on the buffer, the device including: A prediction unit configured to input the t-th word into the first prediction network after sequentially inputting the first t-1 words in the current training text, so that the first prediction network processes the t-1th word according to After the state vector, and the word vector of the t-th word, the state vector after processing the t-th word is determined as the first hidden vector; and according to the first hidden vector, the first predicted probability for the next word is determined Reading unit, configured to read several existing segment vectors from the buffer, the existing several segment vectors are formed based on the text before the t-th word in the current training text, and each A segment vector corresponds to a text segment with a length of L words; a second prediction unit configured to enable the second prediction network to determine a second prediction probability for the next word according to the plurality of segment vectors; a synthesis unit, configured In order to use the interpolation weight coefficient as the weighting coefficient of the second prediction probability, the difference of 1 minus the interpolation weight coefficient is used as the weighting coefficient of the first prediction probability. The predicted probability is interpolated and weighted and integrated to obtain the comprehensive predicted probability for the next word; the loss determination unit is configured to determine the target for the t-th word at least according to the comprehensive predicted probability and the t+1-th word in the training text The training unit is configured to train the text prediction model according to the prediction loss for each word in the current training text.
根据第三方面,提供了一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行第一方面的方法。According to a third aspect, there is provided a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method of the first aspect.
根据第四方面,提供了一种计算设备,包括存储器和处理器,其特征在于,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现第一方面的方法。According to a fourth aspect, there is provided a computing device, including a memory and a processor, characterized in that executable code is stored in the memory, and when the processor executes the executable code, the method of the first aspect is implemented .
根据本说明书实施例提供的文本预测模型,在利用基于时序的第一预测网络针对下一个词进行预测的基础上,还利用缓存器存储在先文本片段的片段向量作为长程上下文信息,并利用第二预测网络基于该长程上下文进行预测。在对第一预测网络和第二预测网络的预测结果进行内插综合时,可以利用策略网络,针对当前词生成内插权重系数。在对上述文本预测模型进行训练时,通过在损失函数中引入奖励项和内插权重系数,有条件地鼓励对长程上下文的探索和利用,从而进一步提高文本预测模型的预测准确性。According to the text prediction model provided by the embodiment of this specification, on the basis of using the first prediction network based on time series to predict the next word, the segment vector of the previous text segment in the buffer is also used as long-range context information, and the first Second, the prediction network makes predictions based on the long-range context. When performing interpolation synthesis on the prediction results of the first prediction network and the second prediction network, the strategy network can be used to generate an interpolation weight coefficient for the current word. When training the above-mentioned text prediction model, by introducing reward items and interpolation weight coefficients into the loss function, the exploration and utilization of long-range context are conditionally encouraged, thereby further improving the prediction accuracy of the text prediction model.
附图说明Description of the drawings
为了更清楚地说明本发明实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to explain the technical solutions of the embodiments of the present invention more clearly, the following will briefly introduce the drawings used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. A person of ordinary skill in the art can obtain other drawings based on these drawings without creative work.
图1为本说明书披露的一个实施例的文本预测模型的示意图;FIG. 1 is a schematic diagram of a text prediction model according to an embodiment disclosed in this specification;
图2示出根据一个实施例的文本预测模型的训练方法的流程图;Fig. 2 shows a flowchart of a method for training a text prediction model according to an embodiment;
图3示出针对一个具体训练文本进行预测处理的例子;Figure 3 shows an example of performing prediction processing for a specific training text;
图4示出根据一个实施例确定文本片段的片段向量的示意图;Fig. 4 shows a schematic diagram of determining a segment vector of a text segment according to an embodiment;
图5示出根据一个实施例确定第二预测概率的步骤流程;Fig. 5 shows a flow of steps for determining a second predicted probability according to an embodiment;
图6示出根据一个实施例的文本预测模型的训练装置的示意性框图。Fig. 6 shows a schematic block diagram of a training device for a text prediction model according to an embodiment.
具体实施方式Detailed ways
下面结合附图,对本说明书提供的方案进行描述。The following describes the solutions provided in this specification with reference to the accompanying drawings.
如前所述,文本预测是自然语言处理的一项基本任务,相应的,希望训练得到具有更高预测准确性的文本预测模型。As mentioned earlier, text prediction is a basic task of natural language processing. Accordingly, it is hoped to train a text prediction model with higher prediction accuracy.
考虑到文本中字词的顺序以及上下文对文本语义理解的重要性,在一种方案中,使用基于时序的神经网络模型,例如循环神经网络RNN,长短期记忆神经网络LSTM,门控循环单元GRU_RNN,作为文本预测模型的基础网络。然而,仅仅基于时序神经网络进行文本预测时,特别是利用LSTM进行预测时,往往只能捕获到距离当前词非常近的局部上下文,从而陷入对文本的局部理解中,难以捕获到距离当前词较远但是对其语义理解有帮助的长程上下文。Taking into account the order of words in the text and the importance of context to the semantic understanding of the text, in one solution, a neural network model based on time sequence is used, such as recurrent neural network RNN, long short-term memory neural network LSTM, and gated recurrent unit GRU_RNN , As the basic network of the text prediction model. However, when text prediction is based only on temporal neural networks, especially when LSTM is used for prediction, only local contexts that are very close to the current word can often be captured, so that it is trapped in the local understanding of the text, and it is difficult to capture the distance from the current word. Long-range context that is far away but helpful for its semantic understanding.
为了更好地捕获和利用长程上下文,从而提高文本预测的准确性,在本说明书的实施例中,提出一种新的文本预测模型及其训练方法。该模型将已输入文本划分为文本片段,将文本片段的表征向量作为长程上下文存储在缓存器中。在针对当前词预测下一个词时,综合考虑当前词对应的隐向量以及缓存器中存储的表征向量进行预测。In order to better capture and utilize the long-range context, thereby improving the accuracy of text prediction, in the embodiments of this specification, a new text prediction model and its training method are proposed. The model divides the input text into text fragments, and stores the characterization vector of the text fragments in the buffer as a long-range context. When predicting the next word for the current word, the implicit vector corresponding to the current word and the representation vector stored in the buffer are comprehensively considered for prediction.
图1为本说明书披露的一个实施例的文本预测模型的示意图。如图1所示,文本预测模型包括基于时序的第一预测网络11,缓存器12,基于缓存器的第二预测网络13,可选的还包括策略网络14。Fig. 1 is a schematic diagram of a text prediction model according to an embodiment disclosed in this specification. As shown in FIG. 1, the text prediction model includes a first prediction network 11 based on a time sequence, a buffer 12, a second prediction network 13 based on a buffer, and optionally a strategy network 14.
第一预测网络11包括时序神经网络,例如RNN,LSTM,GRU_RNN。根据时序神经网络的工作方式,当将训练文本输入该文本预测模型时,第一预测网络11依次读取训练文本中的词,并依次对各个词进行迭代处理。在对每个词W t进行迭代处理时,根据处理上一个词W t-1后的状态向量h t-1,以及当前词的词向量,得到对当前词迭代处理后的状态向量h t。第一预测网络11还可以包括多层感知机MLP,该MLP基于该当前 词对应的状态向量h t,得到针对下一个词的第一预测结果p。 The first prediction network 11 includes a time series neural network, such as RNN, LSTM, and GRU_RNN. According to the working mode of the time series neural network, when the training text is input into the text prediction model, the first prediction network 11 reads the words in the training text in turn, and performs iterative processing on each word in turn. When performing iterative processing on each word W t , according to the state vector h t-1 after processing the previous word W t-1 and the word vector of the current word, the state vector h t after the iterative processing of the current word is obtained. The first prediction network 11 may also include a multi-layer perceptron MLP, which obtains the first prediction result p for the next word based on the state vector h t corresponding to the current word.
缓存器12用于存储当前词之前的文本片段(span)的表征向量,即片段向量。文本片段的长度L可以为预定长度,例如2个词,3个词,5个词,等等。在一个实施例中,从第i个词到第j个词(j=i+L-1)构成的文本片段,其片段向量可以通过第一预测网络11输出的第j个词对应的状态向量与第i-1个词对应的状态向量之差而得到。The buffer 12 is used to store the characterization vector of the text segment (span) before the current word, that is, the segment vector. The length L of the text segment can be a predetermined length, for example, 2 words, 3 words, 5 words, and so on. In an embodiment, for a text segment formed from the i-th word to the j-th word (j=i+L-1), the segment vector may be the state vector corresponding to the j-th word output by the first prediction network 11 It is obtained by the difference of the state vector corresponding to the i-1th word.
第二预测网络13基于缓存器12中存储的已有的片段向量进行预测运算,得到针对下一个词的第二预测结果q。该第二预测结果q反映了基于长程上下文的预测结果。The second prediction network 13 performs prediction operations based on the existing segment vectors stored in the buffer 12 to obtain the second prediction result q for the next word. The second prediction result q reflects the prediction result based on the long-range context.
然后,对第一预测结果p和第二预测结果q进行综合。可以采用内插权重系数λ,对二者进行内插综合,得到综合预测结果。Then, the first prediction result p and the second prediction result q are integrated. The interpolation weight coefficient λ can be used to interpolate and synthesize the two to obtain a comprehensive prediction result.
以上的内插权重系数可以是预设超参数,或者可训练的参数。可选且优选的,该内插权重系数针对每个词而不同,由策略网络14来确定。具体的,策略网络14从第一预测网络11得到当前词对应的状态向量h t,根据该状态向量进行运算,得到针对当前词的内插权重系数λ,用于第一预测结果和第二预测结果的综合。 The above interpolation weight coefficients can be preset hyperparameters or trainable parameters. Optionally and preferably, the interpolation weight coefficient is different for each word, and is determined by the policy network 14. Specifically, the strategy network 14 obtains the state vector h t corresponding to the current word from the first prediction network 11, and performs operations based on the state vector to obtain the interpolation weight coefficient λ for the current word, which is used for the first prediction result and the second prediction Synthesis of results.
由此可以看到,图1所示的文本预测模型至少具有以下特点。首先,在利用时序神经网络进行预测的基础上,还通过缓存器存储当前词之前的文本片段对应的片段向量,将这些片段向量作为长程上下文,进行基于长程上下文的预测。最终的预测结果是两部分预测的综合。进一步地,可以利用策略网络,动态调整长程预测结果所占的比重,从而进一步提高预测准确度。It can be seen that the text prediction model shown in Figure 1 has at least the following characteristics. First, on the basis of using the temporal neural network for prediction, the segment vectors corresponding to the text segment before the current word are also stored in the buffer, and these segment vectors are used as the long-range context to perform prediction based on the long-range context. The final prediction result is a combination of the two parts of the prediction. Furthermore, the strategy network can be used to dynamically adjust the proportion of long-range prediction results, thereby further improving the accuracy of prediction.
下面具体描述上述文本预测模型的训练过程。The training process of the above-mentioned text prediction model will be described in detail below.
图2示出根据一个实施例的文本预测模型的训练方法的流程图。可以理解,该文本预测模型具有图1所示的结构,该方法可以通过任何具有计算、处理能力的装置、设备、平台、设备集群来执行。Fig. 2 shows a flowchart of a method for training a text prediction model according to an embodiment. It can be understood that the text prediction model has the structure shown in FIG. 1, and the method can be executed by any device, device, platform, or device cluster with computing and processing capabilities.
在执行图2所示的步骤之前,可以预先进行以下预备过程。首先,获取训练预料,即训练样本集,其中包括大量的训练文本。在将某个训练文本输入文本预测模型之前,首先对该训练文本进行词嵌入,将其中的各个词转化为词向量,从而将该训练文本转化为词向量序列。在一个实施例中,可以通过one-hot独热编码方式实现词嵌入,此时,每个词向量的维度对应于词库中词的数目V。在其他实施例中,也可以通过其他词嵌入方式实现词向量的转化,例如,word2vec方式,等等。Before performing the steps shown in Figure 2, the following preparatory process can be performed in advance. First of all, get the training expectation, that is, the training sample set, which includes a large amount of training text. Before inputting a certain training text into the text prediction model, firstly, word embedding is performed on the training text, and each word in the training text is converted into a word vector, thereby converting the training text into a word vector sequence. In one embodiment, word embedding can be realized by one-hot encoding. At this time, the dimension of each word vector corresponds to the number V of words in the lexicon. In other embodiments, the conversion of word vectors can also be realized by other word embedding methods, for example, the word2vec method, and so on.
在一个实施例中,训练文本为中文文本。此时,在一个例子中,可以首先对训练文 本进行分词,然后针对分词后的各个词进行词嵌入。在另一例子中,直接将各个中文字作为一个词来处理。因此,下文中的“词”包含中文单字的情况。In one embodiment, the training text is Chinese text. At this point, in an example, the training text can be segmented first, and then word embedding can be performed for each word after the segmentation. In another example, each Chinese character is directly processed as a word. Therefore, the "word" in the following includes the case of single Chinese characters.
在进行词嵌入之后,就可以将训练文本输入到文本预测模型,进行预测和训练。如前所述,文本预测模型的基础网络仍然为时序神经网络,因此,对于当前训练文本,将其中各个词(更具体为词向量)依次输入文本预测模型。相应的,文本预测模型依次对输入的每个词进行预测处理。下面结合训练文本中任意的第t个词,描述文本预测模型的预测处理过程和训练过程。After word embedding, the training text can be input to the text prediction model for prediction and training. As mentioned above, the basic network of the text prediction model is still a time-series neural network. Therefore, for the current training text, each word (more specifically, a word vector) is input into the text prediction model in turn. Correspondingly, the text prediction model performs prediction processing on each input word in turn. The following describes the prediction processing process and training process of the text prediction model in combination with any t-th word in the training text.
如图2所示,在步骤21,将当前训练文本中的第t个词输入文本预测模型中的第一预测网络。可以理解,在此之前,已经将当前训练文本中的前t-1个词依次输入了文本预测模型。As shown in FIG. 2, in step 21, the t-th word in the current training text is input into the first prediction network in the text prediction model. It can be understood that before this, the first t-1 words in the current training text have been sequentially input into the text prediction model.
如前所述,第一预测网络包括时序神经网络,其根据上一时刻的状态以及当前输入,共同确定下一时刻的状态。在上一时刻已经处理了第t-1个词,当前输入第t个词W t的词向量x t的情况下,第一预测网络根据处理第t-1个词后的状态向量h t-1,以及该第t个词的词向量x t,确定处理第t个词后的状态向量h t。这一过程可以通过以下公式(1)表示: As mentioned above, the first prediction network includes a time-series neural network, which jointly determines the state at the next moment according to the state at the previous moment and the current input. At the last moment, the t-1th word has been processed and the word vector x t of the t-th word W t is currently input, the first prediction network is based on the state vector h t- after processing the t-1th word. 1 , and the word vector x t of the t-th word, determine the state vector h t after processing the t-th word. This process can be expressed by the following formula (1):
h t=Φ(x t,h t-1)         (1) h t =Φ(x t , h t-1 ) (1)
其中,Φ为状态转换函数,具体函数形式取决于时序神经网络的网络形式,例如是RNN或是LSTM。状态向量的维度记为d维。Among them, Φ is a state transition function, and the specific function form depends on the network form of the sequential neural network, such as RNN or LSTM. The dimension of the state vector is denoted as d dimension.
下文中为了简单和清楚,将处理当前第t个词后的状态向量h t称为第一隐向量。 Hereinafter, for simplicity and clarity, the state vector h t after processing the current t-th word is called the first hidden vector.
第一预测网络还可以包括多层感知机MLP,用于根据该第一隐向量h t,确定对于下一个词的第一预测概率p。更具体的,第一预测概率p可以包括,下一个词为词库中各个词的概率分布。假定词库中词的数目为V,那么第一预测概率p可以表示为一个V维向量。 The first prediction network may also include a multilayer perceptron MLP, which is used to determine the first prediction probability p for the next word according to the first latent vector h t. More specifically, the first predicted probability p may include that the next word is the probability distribution of each word in the vocabulary. Assuming that the number of words in the lexicon is V, then the first predicted probability p can be expressed as a V-dimensional vector.
在一个实施例中,为了确定第一预测概率p,MLP首先对第一隐向量h t施加一个线性变换矩阵O t+1,该线性变换矩阵是可训练的参数矩阵,可以将d维的第一隐向量h t转换或投影为V维向量。可选的,在此之后施加softmax函数,得到针对各个词的概率分布。具体的,对于下一个词的第一预测概率p可以表示为: In one embodiment, in order to determine the first prediction probability p, MLP first applies a linear transformation matrix O t+1 to the first hidden vector h t . The linear transformation matrix is a trainable parameter matrix. A hidden vector h t is transformed or projected into a V-dimensional vector. Optionally, apply the softmax function after this to obtain the probability distribution for each word. Specifically, the first predicted probability p for the next word can be expressed as:
Figure PCTCN2020132617-appb-000001
Figure PCTCN2020132617-appb-000001
其中,
Figure PCTCN2020132617-appb-000002
表示h t的转置。
in,
Figure PCTCN2020132617-appb-000002
Represents the transposition of h t.
图3示出针对一个具体训练文本进行预测处理的例子。在图3的例子中,假定当前输入的是训练文本中的第92个词“no”。那么,在第一预测网络中,时序神经网络根据处理第91个词“have”之后的状态向量h 91和该第92个词“no”对应的词向量,得到对应于第92个词的状态向量h 92。然后,MLP根据该状态向量h 92,得到针对下一个词,即第93个词的第一预测概率p。 Figure 3 shows an example of performing prediction processing on a specific training text. In the example in Figure 3, assume that the current input is the 92nd word "no" in the training text. Then, in the first prediction network, the temporal neural network obtains the state corresponding to the 92nd word according to the state vector h 91 after processing the 91st word "have" and the word vector corresponding to the 92nd word "no" The vector h 92 . Then, according to the state vector h 92 , the MLP obtains the first predicted probability p for the next word, that is, the 93rd word.
可以理解,一般地,根据时序神经网络的状态向量得到的预测结果,更多地反映距离当前词较近的局部上下文对当前词语义理解的影响。例如,在图3的例子中,由于当前词“no”的局部上下文为“i have”,基于此,第一预测网络的预测结果会倾向于,为该局部上下文的常见搭配词,例如“trouble”,“idea”,输出更高的预测概率。It can be understood that, in general, the prediction result obtained according to the state vector of the time series neural network more reflects the influence of the local context closer to the current word on the understanding of the current word meaning. For example, in the example in Figure 3, since the local context of the current word "no" is "i have", based on this, the prediction results of the first prediction network will tend to be common collocation words in the local context, such as "trouble ", "idea", output a higher prediction probability.
为了更好地利用长程上下文的信息,在步骤22,从缓存器中读取已有的若干片段向量,所述已有的若干片段向量基于当前训练文本中第t个词之前的文本形成,且每个片段向量对应于长度为连续L个词的文本片段。换而言之,在依次处理前t-1个词的过程中,可以按照长度L形成若干文本片段,将这些文本片段的表征向量,即片段向量,存储在缓存器中,作为长程上下文的信息。In order to make better use of the long-range context information, in step 22, read several existing segment vectors from the buffer, which are formed based on the text before the t-th word in the current training text, and Each segment vector corresponds to a text segment with a length of L consecutive words. In other words, in the process of sequentially processing the first t-1 words, several text fragments can be formed according to the length L, and the characterization vectors of these text fragments, that is, the fragment vectors, are stored in the buffer as long-range context information .
具体的,文本片段的长度L可以根据需要预先设定,例如对于较长的训练文本,可以设置较长的片段长度,例如8个词,10个词等等,对于较短的训练文本,可以设置较短的片段长度,例如2个词,3个词,等等。Specifically, the length L of the text segment can be preset according to needs. For example, for a longer training text, you can set a longer segment length, such as 8 words, 10 words, etc., for a shorter training text, you can Set a shorter segment length, such as 2 words, 3 words, and so on.
如此,在处理第t个词之前的文本的过程中,前t-1个词可以按照预设的长度L形成若干文本片段m ij,其中i为文本片段开始的词序号,j为文本片段结尾的词序号,i和j均小于t,且j=i+L-1。文本片段的表征向量,即片段向量,可以基于第一预测网络处理各个在前词时的状态向量而得到。 In this way, in the process of processing the text before the t-th word, the first t-1 words can form several text fragments m ij according to the preset length L, where i is the word sequence number at the beginning of the text fragment, and j is the end of the text fragment The word sequence number of, i and j are both smaller than t, and j=i+L-1. The characterization vector of the text segment, that is, the segment vector, can be obtained based on the state vector when the first prediction network processes each preceding word.
具体的,在一个实施例中,对于从第i个词到第j个词构成的文本片段m ij,其片段向量基于第一状态向量和第二状态向量的差值而获得,其中,第一状态向量为第一预测网络处理第j个词后的状态向量h j,也即处理该文本片段m ij的结尾词(第j个词)之后的状态向量;第二状态向量为所述第一预测网络处理第(i-1)个词后的状态向量h i-1,也即处理该文本片段m ij的开始词(第i个词)之前的状态向量。 Specifically, in one embodiment, for the text segment m ij formed from the i-th word to the j-th word, the segment vector is obtained based on the difference between the first state vector and the second state vector, where the first The state vector is the state vector h j after the first prediction network processes the jth word, that is, the state vector after the end word (jth word) of the text segment m ij is processed; the second state vector is the first The prediction network processes the state vector hi-1 after the (i-1)th word, that is, processes the state vector before the start word (i-th word) of the text segment m ij.
图4示出根据一个实施例确定文本片段的片段向量的示意图。在图4的例子中,以2个词为片段长度形成文本片段。对于图4中以方框示出的第12个词和第13个词形成的当前文本片段m 12-13,其片段向量可以通过h 13-h 11而确定,其中h 13为时序神经网络 处理第13个词后的状态向量,h 11为时序神经网络处理第11个词后(也即处理第12个词之前)的状态向量,或者说,是前一文本片段结束时的状态向量。 Fig. 4 shows a schematic diagram of determining a segment vector of a text segment according to an embodiment. In the example of FIG. 4, a text segment is formed with 2 words as the segment length. For the current text fragment m 12-13 formed by the 12th word and the 13th word shown in the box in Figure 4, the fragment vector can be determined by h 13 -h 11 , where h 13 is a temporal neural network processing The state vector after the 13th word, h 11 is the state vector after the sequential neural network processes the 11th word (that is, before the 12th word), or in other words, is the state vector at the end of the previous text segment.
在另一实施例中,对于从第i个词到第j个词构成的文本片段m ij,获取第一预测网络分别处理该第i个词到第j个词中各个词后的状态向量,得到L个状态向量,将该L个状态向量求和或求平均,作为文本片段m ij对应的片段向量。 In another embodiment, for the text segment m ij formed from the i-th word to the j-th word, the state vector after the first prediction network processes each word from the i-th word to the j-th word is obtained, Obtain L state vectors, sum or average the L state vectors, and use them as the segment vector corresponding to the text segment m ij.
还可以通过其他方式得到片段向量。不过优选地,利用时序神经网络处理各个在前词时的状态向量,计算得到各个片段向量,如此,可以复用第一预测网络的处理结果,简化片段向量的计算方式。The fragment vector can also be obtained in other ways. However, preferably, the time-series neural network is used to process the state vector of each preceding word, and each segment vector is calculated. In this way, the processing result of the first prediction network can be reused and the calculation method of the segment vector can be simplified.
利用以上各种的片段向量计算方式,可以在第一预测网络依次迭代处理当前训练文本的各个词的过程中,得到若干片段向量。具体的,可以设置一个以L为循环的计数器,统计第一预测网络处理的词数目。当处理的词逐个累积,计数器递增,每累积L个词,形成一个新增文本片段,计数器清零重新计数,此时计算出该新增文本片段的片段向量,将其存储在缓存器中。Using the above various segment vector calculation methods, several segment vectors can be obtained in the process of sequentially iteratively processing each word of the current training text by the first prediction network. Specifically, a counter with L as a loop can be set to count the number of words processed by the first prediction network. When the processed words are accumulated one by one, the counter is incremented. Each time L words are accumulated, a new text fragment is formed. The counter is cleared and counted again. At this time, the fragment vector of the newly added text fragment is calculated and stored in the buffer.
相应地,对于当前处理的第t个词,可以判断该第t个词是否为当前文本片段的最后一个词。具体的,可以判断计数器的计数是否达到L。如果是当前文本片段的最后一个词,则将该当前文本片段作为新增文本片段,计算该新增文本片段的片段向量。具体的,在一个实施例中,可以根据前述第一隐向量h t和第二隐向量h t-L的差值确定新增片段向量,其中第二隐向量h t-L为第一预测网络处理第t-L个词后的状态向量。然后,将该新增片段向量添加至缓存器中。 Correspondingly, for the currently processed t-th word, it can be determined whether the t-th word is the last word of the current text segment. Specifically, it can be judged whether the count of the counter reaches L. If it is the last word of the current text segment, the current text segment is regarded as a newly-added text segment, and the segment vector of the newly-added text segment is calculated. Specifically, in an embodiment, the newly added segment vector may be determined according to the difference between the aforementioned first hidden vector h t and the second hidden vector h tL , where the second hidden vector h tL is the tL-th one processed by the first prediction network. The state vector after the word. Then, the newly added segment vector is added to the buffer.
在一个实施例中,用于存储在先文本的各个片段向量的缓存器具有有限的容量大小B,相应的,该缓存器只能存储有限数量N的片段向量。在这样的情况下,可以使得缓存器存储距离当前处理词最近的N个文本片段的片段向量。具体的,在一个实施例中,在将新增片段向量添加至缓存器时,首先判断缓存器中已有的若干片段向量的数目是否达到上述阈值数目N,如果没有达到,则直接将新增片段向量添加到缓存器;如果已有片段向量数目已达到阈值数目N,则删除其中最早存入的片段向量,并将新增片段向量存入缓存器中。In one embodiment, the buffer used to store each segment vector of the previous text has a limited capacity size B. Accordingly, the buffer can only store a limited number of N segment vectors. In this case, the buffer can be made to store the segment vectors of the N text segments closest to the currently processed word. Specifically, in one embodiment, when adding a newly-added segment vector to the buffer, it is first determined whether the number of several segment vectors already in the buffer reaches the above-mentioned threshold number N, and if it does not reach the threshold number N, the newly-added segment vector is directly added. The fragment vector is added to the buffer; if the number of existing fragment vectors has reached the threshold number N, the earliest stored fragment vector is deleted, and the newly added fragment vector is stored in the buffer.
延续图3的示例。当前输入的是训练文本中的第92个词“no”。此时,缓存器中已经存储了基于第92个词之前的文本形成的多个片段向量,其中每个片段向量对应于连续的3个词形成的文本片段。距离当前词最近的文本片段为第89个词到91个词的文本 片段m 89-91。由于缓存器的有限容量,其中存储的最早的片段向量对应于文本片段m 16-18,即第16个词到第18个词形成的文本片段。 Continue the example in Figure 3. The current input is the 92nd word "no" in the training text. At this time, multiple segment vectors based on the text before the 92nd word have been stored in the buffer, where each segment vector corresponds to a text segment formed by three consecutive words. The text fragment closest to the current word is the text fragment m 89-91 from the 89th word to the 91st word. Due to the limited capacity of the buffer, the earliest segment vector stored therein corresponds to the text segment m 16-18 , that is, the text segment formed from the 16th word to the 18th word.
可以看到,缓存器中存储的片段向量,可以表征距离当前词较远的文本片段,因此,这些片段向量可以作为长程上下文信息,用于辅助理解当前词的语义,进而辅助预测下一个词。It can be seen that the segment vectors stored in the buffer can represent text segments that are far away from the current word. Therefore, these segment vectors can be used as long-range context information to help understand the semantics of the current word, and then help predict the next word.
因此,在步骤23,利用第二预测网络,根据缓存器中存储的若干片段向量,确定对于下一个词的第二预测概率q。具体的,第二预测网络可以利用注意力机制,将已有的若干片段向量综合为一个上下文向量,然后基于上下文向量,确定第二预测概率q。Therefore, in step 23, the second prediction network is used to determine the second prediction probability q for the next word according to several segment vectors stored in the buffer. Specifically, the second prediction network can use the attention mechanism to integrate several existing segment vectors into a context vector, and then determine the second prediction probability q based on the context vector.
图5示出根据一个实施例确定第二预测概率的步骤流程。首先在步骤51,确定与若干片段向量分别对应的若干注意力系数。具体的,对于若干片段向量中任意的第i片段向量s i,可以基于相似度衡量,确定其对应的注意力系数α t,iFig. 5 shows a flow of steps for determining the second predicted probability according to an embodiment. First, in step 51, several attention coefficients corresponding to several segment vectors are determined. Specifically, for any i-th segment vector si among several segment vectors, the corresponding attention coefficient α t,i can be determined based on the similarity measurement.
在一个实施例中,可以确定上述第i片段向量s i与第一隐向量h t之间的相似度γ t,i,该相似度可以是余弦相似度,基于欧式距离确定的相似度,等等。然后,根据该相似度γ t,i,确定第i注意力系数α t,i。具体的,可以采用softmax函数,对各个片段向量对应的相似度归一化,得到对应的注意力系数。例如,第i注意力系数α t,i可以确定为: In an embodiment, the similarity γ t,i between the i-th segment vector s i and the first hidden vector h t can be determined, and the similarity can be cosine similarity, similarity determined based on Euclidean distance, etc. Wait. Then, according to the similarity γ t,i , the i-th attention coefficient α t,i is determined . Specifically, the softmax function can be used to normalize the similarity corresponding to each segment vector to obtain the corresponding attention coefficient. For example, the i-th attention coefficient α t,i can be determined as:
α t,i∝exp(γ t,i)           (3) α t,i ∝exp(γ t,i ) (3)
在另一实施例中,对于第i片段向量s i,采用以下方式确定对应的相似度。具体的,可以利用第一变换矩阵W s,将该第i片段向量s i变换为第一中间向量W ss i;并利用第二变换矩阵W h,将第一隐向量h t变换为第二中间向量W hh t;然后,确定第一中间向量和第二中间向量的和向量与第三向量v之间的相似度γ t,i,即: In another embodiment, for the i-th segment vector si , the corresponding similarity is determined in the following manner. Specifically, the first transformation matrix W s can be used to transform the i-th segment vector s i into a first intermediate vector W s s i ; and the second transformation matrix W h can be used to transform the first hidden vector h t into the first intermediate vector W s s i. Two intermediate vectors W h h t ; Then, determine the similarity γ t,i between the sum vector of the first intermediate vector and the second intermediate vector and the third vector v, namely:
γ t,i=v T(W hh t+W ss i)         (4) γ t,i =v T (W h h t +W s s i ) (4)
其中,第一变换矩阵W s,第二变换矩阵s h和第三向量v均为第二预测网络中的可训练网络参数。 Wherein the first transformation matrix W s, a second transform matrix, and the third vector v s h are predicted second network may be trained network parameters.
然后,可以根据该相似度γ t,i,类似的利用公式(3),确定第i注意力系数α t,iThen, according to the similarity γ t,i , the i-th attention coefficient α t,i can be determined similarly using formula (3).
接着在步骤52,以各个片段向量对应的各个注意力系数为权重因子,对前述若干片段向量加权组合,得到上下文向量ξ tNext, in step 52, using each attention coefficient corresponding to each segment vector as a weighting factor, weighted combination of the foregoing several segment vectors, to obtain a context vector ξ t .
在一个例子中,可以将缓存器中存储的各个片段向量s i按顺序排成一个向量序列C t,将各个片段向量分别对应的注意力系数α t,i排成一个注意力向量α t,如此,上下文向量ξ t 可以表示为: In an example, the segment vectors s i stored in the buffer can be sequentially arranged into a vector sequence C t , and the attention coefficients α t,i corresponding to each segment vector can be arranged into an attention vector α t , In this way, the context vector ξ t can be expressed as:
Figure PCTCN2020132617-appb-000003
Figure PCTCN2020132617-appb-000003
于是,在步骤53,根据上下文向量ξ t和一个线性变换矩阵,得到第二预测概率q。可以理解,与第一预测概率p类似的,第二预测概率q可以包括下一个词为词库中各个词的概率分布,因此q也是一个V维向量。相应的,步骤53中使用的线性变换矩阵,用于将d维的上下文向量ξ t转换或投影为V维向量。具体的,第二预测概率q可以表示为: Therefore, in step 53, the second predicted probability q is obtained according to the context vector ξ t and a linear transformation matrix. It can be understood that, similar to the first predicted probability p, the second predicted probability q may include the probability distribution of each word in the dictionary as the next word, so q is also a V-dimensional vector. Correspondingly, the linear transformation matrix used in step 53 is used to transform or project the d-dimensional context vector ξ t into a V-dimensional vector. Specifically, the second predicted probability q can be expressed as:
Figure PCTCN2020132617-appb-000004
Figure PCTCN2020132617-appb-000004
其中,O t+1为针对上下文向量的线性变换矩阵。 Among them, O t+1 is the linear transformation matrix for the context vector.
在一个实施例中,公式(6)中针对上下文向量的线性变换矩阵,与公式(2)中针对第一隐向量的线性变换矩阵为同一矩阵。在另一实施例中,第二预测网络维护公式(6)中针对上下文向量的线性变换矩阵,该矩阵与第一预测网络在公式(2)中使用的线性变换矩阵相独立。In one embodiment, the linear transformation matrix for the context vector in formula (6) is the same matrix as the linear transformation matrix for the first implicit vector in formula (2). In another embodiment, the second predictive network maintains the linear transformation matrix for the context vector in formula (6), which is independent of the linear transformation matrix used by the first predictive network in formula (2).
通过以上方式,第二预测网络根据缓存器中存储的片段向量,得到针对下一个词的第二预测概率q。如前所述,缓存区中存储的片段向量反映了长程上下文信息,因此,基于片段向量得到的第二预测概率q,可以反映基于长程上下文对下一个词的预测。In the above manner, the second prediction network obtains the second prediction probability q for the next word according to the segment vector stored in the buffer. As mentioned above, the segment vector stored in the buffer area reflects the long-range context information. Therefore, the second prediction probability q obtained based on the segment vector can reflect the prediction of the next word based on the long-range context.
延续图3的示例。缓存器中存储了在先文本片段的片段向量,这些在先文本片段包含了距离当前词相对较远的文本片段,例如m 16-18。基于这些片段向量,利用注意力机制,得到对于下一个词的第二预测概率q,该第二预测概率q更多地考虑长程上下文而做出。例如,由于文本片段m 16-18中包含长程上下文“good restaurant”,第二预测概率q倾向于为该长程上下文的相关词,例如“appetite”,输出更高的预测概率。 Continue the example in Figure 3. The buffer stores fragment vectors of previous text fragments. These previous text fragments contain text fragments relatively far from the current word, such as m 16-18 . Based on these segment vectors, the attention mechanism is used to obtain the second predicted probability q for the next word, which is made with more consideration of the long-range context. For example, since the text fragment m 16-18 contains the long-range context "good restaurant", the second predicted probability q tends to be related words in the long-range context, such as "appetite", which outputs a higher predicted probability.
在分别得到第一预测概率p和第二预测概率q的基础上,在图2的步骤24,以内插权重系数λ作为第二预测概率q的加权系数,以1减去λ作为第一预测概率p的加权系数,对第一预测概率和第二预测概率进行内插加权综合,得到对于下一个词的综合预测概率Pr,即:On the basis of obtaining the first prediction probability p and the second prediction probability q respectively, in step 24 of FIG. 2, the interpolation weight coefficient λ is used as the weighting coefficient of the second prediction probability q, and 1 minus λ is used as the first prediction probability The weighting coefficient of p is to perform interpolation weighted synthesis on the first prediction probability and the second prediction probability to obtain the comprehensive prediction probability Pr for the next word, namely:
Pr=λ*q+(1-λ)*p          (7)Pr=λ*q+(1-λ)*p (7)
然后,在步骤25,至少根据上述综合预测概率Pr和当前训练文本中第t+1个词,确定针对第t个词的预测损失。Then, in step 25, the prediction loss for the t-th word is determined at least according to the above-mentioned comprehensive prediction probability Pr and the t+1-th word in the current training text.
在一个实施例中,上述内插权重系数为预设的超参数,或者是可训练的模型参数。此时,可以将训练文本中真实的下一个词,即第t+1个词,作为标签,根据综合预测概率Pr与标签的比对,确定针对当前词的预测损失。例如,可以采用交叉熵损失函数,确定预测损失Loss:In an embodiment, the above-mentioned interpolation weight coefficient is a preset hyperparameter or a trainable model parameter. At this time, the true next word in the training text, that is, the t+1th word, can be used as a label, and the prediction loss for the current word can be determined according to the comparison of the comprehensive prediction probability Pr and the label. For example, the cross entropy loss function can be used to determine the prediction loss Loss:
Loss=-logPr(x t+1|x 1:t)            (8) Loss=-logPr(x t+1 |x 1: t ) (8)
于是,在步骤26,根据当前训练文本中针对各个词的总预测损失,训练文本预测模型。具体的,在总预测损失减小的方向,更新第一预测网络和第二预测网络。Therefore, in step 26, a text prediction model is trained based on the total prediction loss for each word in the current training text. Specifically, the first prediction network and the second prediction network are updated in the direction in which the total prediction loss is reduced.
进一步地,发明人研究发现,对于一段文本,在多数情况下,对当前词的理解和对下一个词的预测更多地依赖于局部上下文,仅在少数情况下,依赖于长程上下文。因此,在对第一预测概率和第二预测概率进行内插综合时,优选使得,内插权重系数并不固定,而是依赖于各个词而逐词不同。Further, the inventor found that for a piece of text, in most cases, the understanding of the current word and the prediction of the next word depend more on the local context, and only in a few cases, it depends on the long-range context. Therefore, when the first prediction probability and the second prediction probability are interpolated and integrated, it is preferable that the interpolation weight coefficient is not fixed, but is different from word to word depending on each word.
为此,如图1所示,在一个实施例中,在第一预测网络、第二预测网络的基础上,文本预测模型还进一步包括策略网络,该策略网络用于针对当前词确定其对应的内插权重系数λ。下面描述策略网络确定内插权重系数的方式,以及其训练方式。To this end, as shown in Figure 1, in one embodiment, on the basis of the first prediction network and the second prediction network, the text prediction model further includes a strategy network, which is used to determine the corresponding word for the current word. Interpolate the weight coefficient λ. The following describes how the strategy network determines the interpolation weight coefficients and its training methods.
具体的,为了针对当前第t个词确定其对应的内插权重系数λ t,策略网络可以获取第一预测网络处理第t个词得到的第一隐向量h t,根据该第一隐向量h t,计算内插权重系数λ tSpecifically, in order to determine the corresponding interpolation weight coefficient λ t for the current t-th word, the strategy network may obtain the first latent vector h t obtained by the first prediction network processing the t-th word, and according to the first latent vector h t , calculate the interpolation weight coefficient λ t .
在一个具体实施例中,策略网络可以对上述第一隐向量h t施加策略变换矩阵W g,得到策略向量W gh t,其中,策略变换矩阵W g为策略网络中维护的可训练的模型参数,可以是M*d维矩阵,从而将d维的第一隐向量变换为M维的策略向量,其中M为预设的维度数目。然后,可以根据M维的策略向量中预定维度的元素值,确定内插权重系数λ t。例如,可以将策略向量归一化后某个维度的元素值,作为内插权重系数λ t,即: In a specific embodiment, the policy network may apply a policy transformation matrix W g to the above-mentioned first hidden vector h t to obtain a policy vector W g h t , where the policy transformation matrix W g is a trainable model maintained in the policy network The parameter can be an M*d-dimensional matrix, so that the d-dimensional first hidden vector is transformed into an M-dimensional strategy vector, where M is the preset number of dimensions. Then, the interpolation weight coefficient λ t can be determined according to the element value of the predetermined dimension in the M-dimensional strategy vector. For example, the element value of a certain dimension after normalization of the strategy vector can be used as the interpolation weight coefficient λ t , namely:
λ t∝exp(W gh t)            (9) λ t ∝exp(W g h t ) (9)
例如,典型地,可以取M=2,则通过策略变换矩阵可以得到2维的策略向量。那么,可以基于该2维向量中其中一维的元素值,得到内插权重系数λ t。在更为简化的例子中,可以取M=1,则策略变换矩阵W g退化为一个向量,策略向量退化为一个数值,可以基于该数值得到内插权重系数λ tFor example, typically, M=2 can be taken, and then a two-dimensional strategy vector can be obtained through the strategy transformation matrix. Then, based on the element value of one dimension in the 2-dimensional vector, the interpolation weight coefficient λ t can be obtained . In a more simplified example, M=1 can be taken, then the strategy transformation matrix W g degenerates to a vector, and the strategy vector degenerates to a value, and the interpolation weight coefficient λ t can be obtained based on the value.
进一步地,为了更好地调控输出的内插权重系数的大小,在策略网络中还设置训练策略系数T,该训练策略系数T可以是在训练过程中可调的超参数,更具体地可以根据 各个训练文本而确定,从而更好地调整内插权重系数的输出。Further, in order to better control the size of the output interpolation weight coefficient, a training strategy coefficient T is also set in the strategy network. The training strategy coefficient T can be a hyperparameter that can be adjusted during the training process, more specifically according to Each training text is determined, so as to better adjust the output of the interpolation weight coefficient.
在这样的情况下,可以将以上的公式(9)修改为如下的公式(10):In this case, the above formula (9) can be modified to the following formula (10):
Figure PCTCN2020132617-appb-000005
Figure PCTCN2020132617-appb-000005
即,在对第一隐向量施加策略变换矩阵W g的基础上,再除以上述训练策略系数T,得到策略向量;然后基于策略向量,得到内插权重系数λ tThat is, on the basis of applying the strategy transformation matrix W g to the first implicit vector, it is divided by the training strategy coefficient T mentioned above to obtain the strategy vector; then based on the strategy vector, the interpolation weight coefficient λ t is obtained .
通过公式(10)可以看到,训练策略系数T越小,得到的内插权重系数越大。根据公式(7)所示,内插权重系数是施加于第二预测概率的权重系数,因此,内插权重系数越大,意味着鼓励对远程上下文的利用。It can be seen from formula (10) that the smaller the training strategy coefficient T, the larger the interpolation weight coefficient obtained. According to formula (7), the interpolation weight coefficient is a weight coefficient applied to the second prediction probability. Therefore, the larger the interpolation weight coefficient, it means that the use of the remote context is encouraged.
因此,在一个实施例中,可以采用类似于“退火”的过程,来设置和调整上述训练策略系数。具体的,可以在训练的开始阶段,设置较大的训练策略系数T,或者说较高的温度T;然后,随着训练的进行,逐步降低训练策略系数T,或者说降低温度T。这意味着,随着训练的进行,更多地鼓励文本预测模型探索使用长程上下文。Therefore, in one embodiment, a process similar to "annealing" may be used to set and adjust the aforementioned training strategy coefficients. Specifically, a larger training strategy coefficient T, or a higher temperature T, can be set at the beginning of training; then, as the training progresses, the training strategy coefficient T, or the temperature T, is gradually reduced. This means that as training progresses, text prediction models are encouraged to explore the use of long-range context.
在一个具体例子中,可以根据当前训练文本在训练样本集中的训练顺序编号,确定该训练策略系数T,使得训练策略系数T与训练顺序编号负相关。换而言之,训练顺序编号越小,越接近训练开始阶段,此时的训练策略系数T越大,温度T越高;随着训练顺序编号增大,温度降低,训练策略系数减小。In a specific example, the training strategy coefficient T can be determined according to the training sequence number of the current training text in the training sample set, so that the training strategy coefficient T is negatively correlated with the training sequence number. In other words, the smaller the training sequence number is, the closer it is to the beginning of training. At this time, the larger the training strategy coefficient T, the higher the temperature T; as the training sequence number increases, the temperature decreases, and the training strategy coefficient decreases.
另一方面,还可以根据当前训练文本的文本总长度,来确定针对当前训练文本的训练策略系数T,具体地,可以使得训练策略系数T与文本总长度负相关。因此,针对较长的训练文本,可以设置较小的系数T,得到较大的内插权重系数,从而更多地鼓励长程上下文的使用。On the other hand, the training strategy coefficient T for the current training text can also be determined according to the total text length of the current training text. Specifically, the training strategy coefficient T can be negatively correlated with the total text length. Therefore, for a longer training text, a smaller coefficient T can be set to obtain a larger interpolation weight coefficient, thereby more encouraging the use of long-range context.
通过以上多种方式,策略网络针对当前训练文本中的当前第t个词,确定出其对应的内插权重系数λ t。该内插权重系数应用于上述公式(7),得到综合预测概率Pr。 Through the above multiple methods, the strategy network determines the corresponding interpolation weight coefficient λ t for the current t-th word in the current training text. The interpolation weight coefficient is applied to the above formula (7) to obtain the comprehensive predicted probability Pr.
延续图3的示例。在图3中,第一预测网络根据当前的第92个词的状态向量h 92,即第一隐向量,得到针对第93个词的第一预测概率p;第二预测网络根据缓存器中存储的片段向量,得到第二预测概率q。策略网络根据上述第一隐向量h 92,以及训练策略系数T(图中示出为“退火”温度),得到内插权重系数。于是,可以利用该内插权重系数,对第一预测概率p和第二预测概率q进行内插综合,得到综合预测概率Pr。 Continue the example in Figure 3. In Fig. 3, the first prediction network obtains the first prediction probability p for the 93rd word according to the current state vector h 92 of the 92nd word, that is, the first hidden vector; the second prediction network obtains the first prediction probability p for the 93rd word according to the storage The segment vector of, get the second predicted probability q. The strategy network obtains the interpolation weight coefficient according to the above-mentioned first hidden vector h 92 and the training strategy coefficient T (shown as the "annealing" temperature in the figure). Therefore, the interpolation weight coefficient can be used to perform interpolation synthesis on the first prediction probability p and the second prediction probability q to obtain the comprehensive prediction probability Pr.
为了训练策略网络,需要对前述确定预测损失的方式进行修改,在确定预测损 失Loss时,不仅考虑根据第一、第二预测网络得到的综合预测概率,还要考虑策略网络的输出。因此,根据一种实施方式,在前述步骤25中,根据综合预测概率和第t+1个词,以及根据第一预测概率p,第二预测概率q,和所述内插权重系数,共同确定预测损失Loss。In order to train the strategy network, the aforementioned method of determining the prediction loss needs to be modified. When determining the prediction loss loss, not only the comprehensive prediction probabilities obtained from the first and second prediction networks are considered, but the output of the strategy network is also considered. Therefore, according to an embodiment, in the foregoing step 25, according to the comprehensive prediction probability and the t+1th word, and according to the first prediction probability p, the second prediction probability q, and the interpolation weight coefficients, determine Forecast loss Loss.
在一个实施例中,在结合策略网络的情况下,可以通过以下方式确定预测损失。一方面,可以根据综合预测概率Pr和第t+1个词,确定第一损失项L1,该第一损失项L1可以采用交叉熵损失的形式,如公式(8)所示。换而言之,可以将公式(8)所示的损失作为此处的第一损失项L1。In an embodiment, in the case of combining a policy network, the prediction loss can be determined in the following manner. On the one hand, the first loss term L1 can be determined according to the comprehensive prediction probability Pr and the t+1th word. The first loss term L1 can take the form of cross-entropy loss, as shown in formula (8). In other words, the loss shown in formula (8) can be used as the first loss item L1 here.
另一方面,根据内插权重系数λ t,确定第二损失项L2,使得第二损失项与该内插权重系数负相关。例如,在一个例子中,可以将第二损失项设置为: On the other hand, according to the interpolation weight coefficient λ t , the second loss term L2 is determined so that the second loss term is negatively related to the interpolation weight coefficient. For example, in an example, the second loss term can be set as:
L2=-logλ t            (11) L2=-logλ t (11)
在其他例子中,也可以将第二损失项L2设置为其他与负相关的形式,例如,为1/λ tIn other examples, the second loss term L2 can also be set to other forms of negative correlation, for example, 1/λ t .
此外,根据第二预测概率q和第一预测概率p分别针对所述第t+1个词的概率值的比值,确定奖励项r t,该奖励项r t正相关于所述比值;然后,以该奖励项r t作为第二损失项L2的系数,对所述第一损失项和所述第二损失项求和,从而确定预测损失Loss。 Further, according to a first and a second prediction probability q p are predicted probability for the value of the t + 1 terms of the probability of the ratio, r t term prize is determined, the bonus items r t positively correlated with the ratio; and, The reward term r t is used as the coefficient of the second loss term L2, and the first loss term and the second loss term are summed to determine the predicted loss Loss.
在第二损失项采用公式(11)的形式的情况下,预测损失Loss可以表示为:In the case that the second loss term adopts the form of formula (11), the predicted loss Loss can be expressed as:
Loss=-logPr(x t+1|x 1:t)-η*r t*logλ t        (12) Loss=-logPr(x t+1 |x 1:t )-η*r t *logλ t (12)
其中,η为可选的调节系数,η>0。Among them, η is an optional adjustment coefficient, η>0.
如公式(12)所示,损失函数表达式中的第一项对应于第一损失项,该第一损失项旨在增大正确预测下一个词的可能性。第二项为奖励项和第二损失项的乘积,旨在有条件地鼓励对长程上下文的探索和使用。As shown in formula (12), the first term in the loss function expression corresponds to the first loss term, which aims to increase the probability of correctly predicting the next word. The second term is the product of the reward term and the second loss term, which aims to conditionally encourage the exploration and use of the long-range context.
可以看到,r t*logλ t在形式上非常类似于强化学习中的策略梯度。实际上,鼓励探索和使用长程上下文可以通过第二损失项L2本身来体现,因为第二损失项的较小值对应于较大的λ t。然而,如前所述,事实上,仅在少数情况下需要依赖于长程上下文进行预测。因此,对长程上下文的鼓励应是有条件地进行,该条件通过奖励项r t来体现。奖励项的调节意味着,仅在第二预测网络针对正确的下一个词的预测概率显著高于第一预测网络的预测概率时,才鼓励较大的内插权重系数λ tIt can be seen that r t *logλ t is very similar in form to the policy gradient in reinforcement learning. In fact, encouraging exploration and use of long-range context can be embodied by the second loss term L2 itself, because a smaller value of the second loss term corresponds to a larger λ t . However, as mentioned earlier, in fact, only a few cases need to rely on long-range context for prediction. Therefore, the encouragement of the long-range context should be carried out conditionally, and the condition is reflected by the reward item r t. The adjustment of the reward term means that only when the prediction probability of the second prediction network for the correct next word is significantly higher than the prediction probability of the first prediction network, a larger interpolation weight coefficient λ t is encouraged.
具体的,第二预测网络输出第二预测概率q,其中针对真实的第t+1个词(也就是正确的下一个词)的概率值为q(x t+1|x 1:t);第一预测网络针对第t+1个词的概率值为p(x t+1|x 1:t)。可以定义二者的比值为R: Specifically, the second prediction network outputs the second prediction probability q, where the probability value for the real t+1th word (that is, the correct next word) is q(x t+1 |x 1:t ); The probability value of the first prediction network for the t+1th word is p(x t+1 |x 1:t ). The ratio of the two can be defined as R:
R=q(x t+1|x 1:t)/p(x t+1|x 1:t)          (13) R=q(x t+1 |x 1: t )/p(x t+1 |x 1: t ) (13)
上述比值R可以反映,第二预测网络和第一预测网络针对正确的下一个词的相对预测准确性。设置奖励项r t正相关于该比值R,也就是,比值R越大,奖励项r t越大。并且,在训练过程中,正确的下一个词,即第t+1个词是已知的,因此,奖励项的大小可以明确而唯一地确定出来。因此,该奖励项又可称为固有奖励(Intrinsic Rewards)。 The above ratio R may reflect the relative prediction accuracy of the second prediction network and the first prediction network for the correct next word. Setting the reward item r t is positively related to the ratio R, that is, the larger the ratio R, the greater the reward item r t . Moreover, during the training process, the correct next word, that is, the t+1th word, is known. Therefore, the size of the reward item can be clearly and uniquely determined. Therefore, this reward item can also be called Intrinsic Rewards.
奖励项r t可以基于上述比值R,通过多种方式确定。 The reward item r t can be determined in a variety of ways based on the above ratio R.
在一个具体例子中,奖励项r t通过以下方式(14)确定: In a specific example, the reward item r t is determined in the following way (14):
Figure PCTCN2020132617-appb-000006
Figure PCTCN2020132617-appb-000006
其中,∈为一极小值,为了避免p(x t+1|x 1:t)为0时造成的数学问题而设置,因此,可以认为
Figure PCTCN2020132617-appb-000007
约等于上述比值R。
Among them, ∈ is a minimum value, which is set in order to avoid mathematical problems caused when p(x t+1 |x 1:t ) is 0. Therefore, it can be considered
Figure PCTCN2020132617-appb-000007
Approximately equal to the above ratio R.
更具体的,在一个例子中,上述函数f(z)可以采用ReLU函数:More specifically, in an example, the above function f(z) can adopt the ReLU function:
Figure PCTCN2020132617-appb-000008
Figure PCTCN2020132617-appb-000008
公式(14)中的κ用于以指数形式对R的作用进行放大,公式(15)中的β用于进行线性放大,这些参数可以根据需要和实践进行设置。例如,在一个例子中,κ=5,β=3。此外,公式(14)中的参数a为截断阈值,参数b为基准阈值,这些阈值也可以根据需要和实践进行设置。例如,在一个例子中,取a=10,b=1。The κ in formula (14) is used to amplify the effect of R in exponential form, and the β in formula (15) is used for linear amplification. These parameters can be set according to needs and practice. For example, in one example, κ=5 and β=3. In addition, the parameter a in formula (14) is the cutoff threshold, and the parameter b is the reference threshold. These thresholds can also be set according to needs and practice. For example, in an example, take a=10 and b=1.
在其他例子中,也可以采用其他具体形式,根据上述比值R确定出奖励项r t,只要使得奖励项r t与比值R正相关。 In other examples, other specific forms may also be used to determine the reward item r t according to the aforementioned ratio R, as long as the reward item r t and the ratio R are positively correlated.
当根据公式(12)确定预测损失时,如果要减小预测损失,在根据第一损失项增大对正确词的预测概率的基础上,还要求第二项也尽量小。为此,当第二预测网络针对下一个正确词的预测概率显著高于第一预测网络时,即上述比值R较大时,得到更大的奖励项r t,这迫使第二损失项更小,即策略网络输出更大的λ t,从而达到有条件地鼓励更大的内插权重系数λ t,即有条件地鼓励长程上下文的目的。 When the prediction loss is determined according to formula (12), if the prediction loss is to be reduced, on the basis of increasing the prediction probability of the correct word according to the first loss term, the second term is also required to be as small as possible. For this reason, when the prediction probability of the second prediction network for the next correct word is significantly higher than that of the first prediction network, that is, when the above-mentioned ratio R is larger, a larger reward term r t is obtained , which forces the second loss term to be smaller , That is, the strategy network outputs a larger λ t , so as to conditionally encourage a larger interpolation weight coefficient λ t , that is, conditionally encourage the purpose of long-range context.
如此,在步骤25根据公式(12)的损失函数确定出预测损失后,在步骤26,根据各个词的总预测损失对文本预测模型进行训练,即在总预测损失减小的方向,调整第一预测网络、第二预测网络以及策略网络中的模型参数,实现上述训练目标。In this way, after the prediction loss is determined in step 25 according to the loss function of formula (12), in step 26, the text prediction model is trained according to the total prediction loss of each word, that is, in the direction in which the total prediction loss decreases, the first The model parameters in the prediction network, the second prediction network, and the strategy network achieve the above training goals.
回顾以上过程,根据本说明书实施例的文本预测模型,在利用基于时序的第一预测网络针对下一个词进行预测的基础上,还利用缓存器存储在先文本片段的片段向量作为长程上下文信息,并利用第二预测网络基于该长程上下文进行预测。在对第一预测网络和第二预测网络的预测结果进行内插综合时,可以利用策略网络,针对当前词生成内插权重系数。在对上述文本预测模型进行训练时,通过在损失函数中引入奖励项和内插权重系数,有条件地鼓励对长程上下文的探索和利用,从而进一步提高预测准确性。Recalling the above process, according to the text prediction model of the embodiment of this specification, on the basis of using the first prediction network based on time sequence to predict the next word, it also uses the segment vector of the previous text segment stored in the buffer as the long-range context information, And use the second prediction network to make predictions based on the long-range context. When performing interpolation synthesis on the prediction results of the first prediction network and the second prediction network, the strategy network can be used to generate an interpolation weight coefficient for the current word. When training the above-mentioned text prediction model, by introducing reward items and interpolation weight coefficients into the loss function, the exploration and utilization of long-range context are conditionally encouraged, thereby further improving the accuracy of prediction.
根据另一方面的实施例,提供了一种文本预测模型的训练装置,所述文本预测模型包括基于时序的第一预测网络,缓存器,基于所述缓存器的第二预测网络,该训练装置可以部署在任何具有计算、处理能力的设备、平台或设备集群中。图6示出根据一个实施例的文本预测模型的训练装置的示意性框图。如图6所示,该训练装置600包括:第一预测单元61,配置为在依次输入当前训练文本中的前t-1个词之后,将第t个词输入所述第一预测网络,使得所述第一预测网络根据处理第t-1个词后的状态向量,以及所述第t个词的词向量,确定处理第t个词后的状态向量作为第一隐向量;并根据该第一隐向量,确定对于下一个词的第一预测概率;读取单元62,配置为从所述缓存器中读取已有的若干片段向量,所述已有的若干片段向量基于所述当前训练文本中所述第t个词之前的文本形成,且每个片段向量对应于长度为L个词的文本片段;第二预测单元63,配置为使得所述第二预测网络根据所述若干片段向量,确定对于下一个词的第二预测概率;综合单元64,配置为以内插权重系数作为所述第二预测概率的加权系数,以1减去所述内插权重系数的差值作为所述第一预测概率的加权系数,对所述第一预测概率和第二预测概率进行内插加权综合,得到对于下一个词的综合预测概率;损失确定单元65,配置为至少根据所述综合预测概率和所述训练文本中第t+1个词,确定针对第t个词的预测损失;训练单元66,配置为根据所述当前训练文本中针对各个词的预测损失,训练所述文本预测模型。According to another embodiment, a training device for a text prediction model is provided. The text prediction model includes a first prediction network based on time series, a buffer, a second prediction network based on the buffer, and the training device It can be deployed in any device, platform or device cluster with computing and processing capabilities. Fig. 6 shows a schematic block diagram of a training device for a text prediction model according to an embodiment. As shown in FIG. 6, the training device 600 includes: a first prediction unit 61 configured to input the t-th word into the first prediction network after sequentially inputting the first t-1 words in the current training text, so that The first prediction network determines the state vector after processing the t-th word as the first latent vector according to the state vector after processing the t-1th word and the word vector of the t-th word; An implicit vector determines the first prediction probability for the next word; the reading unit 62 is configured to read several existing segment vectors from the buffer, and the existing several segment vectors are based on the current training The text before the t-th word in the text is formed, and each segment vector corresponds to a text segment with a length of L words; the second prediction unit 63 is configured to make the second prediction network according to the plurality of segment vectors , Determine the second prediction probability for the next word; the synthesis unit 64 is configured to use the interpolation weight coefficient as the weight coefficient of the second prediction probability, and use the difference of 1 minus the interpolation weight coefficient as the first A weighting coefficient of prediction probability, interpolating and weighting the first prediction probability and the second prediction probability to obtain a comprehensive prediction probability for the next word; the loss determination unit 65 is configured to at least according to the comprehensive prediction probability and The t+1th word in the training text determines the prediction loss for the tth word; the training unit 66 is configured to train the text prediction model according to the prediction loss for each word in the current training text.
在一个实施例中,所述第一预测网络包括循环神经网络RNN或长短期记忆网络LSTM。In one embodiment, the first prediction network includes a recurrent neural network RNN or a long short-term memory network LSTM.
根据一个实施例,缓存器中存储的若干片段向量包括任意的第一文本片段对应的第一片段向量,所述第一文本片段包括所述当前训练文本的第i个词到第j个词,其 中i和j均小于t,所述第一片段向量基于第一状态向量和第二状态向量的差值而获得,其中所述第一状态向量为所述第一预测网络处理所述第j个词后的状态向量,所述第二状态向量为所述第一预测网络处理第(i-1)个词后的状态向量。According to an embodiment, the several segment vectors stored in the buffer include a first segment vector corresponding to any first text segment, and the first text segment includes the i-th word to the j-th word of the current training text, Wherein i and j are both less than t, the first segment vector is obtained based on the difference between the first state vector and the second state vector, and the first state vector is the first prediction network processing the jth The state vector after the word, and the second state vector is the state vector after the first prediction network processes the (i-1)th word.
根据一个实施例,装置600还包括存储单元(未示出),配置为:若所述第t个词为当前文本片段的最后一个词,则根据所述第一隐向量和第二隐向量的差值确定新增片段向量,其中第二隐向量为所述第一预测网络处理第t-L个词后的状态向量;将所述新增片段向量添加至所述缓存器。According to an embodiment, the device 600 further includes a storage unit (not shown), configured to: if the t-th word is the last word of the current text segment, then according to the difference between the first latent vector and the second latent vector The difference determines the newly added segment vector, where the second implicit vector is the state vector after the first prediction network processes the tLth word; the newly added segment vector is added to the buffer.
在一个实施例中,缓存器具有有限存储容量,在这样的情况下,所述存储单元进一步配置为:判断所述缓存器中已有的若干片段向量的数目是否达到预定阈值数目;如果达到所述预定阈值数目,则删除其中最早存入的片段向量,并将所述新增片段向量存入所述缓存器。In one embodiment, the buffer has a limited storage capacity. In this case, the storage unit is further configured to: determine whether the number of fragment vectors already in the buffer reaches a predetermined threshold number; The predetermined threshold number is deleted, the earliest stored segment vector is deleted, and the newly added segment vector is stored in the buffer.
根据一种实施方式,所述第二预测网络通过以下方式得到第二预测概率:确定与所述若干片段向量分别对应的若干注意力系数;以所述若干注意力系数为权重因子,对所述若干片段向量加权组合,得到上下文向量;根据所述上下文向量和线性变换矩阵,得到所述第二预测概率。According to one embodiment, the second prediction network obtains the second prediction probability by: determining a number of attention coefficients corresponding to the several segment vectors; using the several attention coefficients as weighting factors, A number of segment vectors are weighted and combined to obtain a context vector; and the second prediction probability is obtained according to the context vector and the linear transformation matrix.
在一个实施例中,所述第一预测网络在确定第一预测概率时,根据所述第一隐向量和与上述第二预测网络相同的线性变换矩阵,得到所述第一预测概率。In an embodiment, when the first prediction network determines the first prediction probability, the first prediction probability is obtained according to the first hidden vector and the same linear transformation matrix as the second prediction network.
在一个更具体的实施例中,第二预测网络通过以下方式确定注意力系数:根据所述若干片段向量中任意的第i片段向量与所述第一隐向量之间的相似度,确定第i注意力系数。In a more specific embodiment, the second prediction network determines the attention coefficient in the following manner: according to the similarity between any i-th segment vector in the plurality of segment vectors and the first latent vector, determine the i-th segment vector Attention coefficient.
在另一个更具体的实施例中,第二预测网络通过以下方式确定注意力系数:利用第一变换矩阵,将所述若干片段向量中任意的第i片段向量变换为第一中间向量;利用第二变换矩阵,将所述第一隐向量变换为第二中间向量;确定第一中间向量和第二中间向量的和向量与第三向量之间的相似度;根据所述相似度,确定第i注意力系数;其中,所述第一变换矩阵,第二变换矩阵和第三向量均为所述第二预测网络中的可训练网络参数。In another more specific embodiment, the second prediction network determines the attention coefficient by using a first transformation matrix to transform any i-th segment vector in the plurality of segment vectors into a first intermediate vector; A second transformation matrix, transforming the first hidden vector into a second intermediate vector; determining the similarity between the sum vector of the first intermediate vector and the second intermediate vector and the third vector; determining the i-th vector according to the similarity Attention coefficient; wherein, the first transformation matrix, the second transformation matrix and the third vector are all trainable network parameters in the second prediction network.
根据一种实施方式,文本预测模型还包括策略网络,用于根据所述第一隐向量,输出所述内插权重系数;在这样的情况下,所述损失确定单元65进一步配置为,根据所述综合预测概率,所述第t+1个词,所述第一预测概率和第二预测概率,以及所述内 插权重系数,确定所述预测损失。According to an embodiment, the text prediction model further includes a strategy network for outputting the interpolation weight coefficient according to the first latent vector; in this case, the loss determination unit 65 is further configured to The comprehensive prediction probability, the t+1th word, the first prediction probability and the second prediction probability, and the interpolation weight coefficient determine the prediction loss.
在一个实施例中,所述策略网络通过以下方式确定所述内插权重系数:对所述第一隐向量至少施加策略变换矩阵,得到策略向量,其中所述策略变换矩阵为所述策略网络中可训练的模型参数;根据所述策略向量中预定维度的元素值,确定所述内插权重系数。In one embodiment, the strategy network determines the interpolation weight coefficients in the following manner: at least a strategy transformation matrix is applied to the first implicit vector to obtain a strategy vector, wherein the strategy transformation matrix is Trainable model parameters; determine the interpolation weight coefficient according to the element value of the predetermined dimension in the strategy vector.
在一个进一步的实施例中,策略网络通过以下方式得到策略向量:根据所述当前训练文本,确定训练策略系数;对所述第一隐向量施加所述策略变换矩阵,并除以所述训练策略系数,得到所述策略向量。In a further embodiment, the strategy network obtains the strategy vector in the following manner: determining the training strategy coefficient according to the current training text; applying the strategy transformation matrix to the first implicit vector and dividing by the training strategy Coefficient to obtain the strategy vector.
更进一步的,在一个例子中,所述策略网络确定训练策略系数具体包括:根据所述当前训练文本在训练样本集中的训练顺序编号,确定所述训练策略系数,使得所述训练策略系数与所述训练顺序编号负相关。Furthermore, in an example, the strategy network determining the training strategy coefficient specifically includes: determining the training strategy coefficient according to the training sequence number of the current training text in the training sample set, so that the training strategy coefficient is the same as the training strategy coefficient. The number of the training sequence is negatively correlated.
在另一例子中,所述策略网络确定训练策略系数具体包括:根据所述当前训练文本的文本总长度,确定所述训练策略系数,使得所述训练策略系数与所述文本总长度负相关。In another example, determining the training strategy coefficient by the strategy network specifically includes: determining the training strategy coefficient according to the total text length of the current training text, so that the training strategy coefficient is negatively correlated with the total text length.
在一个实施例中,所述损失确定单元65具体配置为:根据所述综合预测概率和所述第t+1个词,确定第一损失项;根据所述内插权重系数,确定第二损失项,其中所述第二损失项与所述内插权重系数负相关;根据所述第二预测概率和第一预测概率分别针对所述第t+1个词的概率值的比值,确定所述奖励项,所述奖励项正相关于所述比值;以所述奖励项作为所述第二损失项的系数,对所述第一损失项和所述第二损失项求和,从而确定所述预测损失。In an embodiment, the loss determination unit 65 is specifically configured to: determine a first loss item according to the comprehensive predicted probability and the t+1th word; and determine a second loss item according to the interpolation weight coefficient Item, wherein the second loss item is negatively correlated with the interpolation weight coefficient; according to the ratio of the second prediction probability and the first prediction probability to the probability value of the t+1th word, the determination of the Reward item, the reward item is positively related to the ratio; taking the reward item as the coefficient of the second loss item, sum the first loss item and the second loss item to determine the Forecast loss.
通过以上装置,实现对文本预测模型的训练。Through the above device, the training of the text prediction model is realized.
根据另一方面的实施例,还提供一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行结合图2所描述的方法。According to another embodiment, there is also provided a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method described in conjunction with FIG. 2.
根据再一方面的实施例,还提供一种计算设备,包括存储器和处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现结合图2所述的方法。According to an embodiment of still another aspect, there is also provided a computing device, including a memory and a processor, the memory is stored with executable code, and when the processor executes the executable code, it implements the method described in conjunction with FIG. 2 method.
本领域技术人员应该可以意识到,在上述一个或多个示例中,本发明所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这 些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。Those skilled in the art should be aware that, in one or more of the above examples, the functions described in the present invention can be implemented by hardware, software, firmware, or any combination thereof. When implemented by software, these functions can be stored in a computer-readable medium or transmitted as one or more instructions or codes on the computer-readable medium.
以上所述的具体实施方式,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施方式而已,并不用于限定本发明的保护范围,凡在本发明的技术方案的基础之上,所做的任何修改、等同替换、改进等,均应包括在本发明的保护范围之内。The specific embodiments described above further describe the purpose, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above are only specific embodiments of the present invention, and are not intended to limit the scope of the present invention. The protection scope, any modification, equivalent replacement, improvement, etc. made on the basis of the technical solution of the present invention shall be included in the protection scope of the present invention.

Claims (32)

  1. 一种文本预测模型的训练方法,所述文本预测模型包括基于时序的第一预测网络,缓存器,基于所述缓存器的第二预测网络,所述方法包括:A method for training a text prediction model, the text prediction model comprising a first prediction network based on time series, a buffer, and a second prediction network based on the buffer, the method comprising:
    在依次输入当前训练文本中的前t-1个词之后,将第t个词输入所述第一预测网络,使得所述第一预测网络根据处理第t-1个词后的状态向量,以及所述第t个词的词向量,确定处理第t个词后的状态向量作为第一隐向量;并根据该第一隐向量,确定对于下一个词的第一预测概率;After sequentially inputting the first t-1 words in the current training text, input the t-th word into the first prediction network, so that the first prediction network processes the state vector of the t-1th word according to the state vector, and The word vector of the t-th word, determining the state vector after processing the t-th word as the first latent vector; and determining the first predicted probability for the next word according to the first latent vector;
    从所述缓存器中读取已有的若干片段向量,所述已有的若干片段向量基于所述当前训练文本中所述第t个词之前的文本形成,且每个片段向量对应于长度为L个词的文本片段;Read several existing segment vectors from the buffer, the existing several segment vectors are formed based on the text before the t-th word in the current training text, and each segment vector corresponds to a length of Text fragment of L words;
    所述第二预测网络根据所述若干片段向量,确定对于下一个词的第二预测概率;The second prediction network determines the second prediction probability for the next word according to the several segment vectors;
    以内插权重系数作为所述第二预测概率的加权系数,以1减去所述内插权重系数的差值作为所述第一预测概率的加权系数,对所述第一预测概率和第二预测概率进行内插加权综合,得到对于下一个词的综合预测概率;The interpolation weight coefficient is used as the weighting coefficient of the second prediction probability, and the difference of 1 minus the interpolation weight coefficient is used as the weighting coefficient of the first prediction probability. Probability is interpolated and weighted and integrated to obtain the comprehensive predicted probability for the next word;
    至少根据所述综合预测概率和所述训练文本中第t+1个词,确定针对第t个词的预测损失;Determine the prediction loss for the t-th word at least according to the comprehensive prediction probability and the t+1-th word in the training text;
    根据所述当前训练文本中针对各个词的预测损失,训练所述文本预测模型。Training the text prediction model according to the prediction loss for each word in the current training text.
  2. 根据权利要求1所述的方法,其中,所述第一预测网络包括循环神经网络RNN或长短期记忆网络LSTM。The method according to claim 1, wherein the first prediction network comprises a recurrent neural network (RNN) or a long short-term memory network (LSTM).
  3. 根据权利要求1所述的方法,其中,所述若干片段向量包括第一文本片段对应的第一片段向量,所述第一文本片段包括所述当前训练文本的第i个词到第j个词,其中i和j均小于t,所述第一片段向量基于第一状态向量和第二状态向量的差值而获得,其中所述第一状态向量为所述第一预测网络处理所述第j个词后的状态向量,所述第二状态向量为所述第一预测网络处理第(i-1)个词后的状态向量。The method according to claim 1, wherein the plurality of segment vectors includes a first segment vector corresponding to a first text segment, and the first text segment includes the i-th word to the j-th word of the current training text , Wherein i and j are both smaller than t, the first segment vector is obtained based on the difference between the first state vector and the second state vector, and the first state vector is the first prediction network processing the jth A state vector after a word, and the second state vector is a state vector after the first prediction network processes the (i-1)th word.
  4. 根据权利要求1或3所述的方法,还包括,The method according to claim 1 or 3, further comprising:
    若所述第t个词为当前文本片段的最后一个词,则根据所述第一隐向量和第二隐向量的差值确定新增片段向量,其中第二隐向量为所述第一预测网络处理第t-L个词后的状态向量;If the t-th word is the last word of the current text segment, a new segment vector is determined according to the difference between the first latent vector and the second latent vector, where the second latent vector is the first prediction network Process the state vector after the tL word;
    将所述新增片段向量添加至所述缓存器。Adding the newly added segment vector to the buffer.
  5. 根据权利要求4所述的方法,其中,将所述新增片段向量添加至所述缓存器,包括:The method according to claim 4, wherein adding the newly added segment vector to the buffer comprises:
    判断所述缓存器中已有的若干片段向量的数目是否达到预定阈值数目;Judging whether the number of several segment vectors already in the buffer reaches a predetermined threshold number;
    如果达到所述预定阈值数目,则删除其中最早存入的片段向量,并将所述新增片段向量存入所述缓存器。If the predetermined threshold number is reached, the earliest stored segment vector is deleted, and the newly added segment vector is stored in the buffer.
  6. 根据权利要求1所述的方法,其中,所述第二预测网络根据所述若干片段向量,确定对于下一个词的第二预测概率,包括:The method according to claim 1, wherein the second prediction network determines the second prediction probability for the next word according to the plurality of segment vectors, comprising:
    确定与所述若干片段向量分别对应的若干注意力系数;Determining several attention coefficients respectively corresponding to the several segment vectors;
    以所述若干注意力系数为权重因子,对所述若干片段向量加权组合,得到上下文向量;Using the several attention coefficients as weighting factors, weighting and combining the several segment vectors to obtain a context vector;
    根据所述上下文向量和线性变换矩阵,得到所述第二预测概率。According to the context vector and the linear transformation matrix, the second predicted probability is obtained.
  7. 根据权利要求6所述的方法,其中,根据该第一隐向量,确定对于下一个词的第一预测概率,包括:The method according to claim 6, wherein determining the first predicted probability for the next word according to the first hidden vector comprises:
    根据所述第一隐向量和所述线性变换矩阵,得到所述第一预测概率。According to the first hidden vector and the linear transformation matrix, the first predicted probability is obtained.
  8. 根据权利要求6所述的方法,其中,确定与所述若干片段向量分别对应的若干注意力系数,包括:8. The method according to claim 6, wherein determining a plurality of attention coefficients respectively corresponding to the plurality of segment vectors comprises:
    根据所述若干片段向量中任意的第i片段向量与所述第一隐向量之间的相似度,确定第i注意力系数。Determine the i-th attention coefficient according to the similarity between any i-th segment vector in the several segment vectors and the first hidden vector.
  9. 根据权利要求6所述的方法,其中,确定与所述若干片段向量分别对应的若干注意力系数,包括:8. The method according to claim 6, wherein determining a plurality of attention coefficients respectively corresponding to the plurality of segment vectors comprises:
    利用第一变换矩阵,将所述若干片段向量中任意的第i片段向量变换为第一中间向量;Using the first transformation matrix to transform any i-th segment vector in the plurality of segment vectors into a first intermediate vector;
    利用第二变换矩阵,将所述第一隐向量变换为第二中间向量;Using a second transformation matrix to transform the first hidden vector into a second intermediate vector;
    确定第一中间向量和第二中间向量的和向量与第三向量之间的相似度;Determine the similarity between the sum vector of the first intermediate vector and the second intermediate vector and the third vector;
    根据所述相似度,确定第i注意力系数;Determine the i-th attention coefficient according to the similarity;
    其中,所述第一变换矩阵,第二变换矩阵和第三向量均为所述第二预测网络中的可训练网络参数。Wherein, the first transformation matrix, the second transformation matrix and the third vector are all trainable network parameters in the second prediction network.
  10. 根据权利要求1所述的方法,其中,所述文本预测模型还包括策略网络;在对所述第一预测概率和第二预测概率进行内插加权综合之前,所述方法还包括:The method according to claim 1, wherein the text prediction model further comprises a strategy network; before performing interpolation weighted synthesis on the first prediction probability and the second prediction probability, the method further comprises:
    所述策略网络根据所述第一隐向量,输出所述内插权重系数;The strategy network outputs the interpolation weight coefficient according to the first hidden vector;
    至少根据所述综合预测概率和所述训练文本中第t+1个词,确定预测损失,包括:根据所述综合预测概率,所述第t+1个词,所述第一预测概率和第二预测概率,以及所述内插权重系数,确定所述预测损失。Determining the prediction loss at least according to the comprehensive prediction probability and the t+1th word in the training text includes: according to the comprehensive prediction probability, the t+1th word, the first prediction probability and the t+1th word Second, the prediction probability, and the interpolation weight coefficient, determine the prediction loss.
  11. 根据权利要求10所述的方法,其中,所述策略网络根据所述第一隐向量,输出所述内插权重系数,包括:The method according to claim 10, wherein the strategy network outputting the interpolation weight coefficient according to the first hidden vector comprises:
    对所述第一隐向量至少施加策略变换矩阵,得到策略向量,其中所述策略变换矩阵为所述策略网络中可训练的模型参数;Applying at least a strategy transformation matrix to the first implicit vector to obtain a strategy vector, wherein the strategy transformation matrix is a trainable model parameter in the strategy network;
    根据所述策略向量中预定维度的元素值,确定所述内插权重系数。The interpolation weight coefficient is determined according to the element value of the predetermined dimension in the strategy vector.
  12. 根据权利要求11所述的方法,其中,对所述第一隐向量至少施加策略变换矩阵,得到策略向量,包括:The method according to claim 11, wherein applying at least a strategy transformation matrix to the first implicit vector to obtain a strategy vector comprises:
    根据所述当前训练文本,确定训练策略系数;Determine the training strategy coefficient according to the current training text;
    对所述第一隐向量施加所述策略变换矩阵,并除以所述训练策略系数,得到所述策略向量。The strategy transformation matrix is applied to the first implicit vector and divided by the training strategy coefficient to obtain the strategy vector.
  13. 根据权利要求12所述的方法,其中,根据所述当前训练文本,确定训练策略系数,包括:The method according to claim 12, wherein determining the training strategy coefficient according to the current training text comprises:
    根据所述当前训练文本在训练样本集中的训练顺序编号,确定所述训练策略系数,使得所述训练策略系数与所述训练顺序编号负相关。The training strategy coefficient is determined according to the training sequence number of the current training text in the training sample set, so that the training strategy coefficient is negatively correlated with the training sequence number.
  14. 根据权利要求12所述的方法,其中,根据所述当前训练文本,确定训练策略系数,包括:The method according to claim 12, wherein determining the training strategy coefficient according to the current training text comprises:
    根据所述当前训练文本的文本总长度,确定所述训练策略系数,使得所述训练策略系数与所述文本总长度负相关。The training strategy coefficient is determined according to the total text length of the current training text, so that the training strategy coefficient is negatively correlated with the total text length.
  15. 根据权利要求10所述的方法,其中,根据所述第一预测概率,第二预测概率,所述综合预测概率,所述第t+1个词,以及所述内插权重系数,确定所述预测损失,包括:The method according to claim 10, wherein, according to the first prediction probability, the second prediction probability, the comprehensive prediction probability, the t+1th word, and the interpolation weight coefficient, the Forecast loss, including:
    根据所述综合预测概率和所述第t+1个词,确定第一损失项;Determine the first loss item according to the comprehensive predicted probability and the t+1th word;
    根据所述内插权重系数,确定第二损失项,其中所述第二损失项与所述内插权重系数负相关;Determine a second loss term according to the interpolation weight coefficient, wherein the second loss term is negatively related to the interpolation weight coefficient;
    根据所述第二预测概率和第一预测概率分别针对所述第t+1个词的概率值的比值,确定所述奖励项,所述奖励项正相关于所述比值;Determine the reward item according to the ratio of the second predicted probability and the first predicted probability to the probability value of the t+1th word, and the reward item is positively related to the ratio;
    以所述奖励项作为所述第二损失项的系数,对所述第一损失项和所述第二损失项求和,从而确定所述预测损失。The reward item is used as a coefficient of the second loss item, and the first loss item and the second loss item are summed to determine the predicted loss.
  16. 一种文本预测模型的训练装置,所述文本预测模型包括基于时序的第一预测网络,缓存器,基于所述缓存器的第二预测网络,所述装置包括:A training device for a text prediction model. The text prediction model includes a first prediction network based on time series, a buffer, and a second prediction network based on the buffer. The device includes:
    第一预测单元,配置为在依次输入当前训练文本中的前t-1个词之后,将第t个词 输入所述第一预测网络,使得所述第一预测网络根据处理第t-1个词后的状态向量,以及所述第t个词的词向量,确定处理第t个词后的状态向量作为第一隐向量;并根据该第一隐向量,确定对于下一个词的第一预测概率;The first prediction unit is configured to input the t-th word into the first prediction network after sequentially inputting the first t-1 words in the current training text, so that the first prediction network processes the t-1th word according to The state vector after the word, and the word vector of the t-th word, determine the state vector after processing the t-th word as the first hidden vector; and determine the first prediction for the next word according to the first hidden vector Probability
    读取单元,配置为从所述缓存器中读取已有的若干片段向量,所述已有的若干片段向量基于所述当前训练文本中所述第t个词之前的文本形成,且每个片段向量对应于长度为L个词的文本片段;The reading unit is configured to read several existing segment vectors from the buffer, the existing several segment vectors are formed based on the text before the t-th word in the current training text, and each The segment vector corresponds to a text segment with a length of L words;
    第二预测单元,配置为使得所述第二预测网络根据所述若干片段向量,确定对于下一个词的第二预测概率;A second prediction unit configured to enable the second prediction network to determine a second prediction probability for the next word according to the several segment vectors;
    综合单元,配置为以内插权重系数作为所述第二预测概率的加权系数,以1减去所述内插权重系数的差值作为所述第一预测概率的加权系数,对所述第一预测概率和第二预测概率进行内插加权综合,得到对于下一个词的综合预测概率;The synthesis unit is configured to use the interpolation weight coefficient as the weighting coefficient of the second prediction probability, and use the difference value of 1 minus the interpolation weight coefficient as the weighting coefficient of the first prediction probability, and perform the calculation on the first prediction probability. The probability and the second predicted probability are interpolated and weighted and integrated to obtain the comprehensive predicted probability for the next word;
    损失确定单元,配置为至少根据所述综合预测概率和所述训练文本中第t+1个词,确定针对第t个词的预测损失;A loss determining unit, configured to determine a prediction loss for the t-th word at least according to the comprehensive prediction probability and the t+1-th word in the training text;
    训练单元,配置为根据所述当前训练文本中针对各个词的预测损失,训练所述文本预测模型。The training unit is configured to train the text prediction model according to the prediction loss for each word in the current training text.
  17. 根据权利要求16所述的装置,其中,所述第一预测网络包括循环神经网络RNN或长短期记忆网络LSTM。The apparatus according to claim 16, wherein the first prediction network comprises a recurrent neural network (RNN) or a long short-term memory network (LSTM).
  18. 根据权利要求16所述的装置,其中,所述若干片段向量包括第一文本片段对应的第一片段向量,所述第一文本片段包括所述当前训练文本的第i个词到第j个词,其中i和j均小于t,所述第一片段向量基于第一状态向量和第二状态向量的差值而获得,其中所述第一状态向量为所述第一预测网络处理所述第j个词后的状态向量,所述第二状态向量为所述第一预测网络处理第(i-1)个词后的状态向量。The device according to claim 16, wherein the plurality of segment vectors includes a first segment vector corresponding to a first text segment, and the first text segment includes the i-th word to the j-th word of the current training text , Wherein i and j are both smaller than t, the first segment vector is obtained based on the difference between the first state vector and the second state vector, and the first state vector is the first prediction network processing the jth A state vector after a word, and the second state vector is a state vector after the first prediction network processes the (i-1)th word.
  19. 根据权利要求16或18所述的装置,还包括存储单元,配置为:The device according to claim 16 or 18, further comprising a storage unit configured to:
    若所述第t个词为当前文本片段的最后一个词,则根据所述第一隐向量和第二隐向量的差值确定新增片段向量,其中第二隐向量为所述第一预测网络处理第t-L个词后的状态向量;If the t-th word is the last word of the current text segment, a new segment vector is determined according to the difference between the first latent vector and the second latent vector, where the second latent vector is the first prediction network Process the state vector after the tL word;
    将所述新增片段向量添加至所述缓存器。Adding the newly added segment vector to the buffer.
  20. 根据权利要求19所述的装置,其中,所述存储单元进一步配置为:The device according to claim 19, wherein the storage unit is further configured to:
    判断所述缓存器中已有的若干片段向量的数目是否达到预定阈值数目;Judging whether the number of several segment vectors already in the buffer reaches a predetermined threshold number;
    如果达到所述预定阈值数目,则删除其中最早存入的片段向量,并将所述新增片段向量存入所述缓存器。If the predetermined threshold number is reached, the earliest stored segment vector is deleted, and the newly added segment vector is stored in the buffer.
  21. 根据权利要求16所述的装置,其中,所述第二预测网络具体用于:The apparatus according to claim 16, wherein the second prediction network is specifically configured to:
    确定与所述若干片段向量分别对应的若干注意力系数;Determining several attention coefficients respectively corresponding to the several segment vectors;
    以所述若干注意力系数为权重因子,对所述若干片段向量加权组合,得到上下文向量;Using the several attention coefficients as weighting factors, weighting and combining the several segment vectors to obtain a context vector;
    根据所述上下文向量和线性变换矩阵,得到所述第二预测概率。According to the context vector and the linear transformation matrix, the second predicted probability is obtained.
  22. 根据权利要求21所述的装置,其中,所述第一预测网络具体用于:The device according to claim 21, wherein the first prediction network is specifically configured to:
    根据所述第一隐向量和所述线性变换矩阵,得到所述第一预测概率。According to the first hidden vector and the linear transformation matrix, the first predicted probability is obtained.
  23. 根据权利要求21所述的装置,其中,所述第二预测网络具体用于:The device according to claim 21, wherein the second prediction network is specifically used for:
    根据所述若干片段向量中任意的第i片段向量与所述第一隐向量之间的相似度,确定第i注意力系数。Determine the i-th attention coefficient according to the similarity between any i-th segment vector in the several segment vectors and the first hidden vector.
  24. 根据权利要求21所述的装置,其中,所述第二预测网络具体用于:The device according to claim 21, wherein the second prediction network is specifically used for:
    利用第一变换矩阵,将所述若干片段向量中任意的第i片段向量变换为第一中间向量;Using the first transformation matrix to transform any i-th segment vector in the plurality of segment vectors into a first intermediate vector;
    利用第二变换矩阵,将所述第一隐向量变换为第二中间向量;Using a second transformation matrix to transform the first hidden vector into a second intermediate vector;
    确定第一中间向量和第二中间向量的和向量与第三向量之间的相似度;Determine the similarity between the sum vector of the first intermediate vector and the second intermediate vector and the third vector;
    根据所述相似度,确定第i注意力系数;Determine the i-th attention coefficient according to the similarity;
    其中,所述第一变换矩阵,第二变换矩阵和第三向量均为所述第二预测网络中的可训练网络参数。Wherein, the first transformation matrix, the second transformation matrix and the third vector are all trainable network parameters in the second prediction network.
  25. 根据权利要求16所述的装置,其中,所述文本预测模型还包括策略网络,用于根据所述第一隐向量,输出所述内插权重系数;The device according to claim 16, wherein the text prediction model further comprises a strategy network for outputting the interpolation weight coefficient according to the first implicit vector;
    所述损失确定单元配置为,根据所述综合预测概率,所述第t+1个词,所述第一预测概率和第二预测概率,以及所述内插权重系数,确定所述预测损失。The loss determination unit is configured to determine the prediction loss according to the comprehensive prediction probability, the t+1th word, the first prediction probability and the second prediction probability, and the interpolation weight coefficient.
  26. 根据权利要求25所述的装置,其中,所述策略网络具体用于:The device according to claim 25, wherein the policy network is specifically used for:
    对所述第一隐向量至少施加策略变换矩阵,得到策略向量,其中所述策略变换矩阵为所述策略网络中可训练的模型参数;Applying at least a strategy transformation matrix to the first implicit vector to obtain a strategy vector, wherein the strategy transformation matrix is a trainable model parameter in the strategy network;
    根据所述策略向量中预定维度的元素值,确定所述内插权重系数。The interpolation weight coefficient is determined according to the element value of the predetermined dimension in the strategy vector.
  27. 根据权利要求26所述的装置,其中,所述策略网络得到所述策略向量具体包括:The device according to claim 26, wherein the obtaining of the policy vector by the policy network specifically comprises:
    根据所述当前训练文本,确定训练策略系数;Determine the training strategy coefficient according to the current training text;
    对所述第一隐向量施加所述策略变换矩阵,并除以所述训练策略系数,得到所述策略向量。The strategy transformation matrix is applied to the first implicit vector and divided by the training strategy coefficient to obtain the strategy vector.
  28. 根据权利要求27所述的装置,其中,所述策略网络确定训练策略系数具体包括:The device according to claim 27, wherein the determination of the training strategy coefficient by the strategy network specifically comprises:
    根据所述当前训练文本在训练样本集中的训练顺序编号,确定所述训练策略系数,使得所述训练策略系数与所述训练顺序编号负相关。The training strategy coefficient is determined according to the training sequence number of the current training text in the training sample set, so that the training strategy coefficient is negatively correlated with the training sequence number.
  29. 根据权利要求27所述的装置,其中,所述策略网络确定训练策略系数具体包括:The device according to claim 27, wherein the determination of the training strategy coefficient by the strategy network specifically comprises:
    根据所述当前训练文本的文本总长度,确定所述训练策略系数,使得所述训练策略系数与所述文本总长度负相关。The training strategy coefficient is determined according to the total text length of the current training text, so that the training strategy coefficient is negatively correlated with the total text length.
  30. 根据权利要求25所述的装置,其中,所述损失确定单元配置为:The apparatus according to claim 25, wherein the loss determination unit is configured to:
    根据所述综合预测概率和所述第t+1个词,确定第一损失项;Determine the first loss item according to the comprehensive predicted probability and the t+1th word;
    根据所述内插权重系数,确定第二损失项,其中所述第二损失项与所述内插权重系数负相关;Determine a second loss term according to the interpolation weight coefficient, wherein the second loss term is negatively related to the interpolation weight coefficient;
    根据所述第二预测概率和第一预测概率分别针对所述第t+1个词的概率值的比值,确定所述奖励项,所述奖励项正相关于所述比值;Determine the reward item according to the ratio of the second predicted probability and the first predicted probability to the probability value of the t+1th word, and the reward item is positively related to the ratio;
    以所述奖励项作为所述第二损失项的系数,对所述第一损失项和所述第二损失项求和,从而确定所述预测损失。The reward item is used as a coefficient of the second loss item, and the first loss item and the second loss item are summed to determine the predicted loss.
  31. 一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行权利要求1-15中任一项的所述的方法。A computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method according to any one of claims 1-15.
  32. 一种计算设备,包括存储器和处理器,其特征在于,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现权利要求1-15中任一项所述的方法。A computing device, comprising a memory and a processor, characterized in that executable code is stored in the memory, and when the processor executes the executable code, the method described in any one of claims 1-15 is implemented. method.
PCT/CN2020/132617 2020-02-06 2020-11-30 Text prediction model training method and apparatus WO2021155705A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010081187.8 2020-02-06
CN202010081187.8A CN111274789B (en) 2020-02-06 2020-02-06 Training method and device of text prediction model

Publications (1)

Publication Number Publication Date
WO2021155705A1 true WO2021155705A1 (en) 2021-08-12

Family

ID=71000235

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/132617 WO2021155705A1 (en) 2020-02-06 2020-11-30 Text prediction model training method and apparatus

Country Status (2)

Country Link
CN (1) CN111274789B (en)
WO (1) WO2021155705A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116362418A (en) * 2023-05-29 2023-06-30 天能电池集团股份有限公司 Online prediction method for application-level manufacturing capacity of intelligent factory of high-end battery
CN117540326A (en) * 2024-01-09 2024-02-09 深圳大学 Construction state abnormality identification method and system for tunnel construction equipment by drilling and blasting method

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111274789B (en) * 2020-02-06 2021-07-06 支付宝(杭州)信息技术有限公司 Training method and device of text prediction model
CN111597819B (en) * 2020-05-08 2021-01-26 河海大学 Dam defect image description text generation method based on keywords
CN111767708A (en) * 2020-07-09 2020-10-13 北京猿力未来科技有限公司 Training method and device of problem solving model and generation method and device of problem solving formula
CN113095040A (en) * 2021-04-16 2021-07-09 支付宝(杭州)信息技术有限公司 Coding network training method, text coding method and system
CN116861258B (en) * 2023-08-31 2023-12-01 腾讯科技(深圳)有限公司 Model processing method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104813275A (en) * 2012-09-27 2015-07-29 谷歌公司 Methods and systems for predicting a text
CN105279552A (en) * 2014-06-18 2016-01-27 清华大学 Character based neural network training method and device
CN108984526A (en) * 2018-07-10 2018-12-11 北京理工大学 A kind of document subject matter vector abstracting method based on deep learning
CN110457674A (en) * 2019-06-25 2019-11-15 西安电子科技大学 A kind of text prediction method of theme guidance
US20190354850A1 (en) * 2018-05-17 2019-11-21 International Business Machines Corporation Identifying transfer models for machine learning tasks
CN111274789A (en) * 2020-02-06 2020-06-12 支付宝(杭州)信息技术有限公司 Training method and device of text prediction model

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7478171B2 (en) * 2003-10-20 2009-01-13 International Business Machines Corporation Systems and methods for providing dialog localization in a distributed environment and enabling conversational communication using generalized user gestures
GB201418402D0 (en) * 2014-10-16 2014-12-03 Touchtype Ltd Text prediction integration
US9607616B2 (en) * 2015-08-17 2017-03-28 Mitsubishi Electric Research Laboratories, Inc. Method for using a multi-scale recurrent neural network with pretraining for spoken language understanding tasks
WO2018065158A1 (en) * 2016-10-06 2018-04-12 Siemens Aktiengesellschaft Computer device for training a deep neural network
US10803252B2 (en) * 2018-06-30 2020-10-13 Wipro Limited Method and device for extracting attributes associated with centre of interest from natural language sentences
CN108984745B (en) * 2018-07-16 2021-11-02 福州大学 Neural network text classification method fusing multiple knowledge maps
CN109597997B (en) * 2018-12-07 2023-05-02 上海宏原信息科技有限公司 Comment entity and aspect-level emotion classification method and device and model training thereof
CN109858031B (en) * 2019-02-14 2023-05-23 北京小米智能科技有限公司 Neural network model training and context prediction method and device
CN110032630B (en) * 2019-03-12 2023-04-18 创新先进技术有限公司 Dialectical recommendation device and method and model training device
CN109992771B (en) * 2019-03-13 2020-05-05 北京三快在线科技有限公司 Text generation method and device
CN110096698B (en) * 2019-03-20 2020-09-29 中国地质大学(武汉) Topic-considered machine reading understanding model generation method and system
CN110059262B (en) * 2019-04-19 2021-07-02 武汉大学 Project recommendation model construction method and device based on hybrid neural network and project recommendation method
CN110427466B (en) * 2019-06-12 2023-05-26 创新先进技术有限公司 Training method and device for neural network model for question-answer matching
CN110413753B (en) * 2019-07-22 2020-09-22 阿里巴巴集团控股有限公司 Question-answer sample expansion method and device
CN110704890A (en) * 2019-08-12 2020-01-17 上海大学 Automatic text causal relationship extraction method fusing convolutional neural network and cyclic neural network
CN110442723B (en) * 2019-08-14 2020-05-15 山东大学 Method for multi-label text classification based on multi-step discrimination Co-Attention model
CN110705294B (en) * 2019-09-11 2023-06-23 苏宁云计算有限公司 Named entity recognition model training method, named entity recognition method and named entity recognition device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104813275A (en) * 2012-09-27 2015-07-29 谷歌公司 Methods and systems for predicting a text
CN105279552A (en) * 2014-06-18 2016-01-27 清华大学 Character based neural network training method and device
US20190354850A1 (en) * 2018-05-17 2019-11-21 International Business Machines Corporation Identifying transfer models for machine learning tasks
CN108984526A (en) * 2018-07-10 2018-12-11 北京理工大学 A kind of document subject matter vector abstracting method based on deep learning
CN110457674A (en) * 2019-06-25 2019-11-15 西安电子科技大学 A kind of text prediction method of theme guidance
CN111274789A (en) * 2020-02-06 2020-06-12 支付宝(杭州)信息技术有限公司 Training method and device of text prediction model

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116362418A (en) * 2023-05-29 2023-06-30 天能电池集团股份有限公司 Online prediction method for application-level manufacturing capacity of intelligent factory of high-end battery
CN116362418B (en) * 2023-05-29 2023-08-22 天能电池集团股份有限公司 Online prediction method for application-level manufacturing capacity of intelligent factory of high-end battery
CN117540326A (en) * 2024-01-09 2024-02-09 深圳大学 Construction state abnormality identification method and system for tunnel construction equipment by drilling and blasting method
CN117540326B (en) * 2024-01-09 2024-04-12 深圳大学 Construction state abnormality identification method and system for tunnel construction equipment by drilling and blasting method

Also Published As

Publication number Publication date
CN111274789B (en) 2021-07-06
CN111274789A (en) 2020-06-12

Similar Documents

Publication Publication Date Title
WO2021155705A1 (en) Text prediction model training method and apparatus
US10762891B2 (en) Binary and multi-class classification systems and methods using connectionist temporal classification
CN110674880B (en) Network training method, device, medium and electronic equipment for knowledge distillation
JP6741357B2 (en) Method and system for generating multi-association label
US11941523B2 (en) Stochastic gradient boosting for deep neural networks
US10720151B2 (en) End-to-end neural networks for speech recognition and classification
WO2021188354A1 (en) Automated and adaptive design and training of neural networks
US11915686B2 (en) Speaker adaptation for attention-based encoder-decoder
JP5861649B2 (en) Model adaptation device, model adaptation method, and model adaptation program
US10762417B2 (en) Efficient connectionist temporal classification for binary classification
KR20220130565A (en) Keyword detection method and apparatus thereof
US11087213B2 (en) Binary and multi-class classification systems and methods using one spike connectionist temporal classification
KR20220024990A (en) Framework for Learning to Transfer Learn (L2TL)
WO2024007619A1 (en) Decoder training method and apparatus, target detection method and apparatus, and storage medium
US20180197082A1 (en) Learning apparatus and method for bidirectional learning of predictive model based on data sequence
CN111557010A (en) Learning device and method, and program
JP5288378B2 (en) Acoustic model speaker adaptation apparatus and computer program therefor
JP7047849B2 (en) Identification device, identification method, and identification program
EP3971782A2 (en) Neural network selection
US11593621B2 (en) Information processing apparatus, information processing method, and computer program product
US20200335085A1 (en) Adversarial speaker adaptation
JP7364228B2 (en) Information processing device, its control method, program, and learned model
van Laarhoven et al. Domain adaptation with randomized expectation maximization
US20220222435A1 (en) Task-Specific Text Generation Based On Multimodal Inputs
JP5527728B2 (en) Pattern classification learning device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20917616

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20917616

Country of ref document: EP

Kind code of ref document: A1