WO2021155705A1 - Procédé et appareil d'entraînement de modèle de prédiction de texte - Google Patents

Procédé et appareil d'entraînement de modèle de prédiction de texte Download PDF

Info

Publication number
WO2021155705A1
WO2021155705A1 PCT/CN2020/132617 CN2020132617W WO2021155705A1 WO 2021155705 A1 WO2021155705 A1 WO 2021155705A1 CN 2020132617 W CN2020132617 W CN 2020132617W WO 2021155705 A1 WO2021155705 A1 WO 2021155705A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
prediction
word
text
segment
Prior art date
Application number
PCT/CN2020/132617
Other languages
English (en)
Chinese (zh)
Inventor
李扬名
姚开盛
Original Assignee
支付宝(杭州)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 支付宝(杭州)信息技术有限公司 filed Critical 支付宝(杭州)信息技术有限公司
Publication of WO2021155705A1 publication Critical patent/WO2021155705A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • One or more embodiments of this specification relate to the field of machine learning, and in particular to a training method and device for a text prediction model.
  • the text classification task can be used in the intelligent question answering customer service system to classify the question raised by the user as input text for user intention recognition, automatic question answering, or manual customer service dispatch.
  • Text classification can also be used in various application scenarios, such as document data classification, public opinion analysis, spam identification, and so on.
  • machine translation tasks in different languages are widely used in various automatic translation systems.
  • the language model is the basic model for performing the above-mentioned various specific natural language processing tasks.
  • Language models need to be trained based on a lot of expectations.
  • text prediction that is, predicting subsequent texts based on existing texts, is a basic task for training language models.
  • One or more embodiments of this specification describe a text prediction model and its training method, in which local context and long-range context are comprehensively used for prediction, thereby comprehensively improving the text prediction model's ability to understand text and predicting accuracy for subsequent text.
  • a method for training a text prediction model including a first prediction network based on time series, a buffer, and a second prediction network based on the buffer, and the method includes: After sequentially inputting the first t-1 words in the current training text, the t-th word is input into the first prediction network, so that the first prediction network processes the state vector after the t-1th word, and State the word vector of the t-th word, determine the state vector after processing the t-th word as the first latent vector; and determine the first prediction probability for the next word according to the first latent vector; from the buffer Read several existing segment vectors, which are formed based on the text before the t-th word in the current training text, and each segment vector corresponds to a text segment with a length of L words The second prediction network determines the second prediction probability for the next word according to the several segment vectors; uses the interpolation weight coefficient as the weight coefficient of the second prediction probability, and subtracts the interpolation weight coefficient from
  • the first prediction probability and the second prediction probability are interpolated and weighted and integrated to obtain the comprehensive prediction probability for the next word; at least according to the comprehensive prediction probability and For the t+1th word in the training text, determine the prediction loss for the tth word; and train the text prediction model according to the prediction loss for each word in the current training text.
  • the first prediction network includes a recurrent neural network RNN or a long short-term memory network LSTM.
  • the several segment vectors stored in the buffer include a first segment vector corresponding to any first text segment, where the first text segment includes the i-th word to the j-th word of the current training text, where i and j are both smaller than t, the first segment vector is obtained based on the difference between the first state vector and the second state vector, wherein the first state vector is the jth word processed by the first prediction network
  • the second state vector is the state vector after the first prediction network processes the (i-1)th word.
  • the above method further includes, if the t-th word is the last word of the current text segment, determining a new segment vector according to the difference between the first latent vector and the second latent vector, where The second latent vector is the state vector after the first prediction network processes the tL-th word; the newly added segment vector is added to the buffer.
  • the buffer has a limited storage capacity. In this case, before adding a new segment vector to the buffer, it is first determined whether the number of several fragment vectors already in the buffer reaches a predetermined threshold. Number; if the predetermined threshold number is reached, the earliest stored segment vector is deleted, and the newly added segment vector is stored in the buffer.
  • the second prediction network determines the second prediction probability for the next word in the following manner: determining several attention coefficients corresponding to the several segment vectors; taking the several attention coefficients as weighting factors, The several segment vectors are weighted and combined to obtain a context vector; and the second prediction probability is obtained according to the context vector and the linear transformation matrix.
  • the first prediction network obtains the first prediction probability according to the first hidden vector and the linear transformation matrix.
  • the second prediction network determines the attention coefficient in the following manner: according to the similarity between any i-th segment vector in the plurality of segment vectors and the first latent vector, determine The i-th attention coefficient.
  • the second prediction network determines the attention coefficient by using a first transformation matrix to transform any i-th segment vector in the plurality of segment vectors into a first intermediate vector; Use the second transformation matrix to transform the first hidden vector into a second intermediate vector; determine the similarity between the sum vector of the first intermediate vector and the second intermediate vector and the third vector; determine according to the similarity The i-th attention coefficient; wherein, the first transformation matrix, the second transformation matrix and the third vector are all trainable network parameters in the second prediction network.
  • the text prediction model further includes a strategy network; before performing interpolation weighted synthesis on the first prediction probability and the second prediction probability, the method further includes: the strategy network according to the first prediction probability A latent vector, outputting the interpolation weight coefficient; and the step of determining the prediction loss specifically includes: according to the comprehensive prediction probability, the t+1th word, the first prediction probability and the second prediction probability , And the interpolation weight coefficient to determine the prediction loss.
  • the strategy network determines the interpolation weight coefficients by applying at least a strategy transformation matrix to the first latent vector to obtain a strategy vector, wherein the strategy transformation matrix is in the strategy network Trainable model parameters; determine the interpolation weight coefficient according to the element value of the predetermined dimension in the strategy vector.
  • the strategy network obtains the strategy vector by: determining the training strategy coefficient according to the current training text; applying the strategy transformation matrix to the first implicit vector and dividing by the training strategy Coefficient to obtain the strategy vector.
  • the training strategy coefficient may be determined according to the training sequence number of the current training text in the training sample set, so that the training strategy coefficient is negatively correlated with the training sequence number.
  • the training strategy coefficient may be determined according to the total text length of the current training text, so that the training strategy coefficient is negatively correlated with the total text length.
  • the step of determining the prediction loss specifically includes: determining a first loss item according to the comprehensive prediction probability and the t+1th word; determining a second loss item according to the interpolation weight coefficient, The second loss term is negatively correlated with the interpolation weight coefficient; the reward term is determined according to the ratio of the second predicted probability and the first predicted probability to the probability values of the t+1th word , The reward item is positively correlated with the ratio; taking the reward item as the coefficient of the second loss item, sum the first loss item and the second loss item to determine the predicted loss .
  • a training device for a text prediction model including a first prediction network based on time series, a buffer, and a second prediction network based on the buffer, the device including: A prediction unit configured to input the t-th word into the first prediction network after sequentially inputting the first t-1 words in the current training text, so that the first prediction network processes the t-1th word according to After the state vector, and the word vector of the t-th word, the state vector after processing the t-th word is determined as the first hidden vector; and according to the first hidden vector, the first predicted probability for the next word is determined Reading unit, configured to read several existing segment vectors from the buffer, the existing several segment vectors are formed based on the text before the t-th word in the current training text, and each A segment vector corresponds to a text segment with a length of L words; a second prediction unit configured to enable the second prediction network to determine a second prediction probability for the next word according to the plurality of segment vectors;
  • the predicted probability is interpolated and weighted and integrated to obtain the comprehensive predicted probability for the next word;
  • the loss determination unit is configured to determine the target for the t-th word at least according to the comprehensive predicted probability and the t+1-th word in the training text
  • the training unit is configured to train the text prediction model according to the prediction loss for each word in the current training text.
  • a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method of the first aspect.
  • a computing device including a memory and a processor, characterized in that executable code is stored in the memory, and when the processor executes the executable code, the method of the first aspect is implemented .
  • the segment vector of the previous text segment in the buffer is also used as long-range context information, and the first Second, the prediction network makes predictions based on the long-range context.
  • the strategy network can be used to generate an interpolation weight coefficient for the current word.
  • FIG. 1 is a schematic diagram of a text prediction model according to an embodiment disclosed in this specification
  • Fig. 2 shows a flowchart of a method for training a text prediction model according to an embodiment
  • Figure 3 shows an example of performing prediction processing for a specific training text
  • Fig. 4 shows a schematic diagram of determining a segment vector of a text segment according to an embodiment
  • Fig. 5 shows a flow of steps for determining a second predicted probability according to an embodiment
  • Fig. 6 shows a schematic block diagram of a training device for a text prediction model according to an embodiment.
  • text prediction is a basic task of natural language processing. Accordingly, it is hoped to train a text prediction model with higher prediction accuracy.
  • a neural network model based on time sequence is used, such as recurrent neural network RNN, long short-term memory neural network LSTM, and gated recurrent unit GRU_RNN .
  • RNN recurrent neural network
  • LSTM long short-term memory neural network
  • GRU_RNN gated recurrent unit
  • a new text prediction model and its training method are proposed.
  • the model divides the input text into text fragments, and stores the characterization vector of the text fragments in the buffer as a long-range context.
  • the implicit vector corresponding to the current word and the representation vector stored in the buffer are comprehensively considered for prediction.
  • Fig. 1 is a schematic diagram of a text prediction model according to an embodiment disclosed in this specification.
  • the text prediction model includes a first prediction network 11 based on a time sequence, a buffer 12, a second prediction network 13 based on a buffer, and optionally a strategy network 14.
  • the first prediction network 11 includes a time series neural network, such as RNN, LSTM, and GRU_RNN. According to the working mode of the time series neural network, when the training text is input into the text prediction model, the first prediction network 11 reads the words in the training text in turn, and performs iterative processing on each word in turn. When performing iterative processing on each word W t , according to the state vector h t-1 after processing the previous word W t-1 and the word vector of the current word, the state vector h t after the iterative processing of the current word is obtained.
  • the first prediction network 11 may also include a multi-layer perceptron MLP, which obtains the first prediction result p for the next word based on the state vector h t corresponding to the current word.
  • the buffer 12 is used to store the characterization vector of the text segment (span) before the current word, that is, the segment vector.
  • the length L of the text segment can be a predetermined length, for example, 2 words, 3 words, 5 words, and so on.
  • the segment vector may be the state vector corresponding to the j-th word output by the first prediction network 11 It is obtained by the difference of the state vector corresponding to the i-1th word.
  • the second prediction network 13 performs prediction operations based on the existing segment vectors stored in the buffer 12 to obtain the second prediction result q for the next word.
  • the second prediction result q reflects the prediction result based on the long-range context.
  • the interpolation weight coefficient ⁇ can be used to interpolate and synthesize the two to obtain a comprehensive prediction result.
  • the above interpolation weight coefficients can be preset hyperparameters or trainable parameters.
  • the interpolation weight coefficient is different for each word, and is determined by the policy network 14. Specifically, the strategy network 14 obtains the state vector h t corresponding to the current word from the first prediction network 11, and performs operations based on the state vector to obtain the interpolation weight coefficient ⁇ for the current word, which is used for the first prediction result and the second prediction Synthesis of results.
  • the text prediction model shown in Figure 1 has at least the following characteristics.
  • the segment vectors corresponding to the text segment before the current word are also stored in the buffer, and these segment vectors are used as the long-range context to perform prediction based on the long-range context.
  • the final prediction result is a combination of the two parts of the prediction.
  • the strategy network can be used to dynamically adjust the proportion of long-range prediction results, thereby further improving the accuracy of prediction.
  • Fig. 2 shows a flowchart of a method for training a text prediction model according to an embodiment. It can be understood that the text prediction model has the structure shown in FIG. 1, and the method can be executed by any device, device, platform, or device cluster with computing and processing capabilities.
  • the following preparatory process can be performed in advance.
  • the training expectation that is, the training sample set, which includes a large amount of training text.
  • word embedding is performed on the training text, and each word in the training text is converted into a word vector, thereby converting the training text into a word vector sequence.
  • word embedding can be realized by one-hot encoding.
  • the dimension of each word vector corresponds to the number V of words in the lexicon.
  • the conversion of word vectors can also be realized by other word embedding methods, for example, the word2vec method, and so on.
  • the training text is Chinese text.
  • the training text can be segmented first, and then word embedding can be performed for each word after the segmentation.
  • each Chinese character is directly processed as a word. Therefore, the "word” in the following includes the case of single Chinese characters.
  • the training text can be input to the text prediction model for prediction and training.
  • the basic network of the text prediction model is still a time-series neural network. Therefore, for the current training text, each word (more specifically, a word vector) is input into the text prediction model in turn.
  • the text prediction model performs prediction processing on each input word in turn. The following describes the prediction processing process and training process of the text prediction model in combination with any t-th word in the training text.
  • step 21 the t-th word in the current training text is input into the first prediction network in the text prediction model. It can be understood that before this, the first t-1 words in the current training text have been sequentially input into the text prediction model.
  • the first prediction network includes a time-series neural network, which jointly determines the state at the next moment according to the state at the previous moment and the current input.
  • the first prediction network is based on the state vector h t- after processing the t-1th word. 1 , and the word vector x t of the t-th word, determine the state vector h t after processing the t-th word. This process can be expressed by the following formula (1):
  • is a state transition function, and the specific function form depends on the network form of the sequential neural network, such as RNN or LSTM.
  • the dimension of the state vector is denoted as d dimension.
  • the state vector h t after processing the current t-th word is called the first hidden vector.
  • the first prediction network may also include a multilayer perceptron MLP, which is used to determine the first prediction probability p for the next word according to the first latent vector h t. More specifically, the first predicted probability p may include that the next word is the probability distribution of each word in the vocabulary. Assuming that the number of words in the lexicon is V, then the first predicted probability p can be expressed as a V-dimensional vector.
  • MLP in order to determine the first prediction probability p, MLP first applies a linear transformation matrix O t+1 to the first hidden vector h t .
  • the linear transformation matrix is a trainable parameter matrix.
  • a hidden vector h t is transformed or projected into a V-dimensional vector.
  • the first predicted probability p for the next word can be expressed as:
  • Figure 3 shows an example of performing prediction processing on a specific training text.
  • the current input is the 92nd word "no" in the training text.
  • the temporal neural network obtains the state corresponding to the 92nd word according to the state vector h 91 after processing the 91st word "have” and the word vector corresponding to the 92nd word "no" The vector h 92 .
  • the MLP obtains the first predicted probability p for the next word, that is, the 93rd word.
  • the prediction result obtained according to the state vector of the time series neural network more reflects the influence of the local context closer to the current word on the understanding of the current word meaning.
  • the prediction results of the first prediction network will tend to be common collocation words in the local context, such as "trouble ", "idea”, output a higher prediction probability.
  • step 22 read several existing segment vectors from the buffer, which are formed based on the text before the t-th word in the current training text, and Each segment vector corresponds to a text segment with a length of L consecutive words.
  • segment vectors corresponds to a text segment with a length of L consecutive words.
  • several text fragments can be formed according to the length L, and the characterization vectors of these text fragments, that is, the fragment vectors, are stored in the buffer as long-range context information .
  • the length L of the text segment can be preset according to needs. For example, for a longer training text, you can set a longer segment length, such as 8 words, 10 words, etc., for a shorter training text, you can Set a shorter segment length, such as 2 words, 3 words, and so on.
  • the first t-1 words can form several text fragments m ij according to the preset length L, where i is the word sequence number at the beginning of the text fragment, and j is the end of the text fragment
  • the characterization vector of the text segment that is, the segment vector, can be obtained based on the state vector when the first prediction network processes each preceding word.
  • the segment vector is obtained based on the difference between the first state vector and the second state vector, where the first The state vector is the state vector h j after the first prediction network processes the jth word, that is, the state vector after the end word (jth word) of the text segment m ij is processed; the second state vector is the first The prediction network processes the state vector hi-1 after the (i-1)th word, that is, processes the state vector before the start word (i-th word) of the text segment m ij.
  • Fig. 4 shows a schematic diagram of determining a segment vector of a text segment according to an embodiment.
  • a text segment is formed with 2 words as the segment length.
  • the fragment vector can be determined by h 13 -h 11 , where h 13 is a temporal neural network processing
  • h 13 is a temporal neural network processing
  • the state vector after the 13th word, h 11 is the state vector after the sequential neural network processes the 11th word (that is, before the 12th word), or in other words, is the state vector at the end of the previous text segment.
  • the state vector after the first prediction network processes each word from the i-th word to the j-th word is obtained, Obtain L state vectors, sum or average the L state vectors, and use them as the segment vector corresponding to the text segment m ij.
  • the fragment vector can also be obtained in other ways.
  • the time-series neural network is used to process the state vector of each preceding word, and each segment vector is calculated. In this way, the processing result of the first prediction network can be reused and the calculation method of the segment vector can be simplified.
  • segment vectors can be obtained in the process of sequentially iteratively processing each word of the current training text by the first prediction network.
  • a counter with L as a loop can be set to count the number of words processed by the first prediction network. When the processed words are accumulated one by one, the counter is incremented. Each time L words are accumulated, a new text fragment is formed. The counter is cleared and counted again. At this time, the fragment vector of the newly added text fragment is calculated and stored in the buffer.
  • the t-th word is the last word of the current text segment. Specifically, it can be judged whether the count of the counter reaches L. If it is the last word of the current text segment, the current text segment is regarded as a newly-added text segment, and the segment vector of the newly-added text segment is calculated. Specifically, in an embodiment, the newly added segment vector may be determined according to the difference between the aforementioned first hidden vector h t and the second hidden vector h tL , where the second hidden vector h tL is the tL-th one processed by the first prediction network. The state vector after the word. Then, the newly added segment vector is added to the buffer.
  • the buffer used to store each segment vector of the previous text has a limited capacity size B. Accordingly, the buffer can only store a limited number of N segment vectors. In this case, the buffer can be made to store the segment vectors of the N text segments closest to the currently processed word. Specifically, in one embodiment, when adding a newly-added segment vector to the buffer, it is first determined whether the number of several segment vectors already in the buffer reaches the above-mentioned threshold number N, and if it does not reach the threshold number N, the newly-added segment vector is directly added. The fragment vector is added to the buffer; if the number of existing fragment vectors has reached the threshold number N, the earliest stored fragment vector is deleted, and the newly added fragment vector is stored in the buffer.
  • the current input is the 92nd word "no" in the training text.
  • multiple segment vectors based on the text before the 92nd word have been stored in the buffer, where each segment vector corresponds to a text segment formed by three consecutive words.
  • the text fragment closest to the current word is the text fragment m 89-91 from the 89th word to the 91st word. Due to the limited capacity of the buffer, the earliest segment vector stored therein corresponds to the text segment m 16-18 , that is, the text segment formed from the 16th word to the 18th word.
  • segment vectors stored in the buffer can represent text segments that are far away from the current word. Therefore, these segment vectors can be used as long-range context information to help understand the semantics of the current word, and then help predict the next word.
  • the second prediction network is used to determine the second prediction probability q for the next word according to several segment vectors stored in the buffer.
  • the second prediction network can use the attention mechanism to integrate several existing segment vectors into a context vector, and then determine the second prediction probability q based on the context vector.
  • Fig. 5 shows a flow of steps for determining the second predicted probability according to an embodiment.
  • step 51 several attention coefficients corresponding to several segment vectors are determined. Specifically, for any i-th segment vector si among several segment vectors, the corresponding attention coefficient ⁇ t,i can be determined based on the similarity measurement.
  • the similarity ⁇ t,i between the i-th segment vector s i and the first hidden vector h t can be determined, and the similarity can be cosine similarity, similarity determined based on Euclidean distance, etc. Wait. Then, according to the similarity ⁇ t,i , the i-th attention coefficient ⁇ t,i is determined .
  • the softmax function can be used to normalize the similarity corresponding to each segment vector to obtain the corresponding attention coefficient.
  • the i-th attention coefficient ⁇ t,i can be determined as:
  • the corresponding similarity is determined in the following manner.
  • the first transformation matrix W s can be used to transform the i-th segment vector s i into a first intermediate vector W s s i
  • the second transformation matrix W h can be used to transform the first hidden vector h t into the first intermediate vector W s s i.
  • first transformation matrix W s, a second transform matrix, and the third vector v s h are predicted second network may be trained network parameters.
  • the i-th attention coefficient ⁇ t,i can be determined similarly using formula (3).
  • step 52 using each attention coefficient corresponding to each segment vector as a weighting factor, weighted combination of the foregoing several segment vectors, to obtain a context vector ⁇ t .
  • the segment vectors s i stored in the buffer can be sequentially arranged into a vector sequence C t , and the attention coefficients ⁇ t,i corresponding to each segment vector can be arranged into an attention vector ⁇ t ,
  • the context vector ⁇ t can be expressed as:
  • the second predicted probability q is obtained according to the context vector ⁇ t and a linear transformation matrix.
  • the second predicted probability q may include the probability distribution of each word in the dictionary as the next word, so q is also a V-dimensional vector.
  • the linear transformation matrix used in step 53 is used to transform or project the d-dimensional context vector ⁇ t into a V-dimensional vector.
  • the second predicted probability q can be expressed as:
  • O t+1 is the linear transformation matrix for the context vector.
  • the linear transformation matrix for the context vector in formula (6) is the same matrix as the linear transformation matrix for the first implicit vector in formula (2).
  • the second predictive network maintains the linear transformation matrix for the context vector in formula (6), which is independent of the linear transformation matrix used by the first predictive network in formula (2).
  • the second prediction network obtains the second prediction probability q for the next word according to the segment vector stored in the buffer.
  • the segment vector stored in the buffer area reflects the long-range context information. Therefore, the second prediction probability q obtained based on the segment vector can reflect the prediction of the next word based on the long-range context.
  • the buffer stores fragment vectors of previous text fragments. These previous text fragments contain text fragments relatively far from the current word, such as m 16-18 . Based on these segment vectors, the attention mechanism is used to obtain the second predicted probability q for the next word, which is made with more consideration of the long-range context. For example, since the text fragment m 16-18 contains the long-range context "good restaurant", the second predicted probability q tends to be related words in the long-range context, such as "appetite", which outputs a higher predicted probability.
  • the interpolation weight coefficient ⁇ is used as the weighting coefficient of the second prediction probability q, and 1 minus ⁇ is used as the first prediction probability
  • the weighting coefficient of p is to perform interpolation weighted synthesis on the first prediction probability and the second prediction probability to obtain the comprehensive prediction probability Pr for the next word, namely:
  • step 25 the prediction loss for the t-th word is determined at least according to the above-mentioned comprehensive prediction probability Pr and the t+1-th word in the current training text.
  • the above-mentioned interpolation weight coefficient is a preset hyperparameter or a trainable model parameter.
  • the true next word in the training text that is, the t+1th word
  • the prediction loss for the current word can be determined according to the comparison of the comprehensive prediction probability Pr and the label.
  • the cross entropy loss function can be used to determine the prediction loss Loss:
  • step 26 a text prediction model is trained based on the total prediction loss for each word in the current training text. Specifically, the first prediction network and the second prediction network are updated in the direction in which the total prediction loss is reduced.
  • the text prediction model further includes a strategy network, which is used to determine the corresponding word for the current word. Interpolate the weight coefficient ⁇ .
  • strategy network determines the interpolation weight coefficients and its training methods.
  • the strategy network may obtain the first latent vector h t obtained by the first prediction network processing the t-th word, and according to the first latent vector h t , calculate the interpolation weight coefficient ⁇ t .
  • the policy network may apply a policy transformation matrix W g to the above-mentioned first hidden vector h t to obtain a policy vector W g h t , where the policy transformation matrix W g is a trainable model maintained in the policy network
  • the parameter can be an M*d-dimensional matrix, so that the d-dimensional first hidden vector is transformed into an M-dimensional strategy vector, where M is the preset number of dimensions.
  • the interpolation weight coefficient ⁇ t can be determined according to the element value of the predetermined dimension in the M-dimensional strategy vector. For example, the element value of a certain dimension after normalization of the strategy vector can be used as the interpolation weight coefficient ⁇ t , namely:
  • a training strategy coefficient T is also set in the strategy network.
  • the training strategy coefficient T can be a hyperparameter that can be adjusted during the training process, more specifically according to Each training text is determined, so as to better adjust the output of the interpolation weight coefficient.
  • the interpolation weight coefficient is a weight coefficient applied to the second prediction probability. Therefore, the larger the interpolation weight coefficient, it means that the use of the remote context is encouraged.
  • a process similar to "annealing” may be used to set and adjust the aforementioned training strategy coefficients. Specifically, a larger training strategy coefficient T, or a higher temperature T, can be set at the beginning of training; then, as the training progresses, the training strategy coefficient T, or the temperature T, is gradually reduced. This means that as training progresses, text prediction models are encouraged to explore the use of long-range context.
  • the training strategy coefficient T can be determined according to the training sequence number of the current training text in the training sample set, so that the training strategy coefficient T is negatively correlated with the training sequence number. In other words, the smaller the training sequence number is, the closer it is to the beginning of training. At this time, the larger the training strategy coefficient T, the higher the temperature T; as the training sequence number increases, the temperature decreases, and the training strategy coefficient decreases.
  • the training strategy coefficient T for the current training text can also be determined according to the total text length of the current training text. Specifically, the training strategy coefficient T can be negatively correlated with the total text length. Therefore, for a longer training text, a smaller coefficient T can be set to obtain a larger interpolation weight coefficient, thereby more encouraging the use of long-range context.
  • the strategy network determines the corresponding interpolation weight coefficient ⁇ t for the current t-th word in the current training text.
  • the interpolation weight coefficient is applied to the above formula (7) to obtain the comprehensive predicted probability Pr.
  • the first prediction network obtains the first prediction probability p for the 93rd word according to the current state vector h 92 of the 92nd word, that is, the first hidden vector; the second prediction network obtains the first prediction probability p for the 93rd word according to the storage The segment vector of, get the second predicted probability q.
  • the strategy network obtains the interpolation weight coefficient according to the above-mentioned first hidden vector h 92 and the training strategy coefficient T (shown as the "annealing" temperature in the figure). Therefore, the interpolation weight coefficient can be used to perform interpolation synthesis on the first prediction probability p and the second prediction probability q to obtain the comprehensive prediction probability Pr.
  • the aforementioned method of determining the prediction loss needs to be modified.
  • the prediction loss loss not only the comprehensive prediction probabilities obtained from the first and second prediction networks are considered, but the output of the strategy network is also considered. Therefore, according to an embodiment, in the foregoing step 25, according to the comprehensive prediction probability and the t+1th word, and according to the first prediction probability p, the second prediction probability q, and the interpolation weight coefficients, determine Forecast loss Loss.
  • the prediction loss in the case of combining a policy network, can be determined in the following manner.
  • the first loss term L1 can be determined according to the comprehensive prediction probability Pr and the t+1th word.
  • the first loss term L1 can take the form of cross-entropy loss, as shown in formula (8). In other words, the loss shown in formula (8) can be used as the first loss item L1 here.
  • the second loss term L2 is determined so that the second loss term is negatively related to the interpolation weight coefficient.
  • the second loss term can be set as:
  • the second loss term L2 can also be set to other forms of negative correlation, for example, 1/ ⁇ t .
  • a first and a second prediction probability q p are predicted probability for the value of the t + 1 terms of the probability of the ratio, r t term prize is determined, the bonus items r t positively correlated with the ratio; and, The reward term r t is used as the coefficient of the second loss term L2, and the first loss term and the second loss term are summed to determine the predicted loss Loss.
  • the predicted loss Loss can be expressed as:
  • is an optional adjustment coefficient, ⁇ >0.
  • the first term in the loss function expression corresponds to the first loss term, which aims to increase the probability of correctly predicting the next word.
  • the second term is the product of the reward term and the second loss term, which aims to conditionally encourage the exploration and use of the long-range context.
  • r t *log ⁇ t is very similar in form to the policy gradient in reinforcement learning.
  • encouraging exploration and use of long-range context can be embodied by the second loss term L2 itself, because a smaller value of the second loss term corresponds to a larger ⁇ t .
  • the encouragement of the long-range context should be carried out conditionally, and the condition is reflected by the reward item r t.
  • the adjustment of the reward term means that only when the prediction probability of the second prediction network for the correct next word is significantly higher than the prediction probability of the first prediction network, a larger interpolation weight coefficient ⁇ t is encouraged.
  • the second prediction network outputs the second prediction probability q, where the probability value for the real t+1th word (that is, the correct next word) is q(x t+1
  • the probability value of the first prediction network for the t+1th word is p(x t+1
  • the ratio of the two can be defined as R:
  • the above ratio R may reflect the relative prediction accuracy of the second prediction network and the first prediction network for the correct next word.
  • Setting the reward item r t is positively related to the ratio R, that is, the larger the ratio R, the greater the reward item r t .
  • the correct next word that is, the t+1th word, is known. Therefore, the size of the reward item can be clearly and uniquely determined. Therefore, this reward item can also be called Intrinsic Rewards.
  • the reward item r t can be determined in a variety of ways based on the above ratio R.
  • the reward item r t is determined in the following way (14):
  • is a minimum value, which is set in order to avoid mathematical problems caused when p(x t+1
  • the above function f(z) can adopt the ReLU function:
  • the ⁇ in formula (14) is used to amplify the effect of R in exponential form, and the ⁇ in formula (15) is used for linear amplification.
  • the parameter a in formula (14) is the cutoff threshold, and the parameter b is the reference threshold.
  • the prediction loss is determined according to formula (12)
  • the prediction loss is to be reduced, on the basis of increasing the prediction probability of the correct word according to the first loss term, the second term is also required to be as small as possible.
  • the prediction probability of the second prediction network for the next correct word is significantly higher than that of the first prediction network, that is, when the above-mentioned ratio R is larger, a larger reward term r t is obtained , which forces the second loss term to be smaller , That is, the strategy network outputs a larger ⁇ t , so as to conditionally encourage a larger interpolation weight coefficient ⁇ t , that is, conditionally encourage the purpose of long-range context.
  • step 26 the text prediction model is trained according to the total prediction loss of each word, that is, in the direction in which the total prediction loss decreases, the first The model parameters in the prediction network, the second prediction network, and the strategy network achieve the above training goals.
  • the text prediction model of the embodiment of this specification on the basis of using the first prediction network based on time sequence to predict the next word, it also uses the segment vector of the previous text segment stored in the buffer as the long-range context information, And use the second prediction network to make predictions based on the long-range context.
  • the strategy network can be used to generate an interpolation weight coefficient for the current word.
  • a training device for a text prediction model includes a first prediction network based on time series, a buffer, a second prediction network based on the buffer, and the training device It can be deployed in any device, platform or device cluster with computing and processing capabilities.
  • Fig. 6 shows a schematic block diagram of a training device for a text prediction model according to an embodiment. As shown in FIG.
  • the training device 600 includes: a first prediction unit 61 configured to input the t-th word into the first prediction network after sequentially inputting the first t-1 words in the current training text, so that The first prediction network determines the state vector after processing the t-th word as the first latent vector according to the state vector after processing the t-1th word and the word vector of the t-th word; An implicit vector determines the first prediction probability for the next word;
  • the reading unit 62 is configured to read several existing segment vectors from the buffer, and the existing several segment vectors are based on the current training The text before the t-th word in the text is formed, and each segment vector corresponds to a text segment with a length of L words;
  • the second prediction unit 63 is configured to make the second prediction network according to the plurality of segment vectors , Determine the second prediction probability for the next word;
  • the synthesis unit 64 is configured to use the interpolation weight coefficient as the weight coefficient of the second prediction probability, and use the difference of 1 minus the interpolation weight coefficient
  • the first prediction network includes a recurrent neural network RNN or a long short-term memory network LSTM.
  • the several segment vectors stored in the buffer include a first segment vector corresponding to any first text segment, and the first text segment includes the i-th word to the j-th word of the current training text, Wherein i and j are both less than t, the first segment vector is obtained based on the difference between the first state vector and the second state vector, and the first state vector is the first prediction network processing the jth The state vector after the word, and the second state vector is the state vector after the first prediction network processes the (i-1)th word.
  • the device 600 further includes a storage unit (not shown), configured to: if the t-th word is the last word of the current text segment, then according to the difference between the first latent vector and the second latent vector The difference determines the newly added segment vector, where the second implicit vector is the state vector after the first prediction network processes the tLth word; the newly added segment vector is added to the buffer.
  • a storage unit (not shown), configured to: if the t-th word is the last word of the current text segment, then according to the difference between the first latent vector and the second latent vector The difference determines the newly added segment vector, where the second implicit vector is the state vector after the first prediction network processes the tLth word; the newly added segment vector is added to the buffer.
  • the buffer has a limited storage capacity.
  • the storage unit is further configured to: determine whether the number of fragment vectors already in the buffer reaches a predetermined threshold number; The predetermined threshold number is deleted, the earliest stored segment vector is deleted, and the newly added segment vector is stored in the buffer.
  • the second prediction network obtains the second prediction probability by: determining a number of attention coefficients corresponding to the several segment vectors; using the several attention coefficients as weighting factors, A number of segment vectors are weighted and combined to obtain a context vector; and the second prediction probability is obtained according to the context vector and the linear transformation matrix.
  • the first prediction probability is obtained according to the first hidden vector and the same linear transformation matrix as the second prediction network.
  • the second prediction network determines the attention coefficient in the following manner: according to the similarity between any i-th segment vector in the plurality of segment vectors and the first latent vector, determine the i-th segment vector Attention coefficient.
  • the second prediction network determines the attention coefficient by using a first transformation matrix to transform any i-th segment vector in the plurality of segment vectors into a first intermediate vector; A second transformation matrix, transforming the first hidden vector into a second intermediate vector; determining the similarity between the sum vector of the first intermediate vector and the second intermediate vector and the third vector; determining the i-th vector according to the similarity Attention coefficient; wherein, the first transformation matrix, the second transformation matrix and the third vector are all trainable network parameters in the second prediction network.
  • the text prediction model further includes a strategy network for outputting the interpolation weight coefficient according to the first latent vector; in this case, the loss determination unit 65 is further configured to The comprehensive prediction probability, the t+1th word, the first prediction probability and the second prediction probability, and the interpolation weight coefficient determine the prediction loss.
  • the strategy network determines the interpolation weight coefficients in the following manner: at least a strategy transformation matrix is applied to the first implicit vector to obtain a strategy vector, wherein the strategy transformation matrix is Trainable model parameters; determine the interpolation weight coefficient according to the element value of the predetermined dimension in the strategy vector.
  • the strategy network obtains the strategy vector in the following manner: determining the training strategy coefficient according to the current training text; applying the strategy transformation matrix to the first implicit vector and dividing by the training strategy Coefficient to obtain the strategy vector.
  • the strategy network determining the training strategy coefficient specifically includes: determining the training strategy coefficient according to the training sequence number of the current training text in the training sample set, so that the training strategy coefficient is the same as the training strategy coefficient. The number of the training sequence is negatively correlated.
  • determining the training strategy coefficient by the strategy network specifically includes: determining the training strategy coefficient according to the total text length of the current training text, so that the training strategy coefficient is negatively correlated with the total text length.
  • the loss determination unit 65 is specifically configured to: determine a first loss item according to the comprehensive predicted probability and the t+1th word; and determine a second loss item according to the interpolation weight coefficient Item, wherein the second loss item is negatively correlated with the interpolation weight coefficient; according to the ratio of the second prediction probability and the first prediction probability to the probability value of the t+1th word, the determination of the Reward item, the reward item is positively related to the ratio; taking the reward item as the coefficient of the second loss item, sum the first loss item and the second loss item to determine the Forecast loss.
  • a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method described in conjunction with FIG. 2.
  • a computing device including a memory and a processor, the memory is stored with executable code, and when the processor executes the executable code, it implements the method described in conjunction with FIG. 2 method.
  • the functions described in the present invention can be implemented by hardware, software, firmware, or any combination thereof.
  • these functions can be stored in a computer-readable medium or transmitted as one or more instructions or codes on the computer-readable medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un procédé d'entraînement de modèle de prédiction de texte exécuté par un ordinateur, et un appareil d'entraînement de modèle de prédiction de texte. Un modèle de prédiction de texte comprend un premier réseau de prédiction (11) sur la base d'une séquence temporelle, une mémoire tampon (12) et un deuxième réseau de prédiction (13) sur la base de la mémoire tampon (12). Le procédé d'entraînement consiste à : entrer un t-ième mot d'un texte d'entraînement dans un premier réseau de prédiction (11), de sorte que le premier réseau de prédiction détermine une première probabilité de prédiction pour le mot suivant selon un vecteur d'état obtenu au moyen d'un traitement de séquence temporelle ; en plus, lire, à partir d'une mémoire tampon (12), plusieurs vecteurs de fragments formés sur la base du texte précédent, et un deuxième réseau de prédiction (13) obtenant une deuxième probabilité de prédiction pour le mot suivant en fonction de ces vecteurs de fragments ; puis, en prenant un coefficient de pondération d'interpolation λ comme coefficient de pondération de la deuxième probabilité de prédiction, et prendre un moins λ comme coefficient de pondération de la première probabilité de prédiction, pondérer et synthétiser la deuxième probabilité de prédiction et la première probabilité de prédiction afin d'obtenir une probabilité de prédiction globale ; et au moins en fonction de la probabilité de prédiction globale et d'un (t+1)-ième mot, déterminer une perte de prédiction concernant le t-ième mot, et ainsi entraîner un modèle de prédiction de texte.
PCT/CN2020/132617 2020-02-06 2020-11-30 Procédé et appareil d'entraînement de modèle de prédiction de texte WO2021155705A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010081187.8 2020-02-06
CN202010081187.8A CN111274789B (zh) 2020-02-06 2020-02-06 文本预测模型的训练方法及装置

Publications (1)

Publication Number Publication Date
WO2021155705A1 true WO2021155705A1 (fr) 2021-08-12

Family

ID=71000235

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/132617 WO2021155705A1 (fr) 2020-02-06 2020-11-30 Procédé et appareil d'entraînement de modèle de prédiction de texte

Country Status (2)

Country Link
CN (1) CN111274789B (fr)
WO (1) WO2021155705A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116362418A (zh) * 2023-05-29 2023-06-30 天能电池集团股份有限公司 一种高端电池智能工厂应用级制造能力在线预测方法
CN117540326A (zh) * 2024-01-09 2024-02-09 深圳大学 钻爆法隧道施工装备的施工状态异常辨识方法及系统

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111274789B (zh) * 2020-02-06 2021-07-06 支付宝(杭州)信息技术有限公司 文本预测模型的训练方法及装置
CN111597819B (zh) * 2020-05-08 2021-01-26 河海大学 一种基于关键词的大坝缺陷图像描述文本生成方法
CN111767708A (zh) * 2020-07-09 2020-10-13 北京猿力未来科技有限公司 解题模型的训练方法及装置、解题公式生成方法及装置
CN113095040A (zh) * 2021-04-16 2021-07-09 支付宝(杭州)信息技术有限公司 一种编码网络的训练方法、文本编码方法和系统
CN116861258B (zh) * 2023-08-31 2023-12-01 腾讯科技(深圳)有限公司 模型处理方法、装置、设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104813275A (zh) * 2012-09-27 2015-07-29 谷歌公司 用于预测文本的方法和系统
CN105279552A (zh) * 2014-06-18 2016-01-27 清华大学 一种基于字的神经网络的训练方法和装置
CN108984526A (zh) * 2018-07-10 2018-12-11 北京理工大学 一种基于深度学习的文档主题向量抽取方法
CN110457674A (zh) * 2019-06-25 2019-11-15 西安电子科技大学 一种主题指导的文本预测方法
US20190354850A1 (en) * 2018-05-17 2019-11-21 International Business Machines Corporation Identifying transfer models for machine learning tasks
CN111274789A (zh) * 2020-02-06 2020-06-12 支付宝(杭州)信息技术有限公司 文本预测模型的训练方法及装置

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7478171B2 (en) * 2003-10-20 2009-01-13 International Business Machines Corporation Systems and methods for providing dialog localization in a distributed environment and enabling conversational communication using generalized user gestures
GB201418402D0 (en) * 2014-10-16 2014-12-03 Touchtype Ltd Text prediction integration
US9607616B2 (en) * 2015-08-17 2017-03-28 Mitsubishi Electric Research Laboratories, Inc. Method for using a multi-scale recurrent neural network with pretraining for spoken language understanding tasks
EP3500979A1 (fr) * 2016-10-06 2019-06-26 Siemens Aktiengesellschaft Dispositif informatique pour l'apprentissage d'un réseau neuronal profond
US10803252B2 (en) * 2018-06-30 2020-10-13 Wipro Limited Method and device for extracting attributes associated with centre of interest from natural language sentences
CN108984745B (zh) * 2018-07-16 2021-11-02 福州大学 一种融合多知识图谱的神经网络文本分类方法
CN109597997B (zh) * 2018-12-07 2023-05-02 上海宏原信息科技有限公司 基于评论实体、方面级情感分类方法和装置及其模型训练
CN109858031B (zh) * 2019-02-14 2023-05-23 北京小米智能科技有限公司 神经网络模型训练、上下文预测方法及装置
CN110032630B (zh) * 2019-03-12 2023-04-18 创新先进技术有限公司 话术推荐设备、方法及模型训练设备
CN109992771B (zh) * 2019-03-13 2020-05-05 北京三快在线科技有限公司 一种文本生成的方法及装置
CN110096698B (zh) * 2019-03-20 2020-09-29 中国地质大学(武汉) 一种考虑主题的机器阅读理解模型生成方法与系统
CN110059262B (zh) * 2019-04-19 2021-07-02 武汉大学 一种基于混合神经网络的项目推荐模型的构建方法及装置、项目推荐方法
CN110427466B (zh) * 2019-06-12 2023-05-26 创新先进技术有限公司 用于问答匹配的神经网络模型的训练方法和装置
CN110413753B (zh) * 2019-07-22 2020-09-22 阿里巴巴集团控股有限公司 问答样本的扩展方法及装置
CN110704890A (zh) * 2019-08-12 2020-01-17 上海大学 一种融合卷积神经网络和循环神经网络的文本因果关系自动抽取方法
CN110442723B (zh) * 2019-08-14 2020-05-15 山东大学 一种基于多步判别的Co-Attention模型用于多标签文本分类的方法
CN110705294B (zh) * 2019-09-11 2023-06-23 苏宁云计算有限公司 命名实体识别模型训练方法、命名实体识别方法及装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104813275A (zh) * 2012-09-27 2015-07-29 谷歌公司 用于预测文本的方法和系统
CN105279552A (zh) * 2014-06-18 2016-01-27 清华大学 一种基于字的神经网络的训练方法和装置
US20190354850A1 (en) * 2018-05-17 2019-11-21 International Business Machines Corporation Identifying transfer models for machine learning tasks
CN108984526A (zh) * 2018-07-10 2018-12-11 北京理工大学 一种基于深度学习的文档主题向量抽取方法
CN110457674A (zh) * 2019-06-25 2019-11-15 西安电子科技大学 一种主题指导的文本预测方法
CN111274789A (zh) * 2020-02-06 2020-06-12 支付宝(杭州)信息技术有限公司 文本预测模型的训练方法及装置

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116362418A (zh) * 2023-05-29 2023-06-30 天能电池集团股份有限公司 一种高端电池智能工厂应用级制造能力在线预测方法
CN116362418B (zh) * 2023-05-29 2023-08-22 天能电池集团股份有限公司 一种高端电池智能工厂应用级制造能力在线预测方法
CN117540326A (zh) * 2024-01-09 2024-02-09 深圳大学 钻爆法隧道施工装备的施工状态异常辨识方法及系统
CN117540326B (zh) * 2024-01-09 2024-04-12 深圳大学 钻爆法隧道施工装备的施工状态异常辨识方法及系统

Also Published As

Publication number Publication date
CN111274789B (zh) 2021-07-06
CN111274789A (zh) 2020-06-12

Similar Documents

Publication Publication Date Title
WO2021155705A1 (fr) Procédé et appareil d'entraînement de modèle de prédiction de texte
US10762891B2 (en) Binary and multi-class classification systems and methods using connectionist temporal classification
CN110674880B (zh) 用于知识蒸馏的网络训练方法、装置、介质与电子设备
JP6741357B2 (ja) マルチ関連ラベルを生成する方法及びシステム
WO2021143396A1 (fr) Procédé et appareil pour effectuer une prédiction de classification à l'aide d'un modèle de classification de texte
US10720151B2 (en) End-to-end neural networks for speech recognition and classification
WO2021204269A1 (fr) Apprentissage de modèle de classification et classification d'objets
US11915686B2 (en) Speaker adaptation for attention-based encoder-decoder
Jung et al. Adaptive detrending to accelerate convolutional gated recurrent unit training for contextual video recognition
US10762417B2 (en) Efficient connectionist temporal classification for binary classification
JPWO2012105231A1 (ja) モデル適応化装置、モデル適応化方法およびモデル適応化用プログラム
KR20220130565A (ko) 키워드 검출 방법 및 장치
US11087213B2 (en) Binary and multi-class classification systems and methods using one spike connectionist temporal classification
JP2022543245A (ja) 学習を転移させるための学習のためのフレームワーク
WO2024007619A1 (fr) Procédé et appareil d'entraînement de décodeur, procédé et appareil de détection de cible, et support d'enregistrement
US20180197082A1 (en) Learning apparatus and method for bidirectional learning of predictive model based on data sequence
CN111557010A (zh) 学习装置和方法以及程序
JP5288378B2 (ja) 音響モデルの話者適応装置及びそのためのコンピュータプログラム
JP7047849B2 (ja) 識別装置、識別方法、および識別プログラム
US20220222435A1 (en) Task-Specific Text Generation Based On Multimodal Inputs
EP3971782A2 (fr) Sélection de réseau de neurones artificiels
US11593621B2 (en) Information processing apparatus, information processing method, and computer program product
US11107460B2 (en) Adversarial speaker adaptation
WO2020044755A1 (fr) Dispositif de reconnaissance vocale, procédé de reconnaissance vocale et programme
JP7364228B2 (ja) 情報処理装置、その制御方法、プログラム、ならびに、学習済モデル

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20917616

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20917616

Country of ref document: EP

Kind code of ref document: A1