Disclosure of Invention
The invention aims to provide a neural machine translation method and device based on a word vector connection technology, so as to enhance the connection and mapping between word vectors, enhance the performance of a translation system and improve the translation quality.
In order to solve the technical problems, the invention provides the following technical scheme:
a neural machine translation method based on a word vector connection technology comprises the following steps:
in the encoding stage, an encoder encodes the read source sentences to obtain word vector sequences x = < x of the source sentences 1 ,x 2 ,…,x j ,…,x T >; wherein x is j The word vector of the jth source word in the source sentence is obtained, and T represents the number of the source words contained in the source sentence;
the forward recurrent neural network RNN of the encoder determines a forward vector sequence consisting of hidden vectors from the word vector sequenceWherein,f is a nonlinear activation function for the forward hidden layer state of the jth source word;
the reverse RNN of the encoder determines a reverse vector sequence composed of hidden vectors from the word vector sequenceWherein,a reverse hidden state for the jth source word;
determining that the hidden vector sequence < h corresponding to the source sentence according to the forward vector sequence and the reverse vector sequence 1 ,h 2 ,…,h j ,…,h T >; wherein,representing the vector containing the context information corresponding to each source word in the source sentences;
obtaining a context vector c by utilizing an attention network according to the hidden layer vector sequence t =q({h 1 ,h 2 ,…,h j ,…,h T }); wherein q is a nonlinear activation function;
in the decoding stage, the decoder is based on the contextVector c t And the current predicted target word y 1 ,y 2 ,…,y t-1 Predicting the target word y of the corresponding source word t Generating a target sentence of the source sentenceT y Representing the number of target words contained in the target sentence.
In one embodiment of the present invention, the probability p (y) of generating the target sentence y of the source sentence is:
wherein, p (y) t |{y 1 ,y 2 ,…,y t-1 },c t )=g(y t-1 ,s t ,c t ) G is a non-linear activation function, s t Is a hidden state in the RNN.
In one embodiment of the present invention,wherein, t * =arg max j (α tj ),α tj Are weights calculated from the attention network in the conventional NMT model.
In one embodiment of the present invention, the target sentence of the source sentence is generatedThen, the method further comprises the following steps:
determining a training setN represents the number of sentence pairs contained in the training corpus, x n 、y n Representing a sentence pair;
and based on the training set, performing model training according to a preset target training function.
In an embodiment of the present invention, the target training function is:
wherein,for the word vector loss function, w is the transformation matrix.
A neural machine translation device based on word vector join technology, comprising:
a word vector sequence obtaining module, configured to, in a coding stage, code the read source sentences by an encoder, and obtain a word vector sequence x = < x of the source sentences 1 ,x 2 ,…,x j ,…,x T >; wherein x is j The word vector of the jth source word in the source sentence is obtained, and T represents the number of the source words contained in the source sentence;
a forward vector sequence determination module for determining a forward vector sequence composed of hidden vectors according to the word vector sequence by a forward Recurrent Neural Network (RNN) of the encoderWherein,f is a nonlinear activation function for the forward hidden layer state of the jth source word;
a reverse vector sequence determination module for determining a reverse vector sequence composed of hidden vectors according to the word vector sequence by a reverse RNN of the encoderWherein,a reverse hidden state for the jth source word;
a hidden vector sequence determining module, configured to determine, according to the forward vector sequence and the reverse vector sequence, that a hidden vector sequence < h corresponds to the source sentence 1 ,h 2 ,…,h j ,…,h T >; wherein,representing the vector containing the context information corresponding to each source word in the source sentences;
a context vector obtaining module for obtaining a context vector c by using an attention network according to the hidden layer vector sequence t =q({h 1 ,h 2 ,…,h j ,…,h T }); wherein q is a nonlinear activation function;
a target statement generation module for, in a decoding stage, decoding the context vector c by a decoder t And the current predicted target word y 1 ,y 2 ,…,y t-1 H, predicting the target word y of the corresponding source word t Generating a target sentence of the source sentenceT y Representing the number of target words contained in the target sentence.
In one embodiment of the present invention, the probability p (y) of generating the target sentence y of the source sentence is:
wherein, p (y) t |{y 1 ,y 2 ,…,y t-1 },c t )=g(y t-1 ,s t ,c t ) G is a non-linear activation function, s t Is a hidden state in the RNN.
In one embodiment of the present invention,wherein, t * =arg max j (α tj ),α tj Are weights calculated from the attention network in the conventional NMT model.
In an embodiment of the present invention, the system further includes a training module, configured to:
generating the target sentence of the source sentenceThereafter, a training set is determinedN represents the number of sentence pairs contained in the training corpus, x n 、y n Representing a sentence pair;
and based on the training set, carrying out model training according to a preset target training function.
In an embodiment of the present invention, the target training function is:
wherein,for the word vector loss function, w is the transformation matrix.
By applying the technical scheme provided by the embodiment of the invention, in the encoding stage, an encoder encodes a read source sentence to obtain a word vector sequence of the source sentence, a forward RNN network of the encoder determines a forward vector sequence, a reverse RNN network determines a reverse vector sequence, a hidden vector sequence corresponding to the source sentence is determined according to the forward vector sequence and the reverse vector sequence, a vector containing context information corresponding to each source word represents a forward hidden state, a reverse hidden state and a word vector corresponding to the source word, the context vector can be obtained by using an attention network according to the hidden vector sequence, and in the decoding stage, a decoder predicts a target word of the corresponding source word according to the context vector and the currently predicted target word to generate a target sentence of the source sentence. The method shortens the information channel between the word vector of the source end and the word vector of the target end, enhances the connection and mapping between the word vectors, enhances the performance of a translation system and improves the translation quality.
Detailed Description
In order that those skilled in the art will better understand the disclosure, reference will now be made in detail to the embodiments of the disclosure as illustrated in the accompanying drawings. It should be apparent that the described embodiments are only some embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, there is a flowchart illustrating an implementation of a neural machine translation method based on a word vector join technique according to an embodiment of the present invention, where the method includes the following steps:
s110: in the encoding stage, the encoder encodes the read source sentences to obtain word vector sequences x = < x of the source sentences 1 ,x 2 ,…,x j ,…,x T >。
Wherein x is j The word vector of the jth source word in the source sentence, and T represents the number of source words contained in the source sentence.
The technical scheme provided by the embodiment of the invention is based on a basic NMT system. The encoder encorder is formed by a bi-directional RNN network, i.e., including a forward RNN network and a reverse RNN network, and the decoder is formed by an RNN network.
In practical applications, each word of each sentence in the training corpus may be initialized to a word vector in advance, and the word vectors of all words constitute a word vector dictionary. The word vector is generally a multi-dimensional vector, each dimension in the vector is a real number, and the dimension can be finally determined according to the result in the experimental process. For example, for the word "xans," the corresponding word vector may be <0.12, -0.23, \8230;, 0.99>.
In the encoding stage, the encoder may encode the read source sentences, that is, encode the source sentences into a series of vectors, and obtain a word vector sequence x = < x of the source sentences 1 ,x 2 ,…,x j ,…,x T > (ii). Wherein x is j The word vector for the jth source word in the source sentence, which may be an m-dimensional vector, represents the number of source words contained in the source sentence.
S120: the forward recurrent neural network RNN of the encoder determines a sequence of forward vectors consisting of hidden vectors from the sequence of word vectors
Wherein,for the forward hidden state of the jth source word, f is the nonlinear activation function.
The forward RNN network of the encoder can determine a forward vector sequence consisting of hidden vectors from the word vector sequence
For the forward hidden state of the jth source word, f is a nonlinear activation function, and particularly, GRU or LSTM may be used.
S130: the reverse RNN of the encoder determines a reverse vector sequence consisting of hidden vectors from the word vector sequence
Wherein,is the reverse hidden state of the jth source word.
The reverse RNN network of the encoder may determine a reverse vector sequence composed of hidden vectors according to the word vector sequence, as in the principle of step S120
The reverse hidden state for the jth source word.
S140: determining hidden layer vector sequence < h corresponding to source sentence according to forward vector sequence and reverse vector sequence 1 ,h 2 ,…,h j ,…,h T >。
Wherein,a vector representation containing context information corresponding to each source word in the source sentence.
After determining the forward vector sequence and the backward vector sequence, the forward hidden layer state corresponding to each source word can be connectedAnd reverse hidden layer stateAnd based on the hidden vector sequence, determining that the hidden vector sequence corresponding to the source sentence is less than h 1 ,h 2 ,…,h j ,…,h T >。A vector representation containing context information corresponding to each source word in the source sentence. Namely, the vector containing the context information corresponding to each source word represents the forward hidden layer state, the reverse hidden layer state and the word vector corresponding to the source word, and a source hidden layer state connection model is formed. A schematic diagram of an NMT model of a fusion source hidden layer state connection model is shown in fig. 2. In this model, the word vector for each source word is simply concatenated behind the hidden state vector for the corresponding source word by the bi-directional RNN network Encoder BiRNN Encoder. The word vector for the source word may be used not only to calculate the weights for the attention network, but also to predict the target word.
S150: obtaining a context vector c by utilizing an attention network according to the hidden vector sequence t =q({h 1 ,h 2 ,…,h j ,…,h T })。
Wherein q is a nonlinear activation function.
In step S140, after determining the hidden vector sequence corresponding to the source sentence, the context vector c can be obtained by using an attention network according to the hidden vector sequence t =q({h 1 ,h 2 ,…,h j ,…,h T }). q is a nonlinear activation function.
Specifically, the context vector may be obtained using an attribute network:
wherein a is a forward network of one layer, α tj For each hidden state h of the encoder j The weight of (c).
S160: in the decoding stage, the decoder bases on the context vector c t And the current predicted target word y 1 ,y 2 ,…,y t-1 H, predicting the target word y of the corresponding source word t Generating a target sentence of a source sentence
T y Indicating the number of target words contained in the target sentence.
In the decoding stage, the decoder may rely on the context vector c t And the current predicted target word y 1 ,y 2 ,…,y t-1 H, predicting the target word y of the corresponding source word t Thereby, a target sentence of the source sentence can be generatedI.e. given a context vector c t And the results of all predicted target words y 1 ,y 2 ,…,y t-1 Can continue to predict y t Thereby generating a target sentence of the source sentence.
In the embodiment of the invention, the encoder and the decoder both adopt RNN networks, mainly because the RNN networks have the following characteristics: the hidden state is determined by the current input and the last hidden state. For example, in the encoding stage, the hidden state is determined by the word vector of the current word at the source end and the previous hidden state, and in the decoding stage, the hidden state is determined by the word vector at the target end calculated in the previous step and the previous hidden state.
By applying the method provided by the embodiment of the invention, in the encoding stage, an encoder encodes a read source sentence to obtain a word vector sequence of the source sentence, a forward RNN network of the encoder determines a forward vector sequence, a reverse RNN network determines a reverse vector sequence, a hidden vector sequence corresponding to the source sentence is determined according to the forward vector sequence and the reverse vector sequence, a vector containing context information corresponding to each source word represents a forward hidden state, a reverse hidden state and a word vector corresponding to the source word, the context vector can be obtained by using an attention network according to the hidden vector sequence, and in the decoding stage, a decoder predicts a target word of the corresponding source word according to the context vector and the currently predicted target word, thereby generating a target sentence of the source sentence. The method shortens the information channel between the word vector of the source end and the word vector of the target end, enhances the connection and mapping between the word vectors, enhances the performance of a translation system and improves the translation quality.
In the embodiment of the present invention, the probability p (y) of generating the target sentence y of the source sentence is:
wherein, p (y) t |{y 1 ,y 2 ,…,y t-1 },c t )=g(y t-1 ,s t ,c t ) G is a non-linear activation function, and specifically, a softmax function, s, can be adopted t Is a hidden state in the RNN.
Based on the conventional NMT model, s t =f(y t-1 ,s t-1 ,c t )。
In one embodiment of the present invention,wherein, t * =arg max j (α tj ),α tj According to atten in the traditional NMT modelWeights calculated by the network.
Weight alpha calculated according to attention network in traditional NMT model tj The NMT model can be obtained to generate the current target word y t Time, mainly using source wordsThe information of (1). In accordance with this principle, embodiments of the present invention utilizeTo enhance the hidden layer state s of the target end t Using it to predict the target word y to be generated t . Namely, it isAnd y t Through the target hidden layer state s t And performing connection, which can be called a target end state connection model. The schematic diagram of the NMT model fused with the hidden layer state connection model at the target end can be seen from fig. 3, in the NMT model, the current target word y can be generated by the Attention weight information t Time, mainly using source wordsIs to be provided withInformation fusion to computational hidden state s t In the formula (c), by the hidden layer state s t Proceeding wordAnd y t The contact of (2).
In one embodiment of the invention, a target sentence of a source sentence is generatedThereafter, the method may further comprise the steps of:
the method comprises the following steps: determining a training setN represents the number of sentence pairs contained in the corpus, x n 、y n Representing a sentence pair;
step two: and based on the training set, carrying out model training according to a preset target training function.
In practical application, the training of the model generally adopts the minimized negative log-likelihood as a loss function and adopts the stochastic gradient descent as a training method to carry out iterative training. Determining a training setThen, based on the training set, the model training can be performed according to a preset target training function.
In one embodiment of the present invention, the target training function may be:
in another embodiment of the present invention, the target training function may be:
wherein,is a word vectoring loss function (word embedding loss), and w is a conversion matrix.
In the embodiment of the invention, the source end word vector and the target end word vector are connected through the conversion matrix w, the difference between the word vectors at two ends is reduced, the word vectors at two ends learned by the NMT model can be converted with each other, and if one source word is a source wordCorresponding to the target word y t By conversion ofThe matrix w can be reducedAnd y t The difference between them. This model may be referred to as a direct-connect model.
The direct connection model is an extension of the source hidden state connection model. And adding a conversion matrix on the basis of the source hidden layer state connection model to reduce the difference between word vectors at two ends. An NMT model diagram fused with a direct connection model is shown in FIG. 4, in which the model can be obtained through the Attention weight information to generate the target word y t The source word is mainly utilized(Obtaining a model of the connection between the mode and the hidden state of the targetObtaining the information in a consistent manner), performing source word processing by converting the matrix wAnd target word y t The mapping of (2) reduces the gap between two words.
Corresponding to the above method embodiments, the embodiments of the present invention further provide a neural machine translation device based on the word vector join technology, and a neural machine translation device based on the word vector join technology described below and a neural machine translation method based on the word vector join technology described above may be referred to correspondingly.
Referring to fig. 5, the apparatus may include the following modules:
a word vector sequence obtaining module 510, configured to, in the encoding stage, encode the read source sentences by the encoder, and obtain a word vector sequence x = < x of the source sentences 1 ,x 2 ,…,x j ,…,x T >; wherein x is j Is the first in the source sentenceThe word vectors of j source words, and T represents the number of the source words contained in the source sentences;
a forward vector sequence determination module 520, configured to determine a forward vector sequence composed of hidden vectors according to the word vector sequence by a forward recurrent neural network RNN of the encoderWherein,f is a nonlinear activation function for the forward hidden layer state of the jth source word;
a reverse vector sequence determination module 530 for the encoder's reverse RNN to determine a reverse vector sequence composed of hidden vectors from the word vector sequenceWherein,a reverse hidden state for the jth source word;
a hidden vector sequence determining module 540, configured to determine, according to the forward vector sequence and the reverse vector sequence, that the hidden vector sequence < h corresponds to the source sentence 1 ,h 2 ,…,h j ,…,h T >; wherein,representing the vector containing context information corresponding to each source word in the source sentences;
a context vector obtaining module 550, configured to obtain a context vector c by using an attention network according to the hidden vector sequence t =q({h 1 ,h 2 ,…,h j ,…,h T }); wherein q is a nonlinear activation function;
a target statement generation module 560 for, in the decoding stage, decoding the vector c according to the context t And the current predicted target word y 1 ,y 2 ,…,y t-1 Predicting the target word y of the corresponding source word t Generating a target sentence of a source sentenceT y Indicating the number of target words contained in the target sentence.
By applying the device provided by the embodiment of the invention, in the encoding stage, the encoder encodes the read source sentences to obtain word vector sequences of the source sentences, the forward RNN network of the encoder determines forward vector sequences, the reverse RNN network determines reverse vector sequences, hidden vector sequences corresponding to the source sentences are determined according to the forward vector sequences and the reverse vector sequences, vectors containing context information corresponding to each source word represent the forward hidden state, the reverse hidden state and the word vectors corresponding to the source word, the context vectors can be obtained by using an attention network according to the hidden vector sequences, and in the decoding stage, the decoder predicts the target words of the corresponding source words according to the context vectors and the currently predicted target words, thereby generating the target sentences of the source sentences. The method shortens the information channel between the word vector of the source end and the word vector of the target end, enhances the connection and mapping between the word vectors, enhances the performance of a translation system and improves the translation quality.
In one embodiment of the present invention, the probability p (y) of the target sentence y generating the source sentence is:
wherein, p (y) t |{y 1 ,y 2 ,…,y t-1 },c t )=g(y t-1 ,s t ,c t ) G is a nonlinear activation function, s t Is a hidden state in the RNN.
In one embodiment of the present invention,wherein, t * =arg max j (α tj ),α tj Are weights calculated from the attention network in the conventional NMT model.
In an embodiment of the present invention, the system further includes a training module, configured to:
in generating target sentence of source sentenceThereafter, a training set is determinedN represents the number of sentence pairs contained in the training corpus, x n 、y n Representing a sentence pair;
and based on the training set, performing model training according to a preset target training function.
In one embodiment of the present invention, the target training function is:
wherein,for the word vector loss function, w is the transformation matrix.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The principle and the embodiment of the present invention are explained by applying specific examples, and the above description of the embodiments is only used to help understanding the technical solution and the core idea of the present invention. It should be noted that, for those skilled in the art, without departing from the principle of the present invention, it is possible to make various improvements and modifications to the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.