CN110263352B - Method and device for training deep neural machine translation model - Google Patents

Method and device for training deep neural machine translation model Download PDF

Info

Publication number
CN110263352B
CN110263352B CN201910528250.5A CN201910528250A CN110263352B CN 110263352 B CN110263352 B CN 110263352B CN 201910528250 A CN201910528250 A CN 201910528250A CN 110263352 B CN110263352 B CN 110263352B
Authority
CN
China
Prior art keywords
representation
training
network
layer
machine translation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910528250.5A
Other languages
Chinese (zh)
Other versions
CN110263352A (en
Inventor
黄辉
刘学博
周沁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Um Zhuhai Research Institute
University of Macau
Original Assignee
Um Zhuhai Research Institute
University of Macau
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Um Zhuhai Research Institute, University of Macau filed Critical Um Zhuhai Research Institute
Priority to CN201910528250.5A priority Critical patent/CN110263352B/en
Publication of CN110263352A publication Critical patent/CN110263352A/en
Application granted granted Critical
Publication of CN110263352B publication Critical patent/CN110263352B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Abstract

The embodiment of the invention provides a method and a device for training a deep neural machine translation model, wherein the method comprises the following steps: obtaining a first combined input representation according to a training sample, wherein the training sample comprises a source sentence and a target sentence; inputting the first joint input representation into the training networks connected in sequence at the M layers to obtain a final output representation; wherein, each layer of training network comprises a self-crossing attention network and a feedforward network which are connected in sequence; and updating the model training parameters by using a back propagation algorithm according to the final output representation and the target statement. The method and the device for training the deep neural machine translation model provided by the embodiment of the invention train the deep neural machine translation model by utilizing the training network which is formed by the self-crossing attention network and the feedforward network and is formed by M layers of layers which are sequentially connected, have smooth gradient flow, realize the training of the deep neural machine translation model and further improve the translation effect of the neural machine translation model.

Description

Method and device for training deep neural machine translation model
Technical Field
The invention relates to the technical field of natural language processing, in particular to a method and a device for training a deep neural machine translation model.
Background
Machine translation refers to a process of automatically converting one natural language (hereinafter, source language) into another natural language (hereinafter, target language) by computer technology. With the development of deep learning technology, neural machine translation has become a new generation of machine translation technology, wherein a transform framework proposed by google in 2017 constitutes the most popular and most effective neural machine translation framework at present.
The Transformer architecture consists of an encoder responsible for mapping sentences in a source language (hereinafter source sentences) to hidden representations and a decoder that reads these hidden representations and generates sentences in a target language (hereinafter target sentences). The Transformer architecture trains the model by using a back propagation algorithm, an error signal in back propagation needs to pass through the whole encoder, but the Transformer architecture cannot be used for constructing a deep neural machine translation model, such as a neural machine translation model with more than 12 layers, because an unstable gradient flow is caused by the increase of the depth of the encoder.
The deep neural model has shown powerful capability in computer vision and other natural language processing tasks, but due to limitations of the Transformer architecture, the neural machine translation field still cannot enjoy many advantages brought by the deep neural model so far, such as better translation effect.
In summary, the framework capable of training the deep neural machine translation model is provided to train the deep neural machine translation model, and becomes an important research topic in the field of neural machine translation.
Disclosure of Invention
To solve the problems in the prior art, embodiments of the present invention provide a method and an apparatus for training a deep neural machine translation model.
In a first aspect, an embodiment of the present invention provides a method for training a deep neural machine translation model, including: obtaining a first combined input representation according to a training sample, wherein the training sample comprises a source sentence and a target sentence; inputting the first joint input representation into M layers of training networks connected in sequence to obtain a final output representation; wherein each layer of training network comprises a self-crossing attention network and a feed-forward network which are connected in sequence; and updating model training parameters by using a back propagation algorithm according to the final output representation and the target statement.
In a second aspect, an embodiment of the present invention provides an apparatus for training a deep neural machine translation model, including: a joint input representation acquisition module to: obtaining a first combined input representation according to a training sample, wherein the training sample comprises a source sentence and a target sentence; an output representation acquisition module to: inputting the first joint input representation into M layers of training networks which are sequentially connected to obtain a final output representation; wherein each layer of training network comprises a self-crossing attention network and a feed-forward network which are connected in sequence; a model training parameter update module to: and updating model training parameters by using a back propagation algorithm according to the final output representation and the target statement.
In a third aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the method provided in the first aspect when executing the program.
In a fourth aspect, an embodiment of the present invention provides a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the method as provided in the first aspect.
According to the method and the device for training the deep neural machine translation model, provided by the embodiment of the invention, the training of the deep neural machine translation model is carried out by utilizing the training network which is formed by the self-crossing attention network and the feedforward network and is formed by M layers of sequentially connected M layers, smooth gradient flow is provided, the training of the deep neural machine translation model is realized, and the translation effect of the neural machine translation model is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow chart of a method for training a deep neural machine translation model according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an architecture of a deep neural machine translation model in a method for training the deep neural machine translation model according to an embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating a process of obtaining a first joint input representation in a method for training a deep neural machine translation model according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a mask matrix of a deep neural machine translation model in a method for training the deep neural machine translation model according to an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of an apparatus for training a deep neural machine translation model according to an embodiment of the present invention;
fig. 6 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a flowchart of a method for training a deep neural machine translation model according to an embodiment of the present invention. Fig. 2 is a schematic diagram of an architecture of a deep neural machine translation model in a method for training the deep neural machine translation model according to an embodiment of the present invention. As shown in fig. 1 and fig. 2, the method includes:
101, obtaining a first combined input representation according to a training sample, wherein the training sample comprises a source sentence and a target sentence;
the training samples may be taken from a parallel corpus in a training set. The parallel corpus is composed of a plurality of sentences which have the same meaning and belong to different languages, such as a Chinese sentence "Beijing welcome you! "corresponds to the English sentence" welome to Beijing! ", then" Beijing welcome you! "is the Source sentence," Welcome to Beijing! "is the target statement.
The training samples comprise source sentences and corresponding target sentences, and a large number of training samples are contained in a training set. For each training sample, the training method provided by the embodiment of the invention can be adopted to train the deep neural machine translation model, and after a large number of training samples are trained, a more optimized deep neural machine translation model is obtained.
The apparatus for training the deep neural machine translation model derives a first joint input representation from the training samples, i.e., the first joint input representation is derived from both the source sentence and the target sentence in the training samples.
102, inputting the first joint input representation into M layers of training networks connected in sequence to obtain a final output representation; wherein each layer of the training network comprises a self-crossing attention network and a feedforward network which are connected in sequence;
and inputting the first joint input representation into M layers of training networks which are connected in sequence to obtain a final output representation. As shown in fig. 2, each layer of the training network includes a self-crossing attention network and a feedforward network connected in sequence. According to the input direction represented by the first joint input, in each layer of the training network, the self-crossing attention network is in front of the feedforward network. I.e. the first joint input represents the self-crossing attention network input to the training network of the first layer, the feed forward network of each layer connecting the self-crossing attention network of the present layer and the self-crossing attention network of the next layer. The input of the feedforward network of each layer is the output of the self-crossing attention network of the layer, and the output of the feedforward network of each layer is the input of the self-crossing attention network of the next layer.
Therefore, after the first joint input representation is input into the self-crossing attention network of the first layer, the final output representation can be obtained after the processing of the training network of M layers.
Where M also refers to the number of layers of the deep neural machine translation model.
And 103, updating model training parameters by using a back propagation algorithm according to the final output representation and the target statement.
And predicting a target statement through the output representation, and updating model training parameters by using a back propagation algorithm to enable the probability of the target statement to be the maximum value, so that the training of the deep neural machine translation model by using the training sample is completed.
Compared with the prior art, the training process of the method for training the deep neural machine translation model has smoother gradient flow compared with the best currently-used Transformer architecture, a deep (30-layer) neural machine translation model is successfully trained, a better translation effect is obtained compared with the most popular currently-used neural machine translation model based on the Transformer architecture, and the method has great potential for large-scale commercialization.
The embodiment of the invention trains the deep neural machine translation model by utilizing the training network consisting of the self-crossing attention network and the feedforward network which are sequentially connected by M layers, has smooth gradient flow, realizes the training of the deep neural machine translation model, and improves the translation effect of the neural machine translation model.
Fig. 3 is a schematic diagram of an acquisition process of a first joint input representation in the method for training the deep neural machine translation model according to the embodiment of the present invention. As shown in fig. 3, the obtaining a first joint input representation according to the training sample specifically includes: preprocessing the source sentence and the target sentence, wherein the preprocessing comprises word segmentation processing; mapping each subword in the source sentence and the target sentence into a word vector respectively; and adding the word vector with the corresponding language vector and the corresponding position vector respectively to obtain the first joint input representation.
And preprocessing the source sentence and the target sentence, wherein the preprocessing comprises word segmentation processing and can also comprise processing of uniformly converting case and the like. For example, the Chinese sentence "Beijing welcome you! "cut into" Beijing "," welcome "," you "and"! "four subwords, and the English sentence" Welcome to Beijing! "is divided into" welome "," to "," Bei @ "," jing ", and"! "five subwords.
And mapping each subword in the source sentence and the target sentence into a word vector respectively, and adding the word vector with the corresponding language vector and the corresponding position vector respectively to obtain the first joint input representation.
Given a source sentence x = (x) with I subwords 1 ,…,x i ,…,x I ) And a target statement y = (y) with J subwords 1 ,…,y j ,…,y J ) First, each word is mapped to a word vector X = (X) respectively 1 ,…,x i ,…,x I ) And Y = (Y) 1 ,…,y j ,…,y J )。
Then, the obtained word vectors are added with the language vector L and the position vector P respectively, and a joint input representation H of the model is obtained 0
Figure BDA0002098894140000061
Wherein H 0 Representing for the first joint input;
Figure BDA0002098894140000062
each vector represented for the first joint input; x = (X) 1 ,…,x i ,…,x I ) A set of word vectors representing I subwords of the source sentence; y is 0 Is an all-zero vector; y is 1 …y J-1 Respectively representing word vectors of the first J-1 sub-words in the J sub-words of the target sentence; l is 1 Representing language vectors corresponding to the source sentences; l is a radical of an alcohol 2 Representing a language vector corresponding to the target statement; />
Figure BDA0002098894140000063
Respectively represent x 1 …x I A position vector of (a); />
Figure BDA0002098894140000064
Respectively represent y 0 、y 1 …y J-1 The position vector of (a).
On the basis of the embodiment, each subword in the segmented source sentence and target sentence is mapped into a word vector, and the word vectors are added with the corresponding language vectors and position vectors to obtain a first combined input representation, so that a basis is provided for training of a deep neural machine translation model.
Further, based on the above embodiment, the inputting the first joint input representation into M layers of training networks connected in sequence to obtain a final output representation specifically includes: taking the first joint input representation as input to a first layer of the training network, i.e., the first joint input representation is input to a first one of the self-crossing attention networks; the training network of each layer is subjected to preset processing to obtain output representation of a corresponding layer, and the output representation of the previous layer is used as the joint input representation of the next layer; finally, after the preset processing of the M layers of training networks, the final output representation is obtained; wherein, for each layer of the training network, the preset processing specifically includes: after receiving the joint input representation of the corresponding hierarchy, the self-crossover attention network obtains an inquiry representation, a key representation and a value representation according to the joint input representation; calculating the query representation, the key representation and the value representation by using a dot-product attention mechanism to obtain an intermediate representation; processing the intermediate representation using the feed-forward network coupled to the self-crossing attention network to obtain the output representation of the layer.
In obtaining a first joint input representation H 0 Thereafter, the self-crossing attention network of the training network of the first layer will first convert it into a query representation Q of the training network of the first layer 0 The key represents K 0 The sum value represents V 0
Figure BDA0002098894140000071
Figure BDA0002098894140000072
Figure BDA0002098894140000073
Wherein
Figure BDA0002098894140000074
Are all model training parameters of the training network of the first layer.
Then, a dot product attention mechanism is adopted to represent Q for the inquiry 1 The key represents K 1 Sum value represents V 1 Calculating to obtain an intermediate representation H of the training network of the first layer 1′
Figure BDA0002098894140000075
Wherein d is a dimension of a hidden representation of the deep neural machine translation model, and B is a mask matrix of the deep neural machine translation model; softmax is a normalized exponential function.
Fig. 4 is a schematic diagram of a mask matrix of a deep neural machine translation model in a method for training the deep neural machine translation model according to an embodiment of the present invention. The effect of the mask matrix is two-fold, firstly to make the source-side representation only focus on all representations of the source side, and secondly to make the target-side representation only focus on all representations of the source side and representations of target words that have been generated, but not on representations of target words that have not been generated. As shown in fig. 4, the nodes without the dashed lines represent the connections of the nodes of the vertical axis corresponding to the nodes of the horizontal axis are masked. In fig. 4, the horizontal direction is the input representation of each word vector, and the vertical direction is the middle representation of each word vector.
In the obtaining of the intermediate representation H 1′ Then, the feedforward network of the training network of the first layer is utilized to process the feedforward network, and the output expression H of the training network of the first layer is obtained 1
Figure BDA0002098894140000081
Wherein
Figure BDA0002098894140000082
And &>
Figure BDA0002098894140000083
Are all model training parameters of the training network of the first layer.
The output of the first layer in the model represents H 1 And also input representation of a second layer, each layer having the same network structure but different model training parameters, and representing H in a combined input representation 0 After traversing M layers, the model obtains a final output representation H M
Wherein, for the training network at level M (1. Ltoreq. M. Ltoreq.M), the query representation, the key representation, and the expression represented by the values are:
Figure BDA0002098894140000084
Figure BDA0002098894140000085
Figure BDA0002098894140000086
wherein Q is m 、K m 、V m The query representation, the key representation, and the value representation of the training network at layer m, respectively;
Figure BDA0002098894140000087
and & ->
Figure BDA0002098894140000088
Model training parameters of the training network of the mth layer are all set; h m-1 Representing the mth joint input, namely representing the joint input corresponding to the training network at the mth layer;
the expression of the intermediate representation is:
Figure BDA0002098894140000089
wherein H m′ The intermediate representation of the training network for the mth layer; d is a dimension of the hidden representation of the deep neural machine translation model, and B is a mask matrix of the deep neural machine translation model; softmax is a normalized exponential function;
the expression of the output representation is:
Figure BDA0002098894140000091
wherein H m Representing the output of the training network for the mth layer;
Figure BDA0002098894140000092
Figure BDA0002098894140000093
and &>
Figure BDA0002098894140000094
All of which are layer m of the training networkModel training parameters; max represents a pair->
Figure BDA0002098894140000095
Taking the maximum value, e.g. assume
Figure BDA0002098894140000096
Is a [ -1,2,3, -4 ]]Then the output result of the max function is [0,2,3,0 ]]。
On the basis of the embodiment, the embodiment of the invention obtains the query expression, the key expression and the value expression according to the combined input expression, obtains the intermediate expression and further obtains the output expression, and advances layer by layer to obtain the final output expression, thereby improving the accuracy of model training.
Further, based on the above embodiment, the updating model training parameters by using a back propagation algorithm according to the final output representation and the target statement specifically includes: and updating model training parameters by using a back propagation algorithm according to the final output representation corresponding to the target statement, so that the probability that the predicted value is the word vector of the target statement is maximized.
As shown in FIG. 3, y 0 Is an all-zero vector for obtaining
Figure BDA0002098894140000097
(y 0 The corresponding joint input representation) and finally gets £ based on the result>
Figure BDA0002098894140000098
(y 0 Corresponding output representation) to be used to predict the first word y of the target sentence 1 Similarly, y 1 For obtaining->
Figure BDA0002098894140000099
(y 1 The corresponding joint input representation) and finally gets £ based on the result>
Figure BDA00020988941400000910
(y 1 Corresponding output representation) to be used for predicting purposesSecond word y of the standard sentence 2 8230and the same analogy is repeated until y J-1 Gets the corresponding output representation->
Figure BDA00020988941400000911
Last word y used to predict target sentence J
Thus, the means for training the deep neural machine translation model synchronously predicts H M Slice matrix in
Figure BDA00020988941400000912
And updating all parameters of the model by using a back propagation algorithm to enable the predicted value to be (y) 1 ,…,y J ) To this end, the training of the deep neural machine translation model using the training samples is ended.
Wherein H M An output representation of the training network for the Mth layer;
Figure BDA0002098894140000101
is H M The final output representation corresponding to the target statement.
On the basis of the embodiment, the embodiment of the invention updates the model training parameters by using the back propagation algorithm according to the final output representation corresponding to the target sentence, so that the probability that the predicted value is the word vector of the target sentence is maximized, the acquisition of the optimized model training parameters is ensured, and the accuracy of model training is further improved.
Further, based on the above embodiment, the method further includes: and translating the source language into the target language by utilizing a decoding method based on cluster searching based on the deep neural machine translation model.
In the decoding process, several target words with the maximum probability are selected as candidate sets each time, the probability values are used as scores of current words, and after each bundle generates a sentence, the target sentence in the bundle with the highest score is selected as a final translation result.
On the basis of the embodiment, the embodiment of the invention translates the source language into the target language by adopting a decoding method based on cluster searching, thereby improving the translation accuracy.
Fig. 5 is a schematic structural diagram of an apparatus for training a deep neural machine translation model according to an embodiment of the present invention. As shown in fig. 5, the apparatus includes a joint input representation obtaining module 10, an output representation obtaining module 20, and a model training parameter updating module 30, wherein: the joint input representation acquisition module 10 is configured to: obtaining a first combined input representation according to a training sample, wherein the training sample comprises a source sentence and a target sentence; the output representation acquisition module 20 is configured to: inputting the first joint input representation into M layers of training networks connected in sequence to obtain a final output representation; wherein each layer of the training network comprises a self-crossing attention network and a feedforward network which are connected in sequence; the model training parameter update module 30 is configured to: and updating model training parameters by using a back propagation algorithm according to the final output representation and the target statement.
The embodiment of the invention trains the deep neural machine translation model by utilizing the training network consisting of the self-crossing attention network and the feedforward network which are sequentially connected by M layers, has smooth gradient flow, realizes the training of the deep neural machine translation model, and improves the translation effect of the neural machine translation model.
Further, based on the above embodiment, the joint input representation acquiring module 10 is specifically configured to: preprocessing the source sentence and the target sentence, wherein the preprocessing comprises word segmentation processing; mapping each subword in the source sentence and the target sentence into a word vector respectively; and adding the word vector with the corresponding language vector and the corresponding position vector respectively to obtain the first joint input representation.
On the basis of the embodiment, the embodiment of the invention respectively maps each sub-word in the source sentence and the target sentence after word segmentation into the word vector, and adds the word vector with the corresponding language vector and the corresponding position vector to obtain the first combined input representation, thereby providing a foundation for training a deep neural machine translation model.
Further, based on the above embodiment, the output representation acquiring module 20 is specifically configured to: taking the first joint input representation as input to a first layer of the training network, i.e., the first joint input representation is input to a first one of the self-crossing attention networks; the training network of each layer is subjected to preset processing to obtain output representation of a corresponding layer, and the output representation of the previous layer is used as the joint input representation of the next layer; finally, after the preset processing of the M layers of training networks, the final output representation is obtained; wherein, for each layer of the training network, the preset processing specifically includes: after receiving the joint input representation of the corresponding hierarchy, the self-crossover attention network obtains an inquiry representation, a key representation, and a value representation from the joint input representation; calculating the query representation, the key representation and the value representation by using a dot-product attention mechanism to obtain an intermediate representation; processing the intermediate representation using the feed-forward network coupled to the self-crossing attention network to obtain the output representation of the layer.
On the basis of the embodiment, the embodiment of the invention obtains the query expression, the key expression and the value expression according to the combined input expression, obtains the intermediate expression and further obtains the output expression, and advances layer by layer to obtain the final output expression, thereby improving the accuracy of model training.
Further, based on the above embodiment, the updating the model training parameters by using a back propagation algorithm according to the final output representation and the target sentence specifically includes: and updating model training parameters by using a back propagation algorithm according to the final output representation corresponding to the target statement, so that the probability that the predicted value is the word vector of the target statement is maximized.
On the basis of the embodiment, the embodiment of the invention translates the source language into the target language by adopting a decoding method based on cluster searching, thereby improving the translation accuracy.
Further, based on the above embodiment, the method further includes: and translating the source language into the target language by utilizing a decoding method based on cluster searching based on the deep neural machine translation model.
On the basis of the embodiment, the embodiment of the invention translates the source language into the target language by adopting a decoding method based on cluster searching, thereby improving the translation accuracy.
The device provided by the embodiment of the invention is used for the method, and specific functions can refer to the method flow, which is not described again here.
Fig. 6 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention. As shown in fig. 6, the electronic device may include: a processor (processor) 810, a communication Interface 820, a memory 830 and a communication bus 840, wherein the processor 810, the communication Interface 820 and the memory 830 communicate with each other via the communication bus 840. The processor 810 may call logic instructions in the memory 830 to perform the following method: obtaining a first combined input representation according to a training sample, wherein the training sample comprises a source sentence and a target sentence; inputting the first joint input representation into M layers of training networks connected in sequence to obtain a final output representation; wherein each layer of the training network comprises a self-crossing attention network and a feedforward network which are connected in sequence; and updating model training parameters by using a back propagation algorithm according to the final output representation and the target statement.
In addition, the logic instructions in the memory 830 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention or a part thereof which substantially contributes to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the transmission method provided in the foregoing embodiments when executed by a processor, and for example, the method includes: obtaining a first combined input representation according to a training sample, wherein the training sample comprises a source sentence and a target sentence; inputting the first joint input representation into M layers of training networks connected in sequence to obtain a final output representation; wherein each layer of the training network comprises a self-crossing attention network and a feedforward network which are connected in sequence; and updating model training parameters by using a back propagation algorithm according to the final output representation and the target statement.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. Based on the understanding, the above technical solutions substantially or otherwise contributing to the prior art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the various embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (7)

1. A method for training a deep neural machine translation model, comprising:
obtaining a first combined input representation according to a training sample, wherein the training sample comprises a source sentence and a target sentence;
inputting the first joint input representation into M layers of training networks connected in sequence to obtain a final output representation; wherein each layer of the training network comprises a self-crossing attention network and a feedforward network which are connected in sequence;
updating model training parameters by using a back propagation algorithm according to the final output representation and the target statement;
inputting the first joint input representation into an M-layer training network connected in sequence to obtain a final output representation, specifically including:
taking the first joint input representation as input to a first layer of the training network, i.e., the first joint input representation is input to a first one of the self-intersecting attention networks; the training network of each layer is subjected to preset processing to obtain output representation of a corresponding layer, and the output representation of the previous layer is used as the joint input representation of the next layer; finally, after the preset processing of the M layers of training networks, the final output representation is obtained;
wherein, for each layer of the training network, the preset processing specifically includes:
after receiving the joint input representation of the corresponding hierarchy, the self-crossover attention network obtains an inquiry representation, a key representation, and a value representation from the joint input representation;
calculating the query representation, the key representation and the value representation by using a dot-product attention mechanism to obtain an intermediate representation;
processing the intermediate representation using the feed-forward network connected to the self-intersecting attention network to obtain the output representation of the layer;
the expression of the first joint input representation is:
Figure FDA0003977150340000011
wherein H 0 Representing for the first joint input;
Figure FDA0003977150340000021
each vector represented for the first joint input; x = (X) 1 ,…,x i ,…,x I ) A set of word vectors representing I subwords of the source sentence; y is 0 Is an all-zero vector; y is 1 …y J-1 Respectively representing word vectors of the first J-1 sub-words in the J sub-words of the target sentence; l is 1 Representing a language vector corresponding to the source sentence; l is 2 Representing a language vector corresponding to the target statement; />
Figure FDA0003977150340000022
Respectively represent x 1 …x I A position vector of (a); />
Figure FDA0003977150340000023
Respectively represent y 0 、y 1 …y J-1 A position vector of (a);
for the training network at level m, the expressions for the query representation, the key representation, and the value representation are:
Figure FDA0003977150340000024
Figure FDA0003977150340000025
Figure FDA0003977150340000026
wherein Q is m 、K m 、V m The query representation, the key representation, and the value representation of the training network at layer m, respectively;
Figure FDA0003977150340000027
and & ->
Figure FDA0003977150340000028
Model training parameters of the training network of the mth layer are all set; h m-1 Representing the mth joint input, namely representing the joint input corresponding to the training network at the mth layer;
the expression of the intermediate representation is:
Figure FDA0003977150340000029
wherein H m′ The intermediate representation of the training network for the mth layer; d is a dimension of the hidden representation of the deep neural machine translation model, and B is a mask matrix of the deep neural machine translation model; softmax is a normalized exponential function;
the expression of the output representation is:
Figure FDA00039771503400000210
wherein H m The output representation of the training network for the mth layer;
Figure FDA00039771503400000211
Figure FDA00039771503400000212
and &>
Figure FDA00039771503400000213
Model training parameters of the training network of the mth layer are all set; max represents a pair->
Figure FDA00039771503400000214
And (5) solving the maximum value, wherein M is more than or equal to 1 and less than or equal to M.
2. The method for training the deep neural machine translation model of claim 1, wherein the deriving the first joint input representation from the training samples comprises:
preprocessing the source sentence and the target sentence, wherein the preprocessing comprises word segmentation processing;
mapping each subword in the source sentence and the target sentence into a word vector respectively;
and adding the word vector with the corresponding language vector and the corresponding position vector respectively to obtain the first joint input representation.
3. The method of claim 2, wherein updating model training parameters using a back propagation algorithm according to the final output representation and the target sentence comprises:
and updating model training parameters by using a back propagation algorithm according to the final output representation corresponding to the target statement, so that the probability that the predicted value is the word vector of the target statement is maximized.
4. The method for training the deep neural machine translation model according to any one of claims 1 to 3, further comprising:
and translating the source language into the target language by utilizing a decoding method based on cluster searching based on the deep neural machine translation model.
5. An apparatus for training a deep neural machine translation model, comprising:
a joint input representation acquisition module to: obtaining a first combined input representation according to a training sample, wherein the training sample comprises a source sentence and a target sentence;
an output representation acquisition module to: inputting the first joint input representation into M layers of training networks which are sequentially connected to obtain a final output representation; wherein each layer of the training network comprises a self-crossing attention network and a feedforward network which are connected in sequence;
a model training parameter update module to: updating model training parameters by using a back propagation algorithm according to the final output representation and the target statement;
the output representation obtaining module is specifically configured to: taking the first joint input representation as input to a first layer of the training network, i.e., the first joint input representation is input to a first one of the self-intersecting attention networks; each layer of training network is subjected to preset processing to obtain output representation of a corresponding layer, and the output representation of the previous layer is used as the joint input representation of the next layer; finally, after the preset processing of the training network of M layers, the final output representation is obtained; for each layer of the training network, the preset processing specifically includes: after receiving the joint input representation of the corresponding hierarchy, the self-crossover attention network obtains an inquiry representation, a key representation, and a value representation from the joint input representation; calculating the query representation, the key representation and the value representation by using a dot-product attention mechanism to obtain an intermediate representation; processing the intermediate representation using the feed-forward network connected to the self-intersecting attention network to obtain the output representation of the layer;
the expression of the first joint input representation is:
Figure FDA0003977150340000041
wherein H 0 Representing for the first joint input;
Figure FDA0003977150340000042
each vector represented for the first joint input; x = (X) 1 ,…,x i ,…,x I ) A set of word vectors representing I subwords of the source sentence; y is 0 Is an all-zero vector; y is 1 …y J-1 Word vectors respectively representing the first J-1 sub-words in the J sub-words of the target sentence; l is 1 Representing a language vector corresponding to the source sentence; l is 2 Representing a language vector corresponding to the target statement; />
Figure FDA0003977150340000043
Respectively represent x 1 …x I A position vector of (a); />
Figure FDA0003977150340000044
Respectively represent y 0 、y 1 …y J-1 A position vector of (a);
for the training network at level m, the expressions for the query representation, the key representation, and the value representation are:
Figure FDA0003977150340000045
Figure FDA0003977150340000046
Figure FDA0003977150340000047
wherein Q is m 、K m 、V m The query representation, the key representation, and the value representation of the training network at the mth layer, respectively;
Figure FDA0003977150340000048
and &>
Figure FDA0003977150340000049
Model training parameters of the training network of the mth layer are all set; h m-1 Representing the mth joint input, namely representing the joint input corresponding to the training network at the mth layer;
the expression of the intermediate representation is:
Figure FDA0003977150340000051
wherein H m′ The intermediate representation of the training network for the mth layer; d is a dimension of the hidden representation of the deep neural machine translation model, and B is a mask matrix of the deep neural machine translation model; softmax is a normalized exponential function;
the expression of the output representation is:
Figure FDA0003977150340000052
wherein H m Representing the output of the training network for the mth layer;
Figure FDA0003977150340000053
Figure FDA0003977150340000054
and &>
Figure FDA0003977150340000055
Are all made ofModel training parameters of the training network at the mth layer; max denotes the value for>
Figure FDA0003977150340000056
And (5) solving the maximum value, wherein M is more than or equal to 1 and less than or equal to M.
6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the method for training a deep neural machine translation model according to any one of claims 1 to 4.
7. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for training a deep neural machine translation model according to any one of claims 1 to 4.
CN201910528250.5A 2019-06-18 2019-06-18 Method and device for training deep neural machine translation model Active CN110263352B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910528250.5A CN110263352B (en) 2019-06-18 2019-06-18 Method and device for training deep neural machine translation model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910528250.5A CN110263352B (en) 2019-06-18 2019-06-18 Method and device for training deep neural machine translation model

Publications (2)

Publication Number Publication Date
CN110263352A CN110263352A (en) 2019-09-20
CN110263352B true CN110263352B (en) 2023-04-07

Family

ID=67919168

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910528250.5A Active CN110263352B (en) 2019-06-18 2019-06-18 Method and device for training deep neural machine translation model

Country Status (1)

Country Link
CN (1) CN110263352B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112308370B (en) * 2020-09-16 2024-03-05 湘潭大学 Automatic subjective question scoring method for thinking courses based on Transformer

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810999A (en) * 2014-02-27 2014-05-21 清华大学 Linguistic model training method and system based on distributed neural networks
JP2016218995A (en) * 2015-05-25 2016-12-22 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカPanasonic Intellectual Property Corporation of America Machine translation method, machine translation system and program
CN108197123A (en) * 2018-02-07 2018-06-22 云南衍那科技有限公司 A kind of cloud translation system and method based on smartwatch
CN108563640A (en) * 2018-04-24 2018-09-21 中译语通科技股份有限公司 A kind of multilingual pair of neural network machine interpretation method and system
CN108647214A (en) * 2018-03-29 2018-10-12 中国科学院自动化研究所 Coding/decoding method based on deep-neural-network translation model
CN109145315A (en) * 2018-09-05 2019-01-04 腾讯科技(深圳)有限公司 Text interpretation method, device, storage medium and computer equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810999A (en) * 2014-02-27 2014-05-21 清华大学 Linguistic model training method and system based on distributed neural networks
JP2016218995A (en) * 2015-05-25 2016-12-22 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカPanasonic Intellectual Property Corporation of America Machine translation method, machine translation system and program
CN108197123A (en) * 2018-02-07 2018-06-22 云南衍那科技有限公司 A kind of cloud translation system and method based on smartwatch
CN108647214A (en) * 2018-03-29 2018-10-12 中国科学院自动化研究所 Coding/decoding method based on deep-neural-network translation model
CN108563640A (en) * 2018-04-24 2018-09-21 中译语通科技股份有限公司 A kind of multilingual pair of neural network machine interpretation method and system
CN109145315A (en) * 2018-09-05 2019-01-04 腾讯科技(深圳)有限公司 Text interpretation method, device, storage medium and computer equipment

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Full-wave modeling of antennas by elementary sources based on spherical waves translation;Juan F. Izquierdo 等;《2012 6th European Conference on Antennas and Propagation (EUCAP)》;全文 *
基于深度学习的机械智能制造知识问答系统设计;朱建楠等;《计算机集成制造系统》(第05期);全文 *
改进的HMM模型在特征抽取上的应用;陈昌浩等;《计算机测量与控制》(第04期);全文 *
模板驱动的神经机器翻译;李强 等;《计算机学报》;第42卷(第3期);全文 *
统计机器翻译中领域自适应问题研究;刘昊;《中国优秀硕士论文全文数据库》;全文 *

Also Published As

Publication number Publication date
CN110263352A (en) 2019-09-20

Similar Documents

Publication Publication Date Title
KR102382499B1 (en) Translation method, target information determination method, related apparatus and storage medium
US11556713B2 (en) System and method for performing a meaning search using a natural language understanding (NLU) framework
CN106484682B (en) Machine translation method, device and electronic equipment based on statistics
CN108052588B (en) Method for constructing automatic document question-answering system based on convolutional neural network
CN109117483B (en) Training method and device of neural network machine translation model
CN106502985B (en) neural network modeling method and device for generating titles
CN110309287B (en) Retrieval type chatting dialogue scoring method for modeling dialogue turn information
CN102968989B (en) Improvement method of Ngram model for voice recognition
CN108734276A (en) A kind of learning by imitation dialogue generation method generating network based on confrontation
CN107967262A (en) A kind of neutral net covers Chinese machine translation method
CN110070855B (en) Voice recognition system and method based on migrating neural network acoustic model
CN111144140B (en) Zhongtai bilingual corpus generation method and device based on zero-order learning
Khan et al. RNN-LSTM-GRU based language transformation
CN101458681A (en) Voice translation method and voice translation apparatus
CN108932232A (en) A kind of illiteracy Chinese inter-translation method based on LSTM neural network
CN111191468B (en) Term replacement method and device
Mandal et al. Futurity of translation algorithms for neural machine translation (NMT) and its vision
CN110263352B (en) Method and device for training deep neural machine translation model
CN111428518A (en) Low-frequency word translation method and device
CN111178097B (en) Method and device for generating Zhongtai bilingual corpus based on multistage translation model
Riou et al. Online adaptation of an attention-based neural network for natural language generation
CN115017924B (en) Construction of neural machine translation model for cross-language translation and translation method thereof
CN114254657B (en) Translation method and related equipment thereof
CN115860015A (en) Translation memory-based transcribed text translation method and computer equipment
CN111104806A (en) Construction method and device of neural machine translation model, and translation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant