CN110263352B

CN110263352B - Method and device for training deep neural machine translation model

Info

Publication number: CN110263352B
Application number: CN201910528250.5A
Authority: CN
Inventors: 黄辉; 刘学博; 周沁
Original assignee: Um Zhuhai Research Institute; University of Macau
Current assignee: Um Zhuhai Research Institute; University of Macau
Priority date: 2019-06-18
Filing date: 2019-06-18
Publication date: 2023-04-07
Anticipated expiration: 2039-06-18
Also published as: CN110263352A

Abstract

The embodiment of the invention provides a method and a device for training a deep neural machine translation model, wherein the method comprises the following steps: obtaining a first combined input representation according to a training sample, wherein the training sample comprises a source sentence and a target sentence; inputting the first joint input representation into the training networks connected in sequence at the M layers to obtain a final output representation; wherein, each layer of training network comprises a self-crossing attention network and a feedforward network which are connected in sequence; and updating the model training parameters by using a back propagation algorithm according to the final output representation and the target statement. The method and the device for training the deep neural machine translation model provided by the embodiment of the invention train the deep neural machine translation model by utilizing the training network which is formed by the self-crossing attention network and the feedforward network and is formed by M layers of layers which are sequentially connected, have smooth gradient flow, realize the training of the deep neural machine translation model and further improve the translation effect of the neural machine translation model.

Description

Method and device for training deep neural machine translation model

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method and a device for training a deep neural machine translation model.

Background

Machine translation refers to a process of automatically converting one natural language (hereinafter, source language) into another natural language (hereinafter, target language) by computer technology. With the development of deep learning technology, neural machine translation has become a new generation of machine translation technology, wherein a transform framework proposed by google in 2017 constitutes the most popular and most effective neural machine translation framework at present.

The Transformer architecture consists of an encoder responsible for mapping sentences in a source language (hereinafter source sentences) to hidden representations and a decoder that reads these hidden representations and generates sentences in a target language (hereinafter target sentences). The Transformer architecture trains the model by using a back propagation algorithm, an error signal in back propagation needs to pass through the whole encoder, but the Transformer architecture cannot be used for constructing a deep neural machine translation model, such as a neural machine translation model with more than 12 layers, because an unstable gradient flow is caused by the increase of the depth of the encoder.

The deep neural model has shown powerful capability in computer vision and other natural language processing tasks, but due to limitations of the Transformer architecture, the neural machine translation field still cannot enjoy many advantages brought by the deep neural model so far, such as better translation effect.

In summary, the framework capable of training the deep neural machine translation model is provided to train the deep neural machine translation model, and becomes an important research topic in the field of neural machine translation.

Disclosure of Invention

To solve the problems in the prior art, embodiments of the present invention provide a method and an apparatus for training a deep neural machine translation model.

In a first aspect, an embodiment of the present invention provides a method for training a deep neural machine translation model, including: obtaining a first combined input representation according to a training sample, wherein the training sample comprises a source sentence and a target sentence; inputting the first joint input representation into M layers of training networks connected in sequence to obtain a final output representation; wherein each layer of training network comprises a self-crossing attention network and a feed-forward network which are connected in sequence; and updating model training parameters by using a back propagation algorithm according to the final output representation and the target statement.

In a second aspect, an embodiment of the present invention provides an apparatus for training a deep neural machine translation model, including: a joint input representation acquisition module to: obtaining a first combined input representation according to a training sample, wherein the training sample comprises a source sentence and a target sentence; an output representation acquisition module to: inputting the first joint input representation into M layers of training networks which are sequentially connected to obtain a final output representation; wherein each layer of training network comprises a self-crossing attention network and a feed-forward network which are connected in sequence; a model training parameter update module to: and updating model training parameters by using a back propagation algorithm according to the final output representation and the target statement.

In a third aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the method provided in the first aspect when executing the program.

In a fourth aspect, an embodiment of the present invention provides a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the method as provided in the first aspect.

According to the method and the device for training the deep neural machine translation model, provided by the embodiment of the invention, the training of the deep neural machine translation model is carried out by utilizing the training network which is formed by the self-crossing attention network and the feedforward network and is formed by M layers of sequentially connected M layers, smooth gradient flow is provided, the training of the deep neural machine translation model is realized, and the translation effect of the neural machine translation model is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of a method for training a deep neural machine translation model according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an architecture of a deep neural machine translation model in a method for training the deep neural machine translation model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating a process of obtaining a first joint input representation in a method for training a deep neural machine translation model according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a mask matrix of a deep neural machine translation model in a method for training the deep neural machine translation model according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of an apparatus for training a deep neural machine translation model according to an embodiment of the present invention;

fig. 6 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a method for training a deep neural machine translation model according to an embodiment of the present invention. Fig. 2 is a schematic diagram of an architecture of a deep neural machine translation model in a method for training the deep neural machine translation model according to an embodiment of the present invention. As shown in fig. 1 and fig. 2, the method includes:

101, obtaining a first combined input representation according to a training sample, wherein the training sample comprises a source sentence and a target sentence;

the training samples may be taken from a parallel corpus in a training set. The parallel corpus is composed of a plurality of sentences which have the same meaning and belong to different languages, such as a Chinese sentence "Beijing welcome you! "corresponds to the English sentence" welome to Beijing! ", then" Beijing welcome you! "is the Source sentence," Welcome to Beijing! "is the target statement.

The training samples comprise source sentences and corresponding target sentences, and a large number of training samples are contained in a training set. For each training sample, the training method provided by the embodiment of the invention can be adopted to train the deep neural machine translation model, and after a large number of training samples are trained, a more optimized deep neural machine translation model is obtained.

The apparatus for training the deep neural machine translation model derives a first joint input representation from the training samples, i.e., the first joint input representation is derived from both the source sentence and the target sentence in the training samples.

102, inputting the first joint input representation into M layers of training networks connected in sequence to obtain a final output representation; wherein each layer of the training network comprises a self-crossing attention network and a feedforward network which are connected in sequence;

and inputting the first joint input representation into M layers of training networks which are connected in sequence to obtain a final output representation. As shown in fig. 2, each layer of the training network includes a self-crossing attention network and a feedforward network connected in sequence. According to the input direction represented by the first joint input, in each layer of the training network, the self-crossing attention network is in front of the feedforward network. I.e. the first joint input represents the self-crossing attention network input to the training network of the first layer, the feed forward network of each layer connecting the self-crossing attention network of the present layer and the self-crossing attention network of the next layer. The input of the feedforward network of each layer is the output of the self-crossing attention network of the layer, and the output of the feedforward network of each layer is the input of the self-crossing attention network of the next layer.

Therefore, after the first joint input representation is input into the self-crossing attention network of the first layer, the final output representation can be obtained after the processing of the training network of M layers.

Where M also refers to the number of layers of the deep neural machine translation model.

And 103, updating model training parameters by using a back propagation algorithm according to the final output representation and the target statement.

And predicting a target statement through the output representation, and updating model training parameters by using a back propagation algorithm to enable the probability of the target statement to be the maximum value, so that the training of the deep neural machine translation model by using the training sample is completed.

Compared with the prior art, the training process of the method for training the deep neural machine translation model has smoother gradient flow compared with the best currently-used Transformer architecture, a deep (30-layer) neural machine translation model is successfully trained, a better translation effect is obtained compared with the most popular currently-used neural machine translation model based on the Transformer architecture, and the method has great potential for large-scale commercialization.

The embodiment of the invention trains the deep neural machine translation model by utilizing the training network consisting of the self-crossing attention network and the feedforward network which are sequentially connected by M layers, has smooth gradient flow, realizes the training of the deep neural machine translation model, and improves the translation effect of the neural machine translation model.

Fig. 3 is a schematic diagram of an acquisition process of a first joint input representation in the method for training the deep neural machine translation model according to the embodiment of the present invention. As shown in fig. 3, the obtaining a first joint input representation according to the training sample specifically includes: preprocessing the source sentence and the target sentence, wherein the preprocessing comprises word segmentation processing; mapping each subword in the source sentence and the target sentence into a word vector respectively; and adding the word vector with the corresponding language vector and the corresponding position vector respectively to obtain the first joint input representation.

And preprocessing the source sentence and the target sentence, wherein the preprocessing comprises word segmentation processing and can also comprise processing of uniformly converting case and the like. For example, the Chinese sentence "Beijing welcome you! "cut into" Beijing "," welcome "," you "and"! "four subwords, and the English sentence" Welcome to Beijing! "is divided into" welome "," to "," Bei @ "," jing ", and"! "five subwords.

And mapping each subword in the source sentence and the target sentence into a word vector respectively, and adding the word vector with the corresponding language vector and the corresponding position vector respectively to obtain the first joint input representation.

Given a source sentence x = (x) with I subwords ₁ ,…,x _i ,…,x _I ) And a target statement y = (y) with J subwords ₁ ,…,y _j ,…,y _J ) First, each word is mapped to a word vector X = (X) respectively ₁ ,…,x _i ,…,x _I ) And Y = (Y) ₁ ,…,y _j ,…,y _J )。

Then, the obtained word vectors are added with the language vector L and the position vector P respectively, and a joint input representation H of the model is obtained ⁰ 。

Wherein H ⁰ Representing for the first joint input;

each vector represented for the first joint input; x = (X) ₁ ,…,x _i ,…,x _I ) A set of word vectors representing I subwords of the source sentence; y is ₀ Is an all-zero vector; y is ₁ …y _J-1 Respectively representing word vectors of the first J-1 sub-words in the J sub-words of the target sentence; l is ₁ Representing language vectors corresponding to the source sentences; l is a radical of an alcohol ₂ Representing a language vector corresponding to the target statement; />

Respectively represent x ₁ …x _I A position vector of (a); />

Respectively represent y ₀ 、y ₁ …y _J-1 The position vector of (a).

On the basis of the embodiment, each subword in the segmented source sentence and target sentence is mapped into a word vector, and the word vectors are added with the corresponding language vectors and position vectors to obtain a first combined input representation, so that a basis is provided for training of a deep neural machine translation model.

Further, based on the above embodiment, the inputting the first joint input representation into M layers of training networks connected in sequence to obtain a final output representation specifically includes: taking the first joint input representation as input to a first layer of the training network, i.e., the first joint input representation is input to a first one of the self-crossing attention networks; the training network of each layer is subjected to preset processing to obtain output representation of a corresponding layer, and the output representation of the previous layer is used as the joint input representation of the next layer; finally, after the preset processing of the M layers of training networks, the final output representation is obtained; wherein, for each layer of the training network, the preset processing specifically includes: after receiving the joint input representation of the corresponding hierarchy, the self-crossover attention network obtains an inquiry representation, a key representation and a value representation according to the joint input representation; calculating the query representation, the key representation and the value representation by using a dot-product attention mechanism to obtain an intermediate representation; processing the intermediate representation using the feed-forward network coupled to the self-crossing attention network to obtain the output representation of the layer.

In obtaining a first joint input representation H ⁰ Thereafter, the self-crossing attention network of the training network of the first layer will first convert it into a query representation Q of the training network of the first layer ⁰ The key represents K ⁰ The sum value represents V ⁰ ：

Wherein

Are all model training parameters of the training network of the first layer.

Then, a dot product attention mechanism is adopted to represent Q for the inquiry ¹ The key represents K ¹ Sum value represents V ¹ Calculating to obtain an intermediate representation H of the training network of the first layer ^1′ ：

Wherein d is a dimension of a hidden representation of the deep neural machine translation model, and B is a mask matrix of the deep neural machine translation model; softmax is a normalized exponential function.

Fig. 4 is a schematic diagram of a mask matrix of a deep neural machine translation model in a method for training the deep neural machine translation model according to an embodiment of the present invention. The effect of the mask matrix is two-fold, firstly to make the source-side representation only focus on all representations of the source side, and secondly to make the target-side representation only focus on all representations of the source side and representations of target words that have been generated, but not on representations of target words that have not been generated. As shown in fig. 4, the nodes without the dashed lines represent the connections of the nodes of the vertical axis corresponding to the nodes of the horizontal axis are masked. In fig. 4, the horizontal direction is the input representation of each word vector, and the vertical direction is the middle representation of each word vector.

In the obtaining of the intermediate representation H ^1′ Then, the feedforward network of the training network of the first layer is utilized to process the feedforward network, and the output expression H of the training network of the first layer is obtained ¹ ：

Wherein

And &>

Are all model training parameters of the training network of the first layer.

The output of the first layer in the model represents H ¹ And also input representation of a second layer, each layer having the same network structure but different model training parameters, and representing H in a combined input representation ⁰ After traversing M layers, the model obtains a final output representation H ^M 。

Wherein, for the training network at level M (1. Ltoreq. M. Ltoreq.M), the query representation, the key representation, and the expression represented by the values are:

wherein Q is ^m 、K ^m 、V ^m The query representation, the key representation, and the value representation of the training network at layer m, respectively;

and & ->

Model training parameters of the training network of the mth layer are all set; h ^m-1 Representing the mth joint input, namely representing the joint input corresponding to the training network at the mth layer;

the expression of the intermediate representation is:

wherein H ^m′ The intermediate representation of the training network for the mth layer; d is a dimension of the hidden representation of the deep neural machine translation model, and B is a mask matrix of the deep neural machine translation model; softmax is a normalized exponential function;

the expression of the output representation is:

wherein H ^m Representing the output of the training network for the mth layer;

and &>

All of which are layer m of the training networkModel training parameters; max represents a pair->

Taking the maximum value, e.g. assume

Is a [ -1,2,3, -4 ]]Then the output result of the max function is [0,2,3,0 ]]。

On the basis of the embodiment, the embodiment of the invention obtains the query expression, the key expression and the value expression according to the combined input expression, obtains the intermediate expression and further obtains the output expression, and advances layer by layer to obtain the final output expression, thereby improving the accuracy of model training.

Further, based on the above embodiment, the updating model training parameters by using a back propagation algorithm according to the final output representation and the target statement specifically includes: and updating model training parameters by using a back propagation algorithm according to the final output representation corresponding to the target statement, so that the probability that the predicted value is the word vector of the target statement is maximized.

As shown in FIG. 3, y ₀ Is an all-zero vector for obtaining

(y ₀ The corresponding joint input representation) and finally gets £ based on the result>

(y ₀ Corresponding output representation) to be used to predict the first word y of the target sentence ₁ Similarly, y ₁ For obtaining->

(y ₁ The corresponding joint input representation) and finally gets £ based on the result>

(y ₁ Corresponding output representation) to be used for predicting purposesSecond word y of the standard sentence ₂ 8230and the same analogy is repeated until y _J-1 Gets the corresponding output representation->

Last word y used to predict target sentence _J 。

Thus, the means for training the deep neural machine translation model synchronously predicts H ^M Slice matrix in

And updating all parameters of the model by using a back propagation algorithm to enable the predicted value to be (y) ₁ ,…,y _J ) To this end, the training of the deep neural machine translation model using the training samples is ended.

Wherein H ^M An output representation of the training network for the Mth layer;

is H ^M The final output representation corresponding to the target statement.

On the basis of the embodiment, the embodiment of the invention updates the model training parameters by using the back propagation algorithm according to the final output representation corresponding to the target sentence, so that the probability that the predicted value is the word vector of the target sentence is maximized, the acquisition of the optimized model training parameters is ensured, and the accuracy of model training is further improved.

Further, based on the above embodiment, the method further includes: and translating the source language into the target language by utilizing a decoding method based on cluster searching based on the deep neural machine translation model.

In the decoding process, several target words with the maximum probability are selected as candidate sets each time, the probability values are used as scores of current words, and after each bundle generates a sentence, the target sentence in the bundle with the highest score is selected as a final translation result.

On the basis of the embodiment, the embodiment of the invention translates the source language into the target language by adopting a decoding method based on cluster searching, thereby improving the translation accuracy.

Fig. 5 is a schematic structural diagram of an apparatus for training a deep neural machine translation model according to an embodiment of the present invention. As shown in fig. 5, the apparatus includes a joint input representation obtaining module 10, an output representation obtaining module 20, and a model training parameter updating module 30, wherein: the joint input representation acquisition module 10 is configured to: obtaining a first combined input representation according to a training sample, wherein the training sample comprises a source sentence and a target sentence; the output representation acquisition module 20 is configured to: inputting the first joint input representation into M layers of training networks connected in sequence to obtain a final output representation; wherein each layer of the training network comprises a self-crossing attention network and a feedforward network which are connected in sequence; the model training parameter update module 30 is configured to: and updating model training parameters by using a back propagation algorithm according to the final output representation and the target statement.

Further, based on the above embodiment, the joint input representation acquiring module 10 is specifically configured to: preprocessing the source sentence and the target sentence, wherein the preprocessing comprises word segmentation processing; mapping each subword in the source sentence and the target sentence into a word vector respectively; and adding the word vector with the corresponding language vector and the corresponding position vector respectively to obtain the first joint input representation.

On the basis of the embodiment, the embodiment of the invention respectively maps each sub-word in the source sentence and the target sentence after word segmentation into the word vector, and adds the word vector with the corresponding language vector and the corresponding position vector to obtain the first combined input representation, thereby providing a foundation for training a deep neural machine translation model.

Further, based on the above embodiment, the output representation acquiring module 20 is specifically configured to: taking the first joint input representation as input to a first layer of the training network, i.e., the first joint input representation is input to a first one of the self-crossing attention networks; the training network of each layer is subjected to preset processing to obtain output representation of a corresponding layer, and the output representation of the previous layer is used as the joint input representation of the next layer; finally, after the preset processing of the M layers of training networks, the final output representation is obtained; wherein, for each layer of the training network, the preset processing specifically includes: after receiving the joint input representation of the corresponding hierarchy, the self-crossover attention network obtains an inquiry representation, a key representation, and a value representation from the joint input representation; calculating the query representation, the key representation and the value representation by using a dot-product attention mechanism to obtain an intermediate representation; processing the intermediate representation using the feed-forward network coupled to the self-crossing attention network to obtain the output representation of the layer.

Further, based on the above embodiment, the updating the model training parameters by using a back propagation algorithm according to the final output representation and the target sentence specifically includes: and updating model training parameters by using a back propagation algorithm according to the final output representation corresponding to the target statement, so that the probability that the predicted value is the word vector of the target statement is maximized.

The device provided by the embodiment of the invention is used for the method, and specific functions can refer to the method flow, which is not described again here.

Fig. 6 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention. As shown in fig. 6, the electronic device may include: a processor (processor) 810, a communication Interface 820, a memory 830 and a communication bus 840, wherein the processor 810, the communication Interface 820 and the memory 830 communicate with each other via the communication bus 840. The processor 810 may call logic instructions in the memory 830 to perform the following method: obtaining a first combined input representation according to a training sample, wherein the training sample comprises a source sentence and a target sentence; inputting the first joint input representation into M layers of training networks connected in sequence to obtain a final output representation; wherein each layer of the training network comprises a self-crossing attention network and a feedforward network which are connected in sequence; and updating model training parameters by using a back propagation algorithm according to the final output representation and the target statement.

In addition, the logic instructions in the memory 830 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention or a part thereof which substantially contributes to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the transmission method provided in the foregoing embodiments when executed by a processor, and for example, the method includes: obtaining a first combined input representation according to a training sample, wherein the training sample comprises a source sentence and a target sentence; inputting the first joint input representation into M layers of training networks connected in sequence to obtain a final output representation; wherein each layer of the training network comprises a self-crossing attention network and a feedforward network which are connected in sequence; and updating model training parameters by using a back propagation algorithm according to the final output representation and the target statement.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. Based on the understanding, the above technical solutions substantially or otherwise contributing to the prior art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the various embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for training a deep neural machine translation model, comprising:

obtaining a first combined input representation according to a training sample, wherein the training sample comprises a source sentence and a target sentence;

inputting the first joint input representation into M layers of training networks connected in sequence to obtain a final output representation; wherein each layer of the training network comprises a self-crossing attention network and a feedforward network which are connected in sequence;

updating model training parameters by using a back propagation algorithm according to the final output representation and the target statement;

inputting the first joint input representation into an M-layer training network connected in sequence to obtain a final output representation, specifically including:

taking the first joint input representation as input to a first layer of the training network, i.e., the first joint input representation is input to a first one of the self-intersecting attention networks; the training network of each layer is subjected to preset processing to obtain output representation of a corresponding layer, and the output representation of the previous layer is used as the joint input representation of the next layer; finally, after the preset processing of the M layers of training networks, the final output representation is obtained;

wherein, for each layer of the training network, the preset processing specifically includes:

after receiving the joint input representation of the corresponding hierarchy, the self-crossover attention network obtains an inquiry representation, a key representation, and a value representation from the joint input representation;

calculating the query representation, the key representation and the value representation by using a dot-product attention mechanism to obtain an intermediate representation;

processing the intermediate representation using the feed-forward network connected to the self-intersecting attention network to obtain the output representation of the layer;

the expression of the first joint input representation is:

wherein H ⁰ Representing for the first joint input;

each vector represented for the first joint input; x = (X) ₁ ,…,x _i ,…,x _I ) A set of word vectors representing I subwords of the source sentence; y is ₀ Is an all-zero vector; y is ₁ …y _J-1 Respectively representing word vectors of the first J-1 sub-words in the J sub-words of the target sentence; l is ₁ Representing a language vector corresponding to the source sentence; l is ₂ Representing a language vector corresponding to the target statement; />

Respectively represent x ₁ …x _I A position vector of (a); />

Respectively represent y ₀ 、y ₁ …y _J-1 A position vector of (a);

for the training network at level m, the expressions for the query representation, the key representation, and the value representation are:

and & ->

the expression of the intermediate representation is:

the expression of the output representation is:

wherein H ^m The output representation of the training network for the mth layer;

and &>

Model training parameters of the training network of the mth layer are all set; max represents a pair->

And (5) solving the maximum value, wherein M is more than or equal to 1 and less than or equal to M.

2. The method for training the deep neural machine translation model of claim 1, wherein the deriving the first joint input representation from the training samples comprises:

preprocessing the source sentence and the target sentence, wherein the preprocessing comprises word segmentation processing;

mapping each subword in the source sentence and the target sentence into a word vector respectively;

and adding the word vector with the corresponding language vector and the corresponding position vector respectively to obtain the first joint input representation.

3. The method of claim 2, wherein updating model training parameters using a back propagation algorithm according to the final output representation and the target sentence comprises:

and updating model training parameters by using a back propagation algorithm according to the final output representation corresponding to the target statement, so that the probability that the predicted value is the word vector of the target statement is maximized.

4. The method for training the deep neural machine translation model according to any one of claims 1 to 3, further comprising:

and translating the source language into the target language by utilizing a decoding method based on cluster searching based on the deep neural machine translation model.

5. An apparatus for training a deep neural machine translation model, comprising:

a joint input representation acquisition module to: obtaining a first combined input representation according to a training sample, wherein the training sample comprises a source sentence and a target sentence;

an output representation acquisition module to: inputting the first joint input representation into M layers of training networks which are sequentially connected to obtain a final output representation; wherein each layer of the training network comprises a self-crossing attention network and a feedforward network which are connected in sequence;

a model training parameter update module to: updating model training parameters by using a back propagation algorithm according to the final output representation and the target statement;

the output representation obtaining module is specifically configured to: taking the first joint input representation as input to a first layer of the training network, i.e., the first joint input representation is input to a first one of the self-intersecting attention networks; each layer of training network is subjected to preset processing to obtain output representation of a corresponding layer, and the output representation of the previous layer is used as the joint input representation of the next layer; finally, after the preset processing of the training network of M layers, the final output representation is obtained; for each layer of the training network, the preset processing specifically includes: after receiving the joint input representation of the corresponding hierarchy, the self-crossover attention network obtains an inquiry representation, a key representation, and a value representation from the joint input representation; calculating the query representation, the key representation and the value representation by using a dot-product attention mechanism to obtain an intermediate representation; processing the intermediate representation using the feed-forward network connected to the self-intersecting attention network to obtain the output representation of the layer;

the expression of the first joint input representation is:

wherein H ⁰ Representing for the first joint input;

each vector represented for the first joint input; x = (X) ₁ ,…,x _i ,…,x _I ) A set of word vectors representing I subwords of the source sentence; y is ₀ Is an all-zero vector; y is ₁ …y _J-1 Word vectors respectively representing the first J-1 sub-words in the J sub-words of the target sentence; l is ₁ Representing a language vector corresponding to the source sentence; l is ₂ Representing a language vector corresponding to the target statement; />

Respectively represent x ₁ …x _I A position vector of (a); />

Respectively represent y ₀ 、y ₁ …y _J-1 A position vector of (a);

wherein Q is ^m 、K ^m 、V ^m The query representation, the key representation, and the value representation of the training network at the mth layer, respectively;

and &>

the expression of the intermediate representation is:

the expression of the output representation is:

wherein H ^m Representing the output of the training network for the mth layer;

and &>

Are all made ofModel training parameters of the training network at the mth layer; max denotes the value for>

6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the method for training a deep neural machine translation model according to any one of claims 1 to 4.

7. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for training a deep neural machine translation model according to any one of claims 1 to 4.