CN112990434A

CN112990434A - Training method of machine translation model and related device

Info

Publication number: CN112990434A
Application number: CN202110255893.4A
Authority: CN
Inventors: 魏文琦; 王健宗; 张之勇; 程宁
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-03-09
Filing date: 2021-03-09
Publication date: 2021-06-18
Anticipated expiration: 2041-03-09
Also published as: CN112990434B

Abstract

The embodiment of the application provides a training method and a related device of a machine translation model, wherein the method comprises the following steps: calculating the similarity between a word to be coded and each word in a preset first sequence through a self-attention layer, wherein the word to be coded is a word input at the ith moment in a preset second sequence, the second sequence is a preset word sequence which needs to be input through k moments, the first sequence is a word sequence input before the ith moment in the words of the second sequence, i and k are positive integers, and i is smaller than k; calculating according to the similarity to obtain the self-attention of the word to be coded; inputting the self-attention into a feedforward neural network to obtain an output result; calculating a loss value between the output result and the self attention; and adjusting the network parameters of the machine translation model according to the loss value. Through the embodiment of the application, the training speed of the model can be improved.

Description

Training method of machine translation model and related device

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a training method of a machine translation model and a related device.

Background

In the task of natural speech generation, most are implemented based on the Seq2Seq model, such as generative dialog, machine translation, text summarization, and so on. Seq2Seq is a network of Encoder-Decoder architecture, the input of which is a sequence and the output of which is also a sequence. The Encoder changes a variable-length signal sequence into a fixed-length vector expression, and the Decoder changes the fixed-length vector into a variable-length target signal sequence. The Encoder and the Decoder can be formed by a Transfomer structure, and the attention mechanism in the Transformer structure enables the Seq2Seq model to be concentrated on all input information which is important for the next target word, so that the effect of the Seq2Seq model is greatly improved.

However, in the process of training the model, when the length of the input/output sequence is long, the calculation amount is large, the training speed is not fast, and time inefficiency is caused.

Disclosure of Invention

The application provides a training method of a machine translation model, which can improve the training speed of the model.

A first aspect of the present application provides a method for training a machine translation model, where the machine translation model includes an encoder, the encoder includes a self-attention layer and a feedforward neural network, and the method may include: calculating the similarity between a word to be coded and each word in a preset first sequence through a self-attention layer, wherein the word to be coded is a word input at the ith moment in a preset second sequence, the second sequence is a preset word sequence which needs to be input through k moments, the first sequence is a word sequence input before the ith moment in the words of the second sequence, i and k are positive integers, and i is smaller than k; calculating according to the similarity to obtain the self-attention of the word to be coded; inputting the self-attention into a feedforward neural network to obtain an output result; calculating a loss value between the output result and the self attention; and adjusting the network parameters of the machine translation model according to the loss value.

According to the first aspect, in one possible implementation manner, calculating, by the self-attention layer, a similarity between a word to be encoded and each word in a preset first sequence includes: obtaining a < Key, Value > data pair for each word in the first sequence; and calculating the similarity between the Query of the word to be coded and each Key, wherein the similarity is a weight coefficient of Value corresponding to each Key.

According to the first aspect, in one possible implementation manner, obtaining the self-attention of the word to be encoded according to similarity calculation includes: acquiring a random function value; and if the random function Value is larger than or equal to a first threshold Value, carrying out weighted summation on the similarity and the Value of the word represented by the similarity to obtain the self attention of the word to be coded.

According to the first aspect, in a possible implementation manner, the method further includes: and if the random function value is smaller than the first threshold value, taking the self-attention of the word at the i-1 th moment in the first sequence as the self-attention of the word to be coded.

According to the first aspect, in one possible implementation, a feedforward neural network includes an input layer, a hidden layer, and an output layer, and inputting self-attention into the feedforward neural network to obtain an output result includes: inputting the self-attention into the input layer to obtain a first output; inputting the first output to the hidden layer to obtain a second output; and inputting the second output to the output layer to obtain an output result.

According to the first aspect, in one possible implementation, calculating a loss value between the output result and the self-attention includes: obtaining a closed form expression of self attention through the recursion of the likelihood function, wherein the closed form expression is as follows:

and calculating a loss value between the closed expression of the self attention and the output result by using a loss function.

According to the first aspect, in a possible implementation manner, adjusting a network parameter of an encoder in a machine translation model according to a loss value includes: calculating the partial derivative of each network parameter in the machine translation model by using the loss value; calculating the gradient value of the loss value to the network parameter according to the derivative chain rule; and updating the network parameters according to the gradient values so that the loss values converge to the global optimum.

The second aspect of the present application provides a training apparatus for a machine translation model, including:

the self-attention unit is used for calculating the similarity between a word to be coded and each word in a preset first sequence through a self-attention layer, wherein the word to be coded is a word input at the ith moment in a preset second sequence, the second sequence is a preset word sequence needing to be input through k moments, the first sequence is a word sequence input before the ith moment in the words of the second sequence, i and k are positive integers, and i is smaller than k;

the self-attention unit is used for calculating and obtaining the self-attention of the word to be coded according to the similarity;

the feedforward neural network unit is used for inputting the self attention into the feedforward neural network to obtain an output result;

a calculation unit for calculating a loss value between the output result and the self-attention;

and the adjusting unit is used for adjusting the network parameters of the machine translation model according to the loss value. A third aspect of the present application provides an electronic device comprising a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, the programs comprising instructions for performing the steps of the method of any of the first aspects of the present application.

A fourth aspect of the present application provides a computer readable storage medium having a computer program stored thereon, the computer program being executable by a processor to perform some or all of the steps described in any of the methods of the first aspect of the present application.

It can be seen that, by the training method of the machine translation model and the related apparatus provided in the present application, when calculating the self-attention of the word input at the ith time, only the sequence input at the previous i time needs to be acquired, and compared with the prior art in which the entire sequence needs to be acquired, the time required for calculating the self-attention of the word input at the ith time can be reduced. The self-attention is input into the feedforward neural network to obtain an output result, and the output result is used for updating the machine translation model. Because the self-attention layer can output self-attention more quickly, the updating speed of the machine translation model can be shortened, and the training speed of the machine translation model can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments or the background art of the present application, the drawings required to be used in the embodiments or the background art of the present application will be described below.

FIG. 1 is a schematic structural diagram of a machine translation model provided by an embodiment of the present application;

fig. 2 is a schematic flowchart of a training method of a machine translation model according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a training method of a machine translation model according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The embodiments of the present application will be described below with reference to the drawings.

The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

As used in this specification, the terms "component," "module," "system," and the like are intended to refer to a computer-related entity, either hardware, firmware, a combination of hardware and software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a computing device and the computing device can be a component. One or more components can reside within a process and/or thread of execution and a component can be localized on one computer and/or distributed between 2 or more computers. In addition, these components can execute from various computer readable media having various data structures stored thereon. The components may communicate by way of local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from two components interacting with another component in a local system, distributed system, and/or across a network such as the internet with other systems by way of the signal).

Referring to fig. 1, fig. 1 is a schematic structural diagram of a machine translation model according to an embodiment of the present disclosure. In a machine translation task, a machine translation model will receive sentences in one language and output corresponding translations in another language. Or to receive sentences of one modality, such as language, and output corresponding translations of another modality, such as text. As can be seen from fig. 1, the encoding component is a stack of a series of encoders including n sequentially connected encoding layers (a stack of 6 encoders shown in fig. 1, the number of encoders may not be limited to 6), and the decoding component is a stack of a corresponding number of decoders to the encoding component, and also includes n sequentially connected decoding layers. Each encoder is structurally identical and can be decomposed into two sub-modules, namely a Self-Attention (Self-Attention) layer and a Feed-Forward Neural Network (Feed-Forward Neural Network). As can be seen from fig. 1, in the encoding part, the input of each encoding layer is the output of the previous encoding layer; in the decoding section, the input of each decoding layer is not only the output of the previous decoding layer, but also the output of the whole encoding section.

For the encoder in the machine translation model, each word in the first sequence is first converted into a vector using word embedding (embedding). And to solve the problem of word order in the first sequence, the machine translation model adds a vector to each word embedding of the first sequence, which follows the specific pattern learned by the model, helping to determine the position of each word, or to learn the distance between different words. The encoder then passes the vector list into the self-attention layer, then into the feed-forward neural network, and then out to the next encoder.

Firstly, calculating the similarity between a word to be coded and each word in a preset first sequence through a self-attention layer, wherein the word to be coded is a word input at the ith moment in a preset second sequence, the second sequence is a preset word sequence which needs to be input through k moments, the first sequence is a word sequence input at the ith moment of the word in the second sequence, the first sequence is a word sequence input before the ith moment in the words in the second sequence, i and k are positive integers, and i is smaller than k. For example, assume that the machine translation model needs to "we eat pizza, fried chicken, hamburgers today. "translate to english, when the word to be encoded input at time i is" pizza ", the first sequence at time i-1 is" we eat today ". When the self-attention of the word "pizza" at the time i is being calculated through the self-attention layer, each word at the previous time i in the first sequence needs to be scored according to the word, namely, the similarity between "pizza" and "today", "we", "eat", "pizza" is calculated. Then, the calculated similarity can be subjected to standardization operation through softmax, and finally, the vector of each word in the first sequence is multiplied by the corresponding softmax, and then the results are respectively added to obtain the self-attention of the word to be coded. Then, the self-attention of the word to be encoded output from the attention layer is input to the feedforward neural network to obtain an output result, namely the input of the feedforward neural network is the output from the attention layer. As can be seen from fig. 1, since there are 6 coding layers, repeating the above-mentioned parts 6 times can result in the output of the entire coded part.

The encoder starts working by processing the input sequence and the output of the top encoder is then converted into a set of attention vectors to be used by each encoder for its own encoder-decoder attention layer, which can help the decoder to see which positions of the input sequence are appropriate. After the encoding phase is completed, the decoding phase is started.

For machine learning, parameters of the model need to be continually optimized through training data to improve the accuracy of the model. Therefore, the output result of the feedforward neural network is compared with the self attention of the word to be coded to obtain a loss value, the parameters of the machine translation model are adjusted according to the loss value, namely, the model of the coder in the machine translation model is adjusted according to the loss value to continue training the machine translation model until the training stopping condition is reached. After the encoder is updated, the self-attention of the word to be encoded can be updated through the updated encoder.

It should be noted that the self-attention used in the embodiments of the present application may also be a multi-head attention, and the embodiments of the present application are not limited in any way.

Referring to fig. 2, fig. 2 is a flowchart illustrating a method for training a machine translation model according to an embodiment of the present disclosure, where the method for training a machine translation model is applicable to the network structure in fig. 1. As shown in fig. 2, a method for training a machine translation model provided in an embodiment of the present application may include:

in step S201, the similarity between the word to be encoded and each word in the first sequence is calculated through the self-attention layer.

Specifically, the input from the attention layer is a word vector, i.e., the input to the entire machine translation model is in the form of a word vector. Firstly, the word vector of each word input at the previous i moment is multiplied by three matrixes to obtain three new vectors, and the three new vectors are multiplied by three matrix parameters instead of directly using the original word vector because more parameters are added, and the model effect is improved. That is, for the input word vector X1, the result of multiplying the three matrices is Q1, K1, V1; for the input word vector X2, after being multiplied by the three matrixes, Q2, K2 and V2 are respectively obtained, and by analogy, Q, K and V corresponding to each word input at the previous i moment are respectively obtained through calculation. Wherein Q represents Query, K represents Key Key, and V represents Value.

Then, the similarity between the word to be encoded input at the time i and each word in the first sequence input at the previous time i is calculated from the attention layer, and the similarity can be a numerical value. Suppose that the machine translation model needs to be "we eat pizza, fried chicken, hamburgers today. "translate to english, when the word to be coded input at time i is" pizza ", the first sequence at time i-1 is" we eat today ". When we are calculating the self-attention of the word "pizza" at time i, each word at the previous time i in the first sequence needs to be scored according to the word, that is, the similarity between "pizza" and "today", "we", "eat", "pizza" is calculated. The similarity between pizza and fried chicken and hamburger is not required to be calculated. Therefore, the machine learning model can be trained in an online manner without knowledge of future information. When the machine translation model processes the word at a certain position, the score determines how much focus (attention) is placed on other parts in the first sequence.

The similarity here can be obtained by dot product of the "Query" vector with the "Key" vector of the word being scored, or by similarity of the two vectors, Cosine, or by introducing an additional neural network. That is, "Query" is the vector of the word to be encoded input at time i, and "Key" is the vector of each word in the first sequence at time i. If the self-attention of the word ' pizza ' input at the moment i in ' the moment we eat pizza, fried chicken and hamburger ' today ' is calculated, a ' Key ' vector of each word in the first sequence at the moment i and a ' Query ' vector Q1 of the word input at the moment i need to be obtained, namely a ' Key ' vector K1 of ' today ' and a ' Key ' vector K2 of ' we ' are obtained, the ' Key ' vector of ' eating ' is K3, and the ' Key ' vector of ' pizza ' is K4. Therefore, the first score of the "pizza" from the point of view is the dot product of Q1 and K1, the second score is the dot product of Q1 and K2, the third score is the dot product of Q1 and K3, and the fourth score is the dot product of Q1 and K4.

And step S202, obtaining the self attention of the word to be coded according to the similarity.

Specifically, the obtained numerical values corresponding to the similarities are divided by a specific numerical value, the numerical value is the square root of the dimension of "Key", and usually, the dimension of the "Key" vector is 64, and then the specific numerical value may be 8. The structure obtained above is then subjected to softmax, which basically normalizes the values so that they are all integers and add up to 1. Then, the result of multiplying the "Value" vector of each word in the first sequence at the first i-th moment by softmax is used, mainly to keep the Value of the word desired to be focused on unchanged, while masking out those irrelevant words. Finally, the weighted "Value" vectors are added up, so far, the self-attention of the word to be encoded input at time i is output from the attention layer. For example, let the self-attention of the word to be encoded be Z, the "Value" vector of each word in the first sequence at the first i-1 time is V1, V2, V3 and V4, respectively, and the "Value" vector of the word input at i time is V5; the results of softmax for the above individual words are Z1, Z2, Z3, Z4, and Z5, respectively. The word to be encoded has self-attention Z1V 1+ Z2V 2+ Z3V 3+ Z4V 4+ Z5V 5.

In one possible implementation, a Random function value (which may be any value in the range of 0 to 1) is obtained by a Random function (e.g., a Random function), and the self-attention of the word to be encoded according to the Random function value may be as follows: in case one, if the random function value is smaller than the first threshold value and the self-attention of the word to be encoded is ignored, the self-attention of the i-1 th time in the first sequence is taken as the self-attention of the word to be encoded. And in case two, if the random function value is larger than the first threshold, outputting the self-attention of the word to be coded according to the similarity through the self-attention layer. The first threshold interval is a value for reference comparison considered to be set according to experience, or a value for reference comparison obtained by training or learning according to a plurality of historical values, and the first threshold is any value in the range of 0-1. Further, the first threshold may be a similarity between the sequence to be coded and any word in the first sequence after the Softmax operation.

In one possible implementation, a Random function Value (which may be any Value in the range of 0 to 1) is obtained by a Random function (e.g., a Random function), and if the Random function Value is smaller than the first threshold, the Value of the nth word in the first sequence is set to 0 by means of Random sampling. Alternatively, a plurality of words may be selected from each word of the first sequence at the first i-1 time, and the Value of the "Value" vector of any one of the plurality of words may be set to 0. Therefore, the attention of the word to be encoded is also 0, that is, the attention corresponding to the nth vector is not included in the self-attention of the word to be encoded. For example, if V3 is 0, Z1V 1+ Z2V 2+ Z3 0+ Z4V 4+ Z5V 5.

In step S203, the self-attention is input into the feedforward neural network to obtain an output result.

Specifically, the feedforward neural network has two main features: a loss function and an activation function. To solve the nonlinear classification or regression problem, the activation function must be a nonlinear function, and in addition, the machine translation model is trained in a gradient-based manner, and therefore the activation function must be derivable. The function of the loss function is used for representing the error between the predicted value and the true value, and the training of the machine translation model is the process of minimizing the loss function by a gradient-based method. The neurons in each layer in the feedforward neural network are arranged in a layered mode, each layer of neurons is only connected with the neurons in the previous layer, the neurons in the previous layer receive the output of the previous layer and output the output to the next layer, and feedback does not exist between the layers. The 0 th layer is called an input layer, the last layer is called an output layer, and other middle layers are called hidden layers (or hidden layers and hidden layers). After the self-attention is input into the feedback neural network, the first output of the self-attention is obtained by the input layer, the first output is input into one or more hidden layers to obtain a second output, the second output passes through one layer of neurons, and finally the output of the output layer is taken as a function of the whole function. Wherein, the values of the hidden layer and the output layer are obtained by substituting the weighted sum of the previous layer into the activation function.

Common activation functions include, among others: sigmoid activation function, Tanh activation function, ReLU activation function, and lreuu activation function.

Common loss functions include: mean square error loss function, cross entropy loss function.

In step S204, a loss value between the output result and the self-attention is calculated.

In particular, for machine learning, parameters of the model need to be continually optimized through training data to improve the accuracy of the model. Therefore, the output result is compared with the self-attention of the word to be coded through back propagation derivation to obtain a loss value, parameters of the machine translation model are adjusted according to the loss value, and the machine translation model continues to be trained until a training stopping condition is reached.

Since the self-attention of the word to be encoded calculated in steps S201 to S203 is determined according to the binomial distribution, and the calculation manner of determining each step according to the binomial distribution is an operation in a probability space, which is different from the euclidean space required for backward propagation derivation, a mapping is required to be found so that the self-attention of the word to be encoded can be derived in the euclidean space.

In one possible implementation, a closed form expression of the self-attention of the word to be encoded is obtained by a recursive likelihood function. Firstly, let the similarity of the word to be coded be pi, j, and let the self-attention of the word to be coded be alpha_i,jThen, then

By alpha_i-1,jAnd alpha_i,j-1To recursively express alpha_i,jThrough the sequence deformation and accumulation operation, let q_i,j＝(1-p_i,j-1)q_i,j+α_i-1,jTo obtain

And j is accumulated to obtain a closed solution:

according to

A closed expression from attention is obtained:

then, the loss value between the closed expression of self-attention and the output result is calculated by a loss function, for example, if Sigmoid is used as the activation function in the output layer, the Sigmoid derivative is Sigmoid (1-Sigmoid), and the loss value is (output result-self-attention) ═ Sigmoid (1-Sigmoid) ═ output value- (output value-self-attention) (1-output value).

And updating parameters of the machine translation model according to the loss value, namely updating an encoder in the machine translation model according to the loss value, and continuing to train the machine translation model until a training stop condition is reached.

And step S205, adjusting the network parameters of the machine translation model according to the loss value.

In particular, the Back Propagation Algorithm (Back Propagation Algorithm) is suitable for a learning Algorithm of a multi-layer neuron network, and is based on a gradient descent method. The input-output relationship of the back propagation algorithm network is essentially a mapping relationship: an n-input m-output BP neural network performs the function of continuous mapping from n-dimensional euclidean space to a finite field in m-dimensional euclidean space, which is highly non-linear. The BP algorithm is mainly iterated by two links (excitation propagation and weight updating) repeatedly and circularly until the response of the network to the input reaches a preset target range.

The BP algorithm is mainly divided into two stages: a forward stage of computing the output of a given input value; and (3) a reverse stage: the gradient is calculated and the weighting coefficients are updated. Firstly, the bias derivative of each network parameter in a machine translation model is calculated by the loss value based on a BP algorithm, then the gradient value of the loss value to the network parameter is calculated according to the derivative chain rule, and finally the network parameter is updated according to the gradient value, so that the loss value is converged to the global optimum.

Referring to fig. 3, fig. 3 is a schematic diagram of a training apparatus for a machine translation model according to an embodiment of the present disclosure, which may be applied to an encoder. As shown in fig. 3, the training device 30 for machine translation model may include:

the self-attention unit 301 calculates similarity between a word to be encoded and each word in a preset first sequence through a self-attention layer, where the word to be encoded is a word input at the ith time in a preset second sequence, the second sequence is a preset word sequence which needs to be input through k times, the first sequence is a word sequence input before the ith time in the words of the second sequence, i and k are positive integers, and i is smaller than k;

the self-attention unit 301 is further configured to calculate and obtain self-attention of the word to be encoded according to the similarity;

an input unit 302, configured to input the self-attention into the feedforward neural network to obtain an output result;

a calculation unit 303 for calculating a loss value between the output result and the self-attention;

and an adjusting unit 304, configured to adjust a network parameter of the machine translation model according to the loss value.

For specific implementation of the training apparatus for a machine translation model in the embodiment of the present application, reference may be made to the above embodiments of the training method for a machine translation model, and details are not described herein again.

It should be noted that the implementation of each unit may also correspond to the corresponding description of the method embodiment shown in fig. 2, and is not described herein again.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a training apparatus for a machine translation model according to an embodiment of the present invention. As shown in fig. 4, the service data processing device 40 may include: one or more processors 401, one or more memories 402, and one or more communication interfaces 403. These components may be connected by a bus 404 or other means, such as a communications bus in FIG. 4. Wherein:

the communication interface 403 can be used for the processing device 40 of the service data to communicate with other communication devices, such as other electronic devices. In particular, the communication interface 403 may be a wired interface.

The memory 402 may be coupled to the processor 401 via the bus 404 or an input/output port, and the memory 402 may be integrated with the processor 401. The memory 402 is used to store various software programs and/or sets of instructions or data. Specifically, the Memory 402 may be a Read-Only Memory (ROM) or other types of static storage devices that can store static information and instructions, a Random Access Memory (RAM) or other types of dynamic storage devices that can store information and instructions, an Electrically Erasable Programmable Read-Only Memory (EEPROM), a Compact Disc Read-Only Memory (CD-ROM) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code resources in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 402 may include high-speed random access memory and may also include non-volatile memory, such as one or more magnetic disk storage devices, flash memory devices, or other non-volatile solid-state storage devices. The memory 402 may store an operating system (hereinafter, referred to as a system), such as an embedded operating system like uCOS, VxWorks, RTLinux, etc. The memory 402 may also store a network communication program that may be used to communicate with one or more additional devices, one or more user devices, one or more electronic devices. The memory may be self-contained and coupled to the processor via a bus. The memory may also be integral to the processor.

The memory 402 is used for storing application code resources for executing the above scheme, and is controlled by the processor 401. The processor 401 is configured to execute application code resources stored in the memory 402.

The processor 401 may be a central processing unit, general purpose processor, digital signal processor, application specific integrated circuit, field programmable gate array or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor may also be a combination of certain functions, including for example one or more microprocessors, a combination of digital signal processors and microprocessors, or the like.

Processor 401 may be configured to invoke an application program stored in memory 402 to implement the steps of the method for training a machine translation model in the embodiment corresponding to fig. 2; in particular implementations, one or more instructions in the computer storage medium are loaded by processor 401 and perform the following steps:

calculating the similarity between a word to be coded and each word in a preset first sequence through a self-attention layer, wherein the word to be coded is a word input at the ith moment in a preset second sequence, the second sequence is a preset word sequence which needs to be input through k moments, the first sequence is a word sequence input before the ith moment in the words of the second sequence, i and k are positive integers, and i is smaller than k;

calculating and obtaining the self-attention of the word to be coded according to the similarity;

inputting the self-attention into a feedforward neural network to obtain an output result;

calculating a loss value between the output result and the self attention;

and adjusting the network parameters of the machine translation model according to the loss value.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

For specific implementation of the computer-readable storage medium in the embodiment of the present application, reference may be made to the embodiments of the method for detecting a pipeline weld defect, which are not described herein again.

It is also noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present application is not limited by the order of acts, as some steps may, in accordance with the present application, occur in other orders and concurrently. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application. In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A method of training a machine translation model, the machine translation model comprising an encoder, the encoder comprising a self attention layer and a feed forward neural network, the method comprising:

calculating the similarity between a word to be coded and each word in a preset first sequence through the self-attention layer, wherein the word to be coded is a word input at the ith moment in a preset second sequence, the second sequence is a preset word sequence which needs to be input at k moments, the first sequence is a word sequence input before the ith moment in the words of the second sequence, i and k are positive integers, and i is smaller than k;

obtaining the self-attention of the word to be coded according to the similarity;

inputting the self attention into the feedforward neural network to obtain an output result;

calculating a loss value between the output result and the self-attention;

2. The method according to claim 1, wherein the calculating, by the self-attention layer, the similarity between the word to be encoded and each word in the preset first sequence comprises:

obtaining a < Key, Value > data pair for each word in the first sequence;

and calculating the similarity between the Query of the word to be coded and each Key, wherein the similarity is a weight coefficient of Value corresponding to each Key.

3. The method according to claim 2, wherein said calculating the self-attention of the word to be encoded according to the similarity comprises:

acquiring a random function value;

and if the random function Value is larger than or equal to a first threshold Value, carrying out weighted summation on the similarity and the Value of the word represented by the similarity to obtain the self-attention of the word to be coded.

4. The method of claim 3, further comprising:

and if the random function value is smaller than the first threshold value, taking the self-attention of the word at the i-1 th moment in the first sequence as the self-attention of the word to be coded.

5. The method of claim 1, wherein the feedforward neural network comprises an input layer, a hidden layer, and an output layer, and wherein inputting the self-attention into the feedforward neural network results in an output comprising:

inputting the self-attention to the input layer to obtain a first output;

inputting the first output to the hidden layer to obtain a second output;

and inputting the second output to the output layer to obtain an output result.

6. The method of claim 1, wherein said calculating a loss value between said output result and said self-attention comprises:

obtaining a closed form expression of the self attention through a recursion of a likelihood function, wherein the closed form expression is as follows:

calculating a loss value between the closed form expression of self-attention and the output result by a loss function.

7. The method of claim 6, wherein said adjusting network parameters of the machine translation model based on the loss values comprises:

calculating partial derivatives of each network parameter in the machine translation model by the loss values based on a back propagation derivation algorithm;

calculating the gradient value of the loss value to the network parameter according to a derivative chain rule;

and updating the network parameters according to the gradient values so that the loss values converge to a global optimum.

8. An apparatus for training a machine translation model, comprising:

the self-attention unit is used for calculating the similarity between a word to be coded and each word in a preset first sequence through the self-attention layer, wherein the word to be coded is a word input at the ith moment in a preset second sequence, the second sequence is a preset word sequence which needs to be input through k moments, the first sequence is a word sequence input before the ith moment in the words of the second sequence, i and k are positive integers, and i is smaller than k;

the self-attention unit is further used for obtaining the self-attention of the word to be coded according to the similarity;

and the adjusting unit is used for adjusting the network parameters of the machine translation model according to the loss value.

9. An electronic device comprising a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, the programs comprising instructions for performing the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to carry out the method according to any one of claims 1 to 7.