CN113449529A

CN113449529A - Translation model training method and device, and translation method and device

Info

Publication number: CN113449529A
Application number: CN202010215046.0A
Authority: CN
Inventors: 李长亮; 郭馨泽
Original assignee: Chengdu Kingsoft Interactive Entertainment Technology Co ltd; Beijing Kingsoft Software Co Ltd; Beijing Kingsoft Digital Entertainment Co Ltd
Current assignee: Chengdu Kingsoft Interactive Entertainment Technology Co ltd; Beijing Kingsoft Software Co Ltd; Kingsoft Corp Ltd; Beijing Kingsoft Digital Entertainment Co Ltd
Priority date: 2020-03-24
Filing date: 2020-03-24
Publication date: 2021-09-28

Abstract

The application provides a training method and a device of a translation model, and a translation method and a device, wherein the translation model comprises the following steps: the system comprises an encoder and a language model, wherein the language model is obtained by adopting monolingual corpus pre-training of a target language; the training method comprises the following steps: inputting a source language sample sentence into an encoder to obtain a first encoding vector corresponding to the source language sample sentence; inputting a first coding vector corresponding to a source language sample sentence and a target language sample sentence into a language model to obtain a first decoding vector output by the language model and an error corresponding to the first decoding vector; parameters of the language model and the encoder are adjusted based on errors corresponding to the first decoding vector until a training stopping condition is reached, so that the problems that the translation model is not sufficiently trained under the condition that bilingual corpus resources are scarce and the quality of an obtained translation result is low are effectively solved, the translation model can be better represented under a low-resource translation task, and further the quality of the translation result is improved.

Description

Translation model training method and device, and translation method and device

Technical Field

The present application relates to the field of internet technologies, and in particular, to a method and an apparatus for training a translation model, a method and an apparatus for translation, a computing device, and a computer-readable storage medium.

Background

With the improvement of computer computing power, neural networks are more widely applied, for example, an end-to-end translation model is constructed to realize the conversion from a source language to a target language. Generally, the architecture of the translation model includes: encoder (encoder) -decoder (decoder). The encoder encodes a source sentence to be translated to generate a vector, and the decoder decodes the vector of the source sentence to generate a corresponding target sentence.

Currently, the general neural-machine translation task relies only on the encoder and decoder of the end-to-end translation model itself, e.g., the transformer model. The translation model needs a large-scale bilingual corpus to be trained, but under the condition of less training corpora, the translation model is difficult to be effectively trained, so that the quality of the obtained translation effect is low.

Disclosure of Invention

In view of this, embodiments of the present application provide a method and an apparatus for training a translation model, a computing device, and a computer-readable storage medium, so as to solve technical defects in the prior art.

The embodiment of the application provides a training method of a translation model, wherein the translation model comprises the following steps: the system comprises an encoder and a language model, wherein the language model is obtained by adopting monolingual corpus pre-training of a target language;

the training method comprises the following steps:

inputting a source language sample sentence into an encoder to obtain a first encoding vector corresponding to the source language sample sentence;

inputting a first coding vector corresponding to the source language sample sentence and a target language sample sentence into a language model to obtain a first decoding vector based on a target language output by the language model and an error corresponding to the first decoding vector output by the language model;

adjusting parameters of the language model and the encoder based on an error of a first decoded vector output by the language model until a training stop condition is reached.

Optionally, the encoder includes m sequentially connected encoding layers, where m is a positive integer;

inputting a source language sample sentence into an encoder, and obtaining a first encoding vector corresponding to the source language sample sentence, wherein the method comprises the following steps:

s102, inputting the source language sample sentence into a first coding layer to generate a first coding vector of the first coding layer;

s104, inputting the first coding vector of the (j-1) th coding layer to the (j) th coding layer to obtain the first coding vector output by the (j) th coding layer, wherein j is more than or equal to 2 and is less than or equal to m;

s106, judging whether j is equal to m, if so, executing a step S108, and if not, executing a step S110;

s108, obtaining a first coding vector corresponding to the source language sample sentence based on the first coding vectors of the m coding layers;

s110, increasing j by 1 and continuing to execute the step S104.

Optionally, obtaining a first code vector corresponding to the source language sample sentence based on the first code vectors of the m code layers includes:

taking the first coding vector of the mth coding layer as the first coding vector corresponding to the source language sample sentence; or

And carrying out weighted summation on the first coding vectors of the m coding layers to obtain a first coding vector corresponding to the source language sample sentence.

Optionally, the language model includes n decoding layers connected in sequence, where n is a positive integer;

inputting a first coding vector corresponding to the source language sample sentence and the target language sample sentence into a language model to obtain a first decoding vector based on a target language output by the language model, wherein the first decoding vector comprises:

s202, generating a corresponding first reference vector according to the input target language sample statement;

s204, inputting the first reference vector and a first coding vector corresponding to the source language sample sentence into a first decoding layer to obtain a first decoding vector of the first decoding layer;

s206, inputting the first decoding vector of the (i-1) th decoding layer and the first coding vector corresponding to the source language sample sentence into the ith decoding layer to obtain the first decoding vector of the ith decoding layer, wherein i is more than or equal to 2 and less than or equal to n;

s208, judging whether i is equal to n, if so, executing a step S210, and if not, executing a step S212;

s210, obtaining a first decoding vector based on a target language output by the language model based on the first decoding vectors of the n decoding layers;

s212, increasing i by 1, and executing step S206.

Optionally, obtaining a first decoding vector with a target language output by the language model based on the first decoding vectors of the n decoding layers includes:

taking the first decoding vector of the n-th decoding layer as a first decoding vector with a target language output by the language model; or

And carrying out weighted summation on the first decoding vectors of the n decoding layers to obtain the first decoding vector with the target language output by the language model.

Optionally, the obtaining an error corresponding to the first decoding vector output by the language model includes: comparing the first decoding vector output by the language model with a preset vector verification set to obtain the error of the first decoding vector output by the language model;

the training stop condition includes: the rate of change of the error of the first decoded vector output by the language model is less than a stability threshold.

The embodiment of the application provides a translation method, which is applied to a translation model obtained by the method, and the translation method comprises the following steps:

inputting a statement to be translated into an encoder to obtain a second coding vector corresponding to the statement to be translated;

inputting a second coding vector corresponding to the statement to be translated into a language model to obtain a second decoding vector which is output by the language model and is based on a target language;

and obtaining a word unit corresponding to each second decoding vector based on the second decoding vectors output by the language model, and obtaining a translation statement according to the word unit.

inputting the statement to be translated into an encoder to obtain a second coding vector corresponding to the statement to be translated, wherein the method comprises the following steps:

s302, embedding the statement to be translated to obtain a corresponding statement vector, inputting the statement vector of the statement to be translated to a first coding layer, and generating a second coding vector of the first coding layer;

s304, inputting the second coding vector of the j-1 th coding layer to the j-th coding layer to obtain the second coding vector output by the j-th coding layer, wherein j is more than or equal to 2 and is less than or equal to m;

s306, judging whether j is equal to m, if so, executing the step S308, otherwise, executing the step S310;

s308, obtaining a second coding vector corresponding to the statement to be translated based on the second coding vectors of the m coding layers;

s310, increasing j by 1 and continuing to execute the step S304.

Optionally, obtaining a second coding vector corresponding to the sentence to be translated based on the second coding vectors of the m coding layers includes:

taking the second coding vector of the mth coding layer as the second coding vector corresponding to the statement to be translated; or carrying out weighted summation on the second coding vectors of the m coding layers to obtain the second coding vector corresponding to the statement to be translated.

inputting the second coding vector corresponding to the statement to be translated into a language model to obtain a second decoding vector output by the language model and based on a target language, wherein the second decoding vector comprises:

s402, inputting a second reference vector and a second coding vector corresponding to the statement to be translated into a first decoding layer to obtain a second decoding vector of the first decoding layer;

s404, inputting a second decoding vector of an i-1 th decoding layer and a second coding vector corresponding to the statement to be translated into the ith decoding layer to obtain a second decoding vector of the ith decoding layer, wherein i is more than or equal to 2 and less than or equal to n;

s406, judging whether i is equal to n, if so, executing a step S408, and if not, executing a step S410;

s408, obtaining a second decoding vector based on the target language output by the language model based on the second decoding vectors of the n decoding layers;

s410, increasing i by 1, and executing the step S404.

Optionally, obtaining a second decoding vector based on the target language output by the language model based on the second decoding vectors of the n decoding layers includes:

using the second decoding vector of the n-th decoding layer as a second decoding vector based on the target language output by the language model; or

And carrying out weighted summation on the second decoding vectors of the n decoding layers to obtain a second decoding vector which is output by the language model and is based on the target language.

The embodiment of the application provides a training device of a translation model, wherein the translation model comprises: the system comprises an encoder and a language model, wherein the language model is obtained by adopting monolingual corpus pre-training of a target language;

the training apparatus includes:

the encoding method comprises the steps that a first encoding module is configured to input a source language sample sentence into an encoder, and a first encoding vector corresponding to the source language sample sentence is obtained;

a first decoding module, configured to input a first coding vector corresponding to the source language sample sentence and a target language sample sentence into a language model, and obtain a first decoding vector output by the language model and based on a target language and an error corresponding to the first decoding vector output by the language model;

a parameter tuning module configured to adjust parameters of the language model and the encoder based on an error of a first decoding vector output by the language model until a training stop condition is reached.

An embodiment of the present application provides a translation apparatus, including:

the second coding module is configured to input the statement to be translated into the coder to obtain a second coding vector corresponding to the statement to be translated;

the second decoding module is configured to input a second coding vector corresponding to the statement to be translated into a language model, and obtain a second decoding vector which is output by the language model and is based on a target language;

and the translation statement generation module is configured to obtain a word unit corresponding to each second decoding vector based on the second decoding vectors output by the language model, and obtain a translation statement according to the word unit.

Embodiments of the present application provide a computing device, which includes a memory, a processor, and computer instructions stored on the memory and executable on the processor, and when the processor executes the instructions, the method for training a translation model or the steps of the translation method are implemented as described above.

Embodiments of the present application provide a computer-readable storage medium, which stores computer instructions, and when the instructions are executed by a processor, the computer-readable storage medium implements the method for training a translation model or the steps of the translation method as described above.

According to the method and the device for training the translation model, the language model pre-trained by adopting the monolingual corpus of the target language is obtained, and then the whole translation model is trained through the bilingual sentences formed by a small amount of source language sample sentences and target language sample sentences, so that the language model has the continuous decoding capacity, the problems that the translation model is not fully trained under the condition that bilingual corpus resources are scarce and the quality of the obtained translation result is low are effectively solved, and the translation model can be better represented under the low-resource translation task.

According to the translation method and the translation device, the trained end-to-end translation model is utilized, the sentence to be translated is input to the encoder to obtain the corresponding second coding vector, the second coding vector is input to the language model to be continuously decoded to obtain the second decoding vector of the target language, and finally the translated sentence is obtained based on the second decoding vector, so that the accuracy of the translation result can be effectively improved.

Drawings

FIG. 1 is an architectural diagram of a translation model according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating a method for training a translation model according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of an encoder according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating a process of generating a first code vector according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a language model according to an embodiment of the present application;

FIG. 6 is a flowchart illustrating a process of generating a first decoded vector according to an embodiment of the present application;

FIG. 7 is a schematic flow chart diagram of a translation method according to yet another embodiment of the present application;

FIG. 8 is a schematic flow chart of generating a second code vector according to yet another embodiment of the present application;

FIG. 9 is a schematic diagram of a process for generating a second decoded vector according to yet another embodiment of the present application;

FIG. 10 is a block diagram of an apparatus for training a translation model according to an embodiment of the present application;

FIG. 11 is a block diagram of a translation device according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a computing device according to another embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.

The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first can also be referred to as a second and, similarly, a second can also be referred to as a first without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

First, the noun terms to which one or more embodiments of the present invention relate are explained.

End-to-end (end to end): the traditional machine learning process usually consists of a plurality of independent modules, for example, in a typical Natural Language Processing (NLP) problem, the process includes a plurality of independent steps such as word segmentation, part of speech tagging, syntactic analysis, semantic analysis and the like, each step is an independent task, and the quality of the result affects the next step, thereby affecting the result of the whole training, which is a non-end-to-end model; for the end-to-end model, the model is not divided into a plurality of modules for processing, but the result output is obtained by the input of the original data, and the neural network between the input end and the output end is integrated.

Transformer model: the Encoder is essentially a structure of an Encoder (Encoder) -Decoder (Decoder), wherein the Encoder is formed by sequentially connecting 6 encoding layers, and the Decoder is formed by sequentially connecting 6 decoding layers. As with all generative models, the encoder receives the original input text and outputs the encoded vectors to the decoder, which generates the decoded vectors and results in the final output text.

Encoder (encoder): and converting the sentence to be translated into a coding vector from words.

Decoder (decoder): the encoded vector is generated into a decoded vector, and the decoded vector is converted into a translation statement.

GPT (generative pre-training) model: the model for GPT is very large, and it is equivalent to a Decoder (Decoder) of the transform model.

one-hot encoding: also known as "one-hot encoding," i.e., using an N-bit status register to encode N states, each state having its own independent register bit and only one of which is active at any one time. For example, statement 1- > [0,1,1,0,0,0,1,0,0], statement 2- > [1,0,0,1,0,0,0,1,0 ].

Sample statement: the training phase inputs the statements to the translation model.

Source language: and translating the language corresponding to the text input by the model.

Target language: and the language corresponding to the text output by the translation model.

Monolingual corpus: there is only one language corpus.

And (3) encoding a vector: the encoder of the translation model outputs a first encoded vector in the present application during the training phase and a second encoded vector during the use phase.

Decoding the vector: the decoder of the translation model outputs, in this application, the first decoded vector in the training phase and the second decoded vector in the use phase.

Model parameters: model parameters are configuration variables within the model whose values define the model that can be used. The parameters are the key of the machine learning algorithm, and the parameter values can be estimated according to training sample data in the training stage. The model is predicted according to the parameters in the use stage.

In the present application, a training method and apparatus for a translation model, a translation method and apparatus, a computing device, and a computer-readable storage medium are provided, and details are described in the following embodiments one by one.

First, the translation model of the present embodiment will be schematically described.

Unlike the conventional transform model, the translation model of the present embodiment is an end-to-end model, see fig. 1, which includes an encoder and a language model as a decoder. The language model can be a model similar to a GPT structure, can be trained by using a monolingual corpus in advance, and then is added to an end-to-end model architecture by using the language model to form a complete translation model together with an encoder. The language model does not have continuous decoding capability and needs to be trained under a translation model framework.

The embodiment discloses a method for training a translation model, which trains the translation model consisting of an encoder and a language model so as to enable the language model to have the capability of continuous decoding. The language model is obtained by adopting monolingual corpus pre-training of a target language.

Taking the GPT model as an example, the pre-training of the GPT model actually adopts a unidirectional language model to pre-train the monolingual corpus. By "unidirectional" is meant: the task goal of language model training is to base word units W on_iTo correctly predict word units W_iWord unit W_iThe preceding word unit sequence Context-before is referred to as above, and the following word unit sequence Contextthe t-after is referred to hereinafter. The GPT model only takes the Context-before of this word to predict, and leaves the Context-after.

Specifically, a monolingual corpus including a plurality of word units is input to a GPT model, and the GPT model predicts the current ith word unit W according to the first i-1 word units_iAnd predicted word unit W_iAnd the ith word unit W in the monolingual corpus_iAnd comparing and calculating a loss value, and then adjusting the parameters of the GPT model until the loss value of the prediction result is less than a threshold value.

Referring to FIG. 2, the training method comprises the following steps 202-206:

202. and inputting the source language sample sentence into an encoder to obtain a first encoding vector corresponding to the source language sample sentence.

Specifically, the encoder includes m coding layers, and referring to fig. 3 and 4, fig. 3 shows a schematic diagram of an encoder having 6 coding layers, and fig. 4 shows a flowchart of generating the first coding vector. Specifically, the step 202 comprises the following steps 402-410:

402. and inputting the source language sample sentence into a first coding layer to generate a first coding vector of the first coding layer.

Specifically, source language sample sentences are subjected to embedding layer processing to generate corresponding sentence vectors, and then the sentence vectors are input into a first coding layer to obtain first coding vectors of the first coding layer.

Performing embedding layer processing on a source language sample sentence, more specifically, segmenting the source language sample sentence to obtain a plurality of word units, then performing word embedding processing on each word unit, and finally obtaining a word vector of each word unit.

Word embedding is actually a type of technique that represents individual word units as real-valued vectors in a predetermined vector space. Each word unit is mapped to a vector (initial randomization).

The usual step of using an embedding layer is to preprocess the source language sample sentence first and convert each word unit into one-hot form of encoding. The word vector corresponding to the word unit is actually one part of the algorithm model, the word vector is represented by a predefined dimension, and the size is initialized randomly. Here, the embedding layer is actually the input layer of the translation model.

404. And inputting the first coding vector of the j-1 th coding layer into the j-th coding layer to obtain the first coding vector output by the j-th coding layer, wherein j is more than or equal to 2 and is less than or equal to m.

Specifically, each encoding layer includes 1 multi-head self-attention layer (FFN) and 1 fully connected feed-forward network (FFN).

For the first coded vector input to the multi-head attention layer, there are 3 different vectors corresponding to each word unit, namely, word vectors q (query), k (key), and v (value). The multi-head attention layer is calculated by projecting word vectors Q, K and V through h different linear transformations and finally splicing different attention results.

In the attention calculation of the encoder, the word vectors q (query), k (key), and v (value) are all equal to each other, and they are the first encoding vector output from the previous encoding layer. For the first coding layer, the word vector Q, K, V is the vector output by the embedding layer (word embedding) multiplied by the weight matrix.

Specifically, the calculation formula of the multi-head attention layer is as follows:

head_i＝Attention(QW_i ^Q,KW_i ^K,VW_i ^V) (1)

Multihead(Q,K,V)＝Concat(head₁,…,head_h)W^O (2)

q, K, V is a word vector corresponding to the input first encoding vector;

head_iself-attention results for each head (head) of a multi-head attention layer;

the Multihead is an output result of the multi-head attention layer;

concat is a splicing function;

W_i ^Q、W_i ^K、W_i ^Vfor each word directionThe weight matrix of quantity Q, K, V for linear transformation, for example, each word unit corresponds to three different word vectors Q, K, V, each of 64 dimensions, which are multiplied by the embedded vector by three different weight matrices W through 3 different weight matrices_i ^Q、W_i ^K、W_i ^VThe three matrices are then 512 x 64 dimensions.

W^OThe weight matrix required for linear transformation has 512 x 512 dimensions.

The output of the multi-head attention layer is then input to a fully connected layer (FFN). The calculation formula of the full connection layer (FFN) is as follows:

FFN(x)＝max(0,xH₁+b₁)H₂+b₂ (3)

wherein H₁、H₂Training to obtain a parameter matrix;

b₁、b₂as a parameter, training is obtained;

x is the output result of the multi-head attention layer;

FFN (x) is the output result of the full link layer.

And obtaining a first coding vector output by the jth coding layer through the output result of the full connection layer.

406. It is determined whether j is equal to m, if so, step 408 is performed, and if not, step 410 is performed.

408. And obtaining a first coding vector corresponding to the source language sample sentence based on the first coding vectors of the m coding layers.

The first coding vector is generated in various ways, for example, the first coding vector of the mth coding layer is used as the first coding vector corresponding to the source language sample sentence; or

In addition, weighted summation can be performed according to the first coding vectors of several coding layers of the m layers to obtain the first coding vector corresponding to the source language sample sentence, and the number of the coding layers can be selected according to actual requirements.

It should be noted that the weighting factor of the first code vector of each code layer can also be selected according to actual requirements. Referring to fig. 3, the encoder is sequentially connected by 6 encoding layers, and the 6 encoding layers are sequentially arranged from low to high according to the sequence of encoding processing on the input vector. For the first encoding vector of the encoding layer of the lower layer, the first encoding vector contains more semantic information; and for the first coding vector of the coding layer at the higher layer, more syntax information is contained, and through the selection of the weight of the first coding vector of different coding layers, the proportion of semantic information and syntax information contained in the finally generated first coding vector corresponding to the source language sample sentence can be selected.

410. Incrementing j by 1, and continuing to step 404.

The steps 402-410 are encoding processes by an encoder including a plurality of encoding layers.

For the case that the encoder only includes one coding layer, the first coding vector output by the coding layer is directly used as the first coding vector corresponding to the source language sample sentence.

204. And inputting a first coding vector corresponding to the source language sample sentence and a target language sample sentence into a language model to obtain a first decoding vector based on a target language output by the language model and an error corresponding to the first decoding vector output by the language model.

The language model is a model similar to a GPT structure and comprises n decoding layers which are connected in sequence, wherein n is a positive integer. Referring to fig. 5 and 6, fig. 5 shows a language model containing 6 decoding layers. Fig. 6 shows a flow diagram for decoding using a language model.

Referring to FIG. 6, step 204 includes the following steps 602-612:

602. and generating a corresponding first reference vector according to the input target language sample statement.

Referring to fig. 5, in the training phase, the first reference vector corresponding to the target language sample sentence is used as the input vector of the first decoding layer.

Specifically, the target language sample statement is processed by the corresponding embedding layer of the decoder to generate a first reference vector.

The processing procedure of the embedded layer corresponding to the decoder is the same as the processing procedure of the embedded layer corresponding to the encoder described in the foregoing step 402, and the description is not repeated here.

604. And inputting the first reference vector and a first coding vector corresponding to the source language sample sentence into a first decoding layer to obtain a first decoding vector of the first decoding layer.

606. And inputting the first decoding vector of the (i-1) th decoding layer and the first coding vector corresponding to the source language sample sentence into the ith decoding layer to obtain the first decoding vector of the ith decoding layer, wherein i is more than or equal to 2 and less than or equal to n.

For each decoding layer, three layers are included, the first layer is a masked multi-head self-attention layer (multi-head attention), the second layer is a multi-head attention layer (multi-head self-attention), and the third layer is a feed-forward layer (feed-forward network). For the multi-head attention layer and the feedforward layer, the process of generating the first code vector by the coding layer introduced in the aforementioned step 404 has been described in detail, and will not be described again here.

It should be noted that, in the self-attention calculation of the decoding layers, the dimension of the word vector Q, K, V is equal, and for the first decoding layer, the word vector Q is obtained by multiplying the sentence vector output by the embedding layer (word embedding) corresponding to the decoder by the weight matrix, and the word vectors K and V are from the first encoding vector corresponding to the source language sample sentence output by the encoder; for the other decoding layers except the first decoding layer, the word vector Q comes from the first decoding vector output by the last decoding layer, and the word vectors K and V come from the first encoding vector corresponding to the source language sample sentence output by the encoder.

As can be seen from the comparison between the structure of the decoding layer and the structure of the encoding layer, the decoding layer has one more masking multi-head self-entry layer (masked multi-head self-entry) than the encoding layer. The operations of the multi-head attention layer and the multi-head attention layer are basically consistent, except that the operation of the mask is added. The role of the mask is to prevent future output words from being used during training. For example, in training, the first word cannot refer to the result of the second word. Mask changes this information to 0 to ensure that the output at each position i will only depend before i bits (i bits are not included because of the right shift by one bit and Mask).

For example, the source language sample sentence includes 4 word units D₁、D₂、D₃And D₄The corresponding target language sample sentence includes 4 word unit words E₁、E₂、E₃And E₄The corresponding first reference vectors are e₀、e₁、e₂And e₃。

In word unit E₁In the generation process of (2), the first reference vector is an initial first reference vector e₁；

In word unit E₂In the generation process, the first reference vector is a word unit E₁Corresponding decoded vector e₁；

In word unit E₃In the generation process, the first reference vector is a word unit E₂Corresponding decoded vector e₂；

In word unit E₄In the generation process, the first reference vector is a word unit E₃Corresponding decoded vector e₃。

608. It is determined whether i is equal to n, if so, go to step 610, otherwise, go to step 612.

610. And obtaining a first decoding vector based on the target language output by the language model based on the first decoding vectors of the n decoding layers.

Specifically, the first decoding vector may be generated in a variety of ways, including:

In addition, the first decoding vectors of several decoding layers of the n layers can be weighted and summed to obtain the first decoding vector of the target language, and the number of the decoding layers can be selected according to actual requirements.

612. Step 606 is performed by incrementing i by 1.

The steps 602-612 are performed by a decoder comprising a plurality of decoding layers.

For the case that the decoder comprises a decoding layer, the first decoding vector output by the decoding layer is directly used as the first decoding vector of the target language obtained by translating the source language sample sentence.

Specifically, obtaining an error corresponding to a first decoding vector output by the language model includes: comparing the first decoding vector output by the language model with a preset vector verification set to obtain the error of the first decoding vector output by the language model;

the training stop conditions include: the rate of change of the error of the first decoded vector output by the language model is less than a stability threshold.

It should be noted that, in this embodiment, the first decoded vector output by the language model does not directly compare with the vector corresponding to the original target language sample statement to calculate an error, but introduces a vector verification set. If the first decoding vector output by the language model is directly compared with the vector corresponding to the original target language sample sentence to calculate errors, overfitting can be caused, the expression of the translation model in other sentence translations is poor, and the translation effect is poor.

206. Adjusting parameters of the language model and the encoder based on an error of a first decoded vector output by the language model until a training stop condition is reached.

It is to be explained that for language models and coders, the parameters are configuration variables inside the model, whose values define the model that can be used. The parameters are the key of the machine learning algorithm, and the parameter values can be estimated according to training sample data in the training stage.

The parameters for adjusting the language model and the encoder described in this embodiment include not only parameter values of the parameters of the language model and the encoder, but also the number of layers of the model, the number of parameters in each layer, and the like.

Specifically, the training stop conditions include: the rate of change of the error of the first decoded vector output by the language model is less than a stability threshold.

The stability threshold may be set according to actual requirements, for example to 1%. Thus, the error tends to be stable, and the model can be considered to be trained.

Through the parameter adjustment in step 206, not only the encoding accuracy of the encoder can be improved, but also the language model obtained by the monolingual corpus pre-training can have the continuous decoding capability, so that the translation model generated by combining the encoder and the language model has the continuous translation capability after the training is finished.

The method for training the translation model provided by this embodiment obtains a language model pre-trained by using a monolingual corpus of a target language, and then trains the whole translation model through a bilingual sentence composed of a small amount of source language sample sentences and target language sample sentences, so that the language model has a continuous decoding capability, thereby effectively solving the problems of insufficient training of the translation model and low quality of the obtained translation result under the condition of scarce bilingual corpus resources, and enabling the translation model to have better performance under a low-resource translation task.

In the present embodiment, the source language is chinese and the target language is english, and the present embodiment is schematically illustrated by a set of training sentences. For example, the source language sample sentence is "I love China" and the target language sample sentence is "I love China".

The training method comprises the following steps:

1) and training a language model by using the monolingual corpus of the target language, and forming an end-to-end network structure by using the obtained language model and the encoder.

In this embodiment, the language model is obtained by training the monolingual corpus, so as to overcome the problem of insufficient training of the translation model caused by the scarcity of bilingual corpus resources.

Then, the obtained language model and the encoder form an end-to-end network structure to build a complete translation model, and parameters of the language model and the encoder can be synchronously adjusted in the subsequent training step.

2) And generating a corresponding statement vector by the first source language sample statement 'I love China', and inputting the statement vector into an encoder to obtain a first encoding vector output by the encoder.

For example, the first source language sample sentence is "i love china", and the sentence vector X generated through the embedding layer is (X)₀，x₁，x₂，x₃) Wherein x is₀X is me₁As "love", x₂In the case of "middle", x₃The country.

In this embodiment, the encoder includes 6 sequentially connected encoding layers, each encoding layer generates a corresponding first encoding vector, and the first encoding vector output by the last encoding layer is used as the first encoding vector of the encoder.

For a specific generation process of the first coded vector, refer to the detailed description of the foregoing steps in this embodiment, and will not be repeated here.

3) And inputting the first coding vector and a first target language sample sentence 'I love china' into a language model to obtain a first decoding vector which is output by the language model and is based on the target language.

Specifically, the language model includes 6 decoding layers connected in sequence.

Inputting a target language sample sentence 'I love china' into an embedding layer corresponding to a decoder to generate a first reference vector, inputting the first reference vector and a first coding vector into a first decoding layer, and generating a first decoding vector output by the first decoding layer.

And inputting the first decoding vector of the first decoding layer and the first coding vector output by the coder into the second decoding layer to obtain the first decoding vector output by the second decoding layer.

And analogically, taking the first decoding vector output by the sixth decoding layer as the first decoding vector output by the decoder based on the target language.

For the specific generation process of the first decoding vector, refer to the detailed description of the foregoing embodiment, and will not be repeated here.

4) And comparing the first decoding vector output by the decoder with a preset vector verification set to obtain the error 0.4 of the first decoding vector output by the decoder, and adjusting the language model and the parameters of the encoder based on the error 0.4 of the first decoding vector output by the decoder.

And calculating the change rate of the errors of the first decoding vectors obtained by all training times before the current training time, and judging whether the change rate is smaller than a stable threshold value. When the error of the first decoded vector is smaller than the stable threshold, the error of the first decoded vector tends to be stable, and the training can be considered to be finished.

In this embodiment, the stability threshold may be set according to actual requirements, for example, set to 0.2, and then the error of the first decoding vector is considered to be stable.

The other embodiment of the present application further discloses a translation method applied to the translation model described above, referring to fig. 7, the translation method includes the following steps 702-706:

702. and inputting the statement to be translated into an encoder to obtain a second encoding vector corresponding to the statement to be translated.

Taking the example that the encoder includes m sequentially connected coding layers, referring to fig. 8, step 702 includes the following steps 802 to 810:

802. and embedding the statement to be translated to obtain a corresponding statement vector, inputting the statement vector of the statement to be translated to a first coding layer, and generating a second coding vector of the first coding layer.

And inputting the statement to be translated into the embedding layer to obtain a statement vector of the statement to be translated.

For the embedded layer corresponding to the encoder and the specific calculation process of each encoded layer, refer to the description of the foregoing embodiments, and are not described herein again.

804. And inputting the second coding vector of the j-1 th coding layer to the j-th coding layer to obtain the second coding vector output by the j-th coding layer, wherein j is more than or equal to 2 and is less than or equal to m.

806. It is determined whether j is equal to m, if so, go to step 808, otherwise, go to step 810.

808. And obtaining a second coding vector corresponding to the statement to be translated based on the second coding vectors of the m coding layers.

The second coding vector is generated in multiple ways, for example, the second coding vector of the mth coding layer is used as the second coding vector corresponding to the statement to be translated; or

And carrying out weighted summation on the second coding vectors of the m coding layers to obtain a second coding vector corresponding to the statement to be translated.

In addition, weighted summation can be performed according to the second coding vectors of several coding layers of the m layers to obtain the second coding vectors corresponding to the sentence to be translated, and the number of the coding layers can be selected according to actual requirements.

810. Incrementing j by 1, and continuing to step 804.

In the case that the encoder includes one coding layer, the second coding vector output by the coding layer is directly used as the second coding vector corresponding to the sentence to be translated.

The above steps 802 to 810 are encoding processes by an encoder including a plurality of encoding layers. Through the steps 802-810, a coding vector generated by coding the input sentence to be translated by the coder is obtained, so that preparation is made for decoding in the subsequent steps.

704. And inputting a second coding vector corresponding to the statement to be translated into a language model to obtain a second decoding vector which is output by the language model and is based on the target language.

For the case where the language model includes n decoding layers connected in series, see FIG. 9, step 704 includes the following steps 902-910:

902. and inputting the second reference vector and a second coding vector corresponding to the statement to be translated into the first decoding layer to obtain a second decoding vector of the first decoding layer.

For the embedded layer corresponding to the decoder and the specific calculation process of each decoded layer, refer to the description of the foregoing embodiments, and are not repeated herein.

Specifically, it should be noted that, for the sentence to be translated, at least one word to be translated is included. And in the translation process, obtaining the translation words corresponding to each word to be translated in sequence. The second decoded vector generated for each translated term may be input to the first decoding layer as the second reference vector for the next translated term.

Specifically, for an initial first word to be translated, the second reference vector is a set initial value, which may be 0; and for other to-be-translated terms except the first to-be-translated term, the second reference vector is a translated term corresponding to a previous to-be-translated term of the current to-be-translated term.

For example, for an input sentence to be translated, i.e., "i is a student", if the current sentence to be translated is "i", the second reference vector is a preset initial value; if the current word to be translated is 'yes', the second reference vector is the translation word 'I' corresponding to 'I'; if the current word to be translated is "one", then the second reference vector is "yes" for the corresponding translated word "am"; if the current word to be translated is "student", then the second reference vector is "one" corresponding translated word "a".

It should be noted that, for the translation model, after the training is completed, the translation model has the ability to cut the sentence to be translated to obtain the word to be translated. After the second coding vector of the whole sentence to be translated is obtained according to the encoder, the second decoding vector corresponding to each word to be translated can be obtained through the language model, and then the corresponding translated word is obtained according to the second decoding vector of each word to be translated.

904. And inputting the second decoding vector of the (i-1) th decoding layer and the second coding vector corresponding to the statement to be translated into the ith decoding layer to obtain the second decoding vector of the ith decoding layer, wherein i is more than or equal to 2 and less than or equal to n.

906. It is determined whether i is equal to n, if so, go to step 908, otherwise, go to step 910.

908. And obtaining a second decoding vector based on the target language output by the language model based on the second decoding vectors of the n decoding layers.

Specifically, the second decoding vector may be generated in a variety of ways, including:

In addition, the second decoding vectors of several decoding layers of the n layers can be subjected to weighted summation to obtain the second decoding vectors output by the language model and based on the target language, and the number of the decoding layers can be selected according to actual requirements.

910. Step 904 is performed by incrementing i by 1.

And in case that the decoder comprises one decoding layer, directly taking the second decoding vector output by the decoding layer as the second decoding vector based on the target language output by the language model.

Through the steps 902-910, the trained language model is used to continuously decode the input second coding vector to obtain a second decoding vector, so that preparation is made for generating a translation statement according to the second decoding vector in the subsequent steps, and the translation task is performed.

706. And obtaining a word unit corresponding to each second decoding vector based on the second decoding vectors output by the language model, and obtaining a translation statement according to the word unit.

Specifically, the language model further includes a normalization layer (softmax), the second decoding vectors are normalized through the normalization layer to obtain word units corresponding to each second decoding vector, and the translated sentences are obtained according to the word units.

For the normalization layer, it can compress a multi-dimensional vector containing arbitrary real numbers such that each element of the vector ranges between (0, 1).

Specifically, the formula for softmax is as follows:

wherein S is_iRepresents a softmax value corresponding to the ith second decoding vector;

i represents the ith second decoded vector;

j represents the total number of second decoded vectors.

According to the translation method provided by the embodiment, the trained end-to-end translation model is utilized, the sentence to be translated is input to the encoder to obtain the corresponding second coding vector, the second coding vector is input to the language model to be continuously decoded to obtain the second decoding vector output by the language model and based on the target language, and finally the translated sentence is obtained based on the second decoding vector output by the language model, so that the accuracy of the translation result can be effectively improved.

The embodiment discloses a training device of a translation model, wherein the translation model comprises: the system comprises an encoder and a language model, wherein the language model is obtained by adopting monolingual corpus pre-training of a target language;

referring to fig. 10, the training apparatus includes:

a first encoding module 1002 configured to input a source language sample sentence to an encoder, resulting in a first encoding vector corresponding to the source language sample sentence;

a first decoding module 1004 configured to input a first encoding vector corresponding to the source language sample sentence and a target language sample sentence into a language model, and obtain a first decoding vector output by the language model and based on a target language and an error corresponding to the first decoding vector output by the language model;

a parameter tuning module 1006 configured to adjust parameters of the language model and the encoder based on an error of a first decoding vector output by the language model until a training stop condition is reached.

the first encoding module 1002 is specifically configured to:

a first encoding unit configured to input the source language sample sentence to a first encoding layer, generating a first encoding vector of the first encoding layer;

the second coding unit is configured to input the first coding vector of the j-1 th coding layer to the j-th coding layer to obtain the first coding vector output by the j-th coding layer, wherein j is more than or equal to 2 and less than or equal to m;

a first judging unit configured to judge whether j is equal to m, if so, execute the first coding vector generating unit, and if not, execute the first self-increasing unit;

a first code vector generation unit configured to obtain a first code vector corresponding to the source language sample sentence based on first code vectors of m code layers;

and the first self-increment unit is configured to self-increment j by 1 and continue to execute the second coding unit.

Optionally, the first encoding vector generating unit is specifically configured to:

taking the first coding vector of the mth coding layer as a first coding vector corresponding to the source language sample sentence; or

the first decoding module 1004 is specifically configured to:

a first reference vector generation unit configured to generate a corresponding first reference vector according to the input target language sample statement;

a first decoding unit, configured to input the first reference vector and a first encoding vector corresponding to the source language sample sentence into a first decoding layer, resulting in a first decoding vector of the first decoding layer;

the second decoding unit is configured to input the first decoding vector of the i-1 th decoding layer and the first coding vector corresponding to the source language sample sentence into the i-th decoding layer to obtain the first decoding vector of the i-th decoding layer, wherein i is more than or equal to 2 and less than or equal to n;

a second judging unit configured to judge whether i is equal to n, if so, execute the first decoding vector generating unit, and if not, execute the second self-increasing unit;

a first decoding vector generation unit configured to obtain a first decoding vector based on a target language output by the language model based on first decoding vectors of n decoding layers;

a second auto-increment unit configured to auto-increment i by 1, performing the second decoding unit.

Optionally, the first decoding vector generating unit is specifically configured to:

Optionally, the first decoding module 1004 is specifically configured to: comparing the first decoding vector output by the language model with a preset vector verification set to obtain the error of the first decoding vector output by the language model;

The training device for the translation model provided by this embodiment obtains the language model pre-trained by the monolingual corpus of the target language, and then trains the whole translation model through the bilingual sentences composed of a small amount of source language sample sentences and target language sample sentences, so that the language model has the capability of continuous decoding, thereby effectively solving the problems of insufficient training of the translation model and low quality of the obtained translation result under the condition of scarce bilingual corpus resources, and enabling the translation model to have better performance under the low-resource translation task.

The present embodiment discloses a translation apparatus, and referring to fig. 11, the apparatus includes:

the second encoding module 1102 is configured to input a statement to be translated to an encoder, so as to obtain a second encoding vector corresponding to the statement to be translated;

a second decoding module 1104, configured to input a second encoding vector corresponding to the sentence to be translated into a language model, and obtain a second decoding vector based on a target language output by the language model;

a translated sentence generating module 1106 configured to obtain a word unit corresponding to each second decoding vector based on the second decoding vectors output by the language model, and obtain a translated sentence according to the word unit.

the second encoding module 1102 specifically includes:

the third coding unit is configured to perform embedding processing on the statement to be translated to obtain a corresponding statement vector, input the statement vector of the statement to be translated to the first coding layer, and generate a second coding vector of the first coding layer;

the fourth coding unit is configured to input the second coding vector of the j-1 th coding layer to the j-th coding layer to obtain the second coding vector output by the j-th coding layer, wherein j is more than or equal to 2 and less than or equal to m;

a third judging unit configured to judge whether j is equal to m, if so, execute the second encoding vector generating unit, and if not, execute the third self-increasing unit;

a second coding vector generation unit configured to obtain a second coding vector corresponding to the sentence to be translated based on second coding vectors of m coding layers;

and a third self-increment unit configured to self-increment j by 1 and continue to execute the fourth encoding unit.

Optionally, the second encoding vector generating unit is specifically configured to:

the second decoding module 1104 is specifically configured to:

the third decoding unit is configured to input the second reference vector and a second coding vector corresponding to the statement to be translated into the first decoding layer to obtain a second decoding vector of the first decoding layer;

the fourth decoding unit is configured to input the second decoding vector of the i-1 th decoding layer and the second coding vector corresponding to the statement to be translated into the i-th decoding layer to obtain the second decoding vector of the i-th decoding layer, wherein i is more than or equal to 2 and less than or equal to n;

a fourth judging unit configured to judge whether i is equal to n, if so, execute the second decoding vector generating unit, and if not, execute the fourth adding unit;

a second decoding vector generation unit configured to obtain a second decoding vector based on the target language output by the language model based on second decoding vectors of the n decoding layers;

a fourth incrementing unit configured to increment i by 1, performing the fourth decoding unit.

Optionally, the second decoding vector generating unit is configured to:

According to the translation device provided by the embodiment, the trained end-to-end translation model is utilized, the sentence to be translated is input to the encoder to obtain the corresponding second coding vector, the second coding vector is input to the language model to be continuously decoded to obtain the second decoding vector output by the language model and based on the target language, and finally the translated sentence is obtained based on the second decoding vector output by the language model, so that the accuracy of the translation result can be effectively improved.

The above is a schematic scheme of a translation apparatus of the present embodiment. It should be noted that the technical solution of the translation apparatus and the technical solution of the translation method described above belong to the same concept, and details that are not described in detail in the technical solution of the translation apparatus can be referred to the description of the technical solution of the translation method described above.

An embodiment of the present application further provides a computing device, which includes a memory, a processor, and computer instructions stored on the memory and executable on the processor, and when the processor executes the instructions, the method for training the translation model or the steps of the translation method are implemented.

Fig. 12 is a block diagram illustrating a structure of a computing device 1200 according to an embodiment of the present description. The components of the computing device 1200 include, but are not limited to, the memory 110 and the processor 120. The processor 120 is coupled to the memory 110 via a bus 130 and a database 150 is used to store data.

Computing device 1200 also includes access device 140, access device 140 enabling computing device 100 to communicate via one or more networks 160. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 140 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the present description, the above-described components of computing device 1200 and other components not shown in FIG. 12 may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 12 is for purposes of example only and is not limiting as to the scope of the description. Those skilled in the art may add or replace other components as desired.

Computing device 1200 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smartphone), wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 1200 may also be a mobile or stationary server.

An embodiment of the present application further provides a computer-readable storage medium, which stores computer instructions, when executed by a processor, for implementing the method for training a translation model or the steps of the translation method as described above.

The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the above-mentioned technical solution of the translation model training method or the translation method, and details of the technical solution of the storage medium, which are not described in detail, can be referred to the above-mentioned description of the technical solution of the translation model training method or the translation method.

The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and the practical application, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

Claims

1. A method for training a translation model, the translation model comprising: the system comprises an encoder and a language model, wherein the language model is obtained by adopting monolingual corpus pre-training of a target language;

the training method comprises the following steps:

inputting a source language sample sentence into the encoder to obtain a first encoding vector corresponding to the source language sample sentence;

2. The method of claim 1, wherein the encoder comprises m sequentially connected encoding layers, wherein m is a positive integer;

s110, increasing j by 1 and continuing to execute the step S104.

3. The method of claim 2, wherein deriving the first code vector corresponding to the source language sample sentence based on the first code vectors of the m coding layers comprises:

4. The method of claim 1, wherein the language model comprises n decoding layers connected in sequence, where n is a positive integer;

s212, increasing i by 1, and executing step S206.

5. The method of claim 4, wherein deriving the first decoded vector in the target language output by the language model based on the first decoded vectors in the n decoded layers comprises:

6. The method of claim 1, wherein obtaining the error corresponding to the first decoded vector output by the language model comprises: comparing the first decoding vector output by the language model with a preset vector verification set to obtain the error of the first decoding vector output by the language model;

7. A translation method applied to the translation model obtained by the method according to any one of claims 1 to 6, the translation method comprising:

8. The method of claim 7, wherein the encoder comprises m sequentially connected encoding layers, wherein m is a positive integer;

s310, increasing j by 1 and continuing to execute the step S304.

9. The method of claim 8, wherein deriving a second codevector corresponding to the sentence to be translated based on second codevectors of m coding layers comprises:

10. The method of claim 7, wherein the language model comprises n decoding layers connected in sequence, where n is a positive integer;

s410, increasing i by 1, and executing the step S404.

11. The method of claim 10, wherein deriving a second decoding vector based on the target language output by the language model based on the second decoding vectors of the n decoding layers comprises:

12. An apparatus for training a translation model, the translation model comprising: the system comprises an encoder and a language model, wherein the language model is obtained by adopting monolingual corpus pre-training of a target language;

the training apparatus includes:

a first encoding module configured to input a source language sample sentence to the encoder, resulting in a first encoding vector corresponding to the source language sample sentence;

13. A translation apparatus, comprising:

14. A computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, wherein the processor implements the steps of the method of any of claims 1-6 or 7-11 when executing the instructions.

15. A computer-readable storage medium storing computer instructions, which when executed by a processor, perform the steps of the method of any one of claims 1-6 or 7-11.