WO2023061107A1

WO2023061107A1 - Language translation method and apparatus based on layer prediction, and device and medium

Info

Publication number: WO2023061107A1
Application number: PCT/CN2022/117230
Authority: WO
Inventors: 周浩; 黄晨阳; 牟力立; 李磊; 扎安·奥斯马尔
Original assignee: 北京有竹居网络技术有限公司
Priority date: 2021-10-13
Filing date: 2022-09-06
Publication date: 2023-04-20
Also published as: CN113935338A; CN113935338B

Abstract

According to an implementation the present disclosure, provided are a language translation method and apparatus based on layer prediction, and a device and a medium. One method comprises: on a first implicit layer among a plurality of implicit layers of a decoder of a translation model, determining, on the basis of a code of input data comprised in training data, a first implicit state associated with the first implicit layer, wherein the training data comprises the input data represented by a source language and output data represented by a target language; determining prediction information associated with the output data; generating an updated first implicit state on the basis of the first implicit state and the prediction information; and outputting the updated first implicit state to a second implicit layer behind the first implicit layer among the plurality of implicit layers, such that the updated first implicit state is used as a second implicit state associated with the second implicit layer. In this way, by means of providing prediction information for each implicit layer, a translation model can be realized in a more effective and accurate manner.

Description

Method, device, device and medium for language translation based on layer prediction

This application claims the priority of the Chinese invention patent application entitled "Method, device, device and medium for language translation based on layer prediction" and application number CN202111191528.8 submitted on October 13, 2021, all of which have been disclosed and approved Incorporated herein by reference.

technical field

Exemplary implementations of the present disclosure generally relate to the computer field, and in particular, relate to a method, device, apparatus, and computer-readable storage medium for language translation based on layer prediction.

Background technique

Language translation involves translating content expressed in a source language into content expressed in a target language. Various translation schemes have been proposed at present, but the translation speed and accuracy of the existing technical schemes are not satisfactory. Thus, it is desirable to be able to perform language translation in a more efficient and accurate manner.

Contents of the invention

According to an exemplary implementation manner of the present disclosure, a solution for language translation based on layer prediction is provided.

In a first aspect of the present disclosure, a method for language translation based on layer prediction is provided. In the method, at a first hidden layer of a plurality of hidden layers of a decoder of a translation model, a first hidden layer associated with the first hidden layer is determined based on an encoding of input data included in the training data. state, the training data includes input data expressed in the source language and output data expressed in the target language, and the translation model is used to translate the input data into output data. Predictive information associated with the output data is determined. An updated first implicit state is generated based on the first implicit state and the prediction information. outputting the updated first implicit state to a second implicit layer subsequent to the first implicit layer among the plurality of implicit layers, such that the updated first implicit state is taken as the first implicit state associated with the second implicit layer Two implicit states.

In a second aspect of the present disclosure, there is provided an electronic device comprising: at least one processing unit; and at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, The instructions, when executed by at least one processing unit, cause the device to perform actions. The actions include, at a first of a plurality of hidden layers of a decoder of the translation model, determining a first hidden value associated with the first hidden layer based on an encoding of input data included in the training data. state, the training data includes input data expressed in the source language and output data expressed in the target language, the translation model is used to translate the input data into output data; determine the prediction information associated with the output data; based on the first implicit state and predicting information, generating an updated first implicit state; and outputting the updated first implicit state to a second implicit layer following the first implicit layer among the plurality of implicit layers, so that the updated first implicit state The state is taken as a second implicit state associated with the second implicit layer.

In a third aspect of the present disclosure, a method for language translation based on layer prediction is provided. In the method, an encoding of data to be translated expressed in a source language is received, a first implicit state associated with a first implicit layer of a plurality of implicit layers in a translation model is determined, and the translation model is used to convert the data in the The data to be translated expressed in the source language is translated into a translation result expressed in the target language. Predictive information associated with the translation result is determined based on the first implicit state. An updated first implicit state update is generated based on the first implicit state and the prediction information. inputting an updated first implicit state to a second implicit layer following the first implicit layer of the plurality of implicit layers, so that the translation model associates the updated first implicit state with the second implicit layer the second implicit state of .

In a fourth aspect of the present disclosure, there is provided an electronic device comprising: at least one processing unit; and at least one memory, the at least one memory being coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, The instructions, when executed by at least one processing unit, cause the device to perform actions. The actions include receiving an encoding of data to be translated expressed in a source language, determining a first implicit state associated with a first of a plurality of implicit layers in a translation model for translating data in a source language into Translating the data to be translated expressed in the language into a translation result expressed in the target language; determining prediction information associated with the translation result based on the first implicit state; generating an updated first implicit state based on the first implicit state and the prediction information update; and output an updated first implicit state to a second implicit layer after the first implicit layer of the plurality of implicit layers, so that the translation model uses the updated first implicit state as the same as the second implicit The second implicit state associated with the layer.

In a fifth aspect of the present disclosure, an apparatus for language translation based on layer prediction is provided, including: a determining unit configured to, at a first implicit layer among a plurality of implicit layers of a decoder of a translation model, Determining a first implicit state associated with the first implicit layer based on an encoding of input data included in training data comprising input data in a source language and output data in a target language, the translation model for translating input data into output data; a prediction unit configured to determine prediction information associated with the output data; a generation unit configured to generate an updated first implicit state based on the first implicit state and the prediction information; and an output unit configured to output an updated first implicit state to a second implicit layer following the first implicit layer among the plurality of implicit layers, so that the updated first implicit state is used as the same as the second implicit state The second implicit state associated with the formula layer.

In a sixth aspect of the present disclosure, an apparatus for language translation based on layer prediction is provided, including: a receiving unit configured to receive the encoding of the data to be translated expressed in the source language, and determine and translate multiple latent elements in the translation model. The first implicit state associated with the first implicit layer in the formula layer, the translation model is used to translate the data to be translated expressed in the source language into the translation result expressed in the target language; the determining unit is configured to be based on the first The implicit state determines prediction information associated with the translation result; a generating unit configured to generate an updated first implicit state update based on the first implicit state and the prediction information; and output the first implicit state update to the plurality of implicit layers The second implicit layer following the implicit layer outputs the updated first implicit state, so that the translation model uses the updated first implicit state as the second implicit state associated with the second implicit layer.

In a seventh aspect of the present disclosure, a computer readable storage medium is provided. A computer program is stored on the medium, and when the program is executed by the processor, the method in the first aspect is realized.

In an eighth aspect of the present disclosure, a computer readable storage medium is provided. A computer program is stored on the medium, and when the program is executed by the processor, the method of the third aspect is realized.

It should be understood that what is described in the Summary of the Invention is not intended to limit the key or important features of the implementations of the present disclosure, nor is it intended to limit the scope of the present disclosure. Other features of the present disclosure will be readily understood through the following description.

Description of drawings

Hereinafter, the above and other features, advantages and aspects of various implementations of the present disclosure will become more apparent with reference to the following detailed description in conjunction with the accompanying drawings. In the drawings, identical or similar reference numerals denote identical or similar elements, wherein:

1 shows a block diagram of an example environment in which implementations of the present disclosure can be implemented;

2 shows a block diagram of a translation model for translating a source language into a target language, according to some implementations of the present disclosure;

Figure 3 shows a block diagram of a decoder in a translation model according to some implementations of the present disclosure;

4 shows a block diagram of a training process for training a decoder according to some implementations of the present disclosure;

Fig. 5 shows a block diagram of prediction information provided to each node in an implicit layer according to some implementations of the present disclosure;

6 shows a block diagram of nodes in an implicit layer in a decoder, according to some implementations of the present disclosure;

FIG. 7 shows a flowchart of a method for language translation based on layer prediction according to some implementations of the present disclosure;

FIG. 8 shows a flowchart of a method for language translation based on layer prediction according to some implementations of the present disclosure;

FIG. 9A shows a block diagram of an apparatus for language translation based on layer prediction according to some implementations of the present disclosure;

FIG. 9B shows a block diagram of an apparatus for language translation based on layer prediction according to some implementations of the present disclosure; and

Figure 10 shows a block diagram of a device capable of implementing various implementations of the present disclosure.

Detailed ways

Implementations of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although certain implementations of the disclosure are shown in the drawings, it should be understood that the disclosure may be embodied in various forms and should not be construed as limited to the implementations set forth herein; It is for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and implementation manners of the present disclosure are for exemplary purposes only, and are not intended to limit the protection scope of the present disclosure.

In the description of the implementation of the present disclosure, the term "comprising" and its similar expressions should be interpreted as an open inclusion, that is, "including but not limited to". The term "based on" should be understood as "based at least in part on". The term "an implementation" or "the implementation" should be understood as "at least one implementation". The term "some implementations" should be read as "at least some implementations". Other definitions, both express and implied, may also be included below.

As used herein, the term "model" can learn the relationship between the corresponding input and output from the training data, so that the corresponding output can be generated for the given input after the training is completed. The generation of the model may be based on machine learning techniques. Deep learning is a machine learning algorithm that uses multiple layers of processing units to process input and provide corresponding output. A neural network model is an example of a deep learning based model. A "model" may also be referred to herein as a "machine learning model," "learning model," "machine learning network," or "learning network," and these terms are used interchangeably herein.

A "neural network" is a machine learning network based on deep learning. A neural network is capable of processing input and providing a corresponding output, and typically includes an input layer and an output layer and one or more hidden layers between the input layer and the output layer. Neural networks used in deep learning applications often include many hidden layers, increasing the depth of the network. The layers of the neural network are connected in sequence so that the output of the previous layer is provided as the input of the subsequent layer, where the input layer receives the input of the neural network, and the output of the output layer serves as the final output of the neural network. Each layer of a neural network consists of one or more nodes (also known as processing nodes or neurons), each of which processes input from the previous layer.

Generally, machine learning can roughly include three phases, namely training phase, testing phase and application phase (also known as inference phase). In the training phase, a given model can be trained using a large amount of training data, and the parameter values are updated iteratively until the model can obtain consistent inferences that meet the expected goals from the training data. Through training, a model can be thought of as being able to learn associations from inputs to outputs (also known as input-to-output mappings) from the training data. The parameter values of the trained model are determined. In the testing phase, the performance of the model is determined by applying test inputs to the trained model to test whether the model can provide the correct output. In the application stage, the model can be used to process the actual input and determine the corresponding output based on the parameter values obtained by training.

In the field of language translation, a training data set including a large amount of training data can be used to train a translation model, so that the translation model can translate input content expressed in a source language into content expressed in a target language. Various technical solutions for language translation have been proposed so far.

In the technical solution based on automatic regression, each word in the sentence expressed in the source language can be predicted one by one. Specifically, the translation of the first word can be predicted in the first processing, and the translations of other words can be predicted one by one in the subsequent processing. Although the above technical solution can achieve higher accuracy, the translation speed is not satisfactory due to multiple processing procedures. At present, a technical solution based on non-automatic regression has also been proposed, which can output the translation results of all words in a sentence in one processing. However, translating each word at the same time, the translation model does not know which words in the sentence have been translated and which words have not been translated, which leads to inaccurate translation results. Therefore, it is expected that such translation models can be further improved in order to provide more accurate translation results.

example environment

FIG. 1 shows a block diagram of an example environment 100 in which implementations of the present disclosure can be implemented. In environment 100 of FIG. 1 , it is desirable to train and use a model (ie, translation model 130 ) configured to translate content expressed in a source language to content expressed in a target language. As shown in FIG. 1 , the environment 100 includes a model training system 150 and a model application system 152 , and a translation model 130 can be implemented using a transformer encoder/decoder-based architecture. The upper part of Fig. 1 shows the process of the model training phase, and the lower part shows the process of the model application phase. Before training, the parameter values of the translation model 130 may have initial values, or may have pre-trained parameter values obtained through a pre-training process. Through the training process, the parameter values of the translation model 130 can be updated and adjusted. The translation model 130' can be obtained after the training is completed. At this point, the parameter values of the translation model 130' have been updated, and based on the updated parameter values, the translation model 130 can be used to implement translation tasks in the application phase.

In the model training phase, the translation model 130 can be trained using a model training system 150 based on a training data set 110 including a plurality of training data 112 . In this case, each training data 112 may refer to a 2-tuple format, for example comprising input data 120 and output data 122 . In the context of this disclosure, only Chinese and English will be used as examples of source and target languages to describe specific details about the translation process. According to an exemplary implementation of the present disclosure, the source language and the target language may also be any two different languages among the following: Japanese, French, Russian, Spanish, and so on.

In the context of the present disclosure, input data 120 may include character strings in a source language, such as "who is she," and output data 122 may include character strings in a target language, such as "who is she." Translation model 130 may be trained using training data 112 including input data 120 and output data 122 . Specifically, the training process can be performed iteratively using a large amount of training data. After the training is completed, the translation model 130' can convert the data to be translated expressed in the source language into a translation result expressed in the target language. In the model application stage, the translation model 130' can be invoked by the model application system 152 (the translation model 130' at this time has the parameter values after training), and the above-mentioned translation tasks can be performed. For example, data to be translated 140 may be received and translation results 142 output.

In FIG. 1 , the model training system 150 and the model application system 152 may include any computing system with computing capabilities, such as various computing devices/systems, terminal devices, servers, and the like. Terminal equipment may refer to any type of mobile terminal, fixed terminal or portable terminal, including mobile phones, desktop computers, laptop computers, notebook computers, netbook computers, tablet computers, media computers, multimedia tablets, or any combination of the foregoing , including accessories and peripherals for these devices, or any combination thereof. Servers include, but are not limited to, mainframes, edge computing nodes, computing devices in cloud environments, and the like.

It should be understood that the components and arrangement shown in FIG. 1 in environment 100 are examples only, and that a computing system suitable for implementing the example implementations described in this disclosure may include one or more different components, other components, and/or or a different arrangement. For example, although shown as separate, model training system 150 and model application system 152 may be integrated in the same system or device. Implementations of the present disclosure are not limited in this regard. Exemplary implementations of model training and model application will be described below with continued reference to the accompanying drawings.

Architecture of the translation model

According to the implementation of the present disclosure, a method for language translation based on layer prediction is proposed. Specifically, the translation model 130 shown in FIG. 1 can be implemented based on the transformer encoder/decoder architecture. An overview of this architecture is first described with reference to FIG. 2, which shows a block diagram 200 of a translation model for translating a source language into a target language, according to some implementations of the present disclosure. The translation model 130 according to an exemplary implementation of the present disclosure may be implemented using a transformer-based encoder/decoder architecture. For example, the translation model 130 may be implemented using various encoders/decoders or variations thereof that are currently known and/or will be developed in the future.

As shown in FIG. 2 , transformer 210 may include encoder 220 and decoder 230. The encoder 220 can map the input data 120 to appropriate codes via a plurality of layers 222 , 224 , . . . , and 226 . Further, the codes output by the encoder 220 may be input to one or more layers 232 , 234 , .

In this architecture, the encoder 220 and the decoder 230 may respectively include a plurality of layers (for example, 6 layers or other numbers), and the plurality of layers operate in coordination to respectively implement encoding and decoding functions. It will be understood that the various layers here are respectively located inside the encoder and the decoder, so these layers can be called implicit layers. Each hidden layer may have a corresponding hidden state (eg, represented by a multidimensional vector). Inside the encoder 220 and the decoder 230, the implicit state of each implicit layer may be processed so as to use the processed updated implicit state as the implicit state of the next layer.

According to an exemplary implementation of the present disclosure, a decoder for layer prediction is proposed. At each layer of the decoder, translation results can be predicted based on the implicit state associated with each layer. Further, the predicted translation results based on the previous layer can be fed to the next layer for subsequent processing. With the exemplary implementation of the present disclosure, a decoder implemented based on layer prediction can know the predicted translation result at the upper layer, and thus can realize the decoding task in a more accurate manner. Further, the implicit layers of the decoder obtained in this way work in harmony, and the final translation is output at the last layer. At this time, the translation model 130 can output the translation of each word at the same time, which can greatly improve the translation speed.

Hereinafter, the architecture of a decoder according to an exemplary implementation of the present disclosure will be described with reference to FIG. 3 . FIG. 3 shows a block diagram 300 of a decoder 310 in the translation model 130 according to some implementations of the disclosure. As shown in FIG. 3, the decoder 310 may include a plurality of layers 312, 314, ..., and 316, among others. As shown in FIG. 3, each layer may include a certain number of nodes, where the number of nodes is determined based on the length of the output data. In the training phase, the length is determined based on the real length of the output data in the training data; and in the application phase, the length can be predicted based on various techniques currently known and/or to be developed in the future. It will be understood that the number of nodes included in each layer is equal, assuming that the length of the output data in the training data is 3, then each layer includes 3 nodes and each node corresponds to a word in the output data.

Node 320 schematically shows a node in a layer in decoder 310 (eg, the last node in layer 1 ), and more details of node 320 are shown on the right side of FIG. 3 . Here, node 320 corresponds to a position in the output data (that is, corresponds to a word in the translation result). In the node 320 shown on the right side of FIG. 3 , except for the implicit state 322 corresponding to the node , input the prediction information 324 corresponding to the node 320 to the node 320 . Here, the prediction information 324 represents prediction information corresponding to the position. Further, an updated implicit state 326 may be generated based on the prediction information 324 and the implicit state 322 corresponding to the position in the translation result. Here, the updated implicit state 326 includes not only the implicit state information of the current layer, but also the prediction information of the translated word at this position.

According to an exemplary implementation of the present disclosure, in an implicit layer, the node 320 described above may be implemented at one or more locations. Alternatively and/or additionally, node 320 described above may be implemented at one or more locations in one or more hidden layers. In this way, at each hidden layer, the decoder 310 can know the prediction information corresponding to each translated word, which helps the decoder 310 to train the translation model in a more accurate manner.

According to an exemplary implementation of the present disclosure, the node 320 described above may be implemented at each position at each layer of the encoder 310 . In this way, the accuracy of the translation model can be further improved based on the translation information corresponding to each translated word. Using the exemplary implementation of the present disclosure, each node in each layer in the decoder 310 can know the word corresponding to the position of the node in the translation result, which can help to eliminate repeated translations, Missing words or inaccurate positions of translated words. In this way, the accuracy of the translation model 130 can be improved.

Model training process

Hereinafter, more details about the training process will be described with reference to the accompanying drawings. FIG. 4 shows a block diagram 400 of a training process for training the decoder 310 according to some implementations of the present disclosure. During training, a training data set 110 may be obtained, where the training data set 110 may include a large amount of labeled training data 112 . Specifically, each training data 112 may include input data 120 in a source language (eg, "who is she") and output data 122 in a target language (eg, "who is she"). The translation model 130 may be trained using the training data 112 .

As shown in FIG. 4 , input data 120 may be input into various encoders (eg, encoder 220 ) that are currently known and/or will be developed in the future to extract an encoding 410 of input data 120 . The codes 410 may be stored in a variety of formats, for example, the extracted codes 410 may be stored as multi-dimensional vectors (eg, 512 dimensions or other dimensions). The encoding 410 may then be provided to a layer prediction based decoder 310 . During training, the number of nodes in each layer is determined based on the number of output data 122 , which for the output data 122 in the training data 112 has a length of three. Thus, layer 312 may include three nodes 420 , 422 and 320 . It will be understood that FIG. 4 only uses one layer 312 in the decoder 310 as an example for description, and similar processing can be performed on other layers in the decoder 310 . Implicit states associated with layers 312 in decoder 310 may be determined based on encoding 410 .

Suppose the input data 120 is

Where T _x is the length of the input data 120 . The encoder 220 can take the input data 120 as input, and based on the embedding function

to map it to multiple discrete values. Specifically, at each layer, a multi-head attention operation can be performed based on Equation 1 as follows in order to determine the deep text representation:

in

Denoting the nth encoder layer, the above operations can be repeated for each layer in the encoder 220, and denote the output by the last layer in the encoder 220 as E.

Further, the length T _y of the output data can be determined. At present, various technical solutions have been proposed to determine the length of the output data. In the training phase, the length can be determined based on the real length of the output data. Further, the code 410 output by the encoder 220 may be input to the decoder 310 . For the decoder 310, assume that the implicit state at layer t is expressed as

At each implicit layer t of the conventional decoder 230, the implicit state at the last layer of the decoder can be predicted based on Equation 2 as follows:

in

Denotes the nth layer of the decoder,

Denotes the implicit state of length T _y at layer n-1 of the decoder, and E denotes the output of the encoder. Further, the softmax function can be used to predict the target word in the translation result based on the last implicit state of the decoder.

where _yt represents words at various positions in the translation result, N represents the number of layers of the decoder, and W represents predetermined parameters. In this way, each word in the translation result can be obtained simultaneously

At this time, when predicting the word y _t , the words at other positions are not known

Prediction. Further, training can be performed based on the training objective defined by Equation 4:

in

Represents the training target, ∑ represents the sum function, log represents the logarithmic function, and the meanings of other symbols are the same as the descriptions of the above formulas.

On the basis of the training process of the conventional decoder described above, in the context of the present disclosure, a layer prediction process based on the implicit state of each layer is added to the training process. The implicit state associated with each implicit layer can be determined by referring to the procedures described above. For example, the length of the output data can be used in the training process to determine the number of nodes included at each layer, thereby determining multiple positions associated with the output data. As shown in FIG. 4 , it can be determined that three nodes 420 , 422 and 320 are included at layer 312 . Further, a plurality of parts respectively corresponding to a plurality of positions may be determined based on the code 410 output from the encoder. For example, the implicit state corresponding to the t-th position of the n-th layer in the decoder 310 can be expressed as

Further, predictive information associated with the output data may be determined. Specifically, the prediction information of the translation result can be generated based on the implicit state of each layer, and the generated prediction information can be fed to the next layer as a tentative translation for further processing. According to an exemplary implementation of the present disclosure, the prediction information corresponding to the tth position of the nth layer can be generated based on various methods

For example, the prediction information can be generated based on the prediction of the translation model or directly based on the truth value of the output data

According to an exemplary implementation manner of the present disclosure, for a given position among the plurality of positions, prediction information corresponding to the given position may be determined based on a translation model. For example, for position t in multiple positions, the tth word in the translation result of the current translation model can be used as prediction information

For the node 320 in FIG. 4 , assuming that the translation result output by the decoder 310 is "who is she", the word "she" at the third position can be used as the prediction information for the position of the node 320

It will be understood that the above only uses "who is she" as an example of the translation result to show how to determine the prediction information

However, the translation model may output other translation results, for example, under insufficient training and/or other abnormal conditions, for example, the translation model may output "who is he", in which case the word "he" can be used as prediction information

It will be understood that although the prediction information may not be completely accurate at this time, however, the prediction information carries the information of the location-related translation words determined based on the implicit state of the previous layer, thus helping to improve Accuracy of the translation model.

According to an exemplary implementation of the present disclosure, the prediction information can also be determined directly based on the ground truth in the training data

For example, for a given position among the plurality of positions, the ground-truth data corresponding to the position in the output data 122 of the training data 112 may be used as prediction information. At this time, the three words “who”, “is” and “she” in the output data 122 “who is she” can be directly used as the prediction information for the three nodes 420 , 422 and 320 respectively.

According to an exemplary implementation manner of the present disclosure, each hidden layer may include multiple positions, and during the training process, prediction information determined based on the translation model may be input to some of the multiple positions. According to an exemplary implementation of the present disclosure, the translation model can be used to determine prediction information based on the following formula 5

in

Represents the prediction information, argmax represents the prediction function,

represents the implicit state,

Indicates the translated word. In other words, the Equation 5 expresses,

is the prediction of the most likely translated word based on the current implicit state.

According to an exemplary implementation of the present disclosure, prediction information determined based on ground-truth data may be input to other locations among the plurality of locations. FIG. 5 shows a block diagram 500 of prediction information provided to various nodes in an implicit layer according to some implementations of the present disclosure. Assume that the training data set 110 further includes training data 510, wherein the input data 512 is "who are you" and the output data 514 is "who are you". At this time, it can be determined that the length of the output data 514 is 3, and for the 3 positions in the translation data, the word "who" from the true value and the

predictions

522 and 524 from the translation model can be used as prediction information respectively. At this time, the prediction information will involve the mixture of the two prediction methods, and the mixture ratio at this time is 1:2.

Although FIG. 5 only shows that the prediction information from the ground truth is input to the first position, according to an exemplary implementation of the present disclosure, the prediction information from the ground truth can be used for any other one or more of the multiple positions. Forecast information. Assuming that the length of the output data of another training data is 10, the prediction information from the true value can be used for the 1st, 2nd, and 4th positions, and the prediction information from the translation model can be used for the other 7 positions. At this time, the mixing ratio is 3:7. As another example, at least a portion of locations may be randomly selected from a plurality of locations, and the prediction information from the ground truth is applied to the selected locations.

Using the exemplary implementation of the present disclosure, using the prediction information from the ground truth can make the training process more consistent with the training data with human annotations, thereby making the translation model more accurate. Furthermore, using the prediction information output from the translation model can ensure that the training process takes into account the prediction results of the current training model, and then distinguish the trained translation model more in line with the training goal.

It will be appreciated that the prediction information can be determined at each hidden layer following the same rules. In other words, assuming that the prediction information from the ground truth is applied to the 1st position at the 1st layer, and the prediction information from the translation model is applied to other positions, then at other implicit layers such as the subsequent 2nd, 3rd, etc. , should also apply predictions from the ground truth to the 1st position, and apply predictions from the translation model to the other positions. In this way, each hidden layer can be trained based on the same prediction information, thereby improving the accuracy of the translation model.

has described how to obtain forecast information

How to generate the updated implicit state will be described below with reference to FIG. 6 . This Fig. 6 shows a block diagram 600 of nodes in an implicit layer in a decoder according to some implementations of the present disclosure. Assuming that the decoder 310 includes N implicit layers, at each implicit layer, the implicit state of the layer can be based on

and associated forecast information

Both, to generate an updated implicit state for each position. As shown in FIG. 6 , a vector 610 (e.g., from the output of an encoder) can be input to a node 320 as an implicit state 322, and a vector 620 can be input to a location 320 (e.g., from the truth value "she" in output data 122 ) as prediction information 324. Further, the predicted word with the greatest possibility can be determined based on the following formula 6

The meanings of the symbols in Formula 6 are the same as those in the above formulas, so they will not be repeated here. Further, the updated implicit state can be determined based on both Equations 5 and 6:

in

represents the updated implicit state (to be output to the next implicit layer as the implicit state associated with the next implicit layer), _Wc represents the weight matrix, and

Indicates forecast information

The embedding. In other words, Equation 7 expresses that the

and associated forecast information

The concatenation operation of , and

has the same dimensions as the previous vector representation.

According to an exemplary implementation of the present disclosure, when providing prediction information in a hybrid manner (that is, applying prediction information from the translation model to some positions and applying prediction information from the ground truth to some positions), it can be based on the following Equation 8 to determine the implicit state of the update:

in

_st depends on the hyperparameter λ used to control the mixing ratio, where _st does not depend on the number n of hidden layers.

According to an exemplary implementation of the present disclosure, similar processing can be performed at each implicit layer based on the formula described above, so as to generate an updated first implicit state. Further, an updated implicit state may be output to a subsequent implicit layer after the current implicit layer among the plurality of implicit layers, so that the updated implicit state is used as the second implicit state associated with the subsequent implicit layer. state. Specifically, the process described above may be firstly performed at the first implicit layer of the decoder 310, and the updated implicit state obtained at the first layer is output to the second implicit layer of the decoder 310 , as the implicit state of the second implicit layer. Further, the process described above may be performed at the second implicit layer, and the updated implicit state obtained at the second layer is output to the third implicit layer of the decoder 310 as the third implicit state The implicit state of the first implicit layer, and so on, until similar processing is performed at all implicit layers.

According to an exemplary implementation of the present disclosure, the decoder 310 constructed in the manner described above may be trained based on the training data 112 . Specifically, a loss function describing the training objective may be generated for the hidden layer 312 . In an implicit layer, a loss function can be generated based on a portion of the hidden state. For example, the loss function can be generated based on any one or more of the 3 locations. For example, the loss function can be expressed as the following formula 9:

in

Indicates the loss function used as the training target, T _y indicates the length of the output data, t indicates the position of each word in the output data,

represents the output data,

denotes an updated implicit state, and n denotes any one of a plurality of implicit layers. In other words, the formula represents the difference between the prediction obtained based on the updated implicit state and the output data in the training data.

According to an exemplary implementation of the present disclosure, the translation model is trained using the input data and output data in the training data, so that the loss function in Formula 9 satisfies a predetermined condition (for example, reaches a predetermined convergence state). With the exemplary implementation of the present disclosure, the prediction information for each position can be introduced at one or some implicit layers. In this way, word predictions at other locations can be made to be considered at various implicit layers in order to improve the accuracy of the translation model.

According to an exemplary implementation of the present disclosure, similar operations may be performed for each hidden layer in the decoder 310 . For example, similar operations can be performed at the first hidden layer, the second hidden layer, ..., the nth hidden layer, and corresponding loss functions can be constructed. Further, an overall loss function can be constructed based on the above-mentioned multiple corresponding loss functions. Specifically, the overall loss function can be constructed based on the following formula 10:

where the symbols in Equation 10 have the same meanings as those in the equations described above, and N represents the number of hidden layers in the decoder. According to an exemplary implementation of the present disclosure, the translation model may be trained based on Formula 10, and the loss function determined based on Formula 10 satisfies a predetermined condition (for example, reaches a predetermined convergence state). With the exemplary implementation of the present disclosure, prediction information for each position can be introduced at all implicit layers. In this way, word predictions elsewhere can be made to be considered in a more accurate manner at the various hidden layers in order to improve the accuracy of the translation model.

With exemplary implementations of the present disclosure, the decoder 310 at different implicit layers may be aware of predictions determined based on the implicit state associated with each implicit layer. In this way, the loss at each hidden layer can be further considered and deeply supervised when constructing the loss function, thereby improving the accuracy of the translation model.

Model application process

The training of the translation model 130 discussed above, the trained translation model 130' can be provided to the model application system 152 shown in FIG. 1 for use in processing the data 140 to be translated. Specifically, after the model training phase has been completed, the received data to be translated can be processed using the already trained translation model 130' with the trained parameter values. Return to Figure 1 to describe more information about the model application process. Data to be translated 140 may be input to the trained translation model 130'. At this point, the fully trained translation model 130' can translate the data to be translated 140 from the source language to the target language. For example, translation model 130' may output translation data 142.

According to an exemplary implementation of the present disclosure, data to be translated 140 expressed in a source language may be received, and an encoder in the translation model 130' may extract a corresponding code from the data to be translated 140. The decoder 310 in the translation model 130' can receive this encoding, and the various hidden layers in the decoder 310 can operate in a similar manner to the training process described above. Specifically, at a first implicit layer among the plurality of implicit layers of the decoder 310, a first implicit state associated with the first implicit layer may be determined. Further, prediction information associated with the translation result may be determined based on the first implicit state (for example, based on formula 5 described above). Further, an updated first implicit state update may be generated based on the first implicit state and the prediction information (eg, based on equations 6 and 7 described above). Next, the updated first implicit state may be input to the second implicit layer after the first implicit layer of the plurality of implicit layers, so that the translation model uses the updated first implicit state as the same as the second implicit The second implicit state associated with the layer.

The process described above may be repeated at each hidden layer until at the last hidden layer of decoder 310 the implicit state associated with the last hidden layer is obtained. According to an exemplary implementation of the present disclosure, the final translation result may be determined based on the implicit state associated with the last implicit layer. Using the exemplary implementation of the present disclosure, it is possible to make full use of the translation model based on layer-by-layer prediction training to obtain more accurate translation results. When using the translation model 130', at each hidden layer of the decoder, the output from the previous layer is no longer directly used as input, but prediction information based on the implicit state of the current hidden layer is added. In this way, the translation result of the translation model can be made more consistent with the translation rules between the two languages. Further, the translation model 130' can output all words in the translation result at the same time, in this way, the translation speed can be greatly improved.

example process

FIG. 7 shows a flowchart of a method 700 for layer prediction based language translation according to some implementations of the present disclosure. Specifically, at block 710, at the first implicit layer among the plurality of implicit layers of the decoder of the translation model, based on the encoding of the input data included in the training data, determine the In the first implicit state, the training data includes input data expressed in the source language and output data expressed in the target language, and the translation model is used to translate the input data into output data.

According to an exemplary implementation of the present disclosure, in order to determine the first implicit state associated with the first implicit layer, a plurality of positions associated with the first implicit state may be determined based on the length of the output data. Further, a plurality of parts respectively corresponding to a plurality of positions in the first implicit state may be determined.

At block 720, predictive information associated with the output data may be determined. According to an exemplary implementation of the present disclosure, for a given position among the plurality of positions, the prediction information for the given position may be determined based on any of the following: a translation model; true value data.

At block 730, an updated first implicit state may be generated based on the first implicit state and the prediction information. According to an exemplary implementation of the present disclosure, in order to generate an updated first implicit state, an updated first implicit state may be generated based on a part of the first implicit state corresponding to a given position and prediction information for a given position. The portion of an implicit state corresponding to a given position.

According to an exemplary implementation manner of the present disclosure, in order to generate the updated first implicit state, a mixing ratio of prediction information determined based on the translation model and prediction information determined based on output data may be obtained. Further, an updated first implicit state may be generated based on the mixing ratio, the first implicit state and prediction information.

At block 740, an updated first implicit state is output to a second implicit layer following the first implicit layer of the plurality of implicit layers, such that the updated first implicit state is used as the same as the second implicit The second implicit state associated with the layer.

According to an exemplary implementation of the present disclosure, an updated second implicit state may be generated based on the second implicit state and prediction information. Further, the updated second implicit state may be output to a third implicit layer following the second implicit layer among the plurality of implicit layers, so that the updated second implicit state is regarded as being related to the third implicit layer The third implicit state of the association.

According to an exemplary implementation of the present disclosure, in order to train the translation model, a first training target associated with the first hidden layer may be generated. The translation model may be trained by using the input data and the output data, so that the first training target satisfies the first predetermined condition.

According to an exemplary implementation of the present disclosure, in order to generate a first training target associated with a first hidden layer, a difference between output data and a prediction based on the first hidden state may be determined. Further, the first training target may be generated based on the difference.

According to an exemplary implementation of the present disclosure, in order to train the translation model, a second training target associated with the second hidden layer may be generated. Further, the translation model may be trained by using the input data and the output data, so that the second training target satisfies the second predetermined condition.

According to an exemplary implementation manner of the present disclosure, in order to train the translation model, the training target of the translation model may be determined based on the first training target and the second training target. Further, the translation model can be trained by using the input data and the output data, so that the training target satisfies a predetermined condition.

FIG. 8 shows a flowchart of a method 800 for layer prediction based language translation according to some implementations of the present disclosure. At block 810, an encoding of data to be translated expressed in a source language is received, a first implicit state associated with a first implicit layer of a plurality of implicit layers in a translation model is determined, and the translation model is used to translate the The data to be translated expressed in the source language is translated into a translation result expressed in the target language. At block 820, predictive information associated with the translation result is determined based on the first implicit state. At block 830, an updated first implicit state update is generated based on the first implicit state and the prediction information. At block 840, an updated first implicit state is input to a second implicit layer following the first implicit layer of the plurality of implicit layers, so that the translation model uses the updated first implicit state as the same as the second implicit state The second implicit state associated with the implicit layer.

According to an exemplary implementation of the present disclosure, in the method 800, at the last implicit layer among the plurality of implicit layers, a translation result is determined based on an implicit state associated with the last implicit layer.

Example Apparatus and Equipment

Fig. 9A shows a block diagram of an apparatus 900A for language translation based on layer prediction according to some implementations of the present disclosure. As shown in FIG. 9A , an apparatus 900A includes a determination unit 910A, a prediction unit 920A, a generation unit 930A, and an output unit 940A.

According to an exemplary implementation of the present disclosure, the determining unit 910A is configured to determine, at the first implicit layer among the plurality of implicit layers of the decoder of the translation model, based on the encoding of the input data included in the training data, and The first implicit state associated with the first implicit layer, the training data includes input data expressed in the source language and output data expressed in the target language, and the translation model is used to translate the input data into output data; the prediction unit 920A is configured for The generating unit 930A is configured to generate an updated first implicit state based on the first implicit state and the predictive information; and the output unit 940A is configured to send to a plurality of implicit layers The second implicit layer following the first implicit layer outputs an updated first implicit state such that the updated first implicit state is taken as the second implicit state associated with the second implicit layer.

According to an exemplary implementation of the present disclosure, the determination unit 910A is further configured to: determine a plurality of positions associated with the first implicit state based on the length of the output data; Multiple parts corresponding to the position.

According to an exemplary implementation of the present disclosure, further comprising a training unit configured to train the translation model based on: generating a first training target associated with the first hidden layer; and using input data and output data to train the translation A model such that the first training target satisfies the first predetermined condition.

According to an exemplary implementation of the present disclosure, the training unit is further configured to: determine a difference between the output data and the prediction based on the first implicit state; and generate the first training target based on the difference.

According to an exemplary implementation of the present disclosure, the training unit is further configured to: generate a second training target associated with the second hidden layer; and use input data and output data to train the translation model, so that the second training target Satisfy the second predetermined condition.

According to an exemplary implementation of the present disclosure, the training unit is further configured to: determine the training target of the translation model based on the first training target and the second training target; and use input data and output data to train the translation model, so that the training target Meet the predetermined conditions.

According to an exemplary implementation of the present disclosure, the prediction unit 920A is further configured to, for a given position among the plurality of positions, determine the prediction information for the given position based on any of the following: a translation model; and The ground truth data corresponding to the given position.

According to an exemplary implementation of the present disclosure, the generating unit 930A is further configured to: generate an updated first implicit The part of the state that corresponds to the given position.

According to an exemplary implementation of the present disclosure, the generation unit 930A is further configured to obtain a mixture ratio of the prediction information determined based on the translation model and the prediction information determined based on the output data; and based on the mixture ratio, the first implicit state, and the prediction information , generating the updated first implicit state.

According to an exemplary implementation manner of the present disclosure, the generating unit 930A is further configured to generate an updated second implicit state based on the second implicit state and prediction information. The output unit 940A is further configured to output the updated second implicit state to a third implicit layer following the second implicit layer among the plurality of implicit layers, so that the updated second implicit state is used as the same as the third implicit state The third implicit state associated with the implicit layer.

FIG. 9B shows a block diagram of an apparatus 900B for language translation according to some implementations of the present disclosure. As shown in FIG. 9B , the apparatus 900B includes a receiving unit 910B, a determining unit 920B, a generating unit 930B, and an output unit 940B.

According to an exemplary implementation of the present disclosure, the receiving unit 910B is configured to receive the encoding of the data to be translated expressed in the source language, and determine the first implicit layer associated with the first implicit layer in the translation model An implicit state, the translation model is used to translate the data to be translated expressed in the source language into a translation result expressed in the target language; the determining unit 920B is configured to determine prediction information associated with the translation result based on the first implicit state; The generating unit 930B is configured to generate an updated first implicit state update based on the first implicit state and the prediction information; The implicit layer inputs the updated first implicit state, so that the translation model uses the updated first implicit state as the second implicit state associated with the second implicit layer.

According to an exemplary implementation of the present disclosure, the device 900B further includes: a translation unit configured to, at the last implicit layer among the plurality of implicit layers, based on the implicit state associated with the last implicit layer , to determine the translation result.

FIG. 10 shows a block diagram of a device 1000 capable of implementing various implementations of the present disclosure. It should be understood that the computing device 1000 shown in FIG. 10 is exemplary only and should not constitute any limitation as to the functionality and scope of the implementations described herein. The computing device 1000 shown in FIG. 10 can be used to implement the model training system 150 shown in FIG. 1 , and can also be used to implement the model application system 152 shown in FIG. 1 .

As shown in FIG. 10, computing device 1000 is in the form of a general-purpose computing device. Components of computing device 1000 may include, but are not limited to, one or more processors or processing units 1010, memory 1020, storage devices 1030, one or more communication units 1040, one or more input devices 1050, and one or more output devices 1060. The processing unit 1010 may be an actual or virtual processor and can perform various processing according to programs stored in the memory 1020 . In a multi-processor system, multiple processing units execute computer-executable instructions in parallel to increase the parallel processing capability of the computing device 1000 .

Computing device 1000 typically includes a plurality of computer storage media. Such media can be any available media that is accessible by computing device 1000, including but not limited to, volatile and nonvolatile media, removable and non-removable media. Memory 1020 can be volatile memory (e.g., registers, cache, random access memory (RAM), nonvolatile memory (e.g., read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM) , flash memory) or some combination of them. Storage device 1030 may be removable or non-removable media, and may include machine-readable media, such as flash drives, magnetic disks, or any other media that may be capable of storing information and/or data (e.g., training data for training ) and can be accessed within computing device 1000.

The computing device 1000 may further include additional removable/non-removable, volatile/nonvolatile storage media. Although not shown in FIG. 10, a disk drive for reading from or writing to a removable, nonvolatile disk (such as a "floppy disk") and a disk drive for reading from a removable, nonvolatile disk may be provided. CD-ROM drive for reading or writing. In these cases, each drive may be connected to the bus (not shown) by one or more data media interfaces. The memory 1020 may include a computer program product 1025 having one or more program modules configured to perform the various methods or actions of the various implementations of the present disclosure.

The communication unit 1040 enables communication with other computing devices through communication media. Additionally, the functionality of the components of computing device 1000 may be implemented in a single computing cluster or as a plurality of computing machines capable of communicating via communication links. Accordingly, computing device 1000 may operate in a networked environment using logical connections to one or more other servers, a network personal computer (PC), or another network node.

The input device 1050 may be one or more input devices, such as a mouse, keyboard, trackball, and the like. Output device 1060 may be one or more output devices, such as a display, speakers, printer, or the like. The computing device 1000 can also communicate with one or more external devices (not shown) through the communication unit 1040 as needed, such as storage devices, display devices, etc., and one or more devices that enable the user to interact with the computing device 1000 communicate, or communicate with any device (eg, network card, modem, etc.) that enables computing device 1000 to communicate with one or more other computing devices. Such communication may be performed via an input/output (I/O) interface (not shown).

According to an exemplary implementation of the present disclosure, there is provided a computer-readable storage medium on which computer-executable instructions are stored, wherein the computer-executable instructions are executed by a processor to implement the methods described above. According to an exemplary implementation of the present disclosure, there is also provided a computer program product tangibly stored on a non-transitory computer-readable medium and comprising computer-executable instructions, and the computer-executable instructions are executed by a processor to implement the method described above. According to an exemplary implementation of the present disclosure, there is provided a computer program product on which a computer program is stored, the program implementing the method described above when executed by a processor.

Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus, apparatus, and computer program products implemented according to the disclosure. It should be understood that each block of the flowcharts and/or block diagrams, and combinations of blocks in the flowcharts and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processing unit of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in the flowchart and/or block diagrams.

computer-readable program instructions can be loaded onto a computer, other programmable data processing apparatus, or other equipment, so that a series of operational steps are performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process, Instructions executed on computers, other programmable data processing devices, or other devices can thus implement the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various implementations of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, a program segment, or a portion of an instruction that contains one or more executable instruction. In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer instructions.

Having described various implementations of the present disclosure above, the foregoing description is exemplary, not exhaustive, and is not limited to the disclosed implementations. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described implementations. The choice of terminology used herein aims to best explain the principle of each implementation, practical application or improvement of technology in the market, or to enable other ordinary skilled in the art to understand each implementation disclosed herein.

Claims

A method of language translation based on layer prediction, comprising: at a first implicit layer of a plurality of implicit layers of a decoder of a translation model,

determining a first implicit state associated with said first implicit layer based on an encoding of input data included in training data comprising input data in a source language and output data in a target language, said translation model for translating said input data into said output data;

determining predictive information associated with said output data;

generating an updated first implicit state based on the first implicit state and the prediction information; and

outputting the updated first implicit state to a second implicit layer subsequent to the first implicit layer of the plurality of implicit layers such that the updated first implicit state is used as the The second implicit state associated with the second implicit layer.
The method of claim 1, wherein determining the first implicit state associated with the first implicit layer comprises:

determining a plurality of locations associated with the first implicit state based on a length of the output data; and

A plurality of portions in the first implicit state corresponding to the plurality of positions respectively are determined.
The method according to claim 1 or 2, further comprising training the translation model based on:

generating a first training target associated with the first hidden layer; and

The translation model is trained by using the input data and the output data, so that the first training target satisfies a first predetermined condition.
The method of claim 3, wherein generating the first training target associated with the first hidden layer comprises:

determining a difference between the output data and a prediction based on the first implicit state; and

The first training target is generated based on the difference.
The method of claim 3, wherein training the translation model further comprises:

generating a second training target associated with the second hidden layer; and

The translation model is trained by using the input data and the output data, so that the second training target satisfies a second predetermined condition.
The method of claim 5, wherein training the translation model further comprises:

determining a training target for the translation model based on the first training target and the second training target; and

The translation model is trained by using the input data and the output data, so that the training target satisfies a predetermined condition.
The method of claim 2, wherein determining the predictive information associated with the output data comprises, for a given location of the plurality of locations, determining for the given location based on any of the following: Forecast information for:

said translation model; and

ground-truth data corresponding to the given location in the output data.
The method according to claim 7, wherein generating the updated first implicit state comprises: based on the portion of the first implicit state corresponding to the given position and the prediction information, generating a portion of the updated first implicit state corresponding to the given position.
The method of claim 7, wherein generating the updated first implicit state comprises:

obtaining a mixture ratio of the prediction information determined based on the translation model and the prediction information determined based on the output data; and

The updated first implicit state is generated based on the mixture ratio, the first implicit state, and the prediction information.
The method according to any one of claims 1 to 9, further comprising:

generating an updated second implicit state based on the second implicit state and the prediction information; and

outputting the updated second implicit state to a third implicit layer subsequent to the second implicit layer of the plurality of implicit layers such that the updated second implicit state is used as the same as the The third implicit state associated with the third implicit layer.
An electronic device comprising:

at least one processing unit; and

at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit that, when executed by the at least one processing unit, cause the device The following actions are performed: at the first of the implicit layers of the decoder of the translation model,

determining a first implicit state associated with said first implicit layer based on an encoding of input data included in training data comprising input data in a source language and output data in a target language, said translation model for translating said input data into said output data;

determining predictive information associated with said output data;

generating an updated first implicit state based on the first implicit state and the prediction information; and

outputting the updated first implicit state to a second implicit layer subsequent to the first implicit layer of the plurality of implicit layers such that the updated first implicit state is used as the The second implicit state associated with the second implicit layer.
The apparatus of claim 11, wherein determining the first implicit state associated with the first implicit layer comprises:

determining a plurality of locations associated with the first implicit state based on a length of the output data; and

A plurality of portions in the first implicit state corresponding to the plurality of positions respectively are determined.
The apparatus of claim 11 or 12, wherein the actions further comprise training the translation model based on:

generating a first training target associated with the first hidden layer; and

The translation model is trained by using the input data and the output data, so that the first training target satisfies a first predetermined condition.
The apparatus of claim 13, wherein generating the first training target associated with the first hidden layer comprises:

determining a difference between the output data and a prediction based on the first implicit state; and

The first training target is generated based on the difference.
The apparatus of claim 13, wherein training the translation model further comprises:

generating a second training target associated with the second hidden layer; and

The translation model is trained by using the input data and the output data, so that the second training target satisfies a second predetermined condition.
The apparatus of claim 15, wherein training the translation model further comprises:

determining a training target for the translation model based on the first training target and the second training target; and

The translation model is trained by using the input data and the output data, so that the training target satisfies a predetermined condition.
The apparatus of claim 12, wherein determining the predictive information associated with the output data comprises, for a given location of the plurality of locations, determining for the given location based on any of the following: Forecast information for:

said translation model; and

ground-truth data corresponding to the given location in the output data.
The apparatus of claim 17, wherein generating the updated first implicit state comprises: based on the portion of the first implicit state corresponding to the given location and the prediction information, generating a portion of the updated first implicit state corresponding to the given position.
The apparatus of claim 17, wherein generating the updated first implicit state comprises:

obtaining a mixture ratio of the prediction information determined based on the translation model and the prediction information determined based on the output data; and

The updated first implicit state is generated based on the mixture ratio, the first implicit state, and the prediction information.
Apparatus according to any one of claims 11 to 19, wherein said actions further comprise:

generating an updated second implicit state based on the second implicit state and the prediction information; and

outputting the updated second implicit state to a third implicit layer subsequent to the second implicit layer of the plurality of implicit layers such that the updated second implicit state is used as the same as the The third implicit state associated with the third implicit layer.
A method for language translation based on layer prediction, comprising:

receiving an encoding of data to be translated in a source language, determining a first implicit state associated with a first of a plurality of implicit layers in a translation model for translating data in the source language Translating the data to be translated expressed in a language into a translation result expressed in a target language;

determining predictive information associated with the translation result based on the first implicit state;

generating an updated first implicit state update based on the first implicit state and the prediction information; and

inputting the updated first implicit state to a second implicit layer following the first implicit layer of the plurality of implicit layers, such that the translation model will update the first implicit state state as a second implicit state associated with the second implicit layer.
The method of claim 21 , further comprising: at a last hidden layer of the plurality of hidden layers, determining the translation result based on an implicit state associated with the last hidden layer .
An electronic device comprising:

at least one processing unit; and

at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit that, when executed by the at least one processing unit, cause the device Perform the following actions:

receiving an encoding of data to be translated in a source language, determining a first implicit state associated with a first of a plurality of implicit layers in a translation model for translating data in the source language Translating the data to be translated expressed in a language into a translation result expressed in a target language;

determining predictive information associated with the translation result based on the first implicit state;

generating an updated first implicit state update based on the first implicit state and the prediction information; and

outputting the updated first implicit state to a second implicit layer following the first implicit layer of the plurality of implicit layers, such that the translation model will update the first implicit state state as a second implicit state associated with the second implicit layer.
The device of claim 23, wherein the actions further comprise: at a last hidden layer of the plurality of hidden layers, based on an implicit state associated with the last hidden layer, determining The translation result.
An apparatus for language translation based on layer prediction, comprising:

A determination unit configured to determine, at a first implicit layer among the plurality of implicit layers of a decoder of the translation model, based on encoding of input data included in the training data, a value associated with the first implicit layer In a first implicit state, the training data includes input data in a source language and output data in a target language, and the translation model is used to translate the input data into the output data;

a prediction unit configured to determine prediction information associated with said output data;

a generating unit configured to generate an updated first implicit state based on the first implicit state and the prediction information; and

an output unit configured to output the updated first hidden state to a second hidden layer following the first hidden layer among the plurality of hidden layers, such that the updated first hidden state The implicit state is taken as the second implicit state associated with the second implicit layer.
An apparatus for language translation based on layer prediction, comprising:

a receiving unit configured to receive an encoding of data to be translated expressed in a source language, to determine a first implicit state associated with a first implicit layer of a plurality of implicit layers in a translation model, said translation model using for translating the data to be translated expressed in the source language into a translation result expressed in the target language;

a determining unit configured to determine prediction information associated with the translation result based on the first implicit state;

a generating unit configured to generate an updated first implicit state update based on the first implicit state and the prediction information; and

outputting to a second hidden layer after the first hidden layer of the plurality of hidden layers, outputting the updated first hidden state such that the translation model will update the first hidden state The implicit state is used as the second implicit state associated with the second implicit layer.
A computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the method according to any one of claims 1 to 10 is implemented.
A computer-readable storage medium on which a computer program is stored, the program implements the method according to any one of claims 21 to 22 when executed by a processor.
A computer program product on which is stored a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 10.
A computer program product having stored thereon a computer program which, when executed by a processor, implements the method according to any one of claims 21 to 22.