WO2023061107A1 - Language translation method and apparatus based on layer prediction, and device and medium - Google Patents

Language translation method and apparatus based on layer prediction, and device and medium Download PDF

Info

Publication number
WO2023061107A1
WO2023061107A1 PCT/CN2022/117230 CN2022117230W WO2023061107A1 WO 2023061107 A1 WO2023061107 A1 WO 2023061107A1 CN 2022117230 W CN2022117230 W CN 2022117230W WO 2023061107 A1 WO2023061107 A1 WO 2023061107A1
Authority
WO
WIPO (PCT)
Prior art keywords
implicit
layer
state
implicit state
updated
Prior art date
Application number
PCT/CN2022/117230
Other languages
French (fr)
Chinese (zh)
Inventor
周浩
黄晨阳
牟力立
李磊
扎安·奥斯马尔
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2023061107A1 publication Critical patent/WO2023061107A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Definitions

  • Exemplary implementations of the present disclosure generally relate to the computer field, and in particular, relate to a method, device, apparatus, and computer-readable storage medium for language translation based on layer prediction.
  • Language translation involves translating content expressed in a source language into content expressed in a target language.
  • Various translation schemes have been proposed at present, but the translation speed and accuracy of the existing technical schemes are not satisfactory. Thus, it is desirable to be able to perform language translation in a more efficient and accurate manner.
  • a solution for language translation based on layer prediction is provided.
  • a method for language translation based on layer prediction is provided.
  • a first hidden layer associated with the first hidden layer is determined based on an encoding of input data included in the training data.
  • the training data includes input data expressed in the source language and output data expressed in the target language
  • the translation model is used to translate the input data into output data.
  • Predictive information associated with the output data is determined.
  • An updated first implicit state is generated based on the first implicit state and the prediction information. outputting the updated first implicit state to a second implicit layer subsequent to the first implicit layer among the plurality of implicit layers, such that the updated first implicit state is taken as the first implicit state associated with the second implicit layer Two implicit states.
  • an electronic device comprising: at least one processing unit; and at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, The instructions, when executed by at least one processing unit, cause the device to perform actions.
  • the actions include, at a first of a plurality of hidden layers of a decoder of the translation model, determining a first hidden value associated with the first hidden layer based on an encoding of input data included in the training data.
  • the training data includes input data expressed in the source language and output data expressed in the target language
  • the translation model is used to translate the input data into output data; determine the prediction information associated with the output data; based on the first implicit state and predicting information, generating an updated first implicit state; and outputting the updated first implicit state to a second implicit layer following the first implicit layer among the plurality of implicit layers, so that the updated first implicit state
  • the state is taken as a second implicit state associated with the second implicit layer.
  • a method for language translation based on layer prediction is provided.
  • an encoding of data to be translated expressed in a source language is received, a first implicit state associated with a first implicit layer of a plurality of implicit layers in a translation model is determined, and the translation model is used to convert the data in the The data to be translated expressed in the source language is translated into a translation result expressed in the target language.
  • Predictive information associated with the translation result is determined based on the first implicit state.
  • An updated first implicit state update is generated based on the first implicit state and the prediction information. inputting an updated first implicit state to a second implicit layer following the first implicit layer of the plurality of implicit layers, so that the translation model associates the updated first implicit state with the second implicit layer the second implicit state of .
  • an electronic device comprising: at least one processing unit; and at least one memory, the at least one memory being coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, The instructions, when executed by at least one processing unit, cause the device to perform actions.
  • the actions include receiving an encoding of data to be translated expressed in a source language, determining a first implicit state associated with a first of a plurality of implicit layers in a translation model for translating data in a source language into Translating the data to be translated expressed in the language into a translation result expressed in the target language; determining prediction information associated with the translation result based on the first implicit state; generating an updated first implicit state based on the first implicit state and the prediction information update; and output an updated first implicit state to a second implicit layer after the first implicit layer of the plurality of implicit layers, so that the translation model uses the updated first implicit state as the same as the second implicit The second implicit state associated with the layer.
  • an apparatus for language translation based on layer prediction including: a determining unit configured to, at a first implicit layer among a plurality of implicit layers of a decoder of a translation model, Determining a first implicit state associated with the first implicit layer based on an encoding of input data included in training data comprising input data in a source language and output data in a target language, the translation model for translating input data into output data; a prediction unit configured to determine prediction information associated with the output data; a generation unit configured to generate an updated first implicit state based on the first implicit state and the prediction information; and an output unit configured to output an updated first implicit state to a second implicit layer following the first implicit layer among the plurality of implicit layers, so that the updated first implicit state is used as the same as the second implicit state The second implicit state associated with the formula layer.
  • an apparatus for language translation based on layer prediction including: a receiving unit configured to receive the encoding of the data to be translated expressed in the source language, and determine and translate multiple latent elements in the translation model.
  • the first implicit state associated with the first implicit layer in the formula layer, the translation model is used to translate the data to be translated expressed in the source language into the translation result expressed in the target language; the determining unit is configured to be based on the first The implicit state determines prediction information associated with the translation result; a generating unit configured to generate an updated first implicit state update based on the first implicit state and the prediction information; and output the first implicit state update to the plurality of implicit layers
  • the second implicit layer following the implicit layer outputs the updated first implicit state, so that the translation model uses the updated first implicit state as the second implicit state associated with the second implicit layer.
  • a computer readable storage medium is provided.
  • a computer program is stored on the medium, and when the program is executed by the processor, the method in the first aspect is realized.
  • a computer readable storage medium is provided.
  • a computer program is stored on the medium, and when the program is executed by the processor, the method of the third aspect is realized.
  • FIG. 1 shows a block diagram of an example environment in which implementations of the present disclosure can be implemented
  • FIG. 2 shows a block diagram of a translation model for translating a source language into a target language, according to some implementations of the present disclosure
  • Figure 3 shows a block diagram of a decoder in a translation model according to some implementations of the present disclosure
  • FIG. 4 shows a block diagram of a training process for training a decoder according to some implementations of the present disclosure
  • Fig. 5 shows a block diagram of prediction information provided to each node in an implicit layer according to some implementations of the present disclosure
  • FIG. 6 shows a block diagram of nodes in an implicit layer in a decoder, according to some implementations of the present disclosure
  • FIG. 7 shows a flowchart of a method for language translation based on layer prediction according to some implementations of the present disclosure
  • FIG. 8 shows a flowchart of a method for language translation based on layer prediction according to some implementations of the present disclosure
  • FIG. 9A shows a block diagram of an apparatus for language translation based on layer prediction according to some implementations of the present disclosure
  • FIG. 9B shows a block diagram of an apparatus for language translation based on layer prediction according to some implementations of the present disclosure.
  • Figure 10 shows a block diagram of a device capable of implementing various implementations of the present disclosure.
  • model can learn the relationship between the corresponding input and output from the training data, so that the corresponding output can be generated for the given input after the training is completed.
  • the generation of the model may be based on machine learning techniques.
  • Deep learning is a machine learning algorithm that uses multiple layers of processing units to process input and provide corresponding output.
  • a neural network model is an example of a deep learning based model.
  • a “model” may also be referred to herein as a "machine learning model,” “learning model,” “machine learning network,” or “learning network,” and these terms are used interchangeably herein.
  • a "neural network” is a machine learning network based on deep learning.
  • a neural network is capable of processing input and providing a corresponding output, and typically includes an input layer and an output layer and one or more hidden layers between the input layer and the output layer.
  • Neural networks used in deep learning applications often include many hidden layers, increasing the depth of the network.
  • the layers of the neural network are connected in sequence so that the output of the previous layer is provided as the input of the subsequent layer, where the input layer receives the input of the neural network, and the output of the output layer serves as the final output of the neural network.
  • Each layer of a neural network consists of one or more nodes (also known as processing nodes or neurons), each of which processes input from the previous layer.
  • machine learning can roughly include three phases, namely training phase, testing phase and application phase (also known as inference phase).
  • training phase a given model can be trained using a large amount of training data, and the parameter values are updated iteratively until the model can obtain consistent inferences that meet the expected goals from the training data.
  • a model can be thought of as being able to learn associations from inputs to outputs (also known as input-to-output mappings) from the training data.
  • the parameter values of the trained model are determined.
  • the testing phase the performance of the model is determined by applying test inputs to the trained model to test whether the model can provide the correct output.
  • the model can be used to process the actual input and determine the corresponding output based on the parameter values obtained by training.
  • a training data set including a large amount of training data can be used to train a translation model, so that the translation model can translate input content expressed in a source language into content expressed in a target language.
  • Various technical solutions for language translation have been proposed so far.
  • each word in the sentence expressed in the source language can be predicted one by one.
  • the translation of the first word can be predicted in the first processing, and the translations of other words can be predicted one by one in the subsequent processing.
  • the above technical solution can achieve higher accuracy, the translation speed is not satisfactory due to multiple processing procedures.
  • a technical solution based on non-automatic regression has also been proposed, which can output the translation results of all words in a sentence in one processing.
  • the translation model does not know which words in the sentence have been translated and which words have not been translated, which leads to inaccurate translation results. Therefore, it is expected that such translation models can be further improved in order to provide more accurate translation results.
  • FIG. 1 shows a block diagram of an example environment 100 in which implementations of the present disclosure can be implemented.
  • a model ie, translation model 130
  • the environment 100 includes a model training system 150 and a model application system 152 , and a translation model 130 can be implemented using a transformer encoder/decoder-based architecture.
  • the upper part of Fig. 1 shows the process of the model training phase, and the lower part shows the process of the model application phase.
  • the parameter values of the translation model 130 may have initial values, or may have pre-trained parameter values obtained through a pre-training process.
  • the parameter values of the translation model 130 can be updated and adjusted.
  • the translation model 130' can be obtained after the training is completed. At this point, the parameter values of the translation model 130' have been updated, and based on the updated parameter values, the translation model 130 can be used to implement translation tasks in the application phase.
  • the translation model 130 can be trained using a model training system 150 based on a training data set 110 including a plurality of training data 112 .
  • each training data 112 may refer to a 2-tuple format, for example comprising input data 120 and output data 122 .
  • the source language and the target language may also be any two different languages among the following: Japanese, French, Russian, Spanish, and so on.
  • input data 120 may include character strings in a source language, such as "who is she,” and output data 122 may include character strings in a target language, such as "who is she.”
  • Translation model 130 may be trained using training data 112 including input data 120 and output data 122 . Specifically, the training process can be performed iteratively using a large amount of training data. After the training is completed, the translation model 130' can convert the data to be translated expressed in the source language into a translation result expressed in the target language. In the model application stage, the translation model 130' can be invoked by the model application system 152 (the translation model 130' at this time has the parameter values after training), and the above-mentioned translation tasks can be performed. For example, data to be translated 140 may be received and translation results 142 output.
  • the model training system 150 and the model application system 152 may include any computing system with computing capabilities, such as various computing devices/systems, terminal devices, servers, and the like.
  • Terminal equipment may refer to any type of mobile terminal, fixed terminal or portable terminal, including mobile phones, desktop computers, laptop computers, notebook computers, netbook computers, tablet computers, media computers, multimedia tablets, or any combination of the foregoing , including accessories and peripherals for these devices, or any combination thereof.
  • Servers include, but are not limited to, mainframes, edge computing nodes, computing devices in cloud environments, and the like.
  • model training system 150 and model application system 152 may be integrated in the same system or device. Implementations of the present disclosure are not limited in this regard. Exemplary implementations of model training and model application will be described below with continued reference to the accompanying drawings.
  • the translation model 130 shown in FIG. 1 can be implemented based on the transformer encoder/decoder architecture.
  • FIG. 2 shows a block diagram 200 of a translation model for translating a source language into a target language, according to some implementations of the present disclosure.
  • the translation model 130 according to an exemplary implementation of the present disclosure may be implemented using a transformer-based encoder/decoder architecture.
  • the translation model 130 may be implemented using various encoders/decoders or variations thereof that are currently known and/or will be developed in the future.
  • transformer 210 may include encoder 220 and decoder 230.
  • the encoder 220 can map the input data 120 to appropriate codes via a plurality of layers 222 , 224 , . . . , and 226 . Further, the codes output by the encoder 220 may be input to one or more layers 232 , 234 , .
  • the encoder 220 and the decoder 230 may respectively include a plurality of layers (for example, 6 layers or other numbers), and the plurality of layers operate in coordination to respectively implement encoding and decoding functions. It will be understood that the various layers here are respectively located inside the encoder and the decoder, so these layers can be called implicit layers. Each hidden layer may have a corresponding hidden state (eg, represented by a multidimensional vector). Inside the encoder 220 and the decoder 230, the implicit state of each implicit layer may be processed so as to use the processed updated implicit state as the implicit state of the next layer.
  • a decoder for layer prediction is proposed.
  • translation results can be predicted based on the implicit state associated with each layer. Further, the predicted translation results based on the previous layer can be fed to the next layer for subsequent processing.
  • a decoder implemented based on layer prediction can know the predicted translation result at the upper layer, and thus can realize the decoding task in a more accurate manner.
  • the implicit layers of the decoder obtained in this way work in harmony, and the final translation is output at the last layer.
  • the translation model 130 can output the translation of each word at the same time, which can greatly improve the translation speed.
  • FIG. 3 shows a block diagram 300 of a decoder 310 in the translation model 130 according to some implementations of the disclosure.
  • the decoder 310 may include a plurality of layers 312, 314, ..., and 316, among others.
  • each layer may include a certain number of nodes, where the number of nodes is determined based on the length of the output data.
  • the length is determined based on the real length of the output data in the training data; and in the application phase, the length can be predicted based on various techniques currently known and/or to be developed in the future. It will be understood that the number of nodes included in each layer is equal, assuming that the length of the output data in the training data is 3, then each layer includes 3 nodes and each node corresponds to a word in the output data.
  • Node 320 schematically shows a node in a layer in decoder 310 (eg, the last node in layer 1 ), and more details of node 320 are shown on the right side of FIG. 3 .
  • node 320 corresponds to a position in the output data (that is, corresponds to a word in the translation result).
  • the prediction information 324 represents prediction information corresponding to the position.
  • an updated implicit state 326 may be generated based on the prediction information 324 and the implicit state 322 corresponding to the position in the translation result.
  • the updated implicit state 326 includes not only the implicit state information of the current layer, but also the prediction information of the translated word at this position.
  • the node 320 described above may be implemented at one or more locations. Alternatively and/or additionally, node 320 described above may be implemented at one or more locations in one or more hidden layers. In this way, at each hidden layer, the decoder 310 can know the prediction information corresponding to each translated word, which helps the decoder 310 to train the translation model in a more accurate manner.
  • the node 320 described above may be implemented at each position at each layer of the encoder 310 .
  • the accuracy of the translation model can be further improved based on the translation information corresponding to each translated word.
  • each node in each layer in the decoder 310 can know the word corresponding to the position of the node in the translation result, which can help to eliminate repeated translations, Missing words or inaccurate positions of translated words. In this way, the accuracy of the translation model 130 can be improved.
  • FIG. 4 shows a block diagram 400 of a training process for training the decoder 310 according to some implementations of the present disclosure.
  • a training data set 110 may be obtained, where the training data set 110 may include a large amount of labeled training data 112 .
  • each training data 112 may include input data 120 in a source language (eg, "who is she") and output data 122 in a target language (eg, "who is she”).
  • the translation model 130 may be trained using the training data 112 .
  • input data 120 may be input into various encoders (eg, encoder 220 ) that are currently known and/or will be developed in the future to extract an encoding 410 of input data 120 .
  • the codes 410 may be stored in a variety of formats, for example, the extracted codes 410 may be stored as multi-dimensional vectors (eg, 512 dimensions or other dimensions).
  • the encoding 410 may then be provided to a layer prediction based decoder 310 .
  • the number of nodes in each layer is determined based on the number of output data 122 , which for the output data 122 in the training data 112 has a length of three.
  • layer 312 may include three nodes 420 , 422 and 320 .
  • FIG. 4 only uses one layer 312 in the decoder 310 as an example for description, and similar processing can be performed on other layers in the decoder 310 . Implicit states associated with layers 312 in decoder 310 may be determined based on encoding 410 .
  • the encoder 220 can take the input data 120 as input, and based on the embedding function to map it to multiple discrete values. Specifically, at each layer, a multi-head attention operation can be performed based on Equation 1 as follows in order to determine the deep text representation:
  • the above operations can be repeated for each layer in the encoder 220, and denote the output by the last layer in the encoder 220 as E.
  • the length T y of the output data can be determined.
  • various technical solutions have been proposed to determine the length of the output data.
  • the length can be determined based on the real length of the output data.
  • the code 410 output by the encoder 220 may be input to the decoder 310 .
  • the implicit state at layer t is expressed as At each implicit layer t of the conventional decoder 230, the implicit state at the last layer of the decoder can be predicted based on Equation 2 as follows:
  • E denotes the output of the encoder.
  • the softmax function can be used to predict the target word in the translation result based on the last implicit state of the decoder.
  • a layer prediction process based on the implicit state of each layer is added to the training process.
  • the implicit state associated with each implicit layer can be determined by referring to the procedures described above. For example, the length of the output data can be used in the training process to determine the number of nodes included at each layer, thereby determining multiple positions associated with the output data. As shown in FIG. 4 , it can be determined that three nodes 420 , 422 and 320 are included at layer 312 . Further, a plurality of parts respectively corresponding to a plurality of positions may be determined based on the code 410 output from the encoder. For example, the implicit state corresponding to the t-th position of the n-th layer in the decoder 310 can be expressed as
  • predictive information associated with the output data may be determined.
  • the prediction information of the translation result can be generated based on the implicit state of each layer, and the generated prediction information can be fed to the next layer as a tentative translation for further processing.
  • the prediction information corresponding to the tth position of the nth layer can be generated based on various methods
  • the prediction information can be generated based on the prediction of the translation model or directly based on the truth value of the output data
  • prediction information corresponding to the given position may be determined based on a translation model. For example, for position t in multiple positions, the tth word in the translation result of the current translation model can be used as prediction information For the node 320 in FIG. 4 , assuming that the translation result output by the decoder 310 is "who is she", the word "she" at the third position can be used as the prediction information for the position of the node 320
  • the translation model may output other translation results, for example, under insufficient training and/or other abnormal conditions, for example, the translation model may output "who is he", in which case the word "he” can be used as prediction information
  • the prediction information may not be completely accurate at this time, however, the prediction information carries the information of the location-related translation words determined based on the implicit state of the previous layer, thus helping to improve Accuracy of the translation model.
  • the prediction information can also be determined directly based on the ground truth in the training data For example, for a given position among the plurality of positions, the ground-truth data corresponding to the position in the output data 122 of the training data 112 may be used as prediction information. At this time, the three words “who”, “is” and “she” in the output data 122 “who is she” can be directly used as the prediction information for the three nodes 420 , 422 and 320 respectively.
  • each hidden layer may include multiple positions, and during the training process, prediction information determined based on the translation model may be input to some of the multiple positions.
  • the translation model can be used to determine prediction information based on the following formula 5
  • Equation 5 expresses, is the prediction of the most likely translated word based on the current implicit state.
  • FIG. 5 shows a block diagram 500 of prediction information provided to various nodes in an implicit layer according to some implementations of the present disclosure.
  • the training data set 110 further includes training data 510, wherein the input data 512 is "who are you" and the output data 514 is "who are you”.
  • the length of the output data 514 is 3, and for the 3 positions in the translation data, the word "who" from the true value and the predictions 522 and 524 from the translation model can be used as prediction information respectively.
  • the prediction information will involve the mixture of the two prediction methods, and the mixture ratio at this time is 1:2.
  • FIG. 5 only shows that the prediction information from the ground truth is input to the first position
  • the prediction information from the ground truth can be used for any other one or more of the multiple positions.
  • Forecast information Assuming that the length of the output data of another training data is 10, the prediction information from the true value can be used for the 1st, 2nd, and 4th positions, and the prediction information from the translation model can be used for the other 7 positions. At this time, the mixing ratio is 3:7.
  • at least a portion of locations may be randomly selected from a plurality of locations, and the prediction information from the ground truth is applied to the selected locations.
  • using the prediction information from the ground truth can make the training process more consistent with the training data with human annotations, thereby making the translation model more accurate. Furthermore, using the prediction information output from the translation model can ensure that the training process takes into account the prediction results of the current training model, and then distinguish the trained translation model more in line with the training goal.
  • the prediction information can be determined at each hidden layer following the same rules.
  • the prediction information from the ground truth is applied to the 1st position at the 1st layer, and the prediction information from the translation model is applied to other positions, then at other implicit layers such as the subsequent 2nd, 3rd, etc. , should also apply predictions from the ground truth to the 1st position, and apply predictions from the translation model to the other positions.
  • each hidden layer can be trained based on the same prediction information, thereby improving the accuracy of the translation model.
  • FIG. 6 shows a block diagram 600 of nodes in an implicit layer in a decoder according to some implementations of the present disclosure.
  • the implicit state of the layer can be based on and associated forecast information Both, to generate an updated implicit state for each position.
  • a vector 610 e.g., from the output of an encoder
  • a vector 620 can be input to a location 320 (e.g., from the truth value "she" in output data 122 ) as prediction information 324.
  • the predicted word with the greatest possibility can be determined based on the following formula 6
  • Equation 7 expresses that the and associated forecast information The concatenation operation of , and has the same dimensions as the previous vector representation.
  • st depends on the hyperparameter ⁇ used to control the mixing ratio, where st does not depend on the number n of hidden layers.
  • Similar processing can be performed at each implicit layer based on the formula described above, so as to generate an updated first implicit state.
  • an updated implicit state may be output to a subsequent implicit layer after the current implicit layer among the plurality of implicit layers, so that the updated implicit state is used as the second implicit state associated with the subsequent implicit layer. state.
  • the process described above may be firstly performed at the first implicit layer of the decoder 310, and the updated implicit state obtained at the first layer is output to the second implicit layer of the decoder 310 , as the implicit state of the second implicit layer.
  • the process described above may be performed at the second implicit layer, and the updated implicit state obtained at the second layer is output to the third implicit layer of the decoder 310 as the third implicit state The implicit state of the first implicit layer, and so on, until similar processing is performed at all implicit layers.
  • the decoder 310 constructed in the manner described above may be trained based on the training data 112 .
  • a loss function describing the training objective may be generated for the hidden layer 312 .
  • a loss function can be generated based on a portion of the hidden state.
  • the loss function can be generated based on any one or more of the 3 locations.
  • the loss function can be expressed as the following formula 9:
  • T y indicates the length of the output data
  • t indicates the position of each word in the output data
  • n denotes any one of a plurality of implicit layers.
  • the formula represents the difference between the prediction obtained based on the updated implicit state and the output data in the training data.
  • the translation model is trained using the input data and output data in the training data, so that the loss function in Formula 9 satisfies a predetermined condition (for example, reaches a predetermined convergence state).
  • a predetermined condition for example, reaches a predetermined convergence state.
  • the prediction information for each position can be introduced at one or some implicit layers. In this way, word predictions at other locations can be made to be considered at various implicit layers in order to improve the accuracy of the translation model.
  • similar operations may be performed for each hidden layer in the decoder 310 .
  • similar operations can be performed at the first hidden layer, the second hidden layer, ..., the nth hidden layer, and corresponding loss functions can be constructed.
  • an overall loss function can be constructed based on the above-mentioned multiple corresponding loss functions. Specifically, the overall loss function can be constructed based on the following formula 10:
  • the translation model may be trained based on Formula 10, and the loss function determined based on Formula 10 satisfies a predetermined condition (for example, reaches a predetermined convergence state).
  • a predetermined condition for example, reaches a predetermined convergence state.
  • the decoder 310 at different implicit layers may be aware of predictions determined based on the implicit state associated with each implicit layer. In this way, the loss at each hidden layer can be further considered and deeply supervised when constructing the loss function, thereby improving the accuracy of the translation model.
  • the trained translation model 130' can be provided to the model application system 152 shown in FIG. 1 for use in processing the data 140 to be translated. Specifically, after the model training phase has been completed, the received data to be translated can be processed using the already trained translation model 130' with the trained parameter values. Return to Figure 1 to describe more information about the model application process. Data to be translated 140 may be input to the trained translation model 130'. At this point, the fully trained translation model 130' can translate the data to be translated 140 from the source language to the target language. For example, translation model 130' may output translation data 142.
  • data to be translated 140 expressed in a source language may be received, and an encoder in the translation model 130' may extract a corresponding code from the data to be translated 140.
  • the decoder 310 in the translation model 130' can receive this encoding, and the various hidden layers in the decoder 310 can operate in a similar manner to the training process described above.
  • a first implicit layer among the plurality of implicit layers of the decoder 310 a first implicit state associated with the first implicit layer may be determined.
  • prediction information associated with the translation result may be determined based on the first implicit state (for example, based on formula 5 described above).
  • an updated first implicit state update may be generated based on the first implicit state and the prediction information (eg, based on equations 6 and 7 described above).
  • the updated first implicit state may be input to the second implicit layer after the first implicit layer of the plurality of implicit layers, so that the translation model uses the updated first implicit state as the same as the second implicit The second implicit state associated with the layer.
  • the process described above may be repeated at each hidden layer until at the last hidden layer of decoder 310 the implicit state associated with the last hidden layer is obtained.
  • the final translation result may be determined based on the implicit state associated with the last implicit layer.
  • the translation model 130' at each hidden layer of the decoder, the output from the previous layer is no longer directly used as input, but prediction information based on the implicit state of the current hidden layer is added. In this way, the translation result of the translation model can be made more consistent with the translation rules between the two languages. Further, the translation model 130' can output all words in the translation result at the same time, in this way, the translation speed can be greatly improved.
  • FIG. 7 shows a flowchart of a method 700 for layer prediction based language translation according to some implementations of the present disclosure. Specifically, at block 710, at the first implicit layer among the plurality of implicit layers of the decoder of the translation model, based on the encoding of the input data included in the training data, determine the In the first implicit state, the training data includes input data expressed in the source language and output data expressed in the target language, and the translation model is used to translate the input data into output data.
  • a plurality of positions associated with the first implicit state may be determined based on the length of the output data. Further, a plurality of parts respectively corresponding to a plurality of positions in the first implicit state may be determined.
  • predictive information associated with the output data may be determined.
  • the prediction information for the given position may be determined based on any of the following: a translation model; true value data.
  • an updated first implicit state may be generated based on the first implicit state and the prediction information.
  • an updated first implicit state may be generated based on a part of the first implicit state corresponding to a given position and prediction information for a given position. The portion of an implicit state corresponding to a given position.
  • a mixing ratio of prediction information determined based on the translation model and prediction information determined based on output data may be obtained. Further, an updated first implicit state may be generated based on the mixing ratio, the first implicit state and prediction information.
  • an updated first implicit state is output to a second implicit layer following the first implicit layer of the plurality of implicit layers, such that the updated first implicit state is used as the same as the second implicit The second implicit state associated with the layer.
  • an updated second implicit state may be generated based on the second implicit state and prediction information. Further, the updated second implicit state may be output to a third implicit layer following the second implicit layer among the plurality of implicit layers, so that the updated second implicit state is regarded as being related to the third implicit layer The third implicit state of the association.
  • a first training target associated with the first hidden layer may be generated.
  • the translation model may be trained by using the input data and the output data, so that the first training target satisfies the first predetermined condition.
  • a difference between output data and a prediction based on the first hidden state may be determined. Further, the first training target may be generated based on the difference.
  • a second training target associated with the second hidden layer may be generated. Further, the translation model may be trained by using the input data and the output data, so that the second training target satisfies the second predetermined condition.
  • the training target of the translation model may be determined based on the first training target and the second training target. Further, the translation model can be trained by using the input data and the output data, so that the training target satisfies a predetermined condition.
  • FIG. 8 shows a flowchart of a method 800 for layer prediction based language translation according to some implementations of the present disclosure.
  • an encoding of data to be translated expressed in a source language is received, a first implicit state associated with a first implicit layer of a plurality of implicit layers in a translation model is determined, and the translation model is used to translate the The data to be translated expressed in the source language is translated into a translation result expressed in the target language.
  • predictive information associated with the translation result is determined based on the first implicit state.
  • an updated first implicit state update is generated based on the first implicit state and the prediction information.
  • an updated first implicit state is input to a second implicit layer following the first implicit layer of the plurality of implicit layers, so that the translation model uses the updated first implicit state as the same as the second implicit state The second implicit state associated with the implicit layer.
  • a translation result is determined based on an implicit state associated with the last implicit layer.
  • FIG. 9A shows a block diagram of an apparatus 900A for language translation based on layer prediction according to some implementations of the present disclosure.
  • an apparatus 900A includes a determination unit 910A, a prediction unit 920A, a generation unit 930A, and an output unit 940A.
  • the determining unit 910A is configured to determine, at the first implicit layer among the plurality of implicit layers of the decoder of the translation model, based on the encoding of the input data included in the training data, and The first implicit state associated with the first implicit layer, the training data includes input data expressed in the source language and output data expressed in the target language, and the translation model is used to translate the input data into output data; the prediction unit 920A is configured for The generating unit 930A is configured to generate an updated first implicit state based on the first implicit state and the predictive information; and the output unit 940A is configured to send to a plurality of implicit layers The second implicit layer following the first implicit layer outputs an updated first implicit state such that the updated first implicit state is taken as the second implicit state associated with the second implicit layer.
  • the determination unit 910A is further configured to: determine a plurality of positions associated with the first implicit state based on the length of the output data; Multiple parts corresponding to the position.
  • a training unit configured to train the translation model based on: generating a first training target associated with the first hidden layer; and using input data and output data to train the translation A model such that the first training target satisfies the first predetermined condition.
  • the training unit is further configured to: determine a difference between the output data and the prediction based on the first implicit state; and generate the first training target based on the difference.
  • the training unit is further configured to: generate a second training target associated with the second hidden layer; and use input data and output data to train the translation model, so that the second training target Satisfy the second predetermined condition.
  • the training unit is further configured to: determine the training target of the translation model based on the first training target and the second training target; and use input data and output data to train the translation model, so that the training target Meet the predetermined conditions.
  • the prediction unit 920A is further configured to, for a given position among the plurality of positions, determine the prediction information for the given position based on any of the following: a translation model; and The ground truth data corresponding to the given position.
  • the generating unit 930A is further configured to: generate an updated first implicit The part of the state that corresponds to the given position.
  • the generation unit 930A is further configured to obtain a mixture ratio of the prediction information determined based on the translation model and the prediction information determined based on the output data; and based on the mixture ratio, the first implicit state, and the prediction information , generating the updated first implicit state.
  • the generating unit 930A is further configured to generate an updated second implicit state based on the second implicit state and prediction information.
  • the output unit 940A is further configured to output the updated second implicit state to a third implicit layer following the second implicit layer among the plurality of implicit layers, so that the updated second implicit state is used as the same as the third implicit state The third implicit state associated with the implicit layer.
  • FIG. 9B shows a block diagram of an apparatus 900B for language translation according to some implementations of the present disclosure.
  • the apparatus 900B includes a receiving unit 910B, a determining unit 920B, a generating unit 930B, and an output unit 940B.
  • the receiving unit 910B is configured to receive the encoding of the data to be translated expressed in the source language, and determine the first implicit layer associated with the first implicit layer in the translation model An implicit state, the translation model is used to translate the data to be translated expressed in the source language into a translation result expressed in the target language; the determining unit 920B is configured to determine prediction information associated with the translation result based on the first implicit state; The generating unit 930B is configured to generate an updated first implicit state update based on the first implicit state and the prediction information; The implicit layer inputs the updated first implicit state, so that the translation model uses the updated first implicit state as the second implicit state associated with the second implicit layer.
  • the device 900B further includes: a translation unit configured to, at the last implicit layer among the plurality of implicit layers, based on the implicit state associated with the last implicit layer , to determine the translation result.
  • FIG. 10 shows a block diagram of a device 1000 capable of implementing various implementations of the present disclosure. It should be understood that the computing device 1000 shown in FIG. 10 is exemplary only and should not constitute any limitation as to the functionality and scope of the implementations described herein. The computing device 1000 shown in FIG. 10 can be used to implement the model training system 150 shown in FIG. 1 , and can also be used to implement the model application system 152 shown in FIG. 1 .
  • computing device 1000 is in the form of a general-purpose computing device.
  • Components of computing device 1000 may include, but are not limited to, one or more processors or processing units 1010, memory 1020, storage devices 1030, one or more communication units 1040, one or more input devices 1050, and one or more output devices 1060.
  • the processing unit 1010 may be an actual or virtual processor and can perform various processing according to programs stored in the memory 1020 . In a multi-processor system, multiple processing units execute computer-executable instructions in parallel to increase the parallel processing capability of the computing device 1000 .
  • Computing device 1000 typically includes a plurality of computer storage media. Such media can be any available media that is accessible by computing device 1000, including but not limited to, volatile and nonvolatile media, removable and non-removable media.
  • Memory 1020 can be volatile memory (e.g., registers, cache, random access memory (RAM), nonvolatile memory (e.g., read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM) , flash memory) or some combination of them.
  • Storage device 1030 may be removable or non-removable media, and may include machine-readable media, such as flash drives, magnetic disks, or any other media that may be capable of storing information and/or data (e.g., training data for training ) and can be accessed within computing device 1000.
  • the computing device 1000 may further include additional removable/non-removable, volatile/nonvolatile storage media.
  • a disk drive for reading from or writing to a removable, nonvolatile disk (such as a "floppy disk") and a disk drive for reading from a removable, nonvolatile disk may be provided.
  • CD-ROM drive for reading or writing.
  • each drive may be connected to the bus (not shown) by one or more data media interfaces.
  • the memory 1020 may include a computer program product 1025 having one or more program modules configured to perform the various methods or actions of the various implementations of the present disclosure.
  • the communication unit 1040 enables communication with other computing devices through communication media. Additionally, the functionality of the components of computing device 1000 may be implemented in a single computing cluster or as a plurality of computing machines capable of communicating via communication links. Accordingly, computing device 1000 may operate in a networked environment using logical connections to one or more other servers, a network personal computer (PC), or another network node.
  • PC network personal computer
  • the input device 1050 may be one or more input devices, such as a mouse, keyboard, trackball, and the like.
  • Output device 1060 may be one or more output devices, such as a display, speakers, printer, or the like.
  • the computing device 1000 can also communicate with one or more external devices (not shown) through the communication unit 1040 as needed, such as storage devices, display devices, etc., and one or more devices that enable the user to interact with the computing device 1000 communicate, or communicate with any device (eg, network card, modem, etc.) that enables computing device 1000 to communicate with one or more other computing devices. Such communication may be performed via an input/output (I/O) interface (not shown).
  • I/O input/output
  • a computer-readable storage medium on which computer-executable instructions are stored, wherein the computer-executable instructions are executed by a processor to implement the methods described above.
  • a computer program product tangibly stored on a non-transitory computer-readable medium and comprising computer-executable instructions, and the computer-executable instructions are executed by a processor to implement the method described above.
  • a computer program product on which a computer program is stored, the program implementing the method described above when executed by a processor.
  • These computer-readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processing unit of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram.
  • These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in the flowchart and/or block diagrams.
  • computer-readable program instructions can be loaded onto a computer, other programmable data processing apparatus, or other equipment, so that a series of operational steps are performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process, Instructions executed on computers, other programmable data processing devices, or other devices can thus implement the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.
  • each block in a flowchart or block diagram may represent a module, a program segment, or a portion of an instruction that contains one or more executable instruction.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

According to an implementation the present disclosure, provided are a language translation method and apparatus based on layer prediction, and a device and a medium. One method comprises: on a first implicit layer among a plurality of implicit layers of a decoder of a translation model, determining, on the basis of a code of input data comprised in training data, a first implicit state associated with the first implicit layer, wherein the training data comprises the input data represented by a source language and output data represented by a target language; determining prediction information associated with the output data; generating an updated first implicit state on the basis of the first implicit state and the prediction information; and outputting the updated first implicit state to a second implicit layer behind the first implicit layer among the plurality of implicit layers, such that the updated first implicit state is used as a second implicit state associated with the second implicit layer. In this way, by means of providing prediction information for each implicit layer, a translation model can be realized in a more effective and accurate manner.

Description

基于层预测的语言翻译的方法、设备、装置和介质Method, device, device and medium for language translation based on layer prediction
本申请要求2021年10月13日递交的,标题为“基于层预测的语言翻译的方法、设备、装置和介质”、申请号为CN202111191528.8的中国发明专利申请的优先权,其全部公开通过引用并入本文。This application claims the priority of the Chinese invention patent application entitled "Method, device, device and medium for language translation based on layer prediction" and application number CN202111191528.8 submitted on October 13, 2021, all of which have been disclosed and approved Incorporated herein by reference.
技术领域technical field
本公开的示例性实现方式总体涉及计算机领域,特别地涉及基于层预测的语言翻译的方法、设备、装置和计算机可读存储介质。Exemplary implementations of the present disclosure generally relate to the computer field, and in particular, relate to a method, device, apparatus, and computer-readable storage medium for language translation based on layer prediction.
背景技术Background technique
语言翻译涉及将以源语言表示的内容翻译为以目标语言表示的内容。目前已经提出了多种翻译方案,然而已有技术方案的翻译速度和准确性并不令人满意。因而,期望能够以更为有效和准确的方式来执行语言翻译。Language translation involves translating content expressed in a source language into content expressed in a target language. Various translation schemes have been proposed at present, but the translation speed and accuracy of the existing technical schemes are not satisfactory. Thus, it is desirable to be able to perform language translation in a more efficient and accurate manner.
发明内容Contents of the invention
根据本公开的示例性实现方式,提供了一种基于层预测的语言翻译的方案。According to an exemplary implementation manner of the present disclosure, a solution for language translation based on layer prediction is provided.
在本公开的第一方面,提供了一种基于层预测的语言翻译的方法。在该方法中,在翻译模型的解码器的多个隐式层中的第一隐式层处,基于训练数据中包括的输入数据的编码,确定与第一隐式层相关联的第一隐式状态,训练数据包括以源语言表示的输入数据和以目标语言表示的输出数据,翻译模型用于将输入数据翻译为输出数据。确定与输出数据相关联的预测信息。基于第一隐式状态和预测信息,生成更新的第一隐式状态。向多个隐式层中的第一隐式层之后的第二隐式层输出更新的第一隐式状态,以使得更新的第一隐式状态被作为与第二 隐式层相关联的第二隐式状态。In a first aspect of the present disclosure, a method for language translation based on layer prediction is provided. In the method, at a first hidden layer of a plurality of hidden layers of a decoder of a translation model, a first hidden layer associated with the first hidden layer is determined based on an encoding of input data included in the training data. state, the training data includes input data expressed in the source language and output data expressed in the target language, and the translation model is used to translate the input data into output data. Predictive information associated with the output data is determined. An updated first implicit state is generated based on the first implicit state and the prediction information. outputting the updated first implicit state to a second implicit layer subsequent to the first implicit layer among the plurality of implicit layers, such that the updated first implicit state is taken as the first implicit state associated with the second implicit layer Two implicit states.
在本公开的第二方面,提供了一种电子设备,包括:至少一个处理单元;以及至少一个存储器,至少一个存储器被耦合到至少一个处理单元并且存储用于由至少一个处理单元执行的指令,指令在由至少一个处理单元执行时使设备执行动作。该动作包括:在翻译模型的解码器的多个隐式层中的第一隐式层处,基于训练数据中包括的输入数据的编码,确定与第一隐式层相关联的第一隐式状态,训练数据包括以源语言表示的输入数据和以目标语言表示的输出数据,翻译模型用于将输入数据翻译为输出数据;确定与输出数据相关联的预测信息;基于第一隐式状态和预测信息,生成更新的第一隐式状态;以及向多个隐式层中的第一隐式层之后的第二隐式层输出更新的第一隐式状态,以使得更新的第一隐式状态被作为与第二隐式层相关联的第二隐式状态。In a second aspect of the present disclosure, there is provided an electronic device comprising: at least one processing unit; and at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, The instructions, when executed by at least one processing unit, cause the device to perform actions. The actions include, at a first of a plurality of hidden layers of a decoder of the translation model, determining a first hidden value associated with the first hidden layer based on an encoding of input data included in the training data. state, the training data includes input data expressed in the source language and output data expressed in the target language, the translation model is used to translate the input data into output data; determine the prediction information associated with the output data; based on the first implicit state and predicting information, generating an updated first implicit state; and outputting the updated first implicit state to a second implicit layer following the first implicit layer among the plurality of implicit layers, so that the updated first implicit state The state is taken as a second implicit state associated with the second implicit layer.
在本公开的第三方面,提供了一种基于层预测的语言翻译的方法。在该方法中,接收以源语言表示的待翻译数据的编码,确定与翻译模型中的多个隐式层中的第一隐式层相关联的第一隐式状态,翻译模型用于将以源语言表示的待翻译数据翻译为以目标语言表示的翻译结果。基于第一隐式状态确定与翻译结果相关联的预测信息。基于第一隐式状态和预测信息,生成更新的第一隐式状态更新。向多个隐式层的第一隐式层之后的第二隐式层,输入更新的第一隐式状态,以使得翻译模型将更新的第一隐式状态作为与第二隐式层相关联的第二隐式状态。In a third aspect of the present disclosure, a method for language translation based on layer prediction is provided. In the method, an encoding of data to be translated expressed in a source language is received, a first implicit state associated with a first implicit layer of a plurality of implicit layers in a translation model is determined, and the translation model is used to convert the data in the The data to be translated expressed in the source language is translated into a translation result expressed in the target language. Predictive information associated with the translation result is determined based on the first implicit state. An updated first implicit state update is generated based on the first implicit state and the prediction information. inputting an updated first implicit state to a second implicit layer following the first implicit layer of the plurality of implicit layers, so that the translation model associates the updated first implicit state with the second implicit layer the second implicit state of .
在本公开的第四方面,提供了一种电子设备,包括:至少一个处理单元;以及至少一个存储器,至少一个存储器被耦合到至少一个处理单元并且存储用于由至少一个处理单元执行的指令,指令在由至少一个处理单元执行时使设备执行动作。该动作包括:接收以源语言表示的待翻译数据的编码,确定与翻译模型中的多个隐式层中的第一隐式层相关联的第一隐式状态,翻译模型用于将以源语言表示的待翻译数据翻译为以目标语言表示的翻译结果;基于第一隐式状态确定与翻 译结果相关联的预测信息;基于第一隐式状态和预测信息,生成更新的第一隐式状态更新;以及向多个隐式层的第一隐式层之后的第二隐式层,输出更新的第一隐式状态,以使得翻译模型将更新的第一隐式状态作为与第二隐式层相关联的第二隐式状态。In a fourth aspect of the present disclosure, there is provided an electronic device comprising: at least one processing unit; and at least one memory, the at least one memory being coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, The instructions, when executed by at least one processing unit, cause the device to perform actions. The actions include receiving an encoding of data to be translated expressed in a source language, determining a first implicit state associated with a first of a plurality of implicit layers in a translation model for translating data in a source language into Translating the data to be translated expressed in the language into a translation result expressed in the target language; determining prediction information associated with the translation result based on the first implicit state; generating an updated first implicit state based on the first implicit state and the prediction information update; and output an updated first implicit state to a second implicit layer after the first implicit layer of the plurality of implicit layers, so that the translation model uses the updated first implicit state as the same as the second implicit The second implicit state associated with the layer.
在本公开的第五方面,提供了一种基于层预测的语言翻译的装置,包括:确定单元,配置用于在翻译模型的解码器的多个隐式层中的第一隐式层处,基于训练数据中包括的输入数据的编码,确定与第一隐式层相关联的第一隐式状态,训练数据包括以源语言表示的输入数据和以目标语言表示的输出数据,翻译模型用于将输入数据翻译为输出数据;预测单元,配置用于确定与输出数据相关联的预测信息;生成单元,配置用于基于第一隐式状态和预测信息,生成更新的第一隐式状态;以及输出单元,配置用于向多个隐式层中的第一隐式层之后的第二隐式层输出更新的第一隐式状态,以使得更新的第一隐式状态被作为与第二隐式层相关联的第二隐式状态。In a fifth aspect of the present disclosure, an apparatus for language translation based on layer prediction is provided, including: a determining unit configured to, at a first implicit layer among a plurality of implicit layers of a decoder of a translation model, Determining a first implicit state associated with the first implicit layer based on an encoding of input data included in training data comprising input data in a source language and output data in a target language, the translation model for translating input data into output data; a prediction unit configured to determine prediction information associated with the output data; a generation unit configured to generate an updated first implicit state based on the first implicit state and the prediction information; and an output unit configured to output an updated first implicit state to a second implicit layer following the first implicit layer among the plurality of implicit layers, so that the updated first implicit state is used as the same as the second implicit state The second implicit state associated with the formula layer.
在本公开的第六方面,提供了一种基于层预测的语言翻译的装置,包括:接收单元,配置用于接收以源语言表示的待翻译数据的编码,确定与翻译模型中的多个隐式层中的第一隐式层相关联的第一隐式状态,翻译模型用于将以源语言表示的待翻译数据翻译为以目标语言表示的翻译结果;确定单元,配置用于基于第一隐式状态确定与翻译结果相关联的预测信息;生成单元,配置用于基于第一隐式状态和预测信息,生成更新的第一隐式状态更新;以及输出向多个隐式层的第一隐式层之后的第二隐式层,输出更新的第一隐式状态,以使得翻译模型将更新的第一隐式状态作为与第二隐式层相关联的第二隐式状态。In a sixth aspect of the present disclosure, an apparatus for language translation based on layer prediction is provided, including: a receiving unit configured to receive the encoding of the data to be translated expressed in the source language, and determine and translate multiple latent elements in the translation model. The first implicit state associated with the first implicit layer in the formula layer, the translation model is used to translate the data to be translated expressed in the source language into the translation result expressed in the target language; the determining unit is configured to be based on the first The implicit state determines prediction information associated with the translation result; a generating unit configured to generate an updated first implicit state update based on the first implicit state and the prediction information; and output the first implicit state update to the plurality of implicit layers The second implicit layer following the implicit layer outputs the updated first implicit state, so that the translation model uses the updated first implicit state as the second implicit state associated with the second implicit layer.
在本公开的第七方面,提供了一种计算机可读存储介质。介质上存储有计算机程序,程序被处理器执行时实现第一方面的方法。In a seventh aspect of the present disclosure, a computer readable storage medium is provided. A computer program is stored on the medium, and when the program is executed by the processor, the method in the first aspect is realized.
在本公开的第八方面,提供了一种计算机可读存储介质。介质上存储有计算机程序,程序被处理器执行时实现第三方面的方法。In an eighth aspect of the present disclosure, a computer readable storage medium is provided. A computer program is stored on the medium, and when the program is executed by the processor, the method of the third aspect is realized.
应当理解,本发明内容部分中所描述的内容并非旨在限定本公开 的实现方式的关键特征或重要特征,也不用于限制本公开的范围。本公开的其它特征将通过以下的描述而变得容易理解。It should be understood that what is described in the Summary of the Invention is not intended to limit the key or important features of the implementations of the present disclosure, nor is it intended to limit the scope of the present disclosure. Other features of the present disclosure will be readily understood through the following description.
附图说明Description of drawings
在下文中,结合附图并参考以下详细说明,本公开各实现方式的上述和其他特征、优点及方面将变得更加明显。在附图中,相同或相似的附图标记表示相同或相似的元素,其中:Hereinafter, the above and other features, advantages and aspects of various implementations of the present disclosure will become more apparent with reference to the following detailed description in conjunction with the accompanying drawings. In the drawings, identical or similar reference numerals denote identical or similar elements, wherein:
图1示出了本公开的实现方式能够在其中实现的示例环境的框图;1 shows a block diagram of an example environment in which implementations of the present disclosure can be implemented;
图2示出了根据本公开的一些实现方式的用于将源语言翻译为目标语言的翻译模型的框图;2 shows a block diagram of a translation model for translating a source language into a target language, according to some implementations of the present disclosure;
图3示出了根据本公开的一些实现方式的翻译模型中的解码器的框图;Figure 3 shows a block diagram of a decoder in a translation model according to some implementations of the present disclosure;
图4示出了根据本公开的一些实现方式的用于训练解码器的训练过程的框图;4 shows a block diagram of a training process for training a decoder according to some implementations of the present disclosure;
图5示出了根据本公开的一些实现方式的向隐式层中的各个节点提供的预测信息的框图;Fig. 5 shows a block diagram of prediction information provided to each node in an implicit layer according to some implementations of the present disclosure;
图6示出了根据本公开的一些实现方式的在解码器中的隐式层中的节点的框图;6 shows a block diagram of nodes in an implicit layer in a decoder, according to some implementations of the present disclosure;
图7示出了根据本公开的一些实现方式的基于层预测的语言翻译的方法的流程图;FIG. 7 shows a flowchart of a method for language translation based on layer prediction according to some implementations of the present disclosure;
图8示出了根据本公开的一些实现方式的基于层预测的语言翻译的方法的流程图;FIG. 8 shows a flowchart of a method for language translation based on layer prediction according to some implementations of the present disclosure;
图9A示出了根据本公开的一些实现方式的基于层预测的语言翻译的装置的框图;FIG. 9A shows a block diagram of an apparatus for language translation based on layer prediction according to some implementations of the present disclosure;
图9B示出了根据本公开的一些实现方式的基于层预测的语言翻译的装置的框图;以及FIG. 9B shows a block diagram of an apparatus for language translation based on layer prediction according to some implementations of the present disclosure; and
图10示出了能够实施本公开的多个实现方式的设备的框图。Figure 10 shows a block diagram of a device capable of implementing various implementations of the present disclosure.
具体实施方式Detailed ways
下面将参照附图更详细地描述本公开的实现方式。虽然附图中示出了本公开的某些实现方式,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实现方式,相反,提供这些实现方式是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实现方式仅用于示例性作用,并非用于限制本公开的保护范围。Implementations of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although certain implementations of the disclosure are shown in the drawings, it should be understood that the disclosure may be embodied in various forms and should not be construed as limited to the implementations set forth herein; It is for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and implementation manners of the present disclosure are for exemplary purposes only, and are not intended to limit the protection scope of the present disclosure.
在本公开的实现方式的描述中,术语“包括”及其类似用语应当理解为开放性包含,即“包括但不限于”。术语“基于”应当理解为“至少部分地基于”。术语“一个实现方式”或“该实现方式”应当理解为“至少一个实现方式”。术语“一些实现方式”应当理解为“至少一些实现方式”。下文还可能包括其他明确的和隐含的定义。In the description of the implementation of the present disclosure, the term "comprising" and its similar expressions should be interpreted as an open inclusion, that is, "including but not limited to". The term "based on" should be understood as "based at least in part on". The term "an implementation" or "the implementation" should be understood as "at least one implementation". The term "some implementations" should be read as "at least some implementations". Other definitions, both express and implied, may also be included below.
如本文中所使用的,术语“模型”可以从训练数据中学习到相应的输入与输出之间的关联,从而在训练完成后可以针对给定的输入,生成对应的输出。模型的生成可以基于机器学习技术。深度学习是一种机器学习算法,通过使用多层处理单元来处理输入和提供相应输出。神经网络模型是基于深度学习的模型的一个示例。在本文中,“模型”也可以被称为“机器学习模型”、“学习模型”、“机器学习网络”或“学习网络”,这些术语在本文中可互换地使用。As used herein, the term "model" can learn the relationship between the corresponding input and output from the training data, so that the corresponding output can be generated for the given input after the training is completed. The generation of the model may be based on machine learning techniques. Deep learning is a machine learning algorithm that uses multiple layers of processing units to process input and provide corresponding output. A neural network model is an example of a deep learning based model. A "model" may also be referred to herein as a "machine learning model," "learning model," "machine learning network," or "learning network," and these terms are used interchangeably herein.
“神经网络”是一种基于深度学习的机器学习网络。神经网络能够处理输入并且提供相应输出,其通常包括输入层和输出层以及在输入层与输出层之间的一个或多个隐式层。在深度学习应用中使用的神经网络通常包括许多隐式层,从而增加网络的深度。神经网络的各个层按顺序相连,从而前一层的输出被提供作为后一层的输入,其中输入层接收神经网络的输入,而输出层的输出作为神经网络的最终输出。神经网络的每个层包括一个或多个节点(也称为处理节点或神经元),每个节点处理来自上一层的输入。A "neural network" is a machine learning network based on deep learning. A neural network is capable of processing input and providing a corresponding output, and typically includes an input layer and an output layer and one or more hidden layers between the input layer and the output layer. Neural networks used in deep learning applications often include many hidden layers, increasing the depth of the network. The layers of the neural network are connected in sequence so that the output of the previous layer is provided as the input of the subsequent layer, where the input layer receives the input of the neural network, and the output of the output layer serves as the final output of the neural network. Each layer of a neural network consists of one or more nodes (also known as processing nodes or neurons), each of which processes input from the previous layer.
通常,机器学习大致可以包括三个阶段,即训练阶段、测试阶段和应用阶段(也称为推理阶段)。在训练阶段,给定的模型可以使用大量的训练数据进行训练,不断迭代更新参数值,直到模型能够从训 练数据中获取一致的满足预期目标的推理。通过训练,模型可以被认为能够从训练数据中学习从输入到输出之间的关联(也称为输入到输出的映射)。训练后的模型的参数值被确定。在测试阶段,将测试输入应用到训练后的模型,测试模型是否能够提供正确的输出,从而确定模型的性能。在应用阶段,模型可以被用于基于训练得到的参数值,对实际的输入进行处理,确定对应的输出。Generally, machine learning can roughly include three phases, namely training phase, testing phase and application phase (also known as inference phase). In the training phase, a given model can be trained using a large amount of training data, and the parameter values are updated iteratively until the model can obtain consistent inferences that meet the expected goals from the training data. Through training, a model can be thought of as being able to learn associations from inputs to outputs (also known as input-to-output mappings) from the training data. The parameter values of the trained model are determined. In the testing phase, the performance of the model is determined by applying test inputs to the trained model to test whether the model can provide the correct output. In the application stage, the model can be used to process the actual input and determine the corresponding output based on the parameter values obtained by training.
在语言翻译领域,可以利用包括大量训练数据的训练数据集来训练翻译模型,进而使得翻译模型可以将输入的以源语言表示的内容翻译为以目标语言表示的内容。目前已经提出了用于语言翻译的多种技术方案。In the field of language translation, a training data set including a large amount of training data can be used to train a translation model, so that the translation model can translate input content expressed in a source language into content expressed in a target language. Various technical solutions for language translation have been proposed so far.
在基于自动回归的技术方案中,可以逐一地预测以源语言表示的语句中的各个词。具体地,在第一次处理过程中可以预测第一个词的翻译,并且在后续的处理过程中可以逐个预测其他词的翻译。尽管上述技术方案可以实现较高的准确性,然而多个处理过程导致翻译速度并不令人满意。目前还提出了基于非自动回归的技术方案,此类技术方案可以在一次处理过程中输出语句中所有词的翻译结果。然而,同时翻译每个词,翻译模型并不知晓语句中的哪些词已经被翻译并且哪些词没有被翻译,这导致翻译结果并不准确。因而,期望可以进一步改进此类翻译模型,以便提供更加准确的翻译结果。In the technical solution based on automatic regression, each word in the sentence expressed in the source language can be predicted one by one. Specifically, the translation of the first word can be predicted in the first processing, and the translations of other words can be predicted one by one in the subsequent processing. Although the above technical solution can achieve higher accuracy, the translation speed is not satisfactory due to multiple processing procedures. At present, a technical solution based on non-automatic regression has also been proposed, which can output the translation results of all words in a sentence in one processing. However, translating each word at the same time, the translation model does not know which words in the sentence have been translated and which words have not been translated, which leads to inaccurate translation results. Therefore, it is expected that such translation models can be further improved in order to provide more accurate translation results.
示例环境example environment
图1示出了本公开的实现方式能够在其中实现的示例环境100的框图。在图1的环境100中,期望训练和使用这样的模型(即,翻译模型130),该模型被配置用于将以源语言表示的内容翻译为以目标语言表示的内容。如图1所示,环境100包括模型训练系统150和模型应用系统152,可以使用基于transformer编码器/解码器的架构来实现翻译模型130。图1上部示出了模型训练阶段的过程,并且下部示出模型应用阶段的过程。在训练前,翻译模型130的参数值可以具有初始值,或者可以具有通过预训练过程获得经预训练的参数值。经过 训练过程,翻译模型130的参数值可以被更新和调整。在训练完成后可以获得翻译模型130’。此时,翻译模型130’的参数值已经被更新,并且基于已更新的参数值,翻译模型130在应用阶段可以被用于实现翻译任务。FIG. 1 shows a block diagram of an example environment 100 in which implementations of the present disclosure can be implemented. In environment 100 of FIG. 1 , it is desirable to train and use a model (ie, translation model 130 ) configured to translate content expressed in a source language to content expressed in a target language. As shown in FIG. 1 , the environment 100 includes a model training system 150 and a model application system 152 , and a translation model 130 can be implemented using a transformer encoder/decoder-based architecture. The upper part of Fig. 1 shows the process of the model training phase, and the lower part shows the process of the model application phase. Before training, the parameter values of the translation model 130 may have initial values, or may have pre-trained parameter values obtained through a pre-training process. Through the training process, the parameter values of the translation model 130 can be updated and adjusted. The translation model 130' can be obtained after the training is completed. At this point, the parameter values of the translation model 130' have been updated, and based on the updated parameter values, the translation model 130 can be used to implement translation tasks in the application phase.
在模型训练阶段,可以基于包括多个训练数据112的训练数据集110,并利用模型训练系统150来训练翻译模型130。在此,每个训练数据112可以涉及二元组格式,例如包括输入数据120和输出数据122。在本公开的上下文中,将仅以中文和英文作为源语言和目标语言的示例来描述有关翻译过程的具体细节。根据本公开的一个示例性实现方式,源语言和目标语言还可以是以下中的任何两种不同语言:日语、法语、俄语、西班牙语,等等。In the model training phase, the translation model 130 can be trained using a model training system 150 based on a training data set 110 including a plurality of training data 112 . In this case, each training data 112 may refer to a 2-tuple format, for example comprising input data 120 and output data 122 . In the context of this disclosure, only Chinese and English will be used as examples of source and target languages to describe specific details about the translation process. According to an exemplary implementation of the present disclosure, the source language and the target language may also be any two different languages among the following: Japanese, French, Russian, Spanish, and so on.
在本公开的上下文中,输入数据120可以包括以源语言表示的字符串,例如“她是谁”,并且输出数据122可以包括以目标语言表示的字符串,例如“who is she”。可以利用包括输入数据120和输出数据122的训练数据112来训练翻译模型130。具体地,可以利用大量训练数据迭代地执行训练过程。在训练完成之后,翻译模型130’可以将以源语言表示的待翻译数据转换为以目标语言表示的翻译结果。在模型应用阶段,可以利用模型应用系统152来调用翻译模型130’(此时的翻译模型130’具有训练后的参数值),并且可以执行上述翻译任务。例如,可以接收待翻译数据140,并且输出翻译结果142。In the context of the present disclosure, input data 120 may include character strings in a source language, such as "who is she," and output data 122 may include character strings in a target language, such as "who is she." Translation model 130 may be trained using training data 112 including input data 120 and output data 122 . Specifically, the training process can be performed iteratively using a large amount of training data. After the training is completed, the translation model 130' can convert the data to be translated expressed in the source language into a translation result expressed in the target language. In the model application stage, the translation model 130' can be invoked by the model application system 152 (the translation model 130' at this time has the parameter values after training), and the above-mentioned translation tasks can be performed. For example, data to be translated 140 may be received and translation results 142 output.
在图1中,模型训练系统150和模型应用系统152可以包括具有计算能力的任何计算系统,例如各种计算设备/系统、终端设备、服务器等。终端设备可以涉及任意类型的移动终端、固定终端或便携式终端,包括移动手机、台式计算机、膝上型计算机、笔记本计算机、上网本计算机、平板计算机、媒体计算机、多媒体平板、或者前述各项的任意组合,包括这些设备的配件和外设或者其任意组合。服务器包括但不限于大型机、边缘计算节点、云环境中的计算设备,等等。In FIG. 1 , the model training system 150 and the model application system 152 may include any computing system with computing capabilities, such as various computing devices/systems, terminal devices, servers, and the like. Terminal equipment may refer to any type of mobile terminal, fixed terminal or portable terminal, including mobile phones, desktop computers, laptop computers, notebook computers, netbook computers, tablet computers, media computers, multimedia tablets, or any combination of the foregoing , including accessories and peripherals for these devices, or any combination thereof. Servers include, but are not limited to, mainframes, edge computing nodes, computing devices in cloud environments, and the like.
应当理解,图1示出的环境100中的部件和布置仅仅是示例,适于用于实现本公开所描述的示例性实现方式的计算系统可以包括一 个或多个不同的部件、其他部件和/或不同的布置方式。例如,虽然被示出为是分离的,但模型训练系统150和模型应用系统152可以集成在相同系统或设备中。本公开的实现方式在此方面不受限制。以下将继续参考附图,分别描述模型训练和模型应用的示例性实现方式。It should be understood that the components and arrangement shown in FIG. 1 in environment 100 are examples only, and that a computing system suitable for implementing the example implementations described in this disclosure may include one or more different components, other components, and/or or a different arrangement. For example, although shown as separate, model training system 150 and model application system 152 may be integrated in the same system or device. Implementations of the present disclosure are not limited in this regard. Exemplary implementations of model training and model application will be described below with continued reference to the accompanying drawings.
翻译模型的架构Architecture of the translation model
根据本公开的实现方式,提出了一种基于层预测的语言翻译的方法。具体地,可以基于transformer编码器/解码器架构来实现如图1所示的翻译模型130。首先参见图2描述有关该架构的概要,该图2示出了根据本公开的一些实现方式的用于将源语言翻译为目标语言的翻译模型的框图200。可以利用基于transformer的编码器/解码器架构来实现根据本公开的一个示例性实现方式的翻译模型130。例如,可以利用目前已知的和/或将在未来开发的多种编码器/解码器或者其变型来实现翻译模型130。According to the implementation of the present disclosure, a method for language translation based on layer prediction is proposed. Specifically, the translation model 130 shown in FIG. 1 can be implemented based on the transformer encoder/decoder architecture. An overview of this architecture is first described with reference to FIG. 2, which shows a block diagram 200 of a translation model for translating a source language into a target language, according to some implementations of the present disclosure. The translation model 130 according to an exemplary implementation of the present disclosure may be implemented using a transformer-based encoder/decoder architecture. For example, the translation model 130 may be implemented using various encoders/decoders or variations thereof that are currently known and/or will be developed in the future.
如图2所示,transformer 210可以包括编码器220和解码器230。其中编码器220可以经由多个层222、224、…、以及226等,来将输入数据120映射至适当的编码。进一步,可以将由编码器220输出的编码输入至解码器230的一个或多个层232、234、…、以及236处,以便将该编码转换至相应的输出数据122。As shown in FIG. 2 , transformer 210 may include encoder 220 and decoder 230. The encoder 220 can map the input data 120 to appropriate codes via a plurality of layers 222 , 224 , . . . , and 226 . Further, the codes output by the encoder 220 may be input to one or more layers 232 , 234 , .
在此架构中,编码器220和解码器230可以分别包括多个层级(例如,6层或者其他数量),并且多个层级协调操作以便分别实现编码和解码功能。将会理解,在此的各个层分别位于编码器和解码器的内部,因而可以将这些层称为隐式层。每个隐式层可以具有相对应的隐式状态(例如,以多维向量表示)。在编码器220和解码器230内部,每个隐式层的隐式状态可以被处理以便将处理后的更新的隐式状态作为下一层的隐式状态。In this architecture, the encoder 220 and the decoder 230 may respectively include a plurality of layers (for example, 6 layers or other numbers), and the plurality of layers operate in coordination to respectively implement encoding and decoding functions. It will be understood that the various layers here are respectively located inside the encoder and the decoder, so these layers can be called implicit layers. Each hidden layer may have a corresponding hidden state (eg, represented by a multidimensional vector). Inside the encoder 220 and the decoder 230, the implicit state of each implicit layer may be processed so as to use the processed updated implicit state as the implicit state of the next layer.
根据本公开的一个示例性实现方式,提出了一种层预测的解码器。在解码器的每个层处,可以基于与每个层相关联的隐式状态来预测翻译结果。进一步,可以将基于前一层预测翻译结果馈送至下一层,以 便进行后续处理。利用本公开的示例性实现方式,基于层预测实现的解码器可以知晓在上一层处的预测翻译结果,因而可以以更加准确的方式实现解码任务。进一步,以此方式获得的解码器的各个隐式层协调工作,并且在最后一层输出最终的翻译结果。此时翻译模型130可以同时输出每个词的翻译,这可以大大提高翻译速度。According to an exemplary implementation of the present disclosure, a decoder for layer prediction is proposed. At each layer of the decoder, translation results can be predicted based on the implicit state associated with each layer. Further, the predicted translation results based on the previous layer can be fed to the next layer for subsequent processing. With the exemplary implementation of the present disclosure, a decoder implemented based on layer prediction can know the predicted translation result at the upper layer, and thus can realize the decoding task in a more accurate manner. Further, the implicit layers of the decoder obtained in this way work in harmony, and the final translation is output at the last layer. At this time, the translation model 130 can output the translation of each word at the same time, which can greatly improve the translation speed.
在下文中,将参见图3描述根据本公开的一个示例性实现方式的解码器的架构。图3示出了根据本公开的一些实现方式的翻译模型130中的解码器310的框图300。如图3所示,解码器310可以包括多个层312、314、…、以及316等。如图3所示,每个层可以包括一定数量的节点,在此节点的数量是基于输出数据的长度来确定的。在训练阶段,该长度是基于训练数据中的输出数据的真实长度来确定的;并且在应用阶段,可以基于目前已知的和/或将在未来开发的多种技术来预测该长度。将会理解,每层所包括的节点的数量是相等的,假设训练数据中的输出数据的长度为3,则每层包括3个节点并且每个节点分别对应于输出数据中的一个词。Hereinafter, the architecture of a decoder according to an exemplary implementation of the present disclosure will be described with reference to FIG. 3 . FIG. 3 shows a block diagram 300 of a decoder 310 in the translation model 130 according to some implementations of the disclosure. As shown in FIG. 3, the decoder 310 may include a plurality of layers 312, 314, ..., and 316, among others. As shown in FIG. 3, each layer may include a certain number of nodes, where the number of nodes is determined based on the length of the output data. In the training phase, the length is determined based on the real length of the output data in the training data; and in the application phase, the length can be predicted based on various techniques currently known and/or to be developed in the future. It will be understood that the number of nodes included in each layer is equal, assuming that the length of the output data in the training data is 3, then each layer includes 3 nodes and each node corresponds to a word in the output data.
节点320示意性示出了解码器310中的一个层中的一个节点(例如,第1层中的最后一个节点),并且图3右侧示出了节点320的更多细节。在此,节点320对应于输出数据中的一个位置(也即,对应于翻译结果中的一个词),在图3右侧所示的节点320中,除了与该节点对应的隐式状态322以外,向节点320输入与该节点320相对应的预测信息324。在此,预测信息324表示与该位置相对应的预测信息。进一步,可以基于与翻译结果中的该位置相对应的预测信息324和隐式状态322,来生成更新的隐式状态326。在此的更新的隐式状态326既包括当前层的隐式状态信息,又包括该位置处的翻译词语的预测信息。Node 320 schematically shows a node in a layer in decoder 310 (eg, the last node in layer 1 ), and more details of node 320 are shown on the right side of FIG. 3 . Here, node 320 corresponds to a position in the output data (that is, corresponds to a word in the translation result). In the node 320 shown on the right side of FIG. 3 , except for the implicit state 322 corresponding to the node , input the prediction information 324 corresponding to the node 320 to the node 320 . Here, the prediction information 324 represents prediction information corresponding to the position. Further, an updated implicit state 326 may be generated based on the prediction information 324 and the implicit state 322 corresponding to the position in the translation result. Here, the updated implicit state 326 includes not only the implicit state information of the current layer, but also the prediction information of the translated word at this position.
根据本公开的一个示例性实现方式,在一个隐式层中,可以在一个或多个位置处实现上文描述的节点320。备选地和/或附加地,可以在一个或多个隐式层中的一个或多个位置处实现上文描述的节点320。以此方式,在各个隐式层处,解码器310都可以知晓与各个翻译词语 相对应的预测信息,这有助于解码器310以更加准确的方式训练翻译模型。According to an exemplary implementation of the present disclosure, in an implicit layer, the node 320 described above may be implemented at one or more locations. Alternatively and/or additionally, node 320 described above may be implemented at one or more locations in one or more hidden layers. In this way, at each hidden layer, the decoder 310 can know the prediction information corresponding to each translated word, which helps the decoder 310 to train the translation model in a more accurate manner.
根据本公开的一个示例性实现方式,可以在编码器310的每层处的每个位置处实现上文描述的节点320。以此方式,可以基于与每个翻译词语相对应的翻译信息,进一步提高翻译模型的准确性。利用本公开的示例性实现方式,解码器310中的各个层中的各个节点可以知晓翻译结果中的与该节点的位置相对应的词,因而可以有助于消除在翻译结果中出现重复翻译、漏词或者翻译词语的位置不准确等情况。以此方式,可以提高翻译模型130的准确性。According to an exemplary implementation of the present disclosure, the node 320 described above may be implemented at each position at each layer of the encoder 310 . In this way, the accuracy of the translation model can be further improved based on the translation information corresponding to each translated word. Using the exemplary implementation of the present disclosure, each node in each layer in the decoder 310 can know the word corresponding to the position of the node in the translation result, which can help to eliminate repeated translations, Missing words or inaccurate positions of translated words. In this way, the accuracy of the translation model 130 can be improved.
模型训练过程Model training process
在下文中,将参见附图描述有关训练过程的更多细节。图4示出了根据本公开的一些实现方式的用于训练解码器310的训练过程的框图400。在训练过程中,可以获取训练数据集110,在此训练数据集110可以包括带有标记的大量训练数据112。具体地,每个训练数据112可以包括以源语言表示的输入数据120(例如,“她是谁”)和以目标语言表示的输出数据122(例如,“who is she”)。可以利用训练数据112来训练翻译模型130。Hereinafter, more details about the training process will be described with reference to the accompanying drawings. FIG. 4 shows a block diagram 400 of a training process for training the decoder 310 according to some implementations of the present disclosure. During training, a training data set 110 may be obtained, where the training data set 110 may include a large amount of labeled training data 112 . Specifically, each training data 112 may include input data 120 in a source language (eg, "who is she") and output data 122 in a target language (eg, "who is she"). The translation model 130 may be trained using the training data 112 .
如图4所示,可以将输入数据120输入至目前已知的和/或将在未来开发的多种编码器(例如,编码器220)中,以便提取输入数据120的编码410。可以以多种格式来存储编码410,例如,可以以多维向量(例如,512维或者其他维度)来存储所提取的编码410。继而,可以向基于层预测的解码器310提供编码410。在训练过程中,每个层中的节点的数量是基于输出数据122的数量来确定的,对于训练数据112中的输出数据122而言,长度为3。因而,层312可以包括3个节点420、422和320。将会理解,图4仅以解码器310中的一个层312为示例进行描述,可以针对解码器310中的其他层进行类似的处理。可以基于编码410确定与解码器310中的层312相关联的隐式状态。As shown in FIG. 4 , input data 120 may be input into various encoders (eg, encoder 220 ) that are currently known and/or will be developed in the future to extract an encoding 410 of input data 120 . The codes 410 may be stored in a variety of formats, for example, the extracted codes 410 may be stored as multi-dimensional vectors (eg, 512 dimensions or other dimensions). The encoding 410 may then be provided to a layer prediction based decoder 310 . During training, the number of nodes in each layer is determined based on the number of output data 122 , which for the output data 122 in the training data 112 has a length of three. Thus, layer 312 may include three nodes 420 , 422 and 320 . It will be understood that FIG. 4 only uses one layer 312 in the decoder 310 as an example for description, and similar processing can be performed on other layers in the decoder 310 . Implicit states associated with layers 312 in decoder 310 may be determined based on encoding 410 .
假设输入数据120为
Figure PCTCN2022117230-appb-000001
其中T x为输入数据120的长度。编码器220可以将输入数据120作为输入,并且基于embedding函数
Figure PCTCN2022117230-appb-000002
来将其映射至多个离散值。具体地,在每个层处,可以基于如下公式1来执行多头注意力运算,以便确定深度文本表示:
Suppose the input data 120 is
Figure PCTCN2022117230-appb-000001
Where T x is the length of the input data 120 . The encoder 220 can take the input data 120 as input, and based on the embedding function
Figure PCTCN2022117230-appb-000002
to map it to multiple discrete values. Specifically, at each layer, a multi-head attention operation can be performed based on Equation 1 as follows in order to determine the deep text representation:
Figure PCTCN2022117230-appb-000003
Figure PCTCN2022117230-appb-000003
其中
Figure PCTCN2022117230-appb-000004
表示第n个编码器层,可以针对编码器220中的每个层来重复执行上述运算,并且将由编码器220中的最后一个层的输出表示为E。
in
Figure PCTCN2022117230-appb-000004
Denoting the nth encoder layer, the above operations can be repeated for each layer in the encoder 220, and denote the output by the last layer in the encoder 220 as E.
进一步,可以确定输出数据的长度T y。目前已经提出了多种技术方案来确定输出数据的长度,在训练阶段,可以基于输出数据的真实长度来确定该长度。进一步,可以将由编码器220输出的编码410输入至解码器310。对于解码器310而言,假设层t处的隐式状态表示为
Figure PCTCN2022117230-appb-000005
在常规解码器230的每个隐式层t处,可以基于如下公式2来预测在解码器的最后一层处的隐式状态:
Further, the length T y of the output data can be determined. At present, various technical solutions have been proposed to determine the length of the output data. In the training phase, the length can be determined based on the real length of the output data. Further, the code 410 output by the encoder 220 may be input to the decoder 310 . For the decoder 310, assume that the implicit state at layer t is expressed as
Figure PCTCN2022117230-appb-000005
At each implicit layer t of the conventional decoder 230, the implicit state at the last layer of the decoder can be predicted based on Equation 2 as follows:
Figure PCTCN2022117230-appb-000006
Figure PCTCN2022117230-appb-000006
其中
Figure PCTCN2022117230-appb-000007
表示解码器的第n层,
Figure PCTCN2022117230-appb-000008
表示解码器第n-1层处的长度为T y的隐式状态,E表示编码器的输出。进一步,可以使用softmax函数并基于解码器的最后的隐式状态,预测翻译结果中的目标词。
in
Figure PCTCN2022117230-appb-000007
Denotes the nth layer of the decoder,
Figure PCTCN2022117230-appb-000008
Denotes the implicit state of length T y at layer n-1 of the decoder, and E denotes the output of the encoder. Further, the softmax function can be used to predict the target word in the translation result based on the last implicit state of the decoder.
Figure PCTCN2022117230-appb-000009
Figure PCTCN2022117230-appb-000009
其中y t表示翻译结果中的各个位置处的词,N表示解码器的层的 数量,W表示预定参数。以此方式,即可同时获得翻译结果中的每个词
Figure PCTCN2022117230-appb-000010
此时,在预测词y t时,并不知晓其他位置处的词
Figure PCTCN2022117230-appb-000011
Figure PCTCN2022117230-appb-000012
的预测。进一步,可以基于公式4定义的训练目标来执行训练:
where yt represents words at various positions in the translation result, N represents the number of layers of the decoder, and W represents predetermined parameters. In this way, each word in the translation result can be obtained simultaneously
Figure PCTCN2022117230-appb-000010
At this time, when predicting the word y t , the words at other positions are not known
Figure PCTCN2022117230-appb-000011
Figure PCTCN2022117230-appb-000012
Prediction. Further, training can be performed based on the training objective defined by Equation 4:
Figure PCTCN2022117230-appb-000013
Figure PCTCN2022117230-appb-000013
其中
Figure PCTCN2022117230-appb-000014
表示训练目标,∑表示求和函数,log表示对数函数,并且其他符号的含义与上文公式的描述相同。
in
Figure PCTCN2022117230-appb-000014
Represents the training target, ∑ represents the sum function, log represents the logarithmic function, and the meanings of other symbols are the same as the descriptions of the above formulas.
在上文的描述的常规解码器的训练过程的基础上,在本公开的上下文中,向训练过程添加了基于每个层的隐式状态的层预测过程。可以参见上文描述的过程来确定与各个隐式层相关联的隐式状态。例如,在训练过程中可以利用输出数据的长度来确定在每个层处包括的节点的数量,进而确定与输出数据相关联的多个位置。如图4所示,可以确定在层312处包括3个节点420、422和320。进一步,可以基于来自编码器输出的编码410确定分别与多个位置相对应的多个部分。例如,与解码器310中的第n层的第t个位置相对应的隐式状态可以表示为
Figure PCTCN2022117230-appb-000015
On the basis of the training process of the conventional decoder described above, in the context of the present disclosure, a layer prediction process based on the implicit state of each layer is added to the training process. The implicit state associated with each implicit layer can be determined by referring to the procedures described above. For example, the length of the output data can be used in the training process to determine the number of nodes included at each layer, thereby determining multiple positions associated with the output data. As shown in FIG. 4 , it can be determined that three nodes 420 , 422 and 320 are included at layer 312 . Further, a plurality of parts respectively corresponding to a plurality of positions may be determined based on the code 410 output from the encoder. For example, the implicit state corresponding to the t-th position of the n-th layer in the decoder 310 can be expressed as
Figure PCTCN2022117230-appb-000015
进一步,可以确定与输出数据相关联的预测信息。具体地,可以基于每个层的隐式状态生成翻译结果的预测信息,并且将生成的预测信息作为试探性翻译馈送至下一层用作进一步处理。根据本公开的一个示例性实现方式,可以基于多种方式来生成对应于第n层的第t个位置的预测信息
Figure PCTCN2022117230-appb-000016
例如,可以基于翻译模型的预测或者直接基于输出数据的真值来生成预测信息
Figure PCTCN2022117230-appb-000017
Further, predictive information associated with the output data may be determined. Specifically, the prediction information of the translation result can be generated based on the implicit state of each layer, and the generated prediction information can be fed to the next layer as a tentative translation for further processing. According to an exemplary implementation of the present disclosure, the prediction information corresponding to the tth position of the nth layer can be generated based on various methods
Figure PCTCN2022117230-appb-000016
For example, the prediction information can be generated based on the prediction of the translation model or directly based on the truth value of the output data
Figure PCTCN2022117230-appb-000017
根据本公开的一个示例性实现方式,针对多个位置中的给定位置,可以基于翻译模型确定与给定位置相对应的预测信息。例如,对于多个位置中的位置t,可以将当前翻译模型的翻译结果中的第t个词作为预测信息
Figure PCTCN2022117230-appb-000018
对于图4中的节点320而言,假设解码器310输出的翻译结果为“who is she”,则可以将第3个位置处的词“she”作为 针对该节点320所在位置的预测信息
Figure PCTCN2022117230-appb-000019
According to an exemplary implementation manner of the present disclosure, for a given position among the plurality of positions, prediction information corresponding to the given position may be determined based on a translation model. For example, for position t in multiple positions, the tth word in the translation result of the current translation model can be used as prediction information
Figure PCTCN2022117230-appb-000018
For the node 320 in FIG. 4 , assuming that the translation result output by the decoder 310 is "who is she", the word "she" at the third position can be used as the prediction information for the position of the node 320
Figure PCTCN2022117230-appb-000019
将会理解,上文仅以“who is she”作为翻译结果的示例来示出了如何确定预测信息
Figure PCTCN2022117230-appb-000020
然而,例如在未经充分训练和/或其他异常情况下,翻译模型可能会输出其他翻译结果,例如,翻译模型可能会输出“who is he”,此时可以将词“he”作为预测信息
Figure PCTCN2022117230-appb-000021
将会理解,尽管此时预测信息可能并不完全准确,然而,该预测信息携带了基于前一层的隐式状态确定的与位置相关的翻译词语的信息,因而在一定程度上有助于提高翻译模型的准确性。
It will be understood that the above only uses "who is she" as an example of the translation result to show how to determine the prediction information
Figure PCTCN2022117230-appb-000020
However, the translation model may output other translation results, for example, under insufficient training and/or other abnormal conditions, for example, the translation model may output "who is he", in which case the word "he" can be used as prediction information
Figure PCTCN2022117230-appb-000021
It will be understood that although the prediction information may not be completely accurate at this time, however, the prediction information carries the information of the location-related translation words determined based on the implicit state of the previous layer, thus helping to improve Accuracy of the translation model.
根据本公开的一个示例性实现方式,还可以直接基于训练数据中的真值来确定预测信息
Figure PCTCN2022117230-appb-000022
例如,针对多个位置中的给定位置,可以将训练数据112的输出数据122中的与位置相对应的真值数据作为预测信息。此时,可以直接将输出数据122“who is she”中的三个词“who”、“is”、“she”分别作为用于三个节点420、422和320的预测信息。
According to an exemplary implementation of the present disclosure, the prediction information can also be determined directly based on the ground truth in the training data
Figure PCTCN2022117230-appb-000022
For example, for a given position among the plurality of positions, the ground-truth data corresponding to the position in the output data 122 of the training data 112 may be used as prediction information. At this time, the three words “who”, “is” and “she” in the output data 122 “who is she” can be directly used as the prediction information for the three nodes 420 , 422 and 320 respectively.
根据本公开的一个示例性实现方式,每个隐式层可以包括多个位置,在训练过程中可以向多个位置中的一部分位置输入基于翻译模型确定的预测信息。根据本公开的一个示例性实现方式,可以基于如下公式5来来利用翻译模型确定预测信息
Figure PCTCN2022117230-appb-000023
According to an exemplary implementation manner of the present disclosure, each hidden layer may include multiple positions, and during the training process, prediction information determined based on the translation model may be input to some of the multiple positions. According to an exemplary implementation of the present disclosure, the translation model can be used to determine prediction information based on the following formula 5
Figure PCTCN2022117230-appb-000023
Figure PCTCN2022117230-appb-000024
Figure PCTCN2022117230-appb-000024
其中
Figure PCTCN2022117230-appb-000025
表示预测信息,argmax表示预测函数,
Figure PCTCN2022117230-appb-000026
表示隐式状态,
Figure PCTCN2022117230-appb-000027
表示翻译词语。换言之,该公式5表示,
Figure PCTCN2022117230-appb-000028
是基于当前的隐式状态所确定的具有最大可能性的翻译词语的预测。
in
Figure PCTCN2022117230-appb-000025
Represents the prediction information, argmax represents the prediction function,
Figure PCTCN2022117230-appb-000026
represents the implicit state,
Figure PCTCN2022117230-appb-000027
Indicates the translated word. In other words, the Equation 5 expresses,
Figure PCTCN2022117230-appb-000028
is the prediction of the most likely translated word based on the current implicit state.
根据本公开的一个示例性实现方式,可以向多个位置中的其他位置输入基于真值数据确定的预测信息。图5示出了根据本公开的一些实现方式的向隐式层中的各个节点提供的预测信息的框图500。假设训练数据集110进一步包括训练数据510,其中输入数据512为“你是谁”并且输出数据514为“who are you”。此时,可以确定输出数 据514的长度为3,并且针对翻译数据中的3个位置,可以分别将来自真值的词“who”、来自翻译模型的预测522和524分别作为预测信息。此时,预测信息将涉及两种预测方式的混合,并且此时的混合比例为1:2。According to an exemplary implementation of the present disclosure, prediction information determined based on ground-truth data may be input to other locations among the plurality of locations. FIG. 5 shows a block diagram 500 of prediction information provided to various nodes in an implicit layer according to some implementations of the present disclosure. Assume that the training data set 110 further includes training data 510, wherein the input data 512 is "who are you" and the output data 514 is "who are you". At this time, it can be determined that the length of the output data 514 is 3, and for the 3 positions in the translation data, the word "who" from the true value and the predictions 522 and 524 from the translation model can be used as prediction information respectively. At this time, the prediction information will involve the mixture of the two prediction methods, and the mixture ratio at this time is 1:2.
尽管图5仅示出了向第1个位置输入来自真值的预测信息,根据本公开的一个示例性实现方式,可以针对多个位置中的任意其他一个或多个位置来使用来自真值的预测信息。假设另一训练数据的输出数据的长度为10,可以针对第1、2、4个位置使用来自真值的预测信息,并且针对其他7个位置使用来自翻译模型的预测信息,此时混合比例为3:7。又例如,可以随机地从多个位置中选择至少一部分位置,并且对选择的这些位置应用来自真值的预测信息。Although FIG. 5 only shows that the prediction information from the ground truth is input to the first position, according to an exemplary implementation of the present disclosure, the prediction information from the ground truth can be used for any other one or more of the multiple positions. Forecast information. Assuming that the length of the output data of another training data is 10, the prediction information from the true value can be used for the 1st, 2nd, and 4th positions, and the prediction information from the translation model can be used for the other 7 positions. At this time, the mixing ratio is 3:7. As another example, at least a portion of locations may be randomly selected from a plurality of locations, and the prediction information from the ground truth is applied to the selected locations.
利用本公开的示例性实现方式,使用来自真值的预测信息可以使得训练过程更加符合带有人工标注的训练数据,进而使得翻译模型更加准确。进一步,使用来自翻译模型输出的预测信息可以确保训练过程考虑当前训练模型预测结果,进而区别训练后的翻译模型更加符合训练目标。Using the exemplary implementation of the present disclosure, using the prediction information from the ground truth can make the training process more consistent with the training data with human annotations, thereby making the translation model more accurate. Furthermore, using the prediction information output from the translation model can ensure that the training process takes into account the prediction results of the current training model, and then distinguish the trained translation model more in line with the training goal.
将会理解,可以在每个隐式层处按照相同的规则来确定预测信息。换言之,假设在第1层处向第1个位置应用来自真值的预测信息,并且向其他位置应用来自翻译模型的预测信息,则在后续的第2层、第3层等其他隐式层处,也应当向第1个位置应用来自真值的预测信息,并且向其他位置应用来自翻译模型的预测信息。以此方式,可以基于相同的预测信息来训练每个隐式层,进而提高翻译模型的准确性。It will be appreciated that the prediction information can be determined at each hidden layer following the same rules. In other words, assuming that the prediction information from the ground truth is applied to the 1st position at the 1st layer, and the prediction information from the translation model is applied to other positions, then at other implicit layers such as the subsequent 2nd, 3rd, etc. , should also apply predictions from the ground truth to the 1st position, and apply predictions from the translation model to the other positions. In this way, each hidden layer can be trained based on the same prediction information, thereby improving the accuracy of the translation model.
已经描述如何获得预测信息
Figure PCTCN2022117230-appb-000029
在下文中将参见图6描述如何生成更新的隐式状态。该图6示出了根据本公开的一些实现方式的在解码器中的隐式层中的节点的框图600。假设解码器310包括N个隐式层,在每个隐式层处,可以基于该层的隐式状态
Figure PCTCN2022117230-appb-000030
和相关联的预测信息
Figure PCTCN2022117230-appb-000031
两者,来生成针对每个位置的更新的隐式状态。如图6所示,可以向节点320输入向量610(例如,来自编码器的输出)以作为隐式状态322,可以向位置320输入向量620(例如,来自输出数 据122中的真值“she”)来作为预测信息324。进一步,可以基于以下公式6来确定具有最大可能性的预测词语
Figure PCTCN2022117230-appb-000032
has described how to obtain forecast information
Figure PCTCN2022117230-appb-000029
How to generate the updated implicit state will be described below with reference to FIG. 6 . This Fig. 6 shows a block diagram 600 of nodes in an implicit layer in a decoder according to some implementations of the present disclosure. Assuming that the decoder 310 includes N implicit layers, at each implicit layer, the implicit state of the layer can be based on
Figure PCTCN2022117230-appb-000030
and associated forecast information
Figure PCTCN2022117230-appb-000031
Both, to generate an updated implicit state for each position. As shown in FIG. 6 , a vector 610 (e.g., from the output of an encoder) can be input to a node 320 as an implicit state 322, and a vector 620 can be input to a location 320 (e.g., from the truth value "she" in output data 122 ) as prediction information 324. Further, the predicted word with the greatest possibility can be determined based on the following formula 6
Figure PCTCN2022117230-appb-000032
Figure PCTCN2022117230-appb-000033
Figure PCTCN2022117230-appb-000033
公式6中的各个符号的含义与上文公式中的含义相同,因而不再赘述。进一步,可以基于公式5和6两者,确定更新的隐式状态:The meanings of the symbols in Formula 6 are the same as those in the above formulas, so they will not be repeated here. Further, the updated implicit state can be determined based on both Equations 5 and 6:
Figure PCTCN2022117230-appb-000034
Figure PCTCN2022117230-appb-000034
其中
Figure PCTCN2022117230-appb-000035
表示更新的隐式状态(将被输出至下一隐式层来作为与下一隐式层相关联的隐式状态),W c表示权重矩阵,并且
Figure PCTCN2022117230-appb-000036
表示预测信息
Figure PCTCN2022117230-appb-000037
的embedding。换言之,公式7表示将
Figure PCTCN2022117230-appb-000038
和相关联的预测信息
Figure PCTCN2022117230-appb-000039
的连接运算,并且
Figure PCTCN2022117230-appb-000040
的维度与之前的向量表示的维度相同。
in
Figure PCTCN2022117230-appb-000035
represents the updated implicit state (to be output to the next implicit layer as the implicit state associated with the next implicit layer), Wc represents the weight matrix, and
Figure PCTCN2022117230-appb-000036
Indicates forecast information
Figure PCTCN2022117230-appb-000037
The embedding. In other words, Equation 7 expresses that the
Figure PCTCN2022117230-appb-000038
and associated forecast information
Figure PCTCN2022117230-appb-000039
The concatenation operation of , and
Figure PCTCN2022117230-appb-000040
has the same dimensions as the previous vector representation.
根据本公开的一个示例性实现方式,当采用混合方式提供预测信息时(也即,向某些位置应用来自翻译模型的预测信息并且向某些位置应用来自真值的预测信息),可以基于如下公式8来确定更新的隐式状态:According to an exemplary implementation of the present disclosure, when providing prediction information in a hybrid manner (that is, applying prediction information from the translation model to some positions and applying prediction information from the ground truth to some positions), it can be based on the following Equation 8 to determine the implicit state of the update:
Figure PCTCN2022117230-appb-000041
Figure PCTCN2022117230-appb-000041
其中
Figure PCTCN2022117230-appb-000042
s t依赖于用于控制混合比例的超参数λ,在此s t不依赖于隐式层的层数n。
in
Figure PCTCN2022117230-appb-000042
st depends on the hyperparameter λ used to control the mixing ratio, where st does not depend on the number n of hidden layers.
根据本公开的一个示例性实现方式,可以基于上文描述的公式在每个隐式层处执行类似的处理,以便基于当前隐式层的隐式状态和预测信息,生成更新的第一隐式状态。进一步,可以向多个隐式层中的当前隐式层之后的后续隐式层,输出更新的隐式状态,以使得更新的 隐式状态被作为与后续隐式层相关联的第二隐式状态。具体地,可以首先在解码器310的第一个隐式层处执行上文描述的过程,并且将在第一层处获得的更新的隐式状态输出至解码器310的第二个隐式层,来作为该第二个隐式层的隐式状态。进一步,可以在第二个隐式层处执行上文描述的过程,并且将在第二层处获得的更新的隐式状态输出至解码器310的第三个隐式层,来作为该第三个隐式层的隐式状态,以此类推,直到在全部隐式层处执行类似的处理。According to an exemplary implementation of the present disclosure, similar processing can be performed at each implicit layer based on the formula described above, so as to generate an updated first implicit state. Further, an updated implicit state may be output to a subsequent implicit layer after the current implicit layer among the plurality of implicit layers, so that the updated implicit state is used as the second implicit state associated with the subsequent implicit layer. state. Specifically, the process described above may be firstly performed at the first implicit layer of the decoder 310, and the updated implicit state obtained at the first layer is output to the second implicit layer of the decoder 310 , as the implicit state of the second implicit layer. Further, the process described above may be performed at the second implicit layer, and the updated implicit state obtained at the second layer is output to the third implicit layer of the decoder 310 as the third implicit state The implicit state of the first implicit layer, and so on, until similar processing is performed at all implicit layers.
根据本公开的一个示例性实现方式,可以基于训练数据112来训练按照上文描述的方式构造的解码器310。具体地,可以针对隐式层312来生成描述训练目标的损失函数。在一个隐式层中,可以基于隐式状态中的一部分来生成损失函数。例如,可以基于针对3个位置的任何一个或者多个位置来生成该损失函数。例如,损失函数可以表示为如下公式9:According to an exemplary implementation of the present disclosure, the decoder 310 constructed in the manner described above may be trained based on the training data 112 . Specifically, a loss function describing the training objective may be generated for the hidden layer 312 . In an implicit layer, a loss function can be generated based on a portion of the hidden state. For example, the loss function can be generated based on any one or more of the 3 locations. For example, the loss function can be expressed as the following formula 9:
Figure PCTCN2022117230-appb-000043
Figure PCTCN2022117230-appb-000043
其中
Figure PCTCN2022117230-appb-000044
表示用作训练目标的损失函数,T y表示输出数据的长度,t表示输出数据中的各个字的位置,
Figure PCTCN2022117230-appb-000045
表示输出数据,
Figure PCTCN2022117230-appb-000046
表示更新的隐式状态,并且n表示多个隐式层中的任一隐式层。换言之,该公式表示基于更新的隐式状态获得的预测与训练数据中的输出数据之间的差异。
in
Figure PCTCN2022117230-appb-000044
Indicates the loss function used as the training target, T y indicates the length of the output data, t indicates the position of each word in the output data,
Figure PCTCN2022117230-appb-000045
represents the output data,
Figure PCTCN2022117230-appb-000046
denotes an updated implicit state, and n denotes any one of a plurality of implicit layers. In other words, the formula represents the difference between the prediction obtained based on the updated implicit state and the output data in the training data.
根据本公开的一个示例性实现方式,利用训练数据中的输入数据和输出数据来训练翻译模型,以使公式9中的损失函数满足预定条件(例如,达到预定的收敛状态)。利用本公开的示例性实现方式,可以在某个或者某些隐式层处引入针对各个位置的预测信息。以此方式,可以使得在各个隐式层处考虑其他位置的词语预测,以便提高翻译模型的准确性。According to an exemplary implementation of the present disclosure, the translation model is trained using the input data and output data in the training data, so that the loss function in Formula 9 satisfies a predetermined condition (for example, reaches a predetermined convergence state). With the exemplary implementation of the present disclosure, the prediction information for each position can be introduced at one or some implicit layers. In this way, word predictions at other locations can be made to be considered at various implicit layers in order to improve the accuracy of the translation model.
根据本公开的一个示例性实现方式,可以针对解码器310中的每个隐式层执行类似的操作。例如,可以分别在第1个隐式层、第2个 隐式层、…、第n个隐式层处执行类似的操作,并且构造相应的损失函数。进一步,可以基于上述多个相应的损失函数来构造整体损失函数。具体地,可以基于如下公式10来构造整体损失函数:According to an exemplary implementation of the present disclosure, similar operations may be performed for each hidden layer in the decoder 310 . For example, similar operations can be performed at the first hidden layer, the second hidden layer, ..., the nth hidden layer, and corresponding loss functions can be constructed. Further, an overall loss function can be constructed based on the above-mentioned multiple corresponding loss functions. Specifically, the overall loss function can be constructed based on the following formula 10:
Figure PCTCN2022117230-appb-000047
Figure PCTCN2022117230-appb-000047
其中公式10中的符号的含义与上文描述的公式中的含义相同,并且N表示解码器中的隐式层的数量。根据本公开的一个示例性实现方式,可以基于公式10来训练翻译模型,并使得基于公式10确定的损失函数满足预定条件(例如,达到预定的收敛状态)。利用本公开的示例性实现方式,可以在全部隐式层处引入针对各个位置的预测信息。以此方式,可以使得在各个隐式层处以更加准确的方式考虑其他位置的词语预测,以便提高翻译模型的准确性。where the symbols in Equation 10 have the same meanings as those in the equations described above, and N represents the number of hidden layers in the decoder. According to an exemplary implementation of the present disclosure, the translation model may be trained based on Formula 10, and the loss function determined based on Formula 10 satisfies a predetermined condition (for example, reaches a predetermined convergence state). With the exemplary implementation of the present disclosure, prediction information for each position can be introduced at all implicit layers. In this way, word predictions elsewhere can be made to be considered in a more accurate manner at the various hidden layers in order to improve the accuracy of the translation model.
利用本公开的示例性实现方式,解码器310在不同隐式层处可以知晓基于与各个隐式层相关联的隐式状态确定的预测。以此方式,在构造损失函数时可以进一步考虑各个隐式层处的损失并且进行深度监督,由此提高翻译模型的准确性。With exemplary implementations of the present disclosure, the decoder 310 at different implicit layers may be aware of predictions determined based on the implicit state associated with each implicit layer. In this way, the loss at each hidden layer can be further considered and deeply supervised when constructing the loss function, thereby improving the accuracy of the translation model.
模型应用过程Model application process
以上讨论的对翻译模型130的训练,训练后的翻译模型130’可以被提供到如图1所示的模型应用系统152中使用,以用于对待翻译数据140进行处理。具体地,在已经完成模型训练阶段之后,可以使用已经训练好的、具有训练后的参数值的翻译模型130’来处理接收到的待翻译数据。返回图1描述有关模型应用过程的更多信息。可以向训练后的翻译模型130’输入待翻译数据140。此时,充分训练之后的翻译模型130’可以将待翻译数据140从源语言翻译为目标语言。例如,翻译模型130’可以输出翻译数据142。The training of the translation model 130 discussed above, the trained translation model 130' can be provided to the model application system 152 shown in FIG. 1 for use in processing the data 140 to be translated. Specifically, after the model training phase has been completed, the received data to be translated can be processed using the already trained translation model 130' with the trained parameter values. Return to Figure 1 to describe more information about the model application process. Data to be translated 140 may be input to the trained translation model 130'. At this point, the fully trained translation model 130' can translate the data to be translated 140 from the source language to the target language. For example, translation model 130' may output translation data 142.
根据本公开的一个示例性实现方式,可以接收以源语言表示的待翻译数据140,并且翻译模型130’中的编码器可以从待翻译数据140 中提取相应的编码。翻译模型130’中的解码器310可以接收该编码,并且解码器310中的各个隐式层可以按照与上文描述的训练过程类似的方式操作。具体地,在解码器310的多个隐式层中的第一隐式层处,可以确定与该第一隐式层相关联的第一隐式状态。进一步,可以基于第一隐式状态确定与翻译结果相关联的预测信息(例如,基于上文描述的公式5)。进一步,可以基于第一隐式状态和预测信息,生成更新的第一隐式状态更新(例如,基于上文描述的公式6和7)。接着,可以向多个隐式层的第一隐式层之后的第二隐式层,输入更新的第一隐式状态,以使得翻译模型将更新的第一隐式状态作为与第二隐式层相关联的第二隐式状态。According to an exemplary implementation of the present disclosure, data to be translated 140 expressed in a source language may be received, and an encoder in the translation model 130' may extract a corresponding code from the data to be translated 140. The decoder 310 in the translation model 130' can receive this encoding, and the various hidden layers in the decoder 310 can operate in a similar manner to the training process described above. Specifically, at a first implicit layer among the plurality of implicit layers of the decoder 310, a first implicit state associated with the first implicit layer may be determined. Further, prediction information associated with the translation result may be determined based on the first implicit state (for example, based on formula 5 described above). Further, an updated first implicit state update may be generated based on the first implicit state and the prediction information (eg, based on equations 6 and 7 described above). Next, the updated first implicit state may be input to the second implicit layer after the first implicit layer of the plurality of implicit layers, so that the translation model uses the updated first implicit state as the same as the second implicit The second implicit state associated with the layer.
可以在每个隐式层处重复上文描述的过程,直到在解码器310的最后一个隐式层处获得与最后一个隐式层相关联的隐式状态。根据本公开的一个示例性实现方式,可以基于与最后的隐式层相关联的隐式状态,确定最终的翻译结果。利用本公开的示例性实现方式,可以充分利用基于逐层预测训练的翻译模型来获得更加准确的翻译结果。使用翻译模型130’时,在解码器的每个隐式层处,不再直接将来自上一层的输出作为输入,而是加入了基于当前隐式层的隐式状态的预测信息。以此方式,可以使得翻译模型的翻译结果更加符合于两种语言之间的翻译规则。进一步,翻译模型130’可以同时输出翻译结果中的全部词,以此方式,可以大大提高翻译速度。The process described above may be repeated at each hidden layer until at the last hidden layer of decoder 310 the implicit state associated with the last hidden layer is obtained. According to an exemplary implementation of the present disclosure, the final translation result may be determined based on the implicit state associated with the last implicit layer. Using the exemplary implementation of the present disclosure, it is possible to make full use of the translation model based on layer-by-layer prediction training to obtain more accurate translation results. When using the translation model 130', at each hidden layer of the decoder, the output from the previous layer is no longer directly used as input, but prediction information based on the implicit state of the current hidden layer is added. In this way, the translation result of the translation model can be made more consistent with the translation rules between the two languages. Further, the translation model 130' can output all words in the translation result at the same time, in this way, the translation speed can be greatly improved.
示例过程example process
图7示出了根据本公开的一些实现方式的用于基于层预测的语言翻译的方法700的流程图。具体地,在框710处,在翻译模型的解码器的多个隐式层中的第一隐式层处,基于训练数据中包括的输入数据的编码,确定与第一隐式层相关联的第一隐式状态,训练数据包括以源语言表示的输入数据和以目标语言表示的输出数据,翻译模型用于将输入数据翻译为输出数据。FIG. 7 shows a flowchart of a method 700 for layer prediction based language translation according to some implementations of the present disclosure. Specifically, at block 710, at the first implicit layer among the plurality of implicit layers of the decoder of the translation model, based on the encoding of the input data included in the training data, determine the In the first implicit state, the training data includes input data expressed in the source language and output data expressed in the target language, and the translation model is used to translate the input data into output data.
根据本公开的一个示例性实现方式,为了确定与第一隐式层相关 联的第一隐式状态,可以基于输出数据的长度确定与第一隐式状态相关联的多个位置。进一步,可以确定第一隐式状态中的分别与多个位置相对应的多个部分。According to an exemplary implementation of the present disclosure, in order to determine the first implicit state associated with the first implicit layer, a plurality of positions associated with the first implicit state may be determined based on the length of the output data. Further, a plurality of parts respectively corresponding to a plurality of positions in the first implicit state may be determined.
在框720处,可以确定与输出数据相关联的预测信息。根据本公开的一个示例性实现方式,针对多个位置中的给定位置,可以基于以下任一项来确定针对给定位置的预测信息:翻译模型;以及输出数据中的与给定位置相对应的真值数据。At block 720, predictive information associated with the output data may be determined. According to an exemplary implementation of the present disclosure, for a given position among the plurality of positions, the prediction information for the given position may be determined based on any of the following: a translation model; true value data.
在框730处,可以基于第一隐式状态和预测信息,生成更新的第一隐式状态。根据本公开的一个示例性实现方式,为了生成更新的第一隐式状态,可以基于第一隐式状态中的与给定位置相对应的部分和针对给定位置的预测信息,生成更新的第一隐式状态中的与给定位置相对应的部分。At block 730, an updated first implicit state may be generated based on the first implicit state and the prediction information. According to an exemplary implementation of the present disclosure, in order to generate an updated first implicit state, an updated first implicit state may be generated based on a part of the first implicit state corresponding to a given position and prediction information for a given position. The portion of an implicit state corresponding to a given position.
根据本公开的一个示例性实现方式,为了生成更新的第一隐式状态,可以获取基于翻译模型确定的预测信息和基于输出数据确定的预测信息的混合比例。进一步,可以基于混合比例、第一隐式状态和预测信息,生成更新的第一隐式状态。According to an exemplary implementation manner of the present disclosure, in order to generate the updated first implicit state, a mixing ratio of prediction information determined based on the translation model and prediction information determined based on output data may be obtained. Further, an updated first implicit state may be generated based on the mixing ratio, the first implicit state and prediction information.
在框740处,向多个隐式层中的第一隐式层之后的第二隐式层输出更新的第一隐式状态,以使得更新的第一隐式状态被作为与第二隐式层相关联的第二隐式状态。At block 740, an updated first implicit state is output to a second implicit layer following the first implicit layer of the plurality of implicit layers, such that the updated first implicit state is used as the same as the second implicit The second implicit state associated with the layer.
根据本公开的一个示例性实现方式,可以基于第二隐式状态和预测信息,生成更新的第二隐式状态。进一步,可以向多个隐式层中的第二隐式层之后的第三隐式层输出更新的第二隐式状态,以使得更新的第二隐式状态被作为与第三隐式层相关联的第三隐式状态。According to an exemplary implementation of the present disclosure, an updated second implicit state may be generated based on the second implicit state and prediction information. Further, the updated second implicit state may be output to a third implicit layer following the second implicit layer among the plurality of implicit layers, so that the updated second implicit state is regarded as being related to the third implicit layer The third implicit state of the association.
根据本公开的一个示例性实现方式,为了训练翻译模型,可以生成与第一隐式层相关联的第一训练目标。可以利用输入数据和输出数据来训练翻译模型,以使得第一训练目标满足第一预定条件。According to an exemplary implementation of the present disclosure, in order to train the translation model, a first training target associated with the first hidden layer may be generated. The translation model may be trained by using the input data and the output data, so that the first training target satisfies the first predetermined condition.
根据本公开的一个示例性实现方式,为了生成与第一隐式层相关联的第一训练目标,可以确定输出数据与基于第一隐式状态的预测之间的差异。进一步,可以基于差异来生成第一训练目标。According to an exemplary implementation of the present disclosure, in order to generate a first training target associated with a first hidden layer, a difference between output data and a prediction based on the first hidden state may be determined. Further, the first training target may be generated based on the difference.
根据本公开的一个示例性实现方式,为了训练翻译模型,可以生成与第二隐式层相关联的第二训练目标。进一步,可以利用输入数据和输出数据来训练翻译模型,以使得第二训练目标满足第二预定条件。According to an exemplary implementation of the present disclosure, in order to train the translation model, a second training target associated with the second hidden layer may be generated. Further, the translation model may be trained by using the input data and the output data, so that the second training target satisfies the second predetermined condition.
根据本公开的一个示例性实现方式,为了训练翻译模型可以基于第一训练目标和第二训练目标确定翻译模型的训练目标。进一步,可以利用输入数据和输出数据来训练翻译模型,以使得训练目标满足预定条件。According to an exemplary implementation manner of the present disclosure, in order to train the translation model, the training target of the translation model may be determined based on the first training target and the second training target. Further, the translation model can be trained by using the input data and the output data, so that the training target satisfies a predetermined condition.
图8示出了根据本公开的一些实现方式的用于基于层预测的语言翻译的方法800的流程图。在框810处,接收以源语言表示的待翻译数据的编码,确定与翻译模型中的多个隐式层中的第一隐式层相关联的第一隐式状态,翻译模型用于将以源语言表示的待翻译数据翻译为以目标语言表示的翻译结果。在框820处,基于第一隐式状态确定与翻译结果相关联的预测信息。在框830处,基于第一隐式状态和预测信息,生成更新的第一隐式状态更新。在框840处,向多个隐式层的第一隐式层之后的第二隐式层,输入更新的第一隐式状态,以使得翻译模型将更新的第一隐式状态作为与第二隐式层相关联的第二隐式状态。FIG. 8 shows a flowchart of a method 800 for layer prediction based language translation according to some implementations of the present disclosure. At block 810, an encoding of data to be translated expressed in a source language is received, a first implicit state associated with a first implicit layer of a plurality of implicit layers in a translation model is determined, and the translation model is used to translate the The data to be translated expressed in the source language is translated into a translation result expressed in the target language. At block 820, predictive information associated with the translation result is determined based on the first implicit state. At block 830, an updated first implicit state update is generated based on the first implicit state and the prediction information. At block 840, an updated first implicit state is input to a second implicit layer following the first implicit layer of the plurality of implicit layers, so that the translation model uses the updated first implicit state as the same as the second implicit state The second implicit state associated with the implicit layer.
根据本公开的一个示例性实现方式,在方法800中,在多个隐式层中的最后的隐式层处,基于与最后的隐式层相关联的隐式状态,确定翻译结果。According to an exemplary implementation of the present disclosure, in the method 800, at the last implicit layer among the plurality of implicit layers, a translation result is determined based on an implicit state associated with the last implicit layer.
示例装置和设备Example Apparatus and Equipment
图9A示出了根据本公开的一些实现方式的基于层预测的语言翻译的装置900A的框图。如图9A所示,装置900A包括确定单元910A、预测单元920A、生成单元930A和输出单元940A。Fig. 9A shows a block diagram of an apparatus 900A for language translation based on layer prediction according to some implementations of the present disclosure. As shown in FIG. 9A , an apparatus 900A includes a determination unit 910A, a prediction unit 920A, a generation unit 930A, and an output unit 940A.
根据本公开的一个示例性实现方式,确定单元910A配置用于在翻译模型的解码器的多个隐式层中的第一隐式层处,基于训练数据中包括的输入数据的编码,确定与第一隐式层相关联的第一隐式状态,训练数据包括以源语言表示的输入数据和以目标语言表示的输出数 据,翻译模型用于将输入数据翻译为输出数据;预测单元920A配置用于确定与输出数据相关联的预测信息;生成单元930A配置用于基于第一隐式状态和预测信息,生成更新的第一隐式状态;以及输出单元940A配置用于向多个隐式层中的第一隐式层之后的第二隐式层输出更新的第一隐式状态,以使得更新的第一隐式状态被作为与第二隐式层相关联的第二隐式状态。According to an exemplary implementation of the present disclosure, the determining unit 910A is configured to determine, at the first implicit layer among the plurality of implicit layers of the decoder of the translation model, based on the encoding of the input data included in the training data, and The first implicit state associated with the first implicit layer, the training data includes input data expressed in the source language and output data expressed in the target language, and the translation model is used to translate the input data into output data; the prediction unit 920A is configured for The generating unit 930A is configured to generate an updated first implicit state based on the first implicit state and the predictive information; and the output unit 940A is configured to send to a plurality of implicit layers The second implicit layer following the first implicit layer outputs an updated first implicit state such that the updated first implicit state is taken as the second implicit state associated with the second implicit layer.
根据本公开的一个示例性实现方式,确定单元910A进一步配置用于:基于输出数据的长度确定与第一隐式状态相关联的多个位置;以及确定第一隐式状态中的分别与多个位置相对应的多个部分。According to an exemplary implementation of the present disclosure, the determination unit 910A is further configured to: determine a plurality of positions associated with the first implicit state based on the length of the output data; Multiple parts corresponding to the position.
根据本公开的一个示例性实现方式,进一步包括训练单元,配置用于基于以下来训练翻译模型:生成与第一隐式层相关联的第一训练目标;以及利用输入数据和输出数据来训练翻译模型,以使得第一训练目标满足第一预定条件。According to an exemplary implementation of the present disclosure, further comprising a training unit configured to train the translation model based on: generating a first training target associated with the first hidden layer; and using input data and output data to train the translation A model such that the first training target satisfies the first predetermined condition.
根据本公开的一个示例性实现方式,训练单元进一步配置用于:确定输出数据与基于第一隐式状态的预测之间的差异;以及基于差异来生成第一训练目标。According to an exemplary implementation of the present disclosure, the training unit is further configured to: determine a difference between the output data and the prediction based on the first implicit state; and generate the first training target based on the difference.
根据本公开的一个示例性实现方式,训练单元进一步配置用于:生成与第二隐式层相关联的第二训练目标;以及利用输入数据和输出数据来训练翻译模型,以使得第二训练目标满足第二预定条件。According to an exemplary implementation of the present disclosure, the training unit is further configured to: generate a second training target associated with the second hidden layer; and use input data and output data to train the translation model, so that the second training target Satisfy the second predetermined condition.
根据本公开的一个示例性实现方式,训练单元进一步配置用于:基于第一训练目标和第二训练目标确定翻译模型的训练目标;以及利用输入数据和输出数据来训练翻译模型,以使得训练目标满足预定条件。According to an exemplary implementation of the present disclosure, the training unit is further configured to: determine the training target of the translation model based on the first training target and the second training target; and use input data and output data to train the translation model, so that the training target Meet the predetermined conditions.
根据本公开的一个示例性实现方式,预测单元920A进一步配置用于针对多个位置中的给定位置,基于以下任一项来确定针对给定位置的预测信息:翻译模型;以及输出数据中的与给定位置相对应的真值数据。According to an exemplary implementation of the present disclosure, the prediction unit 920A is further configured to, for a given position among the plurality of positions, determine the prediction information for the given position based on any of the following: a translation model; and The ground truth data corresponding to the given position.
根据本公开的一个示例性实现方式,生成单元930A进一步配置用于:基于第一隐式状态中的与给定位置相对应的部分和针对给定位 置的预测信息,生成更新的第一隐式状态中的与给定位置相对应的部分。According to an exemplary implementation of the present disclosure, the generating unit 930A is further configured to: generate an updated first implicit The part of the state that corresponds to the given position.
根据本公开的一个示例性实现方式,生成单元930A进一步配置用于获取基于翻译模型确定的预测信息和基于输出数据确定的预测信息的混合比例;以及基于混合比例、第一隐式状态和预测信息,生成更新的第一隐式状态。According to an exemplary implementation of the present disclosure, the generation unit 930A is further configured to obtain a mixture ratio of the prediction information determined based on the translation model and the prediction information determined based on the output data; and based on the mixture ratio, the first implicit state, and the prediction information , generating the updated first implicit state.
根据本公开的一个示例性实现方式,生成单元930A进一步配置用于基于第二隐式状态和预测信息,生成更新的第二隐式状态。输出单元940A进一步配置用于向多个隐式层中的第二隐式层之后的第三隐式层输出更新的第二隐式状态,以使得更新的第二隐式状态被作为与第三隐式层相关联的第三隐式状态。According to an exemplary implementation manner of the present disclosure, the generating unit 930A is further configured to generate an updated second implicit state based on the second implicit state and prediction information. The output unit 940A is further configured to output the updated second implicit state to a third implicit layer following the second implicit layer among the plurality of implicit layers, so that the updated second implicit state is used as the same as the third implicit state The third implicit state associated with the implicit layer.
图9B示出了根据本公开的一些实现方式的用于语言翻译的装置900B的框图。如图9B所示,装置900B包括接收单元910B、确定单元920B、生成单元930B和输出单元940B。FIG. 9B shows a block diagram of an apparatus 900B for language translation according to some implementations of the present disclosure. As shown in FIG. 9B , the apparatus 900B includes a receiving unit 910B, a determining unit 920B, a generating unit 930B, and an output unit 940B.
根据本公开的一个示例性实现方式,接收单元910B配置用于接收以源语言表示的待翻译数据的编码,确定与翻译模型中的多个隐式层中的第一隐式层相关联的第一隐式状态,翻译模型用于将以源语言表示的待翻译数据翻译为以目标语言表示的翻译结果;确定单元920B配置用于基于第一隐式状态确定与翻译结果相关联的预测信息;生成单元930B配置用于基于第一隐式状态和预测信息,生成更新的第一隐式状态更新;以及输出单元940B配置用于向多个隐式层的第一隐式层之后的第二隐式层,输入更新的第一隐式状态,以使得翻译模型将更新的第一隐式状态作为与第二隐式层相关联的第二隐式状态。According to an exemplary implementation of the present disclosure, the receiving unit 910B is configured to receive the encoding of the data to be translated expressed in the source language, and determine the first implicit layer associated with the first implicit layer in the translation model An implicit state, the translation model is used to translate the data to be translated expressed in the source language into a translation result expressed in the target language; the determining unit 920B is configured to determine prediction information associated with the translation result based on the first implicit state; The generating unit 930B is configured to generate an updated first implicit state update based on the first implicit state and the prediction information; The implicit layer inputs the updated first implicit state, so that the translation model uses the updated first implicit state as the second implicit state associated with the second implicit layer.
根据本公开的一个示例性实现方式,装在900B进一步包括:翻译单元,配置用于在多个隐式层中的最后的隐式层处,基于与最后的隐式层相关联的隐式状态,确定翻译结果。According to an exemplary implementation of the present disclosure, the device 900B further includes: a translation unit configured to, at the last implicit layer among the plurality of implicit layers, based on the implicit state associated with the last implicit layer , to determine the translation result.
图10示出了能够实施本公开的多个实现方式的设备1000的框图。应当理解,图10所示出的计算设备1000仅仅是示例性的,而不应当构成对本文所描述的实现方式的功能和范围的任何限制。图10所示 出的计算设备1000可以用于实现如图1所示的模型训练系统150,也可以实现用于如图1所示的模型应用系统152。FIG. 10 shows a block diagram of a device 1000 capable of implementing various implementations of the present disclosure. It should be understood that the computing device 1000 shown in FIG. 10 is exemplary only and should not constitute any limitation as to the functionality and scope of the implementations described herein. The computing device 1000 shown in FIG. 10 can be used to implement the model training system 150 shown in FIG. 1 , and can also be used to implement the model application system 152 shown in FIG. 1 .
如图10所示,计算设备1000是通用计算设备的形式。计算设备1000的组件可以包括但不限于一个或多个处理器或处理单元1010、存储器1020、存储设备1030、一个或多个通信单元1040、一个或多个输入设备1050以及一个或多个输出设备1060。处理单元1010可以是实际或虚拟处理器并且能够根据存储器1020中存储的程序来执行各种处理。在多处理器系统中,多个处理单元并行执行计算机可执行指令,以提高计算设备1000的并行处理能力。As shown in FIG. 10, computing device 1000 is in the form of a general-purpose computing device. Components of computing device 1000 may include, but are not limited to, one or more processors or processing units 1010, memory 1020, storage devices 1030, one or more communication units 1040, one or more input devices 1050, and one or more output devices 1060. The processing unit 1010 may be an actual or virtual processor and can perform various processing according to programs stored in the memory 1020 . In a multi-processor system, multiple processing units execute computer-executable instructions in parallel to increase the parallel processing capability of the computing device 1000 .
计算设备1000通常包括多个计算机存储介质。这样的介质可以是计算设备1000可访问的任何可以获得的介质,包括但不限于易失性和非易失性介质、可拆卸和不可拆卸介质。存储器1020可以是易失性存储器(例如寄存器、高速缓存、随机访问存储器(RAM))、非易失性存储器(例如,只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、闪存)或它们的某种组合。存储设备1030可以是可拆卸或不可拆卸的介质,并且可以包括机器可读介质,诸如闪存驱动、磁盘或者任何其他介质,其可以能够用于存储信息和/或数据(例如用于训练的训练数据)并且可以在计算设备1000内被访问。 Computing device 1000 typically includes a plurality of computer storage media. Such media can be any available media that is accessible by computing device 1000, including but not limited to, volatile and nonvolatile media, removable and non-removable media. Memory 1020 can be volatile memory (e.g., registers, cache, random access memory (RAM), nonvolatile memory (e.g., read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM) , flash memory) or some combination of them. Storage device 1030 may be removable or non-removable media, and may include machine-readable media, such as flash drives, magnetic disks, or any other media that may be capable of storing information and/or data (e.g., training data for training ) and can be accessed within computing device 1000.
计算设备1000可以进一步包括另外的可拆卸/不可拆卸、易失性/非易失性存储介质。尽管未在图10中示出,可以提供用于从可拆卸、非易失性磁盘(例如“软盘”)进行读取或写入的磁盘驱动和用于从可拆卸、非易失性光盘进行读取或写入的光盘驱动。在这些情况中,每个驱动可以由一个或多个数据介质接口被连接至总线(未示出)。存储器1020可以包括计算机程序产品1025,其具有一个或多个程序模块,这些程序模块被配置为执行本公开的各种实现方式的各种方法或动作。The computing device 1000 may further include additional removable/non-removable, volatile/nonvolatile storage media. Although not shown in FIG. 10, a disk drive for reading from or writing to a removable, nonvolatile disk (such as a "floppy disk") and a disk drive for reading from a removable, nonvolatile disk may be provided. CD-ROM drive for reading or writing. In these cases, each drive may be connected to the bus (not shown) by one or more data media interfaces. The memory 1020 may include a computer program product 1025 having one or more program modules configured to perform the various methods or actions of the various implementations of the present disclosure.
通信单元1040实现通过通信介质与其他计算设备进行通信。附加地,计算设备1000的组件的功能可以以单个计算集群或多个计算机器来实现,这些计算机器能够通过通信连接进行通信。因此,计算 设备1000可以使用与一个或多个其他服务器、网络个人计算机(PC)或者另一个网络节点的逻辑连接来在联网环境中进行操作。The communication unit 1040 enables communication with other computing devices through communication media. Additionally, the functionality of the components of computing device 1000 may be implemented in a single computing cluster or as a plurality of computing machines capable of communicating via communication links. Accordingly, computing device 1000 may operate in a networked environment using logical connections to one or more other servers, a network personal computer (PC), or another network node.
输入设备1050可以是一个或多个输入设备,例如鼠标、键盘、追踪球等。输出设备1060可以是一个或多个输出设备,例如显示器、扬声器、打印机等。计算设备1000还可以根据需要通过通信单元1040与一个或多个外部设备(未示出)进行通信,外部设备诸如存储设备、显示设备等,与一个或多个使得用户与计算设备1000交互的设备进行通信,或者与使得计算设备1000与一个或多个其他计算设备通信的任何设备(例如,网卡、调制解调器等)进行通信。这样的通信可以经由输入/输出(I/O)接口(未示出)来执行。The input device 1050 may be one or more input devices, such as a mouse, keyboard, trackball, and the like. Output device 1060 may be one or more output devices, such as a display, speakers, printer, or the like. The computing device 1000 can also communicate with one or more external devices (not shown) through the communication unit 1040 as needed, such as storage devices, display devices, etc., and one or more devices that enable the user to interact with the computing device 1000 communicate, or communicate with any device (eg, network card, modem, etc.) that enables computing device 1000 to communicate with one or more other computing devices. Such communication may be performed via an input/output (I/O) interface (not shown).
根据本公开的示例性实现方式,提供了一种计算机可读存储介质,其上存储有计算机可执行指令,其中计算机可执行指令被处理器执行以实现上文描述的方法。根据本公开的示例性实现方式,还提供了一种计算机程序产品,计算机程序产品被有形地存储在非瞬态计算机可读介质上并且包括计算机可执行指令,而计算机可执行指令被处理器执行以实现上文描述的方法。根据本公开的示例性实现方式,提供了一种计算机程序产品,其上存储有计算机程序,所述程序被处理器执行时实现上文描述的方法。According to an exemplary implementation of the present disclosure, there is provided a computer-readable storage medium on which computer-executable instructions are stored, wherein the computer-executable instructions are executed by a processor to implement the methods described above. According to an exemplary implementation of the present disclosure, there is also provided a computer program product tangibly stored on a non-transitory computer-readable medium and comprising computer-executable instructions, and the computer-executable instructions are executed by a processor to implement the method described above. According to an exemplary implementation of the present disclosure, there is provided a computer program product on which a computer program is stored, the program implementing the method described above when executed by a processor.
这里参照根据本公开实现的方法、装置、设备和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus, apparatus, and computer program products implemented according to the disclosure. It should be understood that each block of the flowcharts and/or block diagrams, and combinations of blocks in the flowcharts and/or block diagrams, can be implemented by computer-readable program instructions.
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理单元,从而生产出一种机器,使得这些指令在通过计算机或其他可编程数据处理装置的处理单元执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品, 其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。These computer-readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processing unit of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in the flowchart and/or block diagrams.
可以把计算机可读程序指令加载到计算机、其他可编程数据处理装置、或其他设备上,使得在计算机、其他可编程数据处理装置或其他设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其他可编程数据处理装置、或其他设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。computer-readable program instructions can be loaded onto a computer, other programmable data processing apparatus, or other equipment, so that a series of operational steps are performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process, Instructions executed on computers, other programmable data processing devices, or other devices can thus implement the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.
附图中的流程图和框图显示了根据本公开的多个实现的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various implementations of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, a program segment, or a portion of an instruction that contains one or more executable instruction. In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer instructions.
以上已经描述了本公开的各实现,上述说明是示例性的,并非穷尽性的,并且也不限于所公开的各实现。在不偏离所说明的各实现的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实现的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其他普通技术人员能理解本文公开的各个实现方式。Having described various implementations of the present disclosure above, the foregoing description is exemplary, not exhaustive, and is not limited to the disclosed implementations. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described implementations. The choice of terminology used herein aims to best explain the principle of each implementation, practical application or improvement of technology in the market, or to enable other ordinary skilled in the art to understand each implementation disclosed herein.

Claims (30)

  1. 一种基于层预测的语言翻译的方法,包括:在翻译模型的解码器的多个隐式层中的第一隐式层处,A method of language translation based on layer prediction, comprising: at a first implicit layer of a plurality of implicit layers of a decoder of a translation model,
    基于训练数据中包括的输入数据的编码,确定与所述第一隐式层相关联的第一隐式状态,所述训练数据包括以源语言表示的输入数据和以目标语言表示的输出数据,所述翻译模型用于将所述输入数据翻译为所述输出数据;determining a first implicit state associated with said first implicit layer based on an encoding of input data included in training data comprising input data in a source language and output data in a target language, said translation model for translating said input data into said output data;
    确定与所述输出数据相关联的预测信息;determining predictive information associated with said output data;
    基于所述第一隐式状态和所述预测信息,生成更新的第一隐式状态;以及generating an updated first implicit state based on the first implicit state and the prediction information; and
    向所述多个隐式层中的所述第一隐式层之后的第二隐式层输出所述更新的第一隐式状态,以使得所述更新的第一隐式状态被作为与所述第二隐式层相关联的第二隐式状态。outputting the updated first implicit state to a second implicit layer subsequent to the first implicit layer of the plurality of implicit layers such that the updated first implicit state is used as the The second implicit state associated with the second implicit layer.
  2. 根据权利要求1所述的方法,其中确定与所述第一隐式层相关联的所述第一隐式状态包括:The method of claim 1, wherein determining the first implicit state associated with the first implicit layer comprises:
    基于所述输出数据的长度确定与所述第一隐式状态相关联的多个位置;以及determining a plurality of locations associated with the first implicit state based on a length of the output data; and
    确定所述第一隐式状态中的分别与所述多个位置相对应的多个部分。A plurality of portions in the first implicit state corresponding to the plurality of positions respectively are determined.
  3. 根据权利要求1或2所述的方法,进一步包括基于以下来训练所述翻译模型:The method according to claim 1 or 2, further comprising training the translation model based on:
    生成与所述第一隐式层相关联的第一训练目标;以及generating a first training target associated with the first hidden layer; and
    利用所述输入数据和所述输出数据来训练所述翻译模型,以使得所述第一训练目标满足第一预定条件。The translation model is trained by using the input data and the output data, so that the first training target satisfies a first predetermined condition.
  4. 根据权利要求3所述的方法,其中生成与所述第一隐式层相关联的所述第一训练目标包括:The method of claim 3, wherein generating the first training target associated with the first hidden layer comprises:
    确定所述输出数据与基于所述第一隐式状态的预测之间的差异;以及determining a difference between the output data and a prediction based on the first implicit state; and
    基于所述差异来生成所述第一训练目标。The first training target is generated based on the difference.
  5. 根据权利要求3所述的方法,其中训练所述翻译模型进一步包括:The method of claim 3, wherein training the translation model further comprises:
    生成与所述第二隐式层相关联的第二训练目标;以及generating a second training target associated with the second hidden layer; and
    利用所述输入数据和所述输出数据来训练所述翻译模型,以使得所述第二训练目标满足第二预定条件。The translation model is trained by using the input data and the output data, so that the second training target satisfies a second predetermined condition.
  6. 根据权利要求5所述的方法,其中训练所述翻译模型进一步包括:The method of claim 5, wherein training the translation model further comprises:
    基于所述第一训练目标和所述第二训练目标确定所述翻译模型的训练目标;以及determining a training target for the translation model based on the first training target and the second training target; and
    利用所述输入数据和所述输出数据来训练所述翻译模型,以使得所述训练目标满足预定条件。The translation model is trained by using the input data and the output data, so that the training target satisfies a predetermined condition.
  7. 根据权利要求2所述的方法,其中确定与所述输出数据相关联的所述预测信息包括:针对所述多个位置中的给定位置,基于以下任一项来确定针对所述给定位置的预测信息:The method of claim 2, wherein determining the predictive information associated with the output data comprises, for a given location of the plurality of locations, determining for the given location based on any of the following: Forecast information for:
    所述翻译模型;以及said translation model; and
    所述输出数据中的与所述给定位置相对应的真值数据。ground-truth data corresponding to the given location in the output data.
  8. 根据权利要求7所述的方法,其中生成所述更新的第一隐式状态包括:基于所述第一隐式状态中的与所述给定位置相对应的部分和针对所述给定位置的预测信息,生成所述更新的第一隐式状态中的与所述给定位置相对应的部分。The method according to claim 7, wherein generating the updated first implicit state comprises: based on the portion of the first implicit state corresponding to the given position and the prediction information, generating a portion of the updated first implicit state corresponding to the given position.
  9. 根据权利要求7所述的方法,其中生成所述更新的第一隐式状态包括:The method of claim 7, wherein generating the updated first implicit state comprises:
    获取基于所述翻译模型确定的所述预测信息和基于所述输出数据确定的所述预测信息的混合比例;以及obtaining a mixture ratio of the prediction information determined based on the translation model and the prediction information determined based on the output data; and
    基于所述混合比例、所述第一隐式状态和所述预测信息,生成所述更新的第一隐式状态。The updated first implicit state is generated based on the mixture ratio, the first implicit state, and the prediction information.
  10. 根据权利要求1至9中的任一项所述的方法,进一步包括:The method according to any one of claims 1 to 9, further comprising:
    基于所述第二隐式状态和所述预测信息,生成更新的第二隐式状 态;以及generating an updated second implicit state based on the second implicit state and the prediction information; and
    向所述多个隐式层中的所述第二隐式层之后的第三隐式层输出所述更新的第二隐式状态,以使得所述更新的第二隐式状态被作为与所述第三隐式层相关联的第三隐式状态。outputting the updated second implicit state to a third implicit layer subsequent to the second implicit layer of the plurality of implicit layers such that the updated second implicit state is used as the same as the The third implicit state associated with the third implicit layer.
  11. 一种电子设备,包括:An electronic device comprising:
    至少一个处理单元;以及at least one processing unit; and
    至少一个存储器,所述至少一个存储器被耦合到所述至少一个处理单元并且存储用于由所述至少一个处理单元执行的指令,所述指令在由所述至少一个处理单元执行时使所述设备执行以下动作:在翻译模型的解码器的多个隐式层中的第一隐式层处,at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit that, when executed by the at least one processing unit, cause the device The following actions are performed: at the first of the implicit layers of the decoder of the translation model,
    基于训练数据中包括的输入数据的编码,确定与所述第一隐式层相关联的第一隐式状态,所述训练数据包括以源语言表示的输入数据和以目标语言表示的输出数据,所述翻译模型用于将所述输入数据翻译为所述输出数据;determining a first implicit state associated with said first implicit layer based on an encoding of input data included in training data comprising input data in a source language and output data in a target language, said translation model for translating said input data into said output data;
    确定与所述输出数据相关联的预测信息;determining predictive information associated with said output data;
    基于所述第一隐式状态和所述预测信息,生成更新的第一隐式状态;以及generating an updated first implicit state based on the first implicit state and the prediction information; and
    向所述多个隐式层中的所述第一隐式层之后的第二隐式层输出所述更新的第一隐式状态,以使得所述更新的第一隐式状态被作为与所述第二隐式层相关联的第二隐式状态。outputting the updated first implicit state to a second implicit layer subsequent to the first implicit layer of the plurality of implicit layers such that the updated first implicit state is used as the The second implicit state associated with the second implicit layer.
  12. 根据权利要求11所述的设备,其中确定与所述第一隐式层相关联的所述第一隐式状态包括:The apparatus of claim 11, wherein determining the first implicit state associated with the first implicit layer comprises:
    基于所述输出数据的长度确定与所述第一隐式状态相关联的多个位置;以及determining a plurality of locations associated with the first implicit state based on a length of the output data; and
    确定所述第一隐式状态中的分别与所述多个位置相对应的多个部分。A plurality of portions in the first implicit state corresponding to the plurality of positions respectively are determined.
  13. 根据权利要求11或12所述的设备,其中所述动作进一步包括基于以下来训练所述翻译模型:The apparatus of claim 11 or 12, wherein the actions further comprise training the translation model based on:
    生成与所述第一隐式层相关联的第一训练目标;以及generating a first training target associated with the first hidden layer; and
    利用所述输入数据和所述输出数据来训练所述翻译模型,以使得所述第一训练目标满足第一预定条件。The translation model is trained by using the input data and the output data, so that the first training target satisfies a first predetermined condition.
  14. 根据权利要求13所述的设备,其中生成与所述第一隐式层相关联的所述第一训练目标包括:The apparatus of claim 13, wherein generating the first training target associated with the first hidden layer comprises:
    确定所述输出数据与基于所述第一隐式状态的预测之间的差异;以及determining a difference between the output data and a prediction based on the first implicit state; and
    基于所述差异来生成所述第一训练目标。The first training target is generated based on the difference.
  15. 根据权利要求13所述的设备,其中训练所述翻译模型进一步包括:The apparatus of claim 13, wherein training the translation model further comprises:
    生成与所述第二隐式层相关联的第二训练目标;以及generating a second training target associated with the second hidden layer; and
    利用所述输入数据和所述输出数据来训练所述翻译模型,以使得所述第二训练目标满足第二预定条件。The translation model is trained by using the input data and the output data, so that the second training target satisfies a second predetermined condition.
  16. 根据权利要求15所述的设备,其中训练所述翻译模型进一步包括:The apparatus of claim 15, wherein training the translation model further comprises:
    基于所述第一训练目标和所述第二训练目标确定所述翻译模型的训练目标;以及determining a training target for the translation model based on the first training target and the second training target; and
    利用所述输入数据和所述输出数据来训练所述翻译模型,以使得所述训练目标满足预定条件。The translation model is trained by using the input data and the output data, so that the training target satisfies a predetermined condition.
  17. 根据权利要求12所述的设备,其中确定与所述输出数据相关联的所述预测信息包括:针对所述多个位置中的给定位置,基于以下任一项来确定针对所述给定位置的预测信息:The apparatus of claim 12, wherein determining the predictive information associated with the output data comprises, for a given location of the plurality of locations, determining for the given location based on any of the following: Forecast information for:
    所述翻译模型;以及said translation model; and
    所述输出数据中的与所述给定位置相对应的真值数据。ground-truth data corresponding to the given location in the output data.
  18. 根据权利要求17所述的设备,其中生成所述更新的第一隐式状态包括:基于所述第一隐式状态中的与所述给定位置相对应的部分和针对所述给定位置的预测信息,生成所述更新的第一隐式状态中的与所述给定位置相对应的部分。The apparatus of claim 17, wherein generating the updated first implicit state comprises: based on the portion of the first implicit state corresponding to the given location and the prediction information, generating a portion of the updated first implicit state corresponding to the given position.
  19. 根据权利要求17所述的设备,其中生成所述更新的第一隐式状态包括:The apparatus of claim 17, wherein generating the updated first implicit state comprises:
    获取基于所述翻译模型确定的所述预测信息和基于所述输出数据确定的所述预测信息的混合比例;以及obtaining a mixture ratio of the prediction information determined based on the translation model and the prediction information determined based on the output data; and
    基于所述混合比例、所述第一隐式状态和所述预测信息,生成所述更新的第一隐式状态。The updated first implicit state is generated based on the mixture ratio, the first implicit state, and the prediction information.
  20. 根据权利要求11至19中的任一项所述的设备,其中所述动作进一步包括:Apparatus according to any one of claims 11 to 19, wherein said actions further comprise:
    基于所述第二隐式状态和所述预测信息,生成更新的第二隐式状态;以及generating an updated second implicit state based on the second implicit state and the prediction information; and
    向所述多个隐式层中的所述第二隐式层之后的第三隐式层输出所述更新的第二隐式状态,以使得所述更新的第二隐式状态被作为与所述第三隐式层相关联的第三隐式状态。outputting the updated second implicit state to a third implicit layer subsequent to the second implicit layer of the plurality of implicit layers such that the updated second implicit state is used as the same as the The third implicit state associated with the third implicit layer.
  21. 一种基于层预测的语言翻译的方法,包括:A method for language translation based on layer prediction, comprising:
    接收以源语言表示的待翻译数据的编码,确定与翻译模型中的多个隐式层中的第一隐式层相关联的第一隐式状态,所述翻译模型用于将以所述源语言表示的所述待翻译数据翻译为以目标语言表示的翻译结果;receiving an encoding of data to be translated in a source language, determining a first implicit state associated with a first of a plurality of implicit layers in a translation model for translating data in the source language Translating the data to be translated expressed in a language into a translation result expressed in a target language;
    基于所述第一隐式状态确定与所述翻译结果相关联的预测信息;determining predictive information associated with the translation result based on the first implicit state;
    基于所述第一隐式状态和所述预测信息,生成更新的第一隐式状态更新;以及generating an updated first implicit state update based on the first implicit state and the prediction information; and
    向所述多个隐式层的所述第一隐式层之后的第二隐式层,输入所述更新的第一隐式状态,以使得所述翻译模型将更新的所述第一隐式状态作为与所述第二隐式层相关联的第二隐式状态。inputting the updated first implicit state to a second implicit layer following the first implicit layer of the plurality of implicit layers, such that the translation model will update the first implicit state state as a second implicit state associated with the second implicit layer.
  22. 根据权利要求21所述的方法,进一步包括:在所述多个隐式层中的最后的隐式层处,基于与所述最后的隐式层相关联的隐式状态,确定所述翻译结果。The method of claim 21 , further comprising: at a last hidden layer of the plurality of hidden layers, determining the translation result based on an implicit state associated with the last hidden layer .
  23. 一种电子设备,包括:An electronic device comprising:
    至少一个处理单元;以及at least one processing unit; and
    至少一个存储器,所述至少一个存储器被耦合到所述至少一个处理单元并且存储用于由所述至少一个处理单元执行的指令,所述指令 在由所述至少一个处理单元执行时使所述设备执行以下动作:at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit that, when executed by the at least one processing unit, cause the device Perform the following actions:
    接收以源语言表示的待翻译数据的编码,确定与翻译模型中的多个隐式层中的第一隐式层相关联的第一隐式状态,所述翻译模型用于将以所述源语言表示的所述待翻译数据翻译为以目标语言表示的翻译结果;receiving an encoding of data to be translated in a source language, determining a first implicit state associated with a first of a plurality of implicit layers in a translation model for translating data in the source language Translating the data to be translated expressed in a language into a translation result expressed in a target language;
    基于所述第一隐式状态确定与所述翻译结果相关联的预测信息;determining predictive information associated with the translation result based on the first implicit state;
    基于所述第一隐式状态和所述预测信息,生成更新的第一隐式状态更新;以及generating an updated first implicit state update based on the first implicit state and the prediction information; and
    向所述多个隐式层的所述第一隐式层之后的第二隐式层,输出所述更新的第一隐式状态,以使得所述翻译模型将更新的所述第一隐式状态作为与所述第二隐式层相关联的第二隐式状态。outputting the updated first implicit state to a second implicit layer following the first implicit layer of the plurality of implicit layers, such that the translation model will update the first implicit state state as a second implicit state associated with the second implicit layer.
  24. 根据权利要求23所述的设备,其中所述动作进一步包括:在所述多个隐式层中的最后的隐式层处,基于与所述最后的隐式层相关联的隐式状态,确定所述翻译结果。The device of claim 23, wherein the actions further comprise: at a last hidden layer of the plurality of hidden layers, based on an implicit state associated with the last hidden layer, determining The translation result.
  25. 一种基于层预测的语言翻译的装置,包括:An apparatus for language translation based on layer prediction, comprising:
    确定单元,配置用于在翻译模型的解码器的多个隐式层中的第一隐式层处,基于训练数据中包括的输入数据的编码,确定与所述第一隐式层相关联的第一隐式状态,所述训练数据包括以源语言表示的输入数据和以目标语言表示的输出数据,所述翻译模型用于将所述输入数据翻译为所述输出数据;A determination unit configured to determine, at a first implicit layer among the plurality of implicit layers of a decoder of the translation model, based on encoding of input data included in the training data, a value associated with the first implicit layer In a first implicit state, the training data includes input data in a source language and output data in a target language, and the translation model is used to translate the input data into the output data;
    预测单元,配置用于确定与所述输出数据相关联的预测信息;a prediction unit configured to determine prediction information associated with said output data;
    生成单元,配置用于基于所述第一隐式状态和所述预测信息,生成更新的第一隐式状态;以及a generating unit configured to generate an updated first implicit state based on the first implicit state and the prediction information; and
    输出单元,配置用于向所述多个隐式层中的所述第一隐式层之后的第二隐式层输出所述更新的第一隐式状态,以使得所述更新的第一隐式状态被作为与所述第二隐式层相关联的第二隐式状态。an output unit configured to output the updated first hidden state to a second hidden layer following the first hidden layer among the plurality of hidden layers, such that the updated first hidden state The implicit state is taken as the second implicit state associated with the second implicit layer.
  26. 一种基于层预测的语言翻译的装置,包括:An apparatus for language translation based on layer prediction, comprising:
    接收单元,配置用于接收以源语言表示的待翻译数据的编码,确 定与翻译模型中的多个隐式层中的第一隐式层相关联的第一隐式状态,所述翻译模型用于将以所述源语言表示的所述待翻译数据翻译为以目标语言表示的翻译结果;a receiving unit configured to receive an encoding of data to be translated expressed in a source language, to determine a first implicit state associated with a first implicit layer of a plurality of implicit layers in a translation model, said translation model using for translating the data to be translated expressed in the source language into a translation result expressed in the target language;
    确定单元,配置用于基于所述第一隐式状态确定与所述翻译结果相关联的预测信息;a determining unit configured to determine prediction information associated with the translation result based on the first implicit state;
    生成单元,配置用于基于所述第一隐式状态和所述预测信息,生成更新的第一隐式状态更新;以及a generating unit configured to generate an updated first implicit state update based on the first implicit state and the prediction information; and
    输出向所述多个隐式层的所述第一隐式层之后的第二隐式层,输出所述更新的第一隐式状态,以使得所述翻译模型将更新的所述第一隐式状态作为与所述第二隐式层相关联的第二隐式状态。outputting to a second hidden layer after the first hidden layer of the plurality of hidden layers, outputting the updated first hidden state such that the translation model will update the first hidden state The implicit state is used as the second implicit state associated with the second implicit layer.
  27. 一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现根据权利要求1至10中任一项所述的方法。A computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the method according to any one of claims 1 to 10 is implemented.
  28. 一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现根据权利要求21至22中任一项所述的方法。A computer-readable storage medium on which a computer program is stored, the program implements the method according to any one of claims 21 to 22 when executed by a processor.
  29. 一种计算机程序产品,其上存储有计算机程序,所述程序被处理器执行时实现根据权利要求1至10中任一项所述的方法。A computer program product on which is stored a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 10.
  30. 一种计算机程序产品,其上存储有计算机程序,所述程序被处理器执行时实现根据权利要求21至22中任一项所述的方法。A computer program product having stored thereon a computer program which, when executed by a processor, implements the method according to any one of claims 21 to 22.
PCT/CN2022/117230 2021-10-13 2022-09-06 Language translation method and apparatus based on layer prediction, and device and medium WO2023061107A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111191528.8A CN113935338B (en) 2021-10-13 2021-10-13 Method, device, apparatus and medium for language translation based on layer prediction
CN202111191528.8 2021-10-13

Publications (1)

Publication Number Publication Date
WO2023061107A1 true WO2023061107A1 (en) 2023-04-20

Family

ID=79279071

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/117230 WO2023061107A1 (en) 2021-10-13 2022-09-06 Language translation method and apparatus based on layer prediction, and device and medium

Country Status (2)

Country Link
CN (1) CN113935338B (en)
WO (1) WO2023061107A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113935338B (en) * 2021-10-13 2024-07-12 北京有竹居网络技术有限公司 Method, device, apparatus and medium for language translation based on layer prediction

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107729329A (en) * 2017-11-08 2018-02-23 苏州大学 A kind of neural machine translation method and device based on term vector interconnection technique
CN111401081A (en) * 2018-12-14 2020-07-10 波音公司 Neural network machine translation method, model and model forming method
US20200250384A1 (en) * 2019-02-01 2020-08-06 Electronics And Telecommunications Research Institute Method and apparatus for constructing translation model
CN111723547A (en) * 2020-05-25 2020-09-29 河海大学 Text automatic summarization method based on pre-training language model
CN113935338A (en) * 2021-10-13 2022-01-14 北京有竹居网络技术有限公司 Method, apparatus, device and medium for language translation based on layer prediction

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108621159B (en) * 2018-04-28 2020-05-19 首都师范大学 Robot dynamics modeling method based on deep learning
CN109376234B (en) * 2018-10-10 2020-09-01 京东数字科技控股有限公司 Method and device for training abstract generation model
CN112907969B (en) * 2021-02-02 2022-04-22 中国科学院计算技术研究所 Method and system for predicting road traffic flow
CN112699244A (en) * 2021-03-16 2021-04-23 成都信息工程大学 Deep learning-based method and system for classifying defect texts of power transmission and transformation equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107729329A (en) * 2017-11-08 2018-02-23 苏州大学 A kind of neural machine translation method and device based on term vector interconnection technique
CN111401081A (en) * 2018-12-14 2020-07-10 波音公司 Neural network machine translation method, model and model forming method
US20200250384A1 (en) * 2019-02-01 2020-08-06 Electronics And Telecommunications Research Institute Method and apparatus for constructing translation model
CN111723547A (en) * 2020-05-25 2020-09-29 河海大学 Text automatic summarization method based on pre-training language model
CN113935338A (en) * 2021-10-13 2022-01-14 北京有竹居网络技术有限公司 Method, apparatus, device and medium for language translation based on layer prediction

Also Published As

Publication number Publication date
CN113935338A (en) 2022-01-14
CN113935338B (en) 2024-07-12

Similar Documents

Publication Publication Date Title
US11120801B2 (en) Generating dialogue responses utilizing an independent context-dependent additive recurrent neural network
Tan et al. Neural machine translation: A review of methods, resources, and tools
WO2022022163A1 (en) Text classification model training method, device, apparatus, and storage medium
Chen et al. Neural natural language inference models enhanced with external knowledge
US11113479B2 (en) Utilizing a gated self-attention memory network model for predicting a candidate answer match to a query
Nguyen et al. Joint event extraction via recurrent neural networks
US11972365B2 (en) Question responding apparatus, question responding method and program
WO2020244065A1 (en) Character vector definition method, apparatus and device based on artificial intelligence, and storage medium
Plepi et al. Context transformer with stacked pointer networks for conversational question answering over knowledge graphs
JP6649536B1 (en) Dialogue processing device, learning device, dialogue processing method, learning method and program
CN113836271B (en) Method and product for natural language processing
JP2022549316A (en) Reinforcement Learning Based Locally Interpretable Models
CN112084301B (en) Training method and device for text correction model, text correction method and device
WO2023061106A1 (en) Method and apparatus for language translation, device, and medium
WO2014073206A1 (en) Information-processing device and information-processing method
WO2023088278A1 (en) Method and apparatus for verifying authenticity of expression, and device and medium
US20230223112A1 (en) Retrosynthesis using neural networks
EP3855341A1 (en) Language generation method and apparatus, electronic device and storage medium
WO2023071581A1 (en) Method and apparatus for determining response sentence, device, and medium
US20230080424A1 (en) Dynamic causal discovery in imitation learning
WO2023061107A1 (en) Language translation method and apparatus based on layer prediction, and device and medium
Zhang et al. Automatic repetition instruction generation for air traffic control training using multi-task learning with an improved copy network
WO2021234610A1 (en) Method of and system for training machine learning algorithm to generate text summary
WO2023116572A1 (en) Word or sentence generation method and related device
US20240028646A1 (en) Textual similarity model for graph-based metadata

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22880051

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE