CN113705256A

CN113705256A - Training method of translation model, translation method and device thereof, and electronic device

Info

Publication number: CN113705256A
Application number: CN202110341084.5A
Authority: CN
Inventors: 周楚伦; 孟凡东; 苏劲松
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-03-30
Filing date: 2021-03-30
Publication date: 2021-11-26

Abstract

The application provides a training method of a translation model, a corpus translation method and device thereof, electronic equipment and a computer readable storage medium; the method comprises the following steps: carrying out forward propagation on the corpus sample in a first translation model to obtain a first confidence coefficient of a corresponding first pre-marked target word; determining a first to-be-predicted position with a first confidence coefficient lower than a confidence coefficient threshold value as a second to-be-predicted position, and determining a first pre-marked target word with the first confidence coefficient not lower than the confidence coefficient threshold value as a contextual word corresponding to a second translation model; carrying out forward propagation on the context words and the corpus samples in a second translation model to obtain a second confidence coefficient of a corresponding second pre-marked target word; and updating parameters of the first translation model and the second translation model based on the first confidence degree of the corresponding first pre-marked target word and the second confidence degree of the corresponding second pre-marked target word. By the aid of the method and the device, accuracy of the translation model for language material translation can be improved.

Description

Training method of translation model, translation method and device thereof, and electronic device

Technical Field

The present application relates to artificial intelligence technologies, and in particular, to a method for training a translation model, a translation method and apparatus thereof, an electronic device, and a computer-readable storage medium.

Background

Artificial Intelligence (AI) is a comprehensive technique in computer science, and by studying the design principles and implementation methods of various intelligent machines, the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to a wide range of fields, for example, natural language processing technology and machine learning/deep learning, etc., and along with the development of the technology, the artificial intelligence technology can be applied in more fields and can play more and more important values.

In the related technology, a natural language processing technology is applied to provide a corpus translation function for users in various application products, but the translation accuracy of a translation model obtained by training in the current natural language processing technology is low, and the requirement of the users on increasingly improved translation quality is difficult to meet.

Content of application

The embodiment of the application provides a training method of a translation model, a translation method and device thereof, an electronic device and a computer readable storage medium, which can improve the accuracy of the translation model for corpus translation.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a method for training a translation model, which comprises the following steps:

carrying out forward propagation on the corpus samples in a first translation model to obtain a first confidence coefficient of a first pre-marked target word corresponding to each first position to be predicted;

determining a first to-be-predicted position with the first confidence coefficient lower than a confidence coefficient threshold value as a second to-be-predicted position of a second translation model, and determining a first pre-marked target word corresponding to the first to-be-predicted position with the first confidence coefficient not lower than the confidence coefficient threshold value as a contextual word corresponding to the second translation model;

carrying out forward propagation on the context words and the corpus samples in a second translation model to obtain a second confidence coefficient of a second pre-marked target word corresponding to each second position to be predicted;

and updating parameters of the first translation model and the second translation model based on the first confidence degree of the first pre-marked target word corresponding to each first position to be predicted and the second confidence degree of the second pre-marked target word corresponding to each second position to be predicted.

The embodiment of the application provides a training device of a translation model, which comprises:

the first task module is used for carrying out forward propagation on the corpus sample in the first translation model to obtain a first confidence coefficient of a first pre-marked target word corresponding to each first position to be predicted;

the selection module is used for determining a first to-be-predicted position with the first confidence coefficient lower than a confidence coefficient threshold value as a second to-be-predicted position of a second translation model, and determining a first pre-marked target word corresponding to the first to-be-predicted position with the first confidence coefficient lower than the confidence coefficient threshold value as a contextual word corresponding to the second translation model;

the second task module is used for carrying out forward propagation on the context words and the corpus samples in a second translation model to obtain a second confidence coefficient of a second pre-marked target word corresponding to each second position to be predicted;

and the updating module is used for updating the parameters of the first translation model and the second translation model based on the first confidence coefficient of the first pre-marked target word corresponding to each first position to be predicted and the second confidence coefficient of the second pre-marked target word corresponding to each second position to be predicted.

In the foregoing solution, the update module is further configured to: determining a first loss corresponding to the first translation model based on the first confidence level; determining a second loss corresponding to the second translation model based on a first confidence coefficient lower than the confidence coefficient threshold value and a second confidence coefficient of a second pre-marked target word corresponding to each second position to be predicted; wherein the second loss is used to characterize a teaching loss of the second translation model to the first translation model; performing polymerization processing on the first loss and the second loss based on polymerization parameters respectively corresponding to the first loss and the second loss to obtain a joint loss; and updating parameters of the first translation model and the second translation model according to the joint loss.

In the above scheme, a plurality of the first positions to be predicted have a plurality of first pre-marked target words in one-to-one correspondence; the update module is further configured to: and performing fusion processing on the first confidence coefficient acquired aiming at each first to-be-predicted position to obtain a first loss corresponding to the first translation model.

In the above scheme, a plurality of second pre-marked target words corresponding to one another are arranged at a plurality of second positions to be predicted; the update module is further configured to: and performing fusion processing on the first confidence coefficient lower than the confidence coefficient threshold value and the second confidence coefficient of the second pre-marked target word corresponding to each second position to be predicted to obtain a second loss corresponding to the second translation model.

In the above solution, the first translation model includes a first coding network and a preamble decoding network; the first task module is further configured to: determining each original word of the corpus sample and an original word vector corresponding to each original word, and combining the original word vectors corresponding to each original word to obtain an original word vector sequence of the corpus sample; performing semantic coding processing on the original word vector sequence of the corpus sample through the first coding network to obtain a first source statement representation corresponding to the corpus sample; performing corpus decoding processing on the first source sentence representation through the preamble decoding network to obtain a first confidence coefficient of a first pre-marked target word corresponding to each first position to be predicted; wherein the first confidence is generated based on the prepositions corresponding to each first position to be predicted.

In the above scheme, the coding network includes N cascaded sub-coding networks, where N is an integer greater than or equal to 2; the first task module is further configured to: performing semantic coding processing on the original word vector sequence of the corpus sample in the following mode through N cascaded first sub-coding networks included in the first coding network: performing self-attention processing on the input of the first sub-coding network to obtain a self-attention processing result corresponding to the first sub-coding network, performing hidden state mapping processing on the self-attention processing result to obtain a hidden state vector sequence corresponding to the first sub-coding network, and taking the hidden state vector sequence as a semantic coding processing result of the first sub-coding network; in N cascaded first sub-coding networks, the input of the first sub-coding network comprises an original word vector sequence of the corpus sample, and the semantic coding processing result of the Nth first sub-coding network comprises a first source sentence representation corresponding to the corpus sample.

In the foregoing solution, the first task module is further configured to: performing the following for each of the original words of the corpus sample: performing linear transformation processing on a first intermediate vector corresponding to the original word in the input of the first sub-coding network to obtain a query vector, a key vector and a value vector corresponding to the original word; performing point multiplication on the query vector of the original word and the key vector of each original word, and performing normalization processing on a point multiplication processing result based on a maximum likelihood function to obtain the weight of the value vector of the original word; and carrying out weighting processing on the value vector of the original word based on the weight of the value vector of the original word to obtain a self-attention processing result of each original word corresponding to the sub-coding network.

In the foregoing solution, the first task module is further configured to: performing the following for each first to-be-predicted position of the preamble decoding network output: acquiring a first pre-marked target word sequence corresponding to a corpus sample from a corpus sample set; extracting a first pre-marked target word positioned in front of the first position to be predicted from the first pre-marked target word sequence, and taking the extracted first pre-marked target word as a pre-word corresponding to the first position to be predicted; and performing semantic decoding processing on the prepositive word corresponding to the first position to be predicted and the first source sentence representation through the prepositive decoding network to obtain a first confidence coefficient of the first pre-marked target word decoded at the first position to be predicted.

In the above scheme, the preamble decoding network includes M concatenated sub-preamble decoding networks, where M is an integer greater than or equal to 2; the first task module is further configured to: performing semantic decoding processing in the following manner through each of the sub-preamble decoding networks: performing mask self-attention processing on the input of the sub-preamble decoding network to obtain a mask self-attention processing result corresponding to the sub-preamble decoding network, performing cross attention processing on the mask self-attention processing result to obtain a cross attention processing result corresponding to the sub-preamble decoding network, and performing hidden state mapping processing on the cross attention processing result; wherein, in M concatenated sub-preamble decoding networks, an input of a first one of the sub-preamble decoding networks comprises: a preposition corresponding to the first position to be predicted and the first source sentence representation; the hidden state mapping processing result of the Mth sub-preamble decoding network comprises: and decoding the first to-be-predicted position into a first confidence coefficient of a corresponding first pre-marked target word.

In the foregoing solution, the first task module is further configured to: performing linear transformation processing on the mask self-attention processing result to obtain a query vector of the mask self-attention processing result; performing the following for each of the original words: performing linear transformation processing on a first source sentence representation of the original word to obtain a key vector and a value vector represented by the first source sentence; performing point multiplication on the query vector of the mask self-attention processing result and the key vector represented by the first source sentence, and performing normalization processing on the result of the point multiplication processing based on a maximum likelihood function to obtain the weight of the value vector represented by the first source sentence; and carrying out weighting processing on the value vector represented by the first source sentence based on the weight of the value vector represented by the first source sentence to obtain a cross attention processing result corresponding to the sub-preamble decoding network.

In the above solution, the second translation model includes a second coding network and a context decoding network; the second task module is further configured to: acquiring each original word of the corpus sample and an original word vector corresponding to each original word, and combining the original word vectors corresponding to each original word to obtain an original word vector sequence of the corpus sample; performing semantic coding processing on the original word vector sequence of the corpus sample through the second coding network to obtain a second source statement representation corresponding to the corpus sample; performing corpus decoding processing on the second source sentence representation through the context decoding network to obtain a second confidence coefficient of a second pre-marked target word corresponding to each second position to be predicted; wherein the second confidence is generated based on context words corresponding to a plurality of the second positions to be predicted.

In the foregoing solution, the second task module is further configured to: performing the following for each second to-be-predicted location of the context decoding network output: and performing semantic decoding processing on the context words and the second source sentence representation through the context decoding network to obtain a second confidence coefficient of the second pre-marked target word decoded at the second position to be predicted as the corresponding second pre-marked target word.

In the above scheme, the context decoding network includes P concatenated sub-context decoding networks, where P is an integer greater than or equal to 2; the second task module is further configured to: performing semantic decoding processing in the following manner on the context word set corresponding to the second position to be predicted and the source sentence representation through the P cascaded subcontext decoding networks: performing context self-attention processing on the input of the subcontext decoding network to obtain a context self-attention processing result corresponding to the subcontext decoding network; performing cross attention processing on the context self-attention processing result to obtain a cross attention processing result corresponding to the subcontext decoding network; carrying out hidden state mapping processing on the cross attention processing result; wherein, in the P concatenated subcontext decoding networks, the input of the first subcontext decoding network comprises: the context word set corresponding to the second position to be predicted and the second source sentence represent: the result of the hidden state mapping process of the pth subcontext decoding network comprises: and decoding the second position to be predicted into a second confidence degree of a corresponding second pre-marked target word.

The embodiment of the application provides a corpus translation method, which comprises the following steps:

responding to a translation request aiming at a target corpus, calling a first translation model or a second translation model to translate the target corpus to obtain a translation result aiming at the target corpus;

the first translation model and the second translation model are obtained by training according to the training method of the translation model provided by the embodiment of the application.

The embodiment of the present application provides a corpus translation device, including:

the application module is used for responding to a translation request aiming at a target corpus, calling a first translation model or a second translation model to translate the target corpus, and obtaining a translation result aiming at the target corpus; the first translation model and the second translation model are obtained by training according to the training method of the translation model provided by the embodiment of the application.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for implementing the training method or the corpus translation method of the translation model provided by the embodiment of the application when the executable instructions stored in the memory are executed.

An embodiment of the present application provides a computer-readable storage medium, which stores executable instructions and is configured to, when executed by a processor, implement a training method or a corpus translation method of a translation model provided in an embodiment of the present application.

The embodiment of the application has the following beneficial effects:

the method has the advantages that the characteristics and the confidence coefficient of the translation task of one neural network model are utilized to assist in training the other neural network model in a targeted manner, the second translation model utilizes the context word set to translate, so that the bidirectional global context information based on the context words is effectively introduced through the context word set, and the position, corresponding to the confidence coefficient, of the target end of the second translation model is the first translation model in a targeted manner through the confidence coefficient threshold value, so that the first translation model after joint training can utilize the local context information of the corresponding preposition at each position to be predicted during translation, can also utilize the global context information in a targeted manner, and further effectively improve the accuracy of translation of the first translation model.

Drawings

FIG. 1 is a schematic structural diagram of a corpus translation system according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a server 200 provided in an embodiment of the present application;

3A-3D are schematic flow diagrams of a method for training a translation model provided by an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a joint training model of a training method of a translation model provided in an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a first translation model provided by an embodiment of the present application;

fig. 6 is a schematic structural diagram of a sub-coding network provided in an embodiment of the present application;

FIG. 7 is a block diagram illustrating a sub-preamble decoding network according to an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a subcontext decoding network provided in an embodiment of the present application;

FIG. 9 is a confidence profile provided by an embodiment of the present application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, references to the terms "first \ second \ third" are only to distinguish similar objects and do not denote a particular order, but rather the terms "first \ second \ third" are used to interchange specific orders or sequences, where appropriate, so as to enable the embodiments of the application described herein to be practiced in other than the order shown or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) Neural machine translation: the Neural Network Machine Translation (NMT) is a Machine Translation method proposed in recent years, and compared with the traditional statistical Machine Translation, the NMT can train a Neural network which can be mapped from one sequence to another sequence, and output a sequence with a longer length, so that the method can obtain excellent performances in Translation, dialogue and text summarization.

2) Word embedding vector: is an important concept of natural language processing technology, and can convert a word into a vector representation with fixed length by using a word embedding vector, thereby facilitating mathematical processing.

In the related art, a backward decoder for backward decoding is introduced at a target end of a translation model, the backward decoder first generates a hidden state vector sequence from right to left, and then a forward decoder performs decoding processing from left to right by using the hidden state vector sequence from right to left, so that the forward decoding process fully considers subsequent information at the target end to improve the translation quality, and because additional backward global context information is introduced for the forward decoder only at each decoding moment at the target end through the backward decoder in the related art, the backward global context information and the local context information of the preamble are practically independent from each other, the translation model cannot effectively comprehensively consider reverse global context information and local context information of the prepositions, so that the translation quality cannot be effectively improved, and the confidence degree of the first translation model for predicting the first pre-marked target word is not considered in the related technology, so that extra bidirectional global context information is introduced at the decoding moment when the bidirectional global context information is not necessarily introduced in the joint training process.

The embodiment of the application provides a training method and device for a translation model, an electronic device and a computer-readable storage medium, which can purposefully introduce bidirectional global context information (based on context words) to a target end of a neural machine translation model at a position where the target end has a lower prediction confidence corresponding to a target answer through knowledge distillation based on the confidence, so that the translation model can not only utilize local context information of corresponding prepositions but also utilize global context information of corresponding context words when predicting each position, thereby improving translation performance. In the following, an exemplary application will be explained when the electronic device is implemented as a server.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a corpus translation system, which may be used in a social scenario and is provided in an embodiment of the present application, in which a terminal 400 is connected to a server 200 through a network 300, where the network may be a wide area network or a local area network, or a combination of the two.

In some embodiments, the function of the corpus translation system is implemented based on each module in the server 200, during the process that a user uses the terminal 400, the terminal 400 collects a corpus sample and sends the corpus sample to the server 200, the server 200 performs joint training based on a plurality of tasks and confidence degrees on a translation model (a first translation model or a second translation model), the trained first translation model or second translation model is integrated in the server 200, in response to the terminal 400 receiving a translation operation for a corpus signal in a social client, the terminal 400 sends the corpus signal to the server 200, the server 200 determines a corpus translation result of the corpus signal through the translation model and sends the corpus signal to the terminal 400, so that the terminal 400 directly presents the corpus translation result.

In some embodiments, when the corpus translation system is applied to a social scenario, the terminal 400 receives a corpus signal sent by another terminal, and in response to the terminal 400 receiving a translation operation for the corpus signal, the terminal 400 sends the corpus signal to the server 200, the server 200 determines a corpus translation result of the corpus signal through a translation model and sends the corpus translation result to the terminal 400, so that the terminal 400 directly presents the corpus translation result, for example, the terminal 400 receives a corpus signal "where you" sent by another terminal, the terminal 400 sends the corpus signal to the server 200, the server 200 determines the corpus translation result "where you are" of the corpus signal through the translation model and sends the corpus signal to the terminal 400, so that the terminal 400 directly presents the corpus translation result "where you are".

In some embodiments, when the corpus translation system is applied to a web browsing scenario, the terminal 400 presents an english web page, and in response to the terminal 400 receiving a translation operation for a corpus signal in the english web page, the terminal 400 sends the corpus signal to the server 200, and the server 200 determines a corpus translation result of the corpus signal through a translation model and sends the corpus translation result to the terminal 400, so that the terminal 400 directly presents the corpus translation result.

In other embodiments, after completing the training process of the translation model, the server 200 sends the translation model to the terminal 400, so that the terminal 400 runs the translation model that is jointly trained to determine the corpus translation result of the corpus signal and present the corpus translation result.

In some embodiments, the server 200 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present application is not limited.

Next, a structure of an electronic device for implementing the training method of the translation model according to the embodiment of the present application is described, and as described above, the electronic device according to the embodiment of the present application may be the server 200 in fig. 1. Referring to fig. 2, fig. 2 is a schematic structural diagram of a server 200 according to an embodiment of the present application, where the server 200 shown in fig. 2 includes: at least one processor 210, memory 250, at least one network interface 220. The various components in server 200 are coupled together by a bus system 240. It is understood that the bus system 240 is used to enable communications among the components. The bus system 240 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 240 in fig. 2.

The Processor 210 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The memory 250 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 250 optionally includes one or more storage devices physically located remotely from processor 210.

The memory 250 includes volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile memory may be a Read Only Memory (ROM), and the volatile memory may be a Random Access Memory (RAM). The memory 250 described in embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 250 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.

An operating system 251 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks; a network communication module 252 for communicating to other computing devices via one or more (wired or wireless) network interfaces 220, exemplary network interfaces 220 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), among others.

In some embodiments, the training device for the translation model provided in the embodiments of the present application may be implemented in software, and fig. 2 illustrates the training device 255 for the translation model stored in the memory 250, which may be software in the form of programs and plug-ins, and includes the following software modules: a first task module 2551, a selection module 2552, a second task module 2553 and an update module 2554, fig. 2 shows a corpus translation mechanism 256 of a translation model stored in a memory 250, which may be software in the form of programs and plug-ins, etc., comprising the following software modules: application modules 2555, 2555 may also be installed in the terminal 400, which are logical and thus may be arbitrarily combined or further divided according to the functions implemented, and the functions of the respective modules will be described hereinafter.

The method for training the translation model provided by the embodiment of the present application will be described in conjunction with an exemplary application and implementation of the server 200 provided by the embodiment of the present application.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a joint training model of a corpus translation method according to an embodiment of the present application, in which a first translation model is first utilized to predict a completely correct preposition corresponding to each first position to be predicted, so as to obtain a first probability distribution of each first pre-marked target word at the position corresponding to the first position to be predicted

The input to the coding network of the first translation model is the original word vector x and the input to the preamble decoding network of the first translation model is the sequence of preambles corresponding to each first pre-marked target word, e.g. for the first pre-marked target word y₆The preorder word sequence is BOS symbol and the first pre-marked target word y₁-y₅Giving a confidence threshold, and taking a first to-be-predicted position corresponding to a first probability (first confidence) smaller than the confidence threshold in the first pre-marked target word as a blocked subset y which is subsequently input into the second translation model_mFor example, the first pre-tagged target word y₂、y₃And y₅The occluded subset y_mWhen the first pre-marked target word y is input into the second translation model, the mask M is invisible, and the rest of the first pre-marked target word y appears₁、y₄And y₆As an input to the secondPartially visible sequence y of translation model_oDetermining the partial visible sequence y of the second translation model by the first translation model_oGiven a source sentence x and a partially visible sequence y_oIn this case, the input of the coding network of the second translation model is the original word vector x, and the pre-trained second translation model pairs the occluded target word subset y_mEach word in y_tPredicting to obtain corresponding prediction probability distribution

As a second degree of confidence (e.g., q)₂、q₃And q is₅) Then, aiming at a second to-be-predicted position of a target end of a second translation model, wherein the second to-be-predicted position is a first to-be-predicted position with a first confidence coefficient lower than a confidence coefficient threshold value, bidirectional global context information (based on context words) is pertinently introduced into the first translation model in a knowledge distillation mode, and for other positions which do not belong to y_mThe first pre-labeled target word of (1) is still trained using the first loss function of the first translation model.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a first translation model provided in an embodiment of the present application, where the first translation model includes a coding network and a preamble decoding network, the coding network includes a plurality of sub-coding networks (encoders), the preamble decoding network also includes a same number (corresponding to an encoder) of sub-preamble decoding networks (decoders), an input of the coding network is an original word vector of a corpus sample, the coding network outputs a source sentence representation of the corpus sample, and an input of the preamble decoding network is a source sentence representation of the corpus sample and a preamble, where an input of each sub-preamble decoding network is a source sentence representation of the corpus sample and a preamble, an output of the preamble decoding network is a first pre-marked target word of the corpus sample, for example, the corpus sample is "where you are", and the first pre-marked target word is "where you".

Referring to fig. 6, fig. 6 is a schematic structural diagram of a first sub-coding network provided in an embodiment of the present application, where the first sub-coding network includes a self-attention processing layer and a feedforward processing layer, and the self-attention processing layer and the feedforward processing layer are configured by the first sub-coding networkAttention handling layer input to this layer (e.g., x)₁、x₂And x₃) Self-attention processing is performed to obtain corresponding self-attention processing results (e.g., z)₁、z₂And z₃) And performing hidden state mapping processing on the self-attention processing result through a feedforward processing layer to obtain a corresponding hidden state vector sequence.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a sub-preamble decoding network provided in an embodiment of the present application, where the sub-preamble decoding network includes a mask self-attention processing layer, a cross-attention processing layer, and a feed-forward processing layer, and performs mask self-attention processing on an input of the sub-preamble decoding network through the mask self-attention processing layer to obtain a corresponding mask self-attention processing result, performs cross-attention processing on the mask self-attention processing result through the cross-attention processing layer to obtain a corresponding cross-attention processing result, and performs hidden state mapping processing on the cross-attention processing result through the feed-forward processing layer, where an input of the sub-preamble decoding network is at least one preamble word of each first position to be predicted, and outputs first pre-labeled target words of different first positions to be predicted simultaneously through parallel processing.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a subcontext decoding network provided in an embodiment of the present application, where the subcontext decoding network includes a context self-attention processing layer, a cross-attention processing layer, and a feed-forward processing layer, and the structure of the context decoding network shown in fig. 8 is similar to that of the preamble decoding network, except that the subcontext decoding network includes a context self-attention processing layer different from the mask self-attention processing layer in fig. 7, and the context decoding network inputs a random set of context words and outputs a second pre-labeled target word of a second position to be predicted.

Referring to fig. 3A, fig. 3A is a flowchart illustrating a method for training a translation model according to an embodiment of the present application, and will be described with reference to steps 101-104 shown in fig. 3A.

In step 101, the corpus sample is propagated in the forward direction in the first translation model, so as to obtain a first confidence of the first pre-marked target word corresponding to each first position to be predicted.

As an example, when a corpus sample is forward propagated in a first translation model, the corpus sample needs to be sequentially processed by an encoding network and a preamble decoding network, wherein when the corpus sample is decoded by the preamble decoding network, the confidence prediction of a first pre-marked target word needs to be performed on each first to-be-predicted position in a parallel manner.

In some embodiments, before the corpus sample is forward propagated in the first translation model, the corpus sample may be forward propagated in the first translation model separately to obtain a first forward propagation result of the corpus sample; reversely propagating the first forward propagation result of the corpus sample in the first translation model to update the parameter of the first translation model, and taking the updated first translation model as the first translation model of the corpus sample in the processing step 101; before the corpus sample is subjected to forward propagation in the second translation model, the corpus sample is subjected to forward propagation in the second translation model to obtain a second forward propagation result of the corpus sample; and reversely propagating the second forward propagation result of the corpus sample in the second translation model to update the parameters of the second translation model, and taking the updated second translation model as the second translation model of the corpus sample in the processing step 101.

Referring to fig. 3B, fig. 3B is a schematic flowchart of a training method for a translation model provided in an embodiment of the present application, where the first translation model includes a first coding network and a preamble decoding network; in step 101, the corpus sample is propagated forward in the first translation model to obtain a first confidence of the first pre-labeled target word corresponding to each first to-be-predicted position, which may be implemented in step 1011 and step 1013 shown in fig. 3B.

In step 1011, each original word of the corpus sample and the original word vector corresponding to each original word are determined, and the original word vectors corresponding to each original word are combined to obtain an original word vector sequence of the corpus sample.

As an example, in the natural language processing task, how words are represented in a computer needs to be considered, and there are generally two ways of representation, such as one-hot encoding and distributed encoding, by which an original word vector of each original word is obtained, for example, for a source sentence "where you are", there are four original words "you", "where", "and" inside ", and four original word vectors.

In step 1012, semantic encoding is performed on the original word vector sequence of the corpus sample through the first encoding network to obtain a first source sentence representation corresponding to the corpus sample.

In some embodiments, the first coding network includes N cascaded first sub-coding networks, where N is an integer greater than or equal to 2, and the semantic coding processing is performed on the original word vector sequence of the corpus sample through the first coding network to obtain the first source sentence representation of the corresponding corpus sample, which may be implemented by the following technical solution: through N cascaded first sub-coding networks included in the first coding network, performing semantic coding processing on an original word vector sequence of a corpus sample in the following mode: performing self-attention processing on the input of the first sub-coding network to obtain a self-attention processing result corresponding to the first sub-coding network, performing hidden state mapping processing on the self-attention processing result to obtain a hidden state vector sequence corresponding to the first sub-coding network, and taking the hidden state vector sequence as a semantic coding processing result of the first sub-coding network; in the N cascaded first sub-coding networks, the input of the first sub-coding network includes an original word vector sequence of the corpus sample, and the semantic coding processing result of the nth first sub-coding network includes a first source sentence representation of the corresponding corpus sample.

As an example, through an nth first sub-coding network of the N cascaded first sub-coding networks, performing semantic coding processing on an input of the nth first sub-coding network, and transmitting an nth semantic coding processing result output by the nth first sub-coding network to an (N + 1) th first sub-coding network to continue the semantic coding processing, so as to obtain an (N + 1) th semantic coding processing result; the method comprises the steps that N is an integer variable with the value increasing from 1, the value range of N is more than or equal to 1 and N is less than N, when the value of N is 1, the input of the nth first sub-coding network is an original word vector sequence of a corpus sample, when the value of N is more than or equal to 2 and N is less than N, the input of the nth first sub-coding network is an N-1 semantic coding processing result output by the N-1 first sub-coding network, and when the value of N is N-1, the output of the N +1 first sub-coding network is represented by a source sentence of the corpus sample.

As an example, each of the first sub-coding networks includes a self-attention processing layer and a feed-forward processing layer; the input of the nth first sub-coding network is subjected to semantic coding processing through the nth first sub-coding network of the N cascaded first sub-coding networks, and the method can be realized through the following technical scheme: performing self-attention processing on the input of an nth first sub-coding network in N cascaded first sub-coding networks through a self-attention processing layer of the nth first sub-coding network to obtain a self-attention processing result corresponding to the nth first sub-coding network; and transmitting the self-attention processing result corresponding to the nth first sub-coding network to a feedforward processing layer of the nth first sub-coding network for hidden state mapping processing to obtain a hidden state vector sequence corresponding to the nth first sub-coding network as an nth semantic coding processing result output by the nth first sub-coding network. Referring to FIG. 6, the first sub-coding network includes a self-attention processing layer and a feed-forward processing layer, wherein the input to the layer (e.g., x) is passed through the self-attention processing layer₁、x₂And x₃) Self-attention processing is performed to obtain corresponding self-attention processing results (e.g., z)₁、z₂And z₃) And performing hidden state mapping processing on the self-attention processing result through a feedforward processing layer to obtain a corresponding hidden state vector sequence.

In some embodiments, the self-attention processing is performed on the input of the first sub-coding network to obtain a self-attention processing result corresponding to the first sub-coding network, and the self-attention processing result may be implemented by the following technical solutions: the following processing is performed for each original word of the corpus sample: performing linear transformation processing on a first intermediate vector corresponding to an original word in the input of the first sub-coding network to obtain a query vector, a key vector and a value vector corresponding to the original word; performing point multiplication on the query vector of the original word and the key vector of each original word, and performing normalization processing on the result of the point multiplication processing based on a maximum likelihood function to obtain the weight of the value vector of the original word; and carrying out weighting processing on the value vector of the original word based on the weight of the value vector of the original word to obtain a self-attention processing result of each original word corresponding to the sub-coding network.

As an example, when the first sub-coding network is the first sub-coding network of the plurality of first sub-coding networks, the first intermediate vector is each original word vector, when the first sub-coding network is not the first sub-coding network of the plurality of first sub-coding networks, the first intermediate vector is a hidden state vector output by the last first sub-coding network, the first intermediate vector corresponds to the original word one by one, the linear transformation processing is actually to multiply the first intermediate vector by three parameter matrixes respectively to obtain a query vector Q, a key vector K and a value vector V corresponding to the first intermediate vector, where the query vector and the key vector of each original word are subjected to point multiplication processing, including point multiplication processing with the own key vector and point multiplication processing with the key vectors of other original words, and the result of the point multiplication processing is used for representing the degree of correlation, so that the obtained degree of correlation includes the degree of autocorrelation, the method also comprises the correlation degree with other original words, the correlation degree can be divided by the square root of the length of the key vector before being subjected to normalization processing based on the maximum likelihood function, the normalization processing based on the maximum likelihood function is that the correlation degree or the result obtained after dividing by the square root of the length of the key vector is substituted into a Softmax function, so that the contribution weight of each original word to a certain original word is obtained, the value vector of the certain original word is subjected to weighting processing through the obtained contribution weight of each original word to the certain original word, so that the context correlation corresponding to the certain original word is obtained, the translation accuracy of a subsequent translation model is facilitated, and the three parameter matrixes are obtained through training.

In step 1013, the corpus decoding process is performed on the first source sentence representation through the preamble decoding network to obtain a first confidence of the first pre-marked target word corresponding to each first position to be predicted.

As an example, the first confidence is generated based on the preamble word corresponding to each first position to be predicted, and the generation process of the first confidence is a parallel process, taking the 5 th first position to be predicted as an example, based on the starting symbol B, the 1 st first position to be predicted, and the 4 th first position to be predicted respectively corresponding first pre-marked target words, the first confidence of the 5 th first position to be predicted being translated into the corresponding first pre-marked target word is predicted, for example, the entropy of the distribution of the first probability or probability, for example, "where you" should be translated into "where are you", three first positions to be predicted exist, for the second first position to be predicted, it should be translated into the corresponding first pre-marked target word "are", then the first probability that the 2 nd first position to be predicted into the corresponding first pre-marked target word "are" obtained based on the preamble words "B" and "where", if the first translation model performs well, the first probability should exceed a probability threshold, indicating that the first translation model has a greater likelihood of translating correctly.

In some embodiments, the foregoing performing corpus decoding processing on the first source sentence representation through the preamble decoding network to obtain the first confidence of the first pre-marked target word corresponding to each first position to be predicted may be implemented by the following technical solution: performing the following for each first to-be-predicted position of the preamble decoding network output: acquiring a first pre-marked target word sequence corresponding to a corpus sample from a corpus sample set; extracting a first pre-marked target word positioned in front of a first position to be predicted from the first pre-marked target word sequence, and taking the extracted first pre-marked target word as a pre-word corresponding to the first position to be predicted; semantic decoding processing is carried out on the preamble word corresponding to the first to-be-predicted position and the first source sentence representation through a preamble decoding network, and a first confidence coefficient of the first to-be-predicted position decoded into the corresponding first pre-marked target word is obtained.

As an example, the corpus sample is "where you are", acquiring a corresponding first pre-tagged target word sequence, "where you are" that should be translated into "B where are you E" from the first pre-tagged target word sequence, for any one first position to be predicted, extracting a first pre-tagged target word before the first position to be predicted from the first pre-tagged target word sequence, where B is a start symbol and E is an end symbol, there are three first positions to be predicted, and for a 2 nd first position to be predicted, the first pre-tagged target word before the 2 nd first position to be predicted is "B" and "where", and taking the extracted first pre-tagged target word as a preceding word corresponding to the 2 nd first position to be predicted; and performing semantic decoding processing on the preamble word corresponding to the 2 nd first position to be predicted and the input of the first source sentence representing as the preamble decoding network to obtain a first confidence coefficient of the first pre-marked target word ' are ' decoded into the corresponding first pre-marked target word ' at the 2 nd first position to be predicted.

As an example, before the input of the nth first sub-coding network is subjected to self-attention processing through a self-attention layer of the nth first sub-coding network in the N cascaded first sub-coding networks, when the value of N is 2 or more and N < N, the regularization processing result of the nth-1 semantic coding processing result and the input of the nth-1 sub-coding network are subjected to splicing processing; and taking the splicing processing result as the input of the self-attention layer of the nth sub-coding network instead of taking the n-1 semantic coding processing result as the input of the nth sub-coding network. Before transmitting the self-attention processing result corresponding to the nth first sub-coding network to the first feedforward processing layer of the nth first sub-coding network for hidden state mapping processing, splicing the regularization processing result corresponding to the self-attention processing result of the nth first sub-coding network and the input of the nth first sub-coding network; and taking the splicing processing result as the input of a feedforward processing layer of the nth first sub-coding network instead of taking the self-attention processing result corresponding to the nth first sub-coding network as the input of the nth first sub-coding network.

In some embodiments, the preamble decoding network comprises M concatenated sub-preamble decoding networks, M being an integer greater than or equal to 2; the semantic decoding processing of the preposition and the first source sentence representation corresponding to the first position to be predicted through the preposition decoding network can be realized by the following technical scheme: performing semantic decoding processing in the following manner through each sub-preamble decoding network: performing mask self-attention processing on input of the sub-preamble decoding network to obtain a mask self-attention processing result corresponding to the sub-preamble decoding network, performing cross attention processing on the mask self-attention processing result to obtain a cross attention processing result corresponding to the sub-preamble decoding network, and performing hidden state mapping processing on the cross attention processing result; wherein, in the M concatenated sub-preamble decoding networks, the input of the first sub-preamble decoding network comprises: the first source sentence is expressed by the corresponding prepositions and the first source words; the hidden state mapping processing result of the Mth sub-preamble decoding network comprises the following steps: the first confidence level of the corresponding first pre-marked target word is decoded at the first to-be-predicted position.

By way of example, a preamble decoding network comprises M concatenated sub-preamble decoding networks, M being an integer greater than or equal to 2; performing semantic decoding processing on the input of the Mth sub-preamble decoding network through the Mth sub-preamble decoding network in the M cascaded sub-preamble decoding networks, and transmitting the Mth semantic decoding processing result output by the Mth sub-preamble decoding network to the M +1 th sub-preamble decoding network to continue the semantic decoding processing to obtain an M +1 th semantic decoding processing result; when the value of M is more than or equal to 2 and M < M, the input of the Mth sub-preamble decoding network is the M-1 semantic decoding processing result output by the Mth sub-preamble decoding network, and when the value of M is M-1, the output of the Mth +1 sub-preamble decoding network is a first confidence coefficient.

As an example, the following processing is performed for a second intermediate vector of each original word of the corpus sample, where the second intermediate vector is a word vector of each first pre-marked target word as a preamble word when the sub-preamble decoding network is the first decoding network, the second intermediate vector is an output of the last sub-preamble decoding network when the sub-preamble decoding network is not the first decoding network, the second intermediate vector corresponds to each first pre-marked target word as a preamble word one-to-one, and the second intermediate vector of each preamble word is subjected to linear transformation processing to obtain a key vector and a value vector of each original word; performing linear transformation processing on a second intermediate vector of the preceding word arranged at the last in the preceding words to obtain a query vector of the last preceding word, performing point multiplication processing on the query vector and a key vector of each preceding word, and performing normalization processing on a point multiplication processing result based on a maximum likelihood function to obtain the weight of a value vector of each preceding word; and carrying out weighting processing on the value vector of each preamble word based on the weight of the value vector of each preamble word to obtain a mask attention processing result of the final preamble word corresponding to the sub-preamble decoding network.

In some embodiments, the above cross attention processing on the mask self-attention processing result to obtain the cross attention processing result of the corresponding sub-preamble decoding network may be implemented by the following technical solutions: performing linear transformation processing on the mask self-attention processing result to obtain a query vector of the mask self-attention processing result; the following processing is performed for each original word: performing linear transformation processing on a first source sentence representation of an original word to obtain a key vector and a value vector represented by the first source sentence; performing point multiplication on the query vector masked from the attention processing result and the key vector represented by the first source sentence, and performing normalization processing on the result of the point multiplication processing based on a maximum likelihood function to obtain the weight of the value vector represented by the first source sentence; and carrying out weighting processing on the value vector represented by the first source sentence based on the weight of the value vector represented by the first source sentence to obtain a cross attention processing result corresponding to the sub-preamble decoding network.

As an example, the result of the masked self-attention processing is a vector for the last preamble, for example, when performing decoding prediction for the 2 nd first position to be predicted, linear transformation processing is performed based on the result of the masked attention processing corresponding to "where" to obtain a query vector of the result of the masked self-attention processing; performing the following processing aiming at each original word, namely inputting a first source sentence representation into a decoding network, performing linear transformation processing on the first source sentence representation of the original word to obtain a key vector and a value vector which are represented by the first source sentence, performing point multiplication processing on a query vector masked from an attention processing result and the key vector corresponding to each original word in the first source sentence representation aiming at each original word in the first source sentence representation respectively, and performing normalization processing based on a maximum likelihood function on the point multiplication processing result to obtain the weight of the value vector corresponding to each original word in the first source sentence representation; and performing weighting processing on the value vector represented by the first source sentence based on the weight of the value vector represented by the first source sentence to obtain a cross attention processing result of the corresponding sub-preamble decoding network, performing feed-forward processing on the cross attention processing result, and performing the self-attention processing on the obtained hidden state vector and other preambles, wherein the other preambles are preambles except for the last preamble, such as a start symbol "B".

In step 102, the first to-be-predicted position with the first confidence coefficient lower than the confidence coefficient threshold is determined as the second to-be-predicted position of the second translation model, and the first pre-marked target word corresponding to the first to-be-predicted position with the first confidence coefficient not lower than the confidence coefficient threshold is determined as the contextual word corresponding to the second translation model.

By way of example, for three first to-be-predicted positions in the target word sequence "B where are you E", for the 1 st first to-be-predicted position, the first confidence of the preamble decoding network output corresponding to the first pre-marked target word "where" is 0.4, for the 2 nd first to-be-predicted position, the first confidence of the preamble decoding network output corresponding to the first pre-marked target word "are 0.8, for the 3 rd first to-be-predicted position, the first confidence of the preamble decoding network output corresponding to the first pre-marked target word" you "is 0.3, if the confidence threshold is 0.6, the 1 st first to-be-predicted position and the 3 rd first to-be-predicted position are determined as the second to-be-predicted position of the second translation model, the first pre-marked target word" are "corresponding to the 2 nd first to-be-predicted position is the context of the second translation model, the second translation model is used for predicting the translation results of the 1 st second to-be-predicted position and the 1 st second to-be-predicted position by taking the first pre-marked target word 'are' of the 2 nd first to-be-predicted position as the visible second pre-marked target word 'are' of the 2 nd second to-be-predicted position of the second translation model, namely the task of the second translation model is to predict the translation results of the 1 st second to-be-predicted position and the 1 st second to-be-predicted position under the condition that the visible second pre-marked target word 'are' of the 2 nd second to-be-predicted position.

In step 103, the context words and the corpus samples are propagated in the forward direction in the second translation model, so as to obtain a second confidence of a second pre-marked target word corresponding to each second to-be-predicted position.

As an example, when the corpus samples and the context words are forward-propagated in the second translation model, the corpus samples and the context words need to be processed by the coding network and the preamble decoding network in sequence, wherein when the corpus samples and the context words are decoded by the context decoding network, the confidence prediction of the second pre-marked target words is performed on each second to-be-predicted position at the same time.

Referring to fig. 3C, fig. 3C is a schematic flowchart of a training method for a translation model provided in an embodiment of the present application, where the second translation model includes a second coding network and a context decoding network; in step 103, the context words and the corpus samples are propagated forward in the second translation model to obtain a second confidence of the second pre-labeled target word corresponding to each second to-be-predicted position, which may be implemented in step 1031-1033 shown in fig. 3C.

In step 1031, each original word of the corpus sample and the original word vector corresponding to each original word are obtained, and the original word vectors corresponding to each original word are combined to obtain an original word vector sequence of the corpus sample.

In step 1032, the original word vector sequence of the corpus sample is subjected to semantic coding processing through the second coding network, so as to obtain a second source sentence representation corresponding to the corpus sample.

As an example, the encoding process in step 1032 may refer to a specific implementation in step 1012, wherein the second encoding network may have the same parameters or different parameters than the first encoding network.

In step 1033, the corpus decoding processing is performed on the second source sentence representation through the context decoding network, so as to obtain a second confidence of the second pre-marked target word corresponding to each second to-be-predicted position.

As an example, the second confidence degrees are generated based on context words corresponding to a plurality of second positions to be predicted, and when the second confidence degrees of the second pre-marked target words corresponding to each second position to be predicted are generated, the second confidence degrees are generated based on the same context words, that is, the second confidence degrees are generated by the context words obtained based on the first translation model.

In some embodiments, in step 1033, performing corpus decoding processing on the second source sentence representation through the context decoding network to obtain a second confidence of the second pre-marked target word corresponding to each second position to be predicted, which may be implemented by the following technical solution: performing the following for each second to-be-predicted position of the context decoding network output: and performing semantic decoding processing on the context words and the second source sentence representation through a context decoding network to obtain a second confidence coefficient of the second pre-marked target word decoded to be corresponding at the second position to be predicted.

In some embodiments, the context decoding network comprises P concatenated sub-context decoding networks, P being an integer greater than or equal to 2; the semantic decoding processing of the context words and the second source sentence representation through the context decoding network can be realized by the following technical scheme: and performing semantic decoding processing in the following mode on the context word set and the source sentence representation corresponding to the second position to be predicted through P cascaded subcontext decoding networks: performing context self-attention processing on the input of the subcontext decoding network to obtain a context self-attention processing result corresponding to the subcontext decoding network; performing cross attention processing on the context self-attention processing result to obtain a cross attention processing result corresponding to the subcontext decoding network; performing hidden state mapping processing on the cross attention processing result; wherein, in the P cascaded subcontext decoding networks, the input of the first subcontext decoding network comprises: and the context word set corresponding to the second position to be predicted and the second source sentence represent: the hidden state mapping processing result of the pth subcontext decoding network includes: as an alternative implementation, there may be only one subcontext decoding network to perform semantic decoding processing on the second confidence level that the second to-be-predicted position is decoded into the corresponding second pre-marked target word.

As an example, performing semantic decoding processing on the input of a P-th subcontext decoding network through the P-th subcontext decoding network of the P cascaded subcontext decoding networks, and transmitting a P-th semantic decoding processing result output by the P-th subcontext decoding network to a P + 1-th subcontext decoding network to continue the semantic decoding processing to obtain a P + 1-th semantic decoding processing result; and when the value of P is more than or equal to 2 and less than P < P, the input of the P-th sub-context decoding network is a P-1 semantic decoding processing result output by the P-1-th sub-context decoding network, and when the value of P is P-1, the output of the P + 1-th sub-context decoding network is a second confidence coefficient.

As an example, when performing semantic decoding processing on the input of the P-th subcontext decoding network through the P-th subcontext decoding network of the P cascaded subcontext decoding networks, performing context self-attention processing on the input of the P-th subcontext decoding network through a context self-attention layer of the P-th subcontext decoding network of the P cascaded subcontext decoding networks to obtain a context self-attention processing result corresponding to the P-th subcontext decoding network; transmitting the self-attention processing result of the context corresponding to the pth subcontext decoding network to a cross attention layer of the pth subcontext decoding network for cross attention processing to obtain a cross attention processing result corresponding to the pth subcontext decoding network; and transmitting the cross attention processing result corresponding to the p-th subcontext decoding network to a feedforward processing layer of the p-th subcontext decoding network to perform hidden state mapping processing to obtain a hidden state sequence corresponding to the p-th subcontext decoding network as a p-th semantic decoding processing result output by the p-th subcontext decoding network.

As an example, the following processing is performed for each context word, when the sub-preamble decoding network is the first decoding network, the third intermediate vector is a word vector of each second pre-tagged target word as a preamble, when the sub-context decoding network is not the first decoding network, the third intermediate vector is an output of the last sub-context decoding network, the third intermediate vector corresponds one-to-one to each second pre-tagged target word as a context word, the linear transformation processing is performed on the third intermediate vector corresponding to each context word to obtain a key vector and a value vector of the context word, the linear transformation processing is performed on the third intermediate vector of a symbol (tagged with a special symbol because it is unknown) of the second position to be predicted to obtain a query vector of the symbol, the point multiplication processing is performed on the query vector and the key vector of each context word, and the normalization processing based on the maximum likelihood function is performed on the result of the point multiplication processing, obtaining the weight of the value vector of each context word; and carrying out weighting processing on the value vectors of the context words based on the weight of the value vector of each context word to obtain a mask attention processing result of the final context word corresponding to the subcontext decoding network.

As an example, the cross attention processing is performed on the context self-attention processing result, and the process of obtaining the cross attention processing result corresponding to the subcontext decoding network is similar to the processing manner in the preamble decoding network.

In step 104, parameters of the first translation model and the second translation model are updated based on a first confidence of the first pre-labeled target word corresponding to each first position to be predicted and a second confidence of the second pre-labeled target word corresponding to each second position to be predicted.

Referring to fig. 3D, fig. 3D is a flowchart illustrating a method for training a translation model according to an embodiment of the present application, and in step 104, parameters of the first translation model and the second translation model are updated based on a first confidence of the first pre-labeled target word corresponding to each first position to be predicted and a second confidence of the second pre-labeled target word corresponding to each second position to be predicted, which may be implemented by

steps

1041 and 1044 shown in fig. 3D.

In step 1041, a first penalty is determined for the first translation model based on the first confidence level.

In some embodiments, the plurality of first to-be-predicted positions have a one-to-one correspondence of a plurality of first pre-marked target words; in step 1041, based on the first confidence, determining a first loss corresponding to the first translation model, which may be implemented by the following technical solution: and performing fusion processing on the first confidence coefficient acquired aiming at each first to-be-predicted position to obtain a first loss corresponding to the first translation model.

In step 1042, a second loss corresponding to the second translation model is determined based on the first confidence level lower than the confidence level threshold and the second confidence level of the second pre-labeled target word corresponding to each second to-be-predicted position.

In some embodiments, the second loss is used for characterizing the teaching loss of the first translation model by the second translation model, and the second pre-marked target words are in one-to-one correspondence with the second positions to be predicted; in step 1042, a second loss corresponding to the second translation model is determined based on the first confidence coefficient lower than the confidence coefficient threshold and the second confidence coefficient of the second pre-labeled target word corresponding to each second to-be-predicted position, which may be implemented by the following technical solutions: and performing fusion processing on the first confidence coefficient lower than the confidence coefficient threshold value and the second confidence coefficient of the second pre-marked target word corresponding to each second position to be predicted to obtain a second loss corresponding to the second translation model.

In step 1043, the first loss and the second loss are aggregated based on aggregation parameters corresponding to the first loss and the second loss, respectively, to obtain a combined loss.

As an example, the partially visible sequence y of the second translation model is determined by the first translation model_oGiven a source sentence x and a partially visible sequence y_oIn case of the second pre-trained translation model, the second pre-trained translation model is used for the occluded target word subset y_mEach word in y_tPredicting to obtain corresponding prediction probability distribution

As a second confidence, then, aiming at a second to-be-predicted position of the target end of the second translation model, wherein the second to-be-predicted position is a first to-be-predicted position of which the first confidence is lower than a confidence threshold, bidirectional global context information (based on context words) is pertinently introduced into the first translation model by adopting a knowledge distillation mode, and a loss function of the knowledge distillation is adopted, which is shown in formula (1):

wherein KL (·) represents Kullback-Leibler divergence, alpha is a balance coefficient, and the value strategy of alpha is as follows,

the summation is a second loss, and the value of alpha is linearly decreased from 1 to 0 along with the training round, so that the first translation model can be guided to absorb more knowledge from the second translation model with bidirectional global context information in the early stage, and then the prediction of the first pre-marked target word is gradually paid attention again, so that the first pre-marked target word is better trained, and other words which do not belong to y are better trained_mStill using the first loss function of the first translation model, and thus jointlyThe loss function can be seen in equation (2):

wherein, y_t∈y_o\[M]Representing the exclusion of all special symbols [ M ]]L (the first pre-marked target word of the plurality of first pre-marked target words having a first confidence above a confidence threshold), L_CBKD(θ_ne,θ_nd) Is a combined loss, L_kd(θ_ne,θ_nd) It is the second loss that is the loss of,

the result of the summation of

The sum of (1) is the first loss, and bidirectional global context information is introduced at the target end for the first translation model in a targeted manner through knowledge distillation based on confidence, and meanwhile, the second translation model only participates in the training process and does not participate in the reasoning stage of the first translation model.

In step 1044, parameters of the first translation model and the second translation model are updated according to the joint loss.

As an example, the parameters of the two models are updated according to the joint loss, and a gradient descent algorithm may be used for the update.

The method for translating corpus provided by the embodiment of the present application will be described in conjunction with an exemplary application and implementation of the server 200 provided by the embodiment of the present application.

In some embodiments, in response to a translation request for a target corpus, a first translation model or a second translation model is called to perform translation processing on the target corpus to obtain a translation result for the target corpus; the first translation model and the second translation model are obtained by training through the translation model training method provided by the embodiment of the application.

As an example, the first translation model is different from the training stage in the application stage, when the first translation model is applied, the corpus sample is encoded through an encoding network of the first translation model to obtain a first source sentence representation, then each first position to be predicted is sequentially decoded in a step-by-step manner, the start symbol "B" and the first source sentence representation are decoded through a preamble decoding network to obtain a word "where" in the word list with the highest confidence level is used as a translation result of the first position to be predicted, then the start symbol "B", "where" and the first source sentence representation are decoded through the preamble decoding network to obtain a word with the highest confidence level in the word list as a translation result of the second first position to be predicted, and so on until the termination symbol "E" is output. The application phase and the training phase of the second translation model are the same, namely, one forward propagation is performed.

Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described.

In some embodiments, a language translation function may be provided in the social client, for example, the social client running on terminal a receives text information sent by terminal B via the server, where the text information is in english, and in response to a language translation operation for the text information received by terminal a, terminal a invokes a first translation model provided in an embodiment of the present application to perform translation processing on the text information belonging to a first language to obtain text information corresponding to a second language, where the first translation model is obtained by means of auxiliary training of a second translation model, and joint prediction is performed only on a first to-be-predicted position with a lower first confidence when the auxiliary training is performed by the second translation model.

In some embodiments, the present application provides a joint training framework for merging bidirectional global context information based on confidence, where the joint training framework includes a translation model (a first translation model) and a conditional mask language model (a second translation model), a confidence of a first pre-labeled target word is predicted based on the first translation model in a joint training process, and the bidirectional global context information (context words) is merged into the first translation model by the second translation model using knowledge distillation, where the context words are target words that are not occluded, and the joint training is divided into two stages: (1) pre-training a first translation model and a second translation model; (2) knowledge distillation is performed based on confidence.

In some embodiments, the first translation model is pre-trained (i.e., trained separately), the first translation model and the second translation model have the same coding network, the coding network is used for coding the input source sentences (corresponding to the corpus samples) into source semantic representations, and the coding network can be represented by L_eA same sub-coding network, L_eEach layer is an integer greater than or equal to 1 and comprises two sublayers: (1) a self-attention processing layer, and (2) a feed-forward processing layer. In the sub-coding network, the self-attention processing layer takes the hidden state vector sequence output by the previous layer as an input, and further maps the hidden state vector sequence by a self-attention mechanism, that is, performs a multi-head self-attention operation, and the self-attention processing layer can be expressed as the following formula (3):

c^(l)＝AN(SelfAtt(h^(l-1),h^(l-1),h^(l-1))) (3)；

wherein, c^(l)Is AN intermediate vector calculated by the self-attention sublayer, AN (-) represents a layer regularization operation with residual connection, SelfAtt (-) represents a multi-head self-attention operation, h^(l)Hidden state vector sequence, h, representing the output of the l-th layer of the coding network^(l-1)Hidden state vector sequence representing the output of layer l-1 of the coding network, c^(l)The hidden state vector sequence h is mapped to the output of the ith layer of the coding network through a feedforward processing layer^(l)See formula (4):

h^(l)＝AN(FFN(h^(l-1))) (4)；

in some embodiments, h of the network is encoded⁽⁰⁾Namely embedding a vector sequence into the words corresponding to the input source sentences,

i.e. codingThe source sentence representation ultimately output by the network.

In some embodiments, the preamble decoding network may be formed of L_dEach sub-preamble decoding network has three layers: (1) masking a self-attention processing layer (maskedselftatt), (2) a cross-attention processing layer (CrossAtt), (3) a feed-forward processing layer (FFN), wherein in order to ensure the auto-regression property of the first translation model, the maskedselftatt layer uses an attention mask to block all subsequent words of a target word at each first position to be predicted, so that the first translation model only depends on the preceding words at the target end for prediction, and because each time step in the test phase of the first translation model can only use the words generated before, the maskedselftatt layer has an attention mask to block all subsequent words of each first pre-marked target word during training, and can be formatted as formula (5):

a^(l)＝AN(MaskedSelfAtt(s^(l-1),s^(l-1),s^(l-1))) (5)；

wherein s is^(l-1)Implicit state vector sequence representing the l-1 sub-preamble decoding network, a^(l)For the intermediate representation (masked attention process result) after Ma skedSelfAtt layer mapping, AN (-) represents the layer regularization operation with residual connection, and FFN (-) represents the mapping operation of the feedforward process layer.

In some embodiments, CrossAtt layer pair a^(l)And the source sentence representation output by the coding network is modeled by adopting a cross attention mechanism, and the calculation process is abstract and is formula (6):

wherein z is^(l)Is the intermediate representation (cross attention process result) after CrossAtt layer mapping, AN (-) represents the layer regularization operation with residual connection, a^(l)For the intermediate representation after MaskedSelfAtt layer mapping (masked attention processing result),

is a source sentence representation of the encoded network output, CrossAtt (·) is a cross attention process.

In some embodiments, z^(l)Hidden state vector s mapped to sub-preamble decoding network output through FFN layer^(l)The mapping process is abstracted as equation (7):

s^(l)＝AN(FFN(z^(l))) (7)；

wherein s is^(l)Represents the hidden state vector sequence of the ith sub-preamble decoding network, AN (-) represents the layer regularization operation with residual connection, FNN (-) is the feed-forward process, z^(l)Is the intermediate representation after CrossAtt layer mapping (cross attention processing result).

In some embodiments, given a source sentence x, the target end corresponds to a sequence of prologs y at time t_<tAnd in the case of decoding the hidden state vector s output by the highest layer of the network, the first translation model predicts y in the form of the following formula (8)_tProbability distribution over vocabulary:

p(y_t|y_<t,x)＝softmax(Ws_t) (8)；

wherein, the time can be understood as different first positions to be predicted, each first position to be predicted corresponds to different time steps, W represents a learnable linear transformation matrix, s_tIs the hidden state vector corresponding to the t-th moment of the decoding network of the first translation model. For example, for the first pre-marked target word y at the 5 th first to-be-predicted position₅Based on a given source sentence x (e.g., "where you are"), a given prelude y₁To y₄And decoding a hidden state vector output by the network and corresponding to the t-th time, predicting the hidden state vector to be translated into a first pre-marked target word y at the 5 th first position to be predicted₅Probability p (y) of₅|y_<5X), where the hidden state vector corresponding to the t-th instant of the decoded network output is also based on the given preamble y₁To y₄And (4) obtaining the product.

In some embodiments, the first translation model and the second translation model may share parameters of the coding network, or may have independent coding networks, the decoding networks of the first translation model and the second translation model may have partially same parameters, and the first translation model is additionally influenced by a supervisory signal from the context decoding network of the second translation model, the context decoding network of the second translation model has target-side bidirectional global context information, so that the jointly trained first translation model has the capability of capturing bidirectional global context information, and the loss function of the first translation model is shown in formula (9):

wherein x is a source sentence, y_tFor a first pre-marked target word at a corresponding time t (the t-th first position to be predicted) of the target end, P (y)_t|x,y_<t) Is y_tProbability distribution (first probability) over the vocabulary, the first translation model is trained in a standard teacher-guided manner.

In some embodiments, the second translation model is pre-trained (i.e., trained separately), and in some embodiments the context decoding network of the second translation model decodes differently than the preamble decoding network of the first translation model, in which a portion of the visible sequence of input words y is given at the target end_oPredicting the target word set y with the target end blocked_mDecoding network of the second translation model consists of L_dEach of the same subcontext decoding networks comprises three layers: (1) self attention processing layer (SelfAtt); (2) cross attention layer (CrossAtt); (3) a feedforward processing layer (FN), a SelfAtt layer of the subcontext decoding network of the second translation model does not have a mask for blocking out the corresponding subsequent word at each position when the self-attention mechanism is executed, but can focus on all the unblocked positions of the target end input sequence (the visible input word sequence) and perform the solution with the preambleThe same calculation process for the code network, in a given source sentence x, the target end part visible sequence y_oAnd the second translation model predicts y by the following formula (10) in the case of the hidden state vector s' output by the highest layer of the context decoding network_tProbability distribution over vocabulary:

p(y_t|y_o,x)＝softmax(W′s_t′) (10)；

where W 'is a learnable linear transformation matrix, s'_tIs the corresponding hidden state vector at the t-th moment of the context decoding network, for example, for the second pre-marked target word y at the 5 th second position to be predicted₅And a second pre-marked target word y at a 3 rd second position to be predicted₃Based on a given source sentence x (e.g., "do you like this flower"), a given visible word y₁、y₂And y₄And decoding a hidden state vector output by the network and corresponding to the t-th time, predicting the hidden state vector to be translated into a second pre-marked target word y at a 5 th second position to be predicted₅Probability p (y) of₅|y_1，2，4X) and a second pre-marked target word p (y) at a 3 rd second position to be predicted₃|y_1，2，4X), wherein the hidden state vector corresponding to the t-th instant of the decoding network output is also based on the given visible word y₁、y₂And y₄The resulting visible words form a set of contextual words, forming global context information.

In some embodiments, for the second translation model, an integer v is randomly generated from 1 to y, and then v words are randomly selected from y second pre-marked target words, which are replaced by a special character, so that the second pre-marked target word sequence is segmented into an input sequence y of the observable context decoding network_o(consisting of a visible second pre-marked target word) and masked sequence y_m(consisting of an invisible second pre-marked target word), the training target of the second translation model may be represented by the following equation (11):

wherein x is a source language sentence, y_tFor a second pre-marked target word at the target end corresponding to the t-th time (the t-th second position to be predicted), P (y)_t|x，y_o) Is y_tProbability distribution over the vocabulary (second probability).

In some embodiments, the first translation model still predicts a low probability for a large number of first pre-labeled target words given a completely correct preposition, see fig. 9, fig. 9 is a confidence distribution diagram provided by the embodiments of the present application, the horizontal axis of fig. 9 is confidence, and the vertical axis of fig. 9 is a proportion of the target words, for a fully trained (the pre-training process described above) first translation model, in the case that the target end gives a completely correct preposition for each first position to be predicted, the predicted confidence distribution about the first pre-labeled target words is, for example, 25.67% of the first pre-labeled target words exist, and in the case of giving a completely correct preposition, the predicted confidence is only 0.1.

In some embodiments, the method for training the translation model provided in the embodiments of the present application uses the second translation model obtained by the pre-training as a teacher model through knowledge distillation based on confidence, and introduces bidirectional global context information into the first pre-marked target word with lower confidence when the first translation model predicts the first pre-marked target word, so as to improve the training process of the first translation model, as shown in fig. 4, fig. 4 is a schematic structural diagram of a joint training model of the method for training the translation model provided in the embodiments of the present application

Given a confidence threshold, a corresponding first probability (first confidence) in the first pre-labeled target word is less than a second confidence of the confidence thresholdA to-be-predicted position is used as a subset y which is subsequently input into the second translation model and is blocked_mThe remaining first pre-labeled target words are used as part of the visible sequence y to be input into the second translation model_oThe process can be expressed by equation (12):

wherein the content of the first and second substances,

representing the prediction probability (first confidence) of the first translation model to the first pre-marked target word at the t-th moment, epsilon is a confidence threshold, t is a prediction time step, different first positions to be predicted are distinguished through the prediction time step, | y | is the number of the first pre-marked target words, y_tIs the first pre-marked target word corresponding to the t-th instant (corresponding to the first position to be predicted).

Determining the partial visible sequence y of the second translation model by the first translation model_oGiven a source sentence x and a partially visible sequence y_oIn case of the second pre-trained translation model, the second pre-trained translation model is used for the occluded target word subset y_mEach word in y_tPredicting to obtain corresponding prediction probability distribution

As the second confidence, next, aiming at a second to-be-predicted position of the target end of the second translation model, wherein the second to-be-predicted position is a first to-be-predicted position of which the first confidence is lower than the confidence threshold, the knowledge distillation mode is adopted to purposefully introduce bidirectional global context information (based on context words) for the first translation model, and the second loss function of the knowledge distillation is adopted, which is shown in formula (13):

wherein KL isThe value of alpha is linearly reduced from 1 to 0 along with the training turn, so that the first translation model can be guided to absorb more knowledge from the second translation model with bidirectional global context information at early stage, and then the prediction of the first pre-marked target word is gradually focused again, so that the first pre-marked target word is better trained, and for other words not belonging to y_mStill using the first loss function of the first translation model, the joint loss function can be seen in equation (14):

wherein, y_t∈y_o\[M]Representing the exclusion of all special symbols [ M ]]L (the first pre-marked target word of the plurality of first pre-marked target words having a first confidence above a confidence threshold), L_CBKD(θ_ne，θ_nd) Is a combined loss, L_kd(θ_ne,θ_nd) And the second loss is that the two-way global context information is introduced into the target end for the first translation model in a targeted manner through knowledge distillation based on confidence degree, and meanwhile, the second translation model only participates in the training process and does not participate in the reasoning stage of the first translation model.

By the training method of the translation model, knowledge distillation based on confidence is performed, bidirectional global context information is pertinently introduced to a first to-be-predicted position, with a lower first confidence, of a first pre-marked target word at a target end of a first translation model, so that the first translation model can predict each first to-be-predicted position by using not only local context information of the corresponding pre-marked word, but also global context information, and translation performance of the first translation model is improved.

Continuing with the exemplary structure of the training device 255 for translation models provided by the embodiments of the present application implemented as software modules, in some embodiments, as shown in fig. 2, the software modules stored in the training device 255 for translation models of the memory 250 may include: the first task module 2551 is configured to perform forward propagation on the corpus sample in the first translation model to obtain a first confidence of the first pre-marked target word corresponding to each first to-be-predicted position; a selecting module 2552, configured to determine a first to-be-predicted position where the first confidence coefficient is lower than the confidence coefficient threshold as a second to-be-predicted position of the second translation model, and determine a first pre-tagged target word corresponding to the first to-be-predicted position where the first confidence coefficient is not lower than the confidence coefficient threshold as a contextual word corresponding to the second translation model; the second task module 2553 is configured to forward propagate the context words and the corpus samples in the second translation model, so as to obtain a second confidence of a second pre-marked target word corresponding to each second to-be-predicted position; an updating module 2554, configured to update parameters of the first translation model and the second translation model based on a first confidence of the first pre-labeled target word corresponding to each first position to be predicted and a second confidence of the second pre-labeled target word corresponding to each second position to be predicted.

In some embodiments, the update module 2554 is further configured to: determining a first loss corresponding to the first translation model based on the first confidence; determining a second loss corresponding to the second translation model based on the first confidence coefficient lower than the confidence coefficient threshold value and the second confidence coefficient of the second pre-marked target word corresponding to each second position to be predicted; wherein the second loss is used for characterizing the teaching loss of the first translation model by the second translation model; performing polymerization treatment on the first loss and the second loss based on polymerization parameters respectively corresponding to the first loss and the second loss to obtain combined loss; and updating parameters of the first translation model and the second translation model according to the joint loss.

In some embodiments, the plurality of first to-be-predicted positions have a one-to-one correspondence of a plurality of first pre-marked target words; an update module 2554, further configured to: and performing fusion processing on the first confidence coefficient acquired aiming at each first to-be-predicted position to obtain a first loss corresponding to the first translation model.

In some embodiments, the plurality of second positions to be predicted have a one-to-one correspondence of a plurality of second pre-marked target words; an update module 2554, further configured to: and performing fusion processing on the first confidence coefficient lower than the confidence coefficient threshold value and the second confidence coefficient of the second pre-marked target word corresponding to each second position to be predicted to obtain a second loss corresponding to the second translation model.

In some embodiments, the first translation model includes a first encoding network and a preamble decoding network; a first task module 2551, further configured to: determining each original word of the corpus sample and an original word vector corresponding to each original word, and combining the original word vectors corresponding to each original word to obtain an original word vector sequence of the corpus sample; performing semantic coding processing on an original word vector sequence of the corpus sample through a first coding network to obtain a first source statement representation corresponding to the corpus sample; performing corpus decoding processing on the first source sentence representation through a preamble decoding network to obtain a first confidence coefficient of a first pre-marked target word corresponding to each first position to be predicted; and generating a first confidence coefficient based on the prepositions corresponding to each first position to be predicted.

In some embodiments, the encoding network comprises N concatenated sub-encoding networks, N being an integer greater than or equal to 2; a first task module 2551, further configured to: through N cascaded first sub-coding networks included in the first coding network, performing semantic coding processing on an original word vector sequence of a corpus sample in the following mode: performing self-attention processing on the input of the first sub-coding network to obtain a self-attention processing result corresponding to the first sub-coding network, performing hidden state mapping processing on the self-attention processing result to obtain a hidden state vector sequence corresponding to the first sub-coding network, and taking the hidden state vector sequence as a semantic coding processing result of the first sub-coding network; in the N cascaded first sub-coding networks, the input of the first sub-coding network comprises an original word vector sequence of a corpus sample, and the semantic coding processing result of the Nth first sub-coding network comprises a first source statement representation corresponding to the corpus sample.

In some embodiments, the first task module 2551 is further configured to: the following processing is performed for each original word of the corpus sample: performing linear transformation processing on a first intermediate vector corresponding to an original word in the input of the first sub-coding network to obtain a query vector, a key vector and a value vector corresponding to the original word; performing point multiplication on the query vector of the original word and the key vector of each original word, and performing normalization processing on the result of the point multiplication processing based on a maximum likelihood function to obtain the weight of the value vector of the original word; and carrying out weighting processing on the value vector of the original word based on the weight of the value vector of the original word to obtain a self-attention processing result of each original word corresponding to the sub-coding network.

In some embodiments, the first task module 2551 is further configured to: performing the following for each first to-be-predicted position of the preamble decoding network output: acquiring a first pre-marked target word sequence corresponding to a corpus sample from a corpus sample set; extracting a first pre-marked target word positioned in front of a first position to be predicted from the first pre-marked target word sequence, and taking the extracted first pre-marked target word as a pre-word corresponding to the first position to be predicted; semantic decoding processing is carried out on the preamble word corresponding to the first to-be-predicted position and the first source sentence representation through a preamble decoding network, and a first confidence coefficient of the first to-be-predicted position decoded into the corresponding first pre-marked target word is obtained.

In some embodiments, the preamble decoding network comprises M concatenated sub-preamble decoding networks, M being an integer greater than or equal to 2; a first task module 2551, further configured to: performing semantic decoding processing in the following manner through each sub-preamble decoding network: performing mask self-attention processing on input of the sub-preamble decoding network to obtain a mask self-attention processing result corresponding to the sub-preamble decoding network, performing cross attention processing on the mask self-attention processing result to obtain a cross attention processing result corresponding to the sub-preamble decoding network, and performing hidden state mapping processing on the cross attention processing result; wherein, in the M concatenated sub-preamble decoding networks, the input of the first sub-preamble decoding network comprises: the first source sentence is expressed by the corresponding prepositions and the first source words; the hidden state mapping processing result of the Mth sub-preamble decoding network comprises the following steps: the first confidence level of the corresponding first pre-marked target word is decoded at the first to-be-predicted position.

In some embodiments, the first task module 2551 is further configured to: performing linear transformation processing on the mask self-attention processing result to obtain a query vector of the mask self-attention processing result; the following processing is performed for each original word: performing linear transformation processing on a first source sentence representation of an original word to obtain a key vector and a value vector represented by the first source sentence; performing point multiplication on the query vector masked from the attention processing result and the key vector represented by the first source sentence, and performing normalization processing on the result of the point multiplication processing based on a maximum likelihood function to obtain the weight of the value vector represented by the first source sentence; and carrying out weighting processing on the value vector represented by the first source sentence based on the weight of the value vector represented by the first source sentence to obtain a cross attention processing result corresponding to the sub-preamble decoding network.

In some embodiments, the second translation model includes a second encoding network and a context decoding network; a second task module 2553, further configured to: acquiring each original word of the corpus sample and an original word vector corresponding to each original word, and combining the original word vectors corresponding to each original word to obtain an original word vector sequence of the corpus sample; performing semantic coding processing on the original word vector sequence of the corpus sample through a second coding network to obtain a second source statement representation corresponding to the corpus sample; performing corpus decoding processing on the second source sentence representation through a context decoding network to obtain a second confidence coefficient of a second pre-marked target word corresponding to each second position to be predicted; and generating a second confidence coefficient based on the context words corresponding to the second positions to be predicted.

In some embodiments, second task module 2553 is further configured to: performing the following for each second to-be-predicted position of the context decoding network output: and performing semantic decoding processing on the context words and the second source sentence representation through a context decoding network to obtain a second confidence coefficient of the second pre-marked target word decoded to be corresponding at the second position to be predicted.

In some embodiments, the context decoding network comprises P concatenated sub-context decoding networks, P being an integer greater than or equal to 2; a second task module 2553, further configured to: and performing semantic decoding processing in the following mode on the context word set and the source sentence representation corresponding to the second position to be predicted through P cascaded subcontext decoding networks: performing context self-attention processing on the input of the subcontext decoding network to obtain a context self-attention processing result corresponding to the subcontext decoding network; performing cross attention processing on the context self-attention processing result to obtain a cross attention processing result corresponding to the subcontext decoding network; performing hidden state mapping processing on the cross attention processing result; wherein, in the P cascaded subcontext decoding networks, the input of the first subcontext decoding network comprises: and the context word set corresponding to the second position to be predicted and the second source sentence represent: the hidden state mapping processing result of the pth subcontext decoding network includes: and a second confidence level of the corresponding second pre-marked target word is decoded at the second position to be predicted.

The embodiment of the present application provides a corpus translation device 256 of a translation model, including: the application module 2555 is configured to, in response to a translation request for a target corpus, invoke a first translation model or a second translation model to perform translation processing on the target corpus, so as to obtain a translation result for the target corpus; the first translation model and the second translation model are obtained by training according to the training method of the translation model provided by the embodiment of the application.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the training method of the translation model and the corpus translation method of the embodiment of the application.

Embodiments of the present application provide a computer-readable storage medium storing executable instructions, which when executed by a processor, cause the processor to perform a method for training a translation model provided by embodiments of the present application, for example, a method for training a translation model as shown in fig. 3A-3D.

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EP ROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (H TML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

In summary, according to the embodiments of the present application, characteristics and confidence levels of a translation task of one neural network model are used to assist in training another neural network model in a targeted manner, and the second translation model is translated by using a context word set, so that bidirectional global context information is effectively introduced through the context word set, and bidirectional global context information based on context words is specifically introduced into the first translation model at a position of a target end with a lower confidence level as the position of the first translation model, so that the first translation model after joint training can not only utilize local context information of a preamble word corresponding to each position to be predicted during translation, but also can utilize global context information in a targeted manner, thereby effectively improving the accuracy of translation by the first translation model.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A method for training a translation model, comprising:

2. The method of claim 1, wherein updating parameters of the first translation model and the second translation model based on a first confidence of a first pre-labeled target word corresponding to each first position to be predicted and a second confidence of a second pre-labeled target word corresponding to each second position to be predicted comprises:

determining a first loss corresponding to the first translation model based on the first confidence level;

determining a second loss corresponding to the second translation model based on a first confidence coefficient lower than the confidence coefficient threshold value and a second confidence coefficient of a second pre-marked target word corresponding to each second position to be predicted;

wherein the second loss is used to characterize a teaching loss of the second translation model to the first translation model;

performing polymerization processing on the first loss and the second loss based on polymerization parameters respectively corresponding to the first loss and the second loss to obtain a joint loss;

and updating parameters of the first translation model and the second translation model according to the joint loss.

3. The method of claim 2,

the first positions to be predicted are provided with a plurality of first pre-marked target words in one-to-one correspondence, and the second positions to be predicted are provided with a plurality of second pre-marked target words in one-to-one correspondence;

the determining, based on the first confidence, a first loss corresponding to the first translation model comprises:

performing fusion processing on the first confidence coefficient obtained for each first to-be-predicted position to obtain a first loss corresponding to the first translation model;

determining a second loss corresponding to the second translation model based on a first confidence level lower than the confidence level threshold and a second confidence level of a second pre-marked target word corresponding to each second position to be predicted, including:

and performing fusion processing on the first confidence coefficient lower than the confidence coefficient threshold value and the second confidence coefficient of the second pre-marked target word corresponding to each second position to be predicted to obtain a second loss corresponding to the second translation model.

4. The method of claim 1, wherein the first translation model comprises a first coding network and a preamble decoding network;

the forward propagation of the corpus sample in the first translation model to obtain a first confidence coefficient of a first pre-marked target word corresponding to each first to-be-predicted position includes:

determining each original word of the corpus sample and an original word vector corresponding to each original word, and combining the original word vectors corresponding to each original word to obtain an original word vector sequence of the corpus sample;

performing semantic coding processing on the original word vector sequence of the corpus sample through the first coding network to obtain a first source statement representation corresponding to the corpus sample;

performing corpus decoding processing on the first source sentence representation through the preamble decoding network to obtain a first confidence coefficient of a first pre-marked target word corresponding to each first position to be predicted;

wherein the first confidence is generated based on the prepositions corresponding to each first position to be predicted.

5. The method according to claim 4, wherein the first coding network comprises N concatenated first sub-coding networks, N being an integer greater than or equal to 2;

the semantic coding processing is performed on the original word vector sequence of the corpus sample through the first coding network to obtain a first source sentence representation corresponding to the corpus sample, and the semantic coding processing comprises the following steps:

performing semantic coding processing on the original word vector sequence of the corpus sample in the following mode through N cascaded first sub-coding networks included in the first coding network:

performing self-attention processing on the input of the first sub-coding network to obtain a self-attention processing result corresponding to the first sub-coding network, performing hidden state mapping processing on the self-attention processing result to obtain a hidden state vector sequence corresponding to the first sub-coding network, and taking the hidden state vector sequence as a semantic coding processing result of the first sub-coding network;

in N cascaded first sub-coding networks, the input of the first sub-coding network comprises an original word vector sequence of the corpus sample, and the semantic coding processing result of the Nth first sub-coding network comprises a first source sentence representation corresponding to the corpus sample.

6. The method of claim 4, wherein said performing corpus decoding processing on said first source sentence representation via said preamble decoding network to obtain a first confidence level of a first pre-marked target word corresponding to each of said first positions to be predicted comprises:

performing the following for each first to-be-predicted position of the preamble decoding network output:

acquiring a first pre-marked target word sequence corresponding to a corpus sample from a corpus sample set;

extracting a first pre-marked target word positioned in front of the first position to be predicted from the first pre-marked target word sequence, and taking the extracted first pre-marked target word as a pre-word corresponding to the first position to be predicted;

and performing semantic decoding processing on the prepositive word corresponding to the first position to be predicted and the first source sentence representation through the prepositive decoding network to obtain a first confidence coefficient of the first pre-marked target word decoded at the first position to be predicted.

7. The method of claim 6, wherein the preamble decoding network comprises M concatenated sub-preamble decoding networks, M being an integer greater than or equal to 2;

the semantic decoding processing of the preposition corresponding to the first position to be predicted and the first source sentence representation through the preposition decoding network comprises the following steps:

performing semantic decoding processing in the following manner through each of the sub-preamble decoding networks: performing mask self-attention processing on the input of the sub-preamble decoding network to obtain a mask self-attention processing result corresponding to the sub-preamble decoding network, performing cross attention processing on the mask self-attention processing result to obtain a cross attention processing result corresponding to the sub-preamble decoding network, and performing hidden state mapping processing on the cross attention processing result;

wherein, in M concatenated sub-preamble decoding networks, an input of a first one of the sub-preamble decoding networks comprises: a preposition corresponding to the first position to be predicted and the first source sentence representation; the hidden state mapping processing result of the Mth sub-preamble decoding network comprises: and decoding the first to-be-predicted position into a first confidence coefficient of a corresponding first pre-marked target word.

8. The method of claim 7 wherein said cross attention processing said masked self-attention processing result to obtain a cross attention processing result corresponding to said sub-preamble decoding network comprises:

performing linear transformation processing on the mask self-attention processing result to obtain a query vector of the mask self-attention processing result;

performing the following for each of the original words:

performing linear transformation processing on a first source sentence representation of the original word to obtain a key vector and a value vector represented by the first source sentence;

performing point multiplication on the query vector of the mask self-attention processing result and the key vector represented by the first source sentence, and performing normalization processing on the result of the point multiplication processing based on a maximum likelihood function to obtain the weight of the value vector represented by the first source sentence;

and carrying out weighting processing on the value vector represented by the first source sentence based on the weight of the value vector represented by the first source sentence to obtain a cross attention processing result corresponding to the sub-preamble decoding network.

9. The method of claim 1, wherein the second translation model comprises a second encoding network and a context decoding network;

the forward propagation of the context words and the corpus samples in the second translation model to obtain a second confidence of a second pre-marked target word corresponding to each second position to be predicted includes:

acquiring each original word of the corpus sample and an original word vector corresponding to each original word, and combining the original word vectors corresponding to each original word to obtain an original word vector sequence of the corpus sample;

performing semantic coding processing on the original word vector sequence of the corpus sample through the second coding network to obtain a second source statement representation corresponding to the corpus sample;

performing corpus decoding processing on the second source sentence representation through the context decoding network to obtain a second confidence coefficient of a second pre-marked target word corresponding to each second position to be predicted;

wherein the second confidence is generated based on context words corresponding to a plurality of the second positions to be predicted.

10. The method according to claim 9, wherein said performing corpus decoding processing on the second source sentence representation through the context decoding network to obtain a second confidence of a second pre-labeled target word corresponding to each of the second positions to be predicted comprises:

performing the following for each second to-be-predicted location of the context decoding network output:

and performing semantic decoding processing on the context words and the second source sentence representation through the context decoding network to obtain a second confidence coefficient of the second pre-marked target word decoded at the second position to be predicted as the corresponding second pre-marked target word.

11. A corpus translation method, comprising:

wherein the first translation model and the second translation model are trained according to the training method of the translation model of any one of claims 1 to 10.

12. An apparatus for training a translation model, comprising:

13. A corpus translation apparatus, comprising:

the application module is used for responding to a translation request aiming at a target corpus, calling a first translation model or a second translation model to translate the target corpus, and obtaining a translation result aiming at the target corpus; wherein the first translation model and the second translation model are trained according to the training method of the translation model of any one of claims 1 to 10.

14. An electronic device, comprising:

a memory for storing executable instructions;

a processor, configured to implement the training method of the translation model according to any one of claims 1 to 10 or the corpus translation method according to claim 11 when executing the executable instructions stored in the memory.

15. A computer-readable storage medium storing executable instructions for implementing the method for training a translation model according to any one of claims 1 to 10 or the corpus translation method according to claim 11 when executed by a processor.