WO2023116709A1

WO2023116709A1 - Machine translation method and apparatus, electronic device and storage medium

Info

Publication number: WO2023116709A1
Application number: PCT/CN2022/140417
Authority: WO
Inventors: 高洪; 周志浩; 黄书剑; 陈家骏; 张洋铭; 周祥生
Original assignee: 中兴通讯股份有限公司
Priority date: 2021-12-20
Filing date: 2022-12-20
Publication date: 2023-06-29
Also published as: CN116384414A

Abstract

The embodiment of the present application relates to the technical field of machine learning. Disclosed are a machine translation method and apparatus, an electronic device and a storage medium. The machine translation method comprises the following steps: acquiring corpus data to be translated (101); inputting said corpus data subjected to word segmentation into an encoder to obtain a subword-based contextual representation (102); inputting the subword-based contextual representation into a word representation synthesizer to obtain a word-based contextual representation (103); and inputting the word-based contextual representation into a decoder to obtain a translation result of said corpus data (104).

Description

Machine translation method, device, electronic device and storage medium

Cross References to Related Applications

This application is based on a Chinese patent application with application number 202111567148.X and a filing date of December 20, 2021, and claims the priority of this Chinese patent application. The entire content of this Chinese patent application is hereby incorporated by reference into this application.

technical field

The embodiments of the present application relate to the technical field of machine learning, and in particular, to a machine translation method, device, electronic device, and storage medium.

Background technique

According to the different training methods of the model, machine translation can be divided into supervised machine translation, semi-supervised machine translation and unsupervised machine translation. Among them, unsupervised machine translation does not need to construct parallel corpus data, but only needs to collect monolingual corpus data to construct monolingual corpus Data set for training, therefore, has a wider application prospect. The current unsupervised machine translation is mainly based on codec + attention mechanism for pre-training and back-translation. Among them, through pre-training, two different languages are coded to learn the contextual representation shared by each other, and through back-translation to construct pseudo-parallel The corpus is used for translation training to further improve the quality of translation.

However, translation based on the above model is more effective only between English-French, English-German and other languages, while for Chinese-English, French-Korean and other languages, the expected results are often not achieved.

Contents of the invention

Embodiments of the present application provide a machine translation method, device, electronic equipment, and storage medium.

The embodiment of the present application provides a machine translation method, the method includes the following steps: obtaining the corpus data to be translated; inputting the corpus data to be translated after word segmentation into an encoder to obtain a context representation based on subwords; The subword-based context representation is input into a word representation synthesizer to obtain a word-based context representation; the word-based context representation is input into a decoder to obtain a translation result of the corpus data to be translated.

The embodiment of the present application also proposes a machine translation device, including: an acquisition module configured to acquire the corpus data to be translated; an encoding module configured to input the word-segmented corpus data to be translated into an encoder to obtain Based on the context representation of subwords; the synthesis module is configured to input the context representation based on subwords into a word representation synthesizer to obtain a context representation based on words; the decoding module is configured to input the context representation based on words A decoder for obtaining the translation result of the corpus data to be translated.

The embodiment of the present application also proposes an electronic device, including: at least one processor; and a memory connected to the at least one processor in communication; wherein, the memory stores information executable by the at least one processor. Instructions, the instructions are executed by the at least one processor, so that the at least one processor can execute the above-mentioned machine translation method.

The embodiment of the present application also provides a computer-readable storage medium storing a computer program, and implementing the above-mentioned machine translation method when the computer program is executed by a processor.

Description of drawings

One or more embodiments are exemplified by pictures in the accompanying drawings, and these exemplifications are not intended to limit the embodiments.

FIG. 1 is a flowchart of a machine translation method provided in an embodiment of the present application;

Fig. 2 is a schematic structural diagram of a machine translation device provided in another embodiment of the present application;

Fig. 3 is a schematic structural diagram of an electronic device provided in another embodiment of the present application.

Detailed ways

It can be seen from the background technology that the translation effect of the current machine translation method will be limited by the languages before and after the translation, and not any translation between any two languages can achieve the expected effect.

After analysis, it is found that the reason why the effect of the current machine translation method is limited by the language is that during pre-training, the input decoder is a subword, and what is learned is a subword-based context representation shared by two different language codes. The subwords between English-French, English-German and other languages have some commonality and the way of constructing words from subwords also has commonality. For example, the sun "sun" in English is similar to the sun "Sonne" in German, and the English sun "Sonne" in German is similar. Both German and German use "er" and "est" to represent comparative and superlative, etc.; while the differences between Chinese and English subword representations and the differences in word construction from subwords are very large, even through training. However, due to the diversity of subword formation methods and the unfixed meaning of subwords in words and sentences, the decoder will still translate according to subwords far from the true meaning of the word, and finally achieve Unexpected translation effect.

In order to solve the above problems, the embodiment of the present application provides a machine translation method, which includes the following steps: obtaining the corpus data to be translated; inputting the corpus data to be translated after word segmentation into an encoder, and obtaining The context representation based on the subword is input into the word representation synthesizer to obtain the context representation based on the word; the context representation based on the word is input into the decoder to obtain the translation result of the corpus data to be translated.

In the machine translation method proposed in the embodiment of this application, after the encoder outputs the subword-based context representation of the corpus data to be translated, it does not directly input the subword-based context representation into the decoder, but first inputs the subword-based context representation into the decoder. The context representation is input into the word representation synthesizer to synthesize the sub-word-based context representation according to the word granularity to obtain the word-based context representation, and then input the word-based context representation into the decoder, that is, in the original encoding On the basis of the decoder + attention mechanism, an additional word representation synthesizer is introduced between the encoder and the decoder, and the input of the decoder is changed from the original subword-based context representation to the word-based context representation, that is, the decoder The granularity of decoding and translation changes from subwords to words. Since the meaning of words is more stable than subwords, it is less affected by language structures such as context and embedding methods in sentences than subwords, so the decoder does not need to be based on subwords. Words are reconstructed, which avoids the problem that the meaning of words reconstructed based on subwords changes in the language before and after translation due to the different ways of word formation of subwords, which in turn leads to a large change in the meaning of sentences before and after translation, making sentences before and after translation The meaning of each word in the text is more accurate, and the translated sentences will be more accurate, which overcomes the language restrictions in the translation process and can effectively and accurately translate between any languages.

In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the embodiments of the present application will be described in detail below with reference to the accompanying drawings. However, those of ordinary skill in the art can understand that in each embodiment of the application, many technical details are provided for readers to better understand the application. However, even without these technical details and various changes and modifications based on the following embodiments, the technical solutions claimed in this application can also be realized. The division of the following embodiments is for the convenience of description, and should not constitute any limitation to the implementation of the present application, and the embodiments can be combined and referred to each other on the premise of no contradiction.

On the one hand, the embodiment of the present application provides a machine translation method, which is applied in the process of mutual translation between two different languages, and is applied to electronic devices such as mobile phones and servers. As shown in Figure 1, the process at least includes but is not limited to :

Step 101, acquiring corpus data to be translated.

The corpus data to be translated in this embodiment is text data, but this embodiment does not limit the data volume of the corpus data to be translated, it may be a sentence, a word or a paragraph, etc.

It should be noted that the source of the corpus data to be translated can be audio, video, text, etc. In the process of obtaining the corpus data to be translated, it may also need to combine other things such as language conversion, text segmentation, etc.

In an example, it is necessary to provide subtitles in language B for a video in language A. It is necessary to perform language-to-text processing on the voice signal in the video to obtain several sentences. Each sentence in the several sentences can be used as the current waiting list To translate the corpus, several sentences can also be used as the corpus data to be translated, and then the translation results are added to the video in the form of subtitles.

In another example, a Portable Document Format (PDF) document needs to be translated into an editable document in the target language. At this time, the PDF document can be subjected to Optical Character Recognition (OCR) to obtain the entire For the text data of the document, the entire document can be used as the corpus data to be translated, or the entire text can be divided into paragraphs, etc., and each part of the divided text data can be used as the corpus data to be translated in turn, and then the translated document can be saved as Editable text format.

Step 102, input the word-segmented corpus data to be translated into the encoder to obtain a contextual representation based on subwords.

In this embodiment, the encoder can adopt various network structures, such as multi-layer attention network Transformer, recurrent neural network (Recurrent Neural Network, RNN), etc., which will not be repeated here.

It can be understood that the input of the encoder is the corpus data to be translated after word segmentation. Therefore, before inputting data to the encoder, it is necessary to perform word segmentation on the corpus data to be translated. Segmentation, such as dividing the word "information" into two subwords of "xin" and "information", wherein the word segmentation can be realized by using a word segmentation model, such as using a byte pair encoding (Byte Pair Encoder, BPE) model Learn to perform word segmentation on corpus data to be translated.

It should be noted that the context representation based on subwords refers to the minimum granularity of context representations as subwords. For example, the subword-based context representation obtained after the encoder processes subwords A and subwords includes context representation C of subword A and subword B. The context of the word B represents D.

It should also be noted that each word in the corpus data to be translated after word segmentation has been segmented, and some words may not be segmented but retain the original format, and the number of subwords generated after each word is segmented It may also be different. For example, after word segmentation, "technical scheme" may get two subwords "technology" and "scheme", and after word segmentation, "zhenjiaomei" may get "zhen", "delicacy" and "delicious". three subwords.

Step 103, input the subword-based context representation into the word representation synthesizer to obtain a word-based context representation.

In this embodiment, the word-based context representation means that the minimum granularity of the context is a word, such as the word representation synthesizer synthesizes the context representation C of the subword A and the context representation D of the subword B into a context representation E, wherein the context representation E is a word-based contextual representation.

It can be understood that the word-based contextual representation will be input into the decoder for decoding and translation, and the synthesis method of the contextual representation of each subword in the word-based contextual representation will affect the effect of decoding and translation, such as for "technical scheme", its corresponding subwords are "technique", "skill", "fang" and "case", if the context expressions of "technique" and "technique" are synthesized into one context representation, "fang" and "case" The context representation of "is synthesized into a context representation, which is more accurate than combining the context representations of "Technology", "Skill" and "Fang" into a context representation, because the second synthesis method destroys the pattern of subword formation, That is to say, it will be more accurate if the word representation synthesizer can synthesize according to the distribution position of words in the corpus data to be translated before word segmentation.

Therefore, in some examples, before inputting the word-segmented corpus data to be translated into the encoder to obtain the context representation based on subwords, the machine translation method further includes: performing word segmentation on the to-be-translated corpus data and generating The segmented word in generates the first position label, where multiple first position labels are generated in the case of multiple words being segmented, and each first position label indicates the position of a segmented word Position information, for example, for corpus data x=(x1, x2, x3), where x1, x2 and x3 are 3 words, assuming that word x2 is divided into x2' and x2'', then after segmentation In the corpus data to be translated x'=(x1, x2', x2'', x3), the generated label indicating the position after X1 and the generated label indicating the position before X2 are used together as the first position label to mark the word x2 The start position and end position of the subwords obtained by segmentation can also be directly marked in the segmented corpus data to be translated, such as the segmented corpus data to be translated x'=(x1, E _beg , x2', x2'', E _end , x3), where E _beg marks the starting position of a segmented word, and E _end marks the starting position of a segmented word. Furthermore, when the first position label is generated, the word representation synthesizer can determine the subword position contained in the segmented word according to the first position label, so that the context representation of the subword in the position can be synthesized, that is, based on The contextual representation of the subword is input into the word representation synthesizer, and the context representation based on the word is obtained, which can be realized in the following way: the context representation based on the subword and the first position label are input into the word representation synthesizer, and the context representation based on the subword The contextual representations of several subwords from the same word are synthesized to obtain a word-based contextual representation.

In other examples, before inputting the word-segmented corpus data to be translated into the encoder to obtain the subword-based context representation, the machine translation method further includes: generating a second position label for each word in the corpus data to be translated And perform word segmentation on the corpus data to be translated. Correspondingly, the subword-based context representation is input into the word representation synthesizer to obtain the word-based context representation, which can be realized in the following manner: input the subword-based context representation and the second position label into the word representation synthesizer to obtain the word representation synthesizer based on In the context representation of subwords, the context representations of several subwords from the same word are synthesized to obtain a word-based context representation. It is worth mentioning that instead of distinguishing words according to whether they are segmented or not, position labels are generated for each word, so that even if the position labels of some of the segmented words are lost, it can still be determined based on the position labels of other words. The position indicated by the lost position label avoids the influence of label loss. Of course, generating location tags for segmented words can reduce processing and speed up processing efficiency, and you can choose which tag generation method to use according to actual needs.

It should be noted that the above examples are only for the description of the position label generation process and the word representation synthesis process. The generated position labels can actually be input to the encoder together so that the encoder can generate a subword-based context representation containing the position label. When the position label mainly indicates the start position and end position of the context representation of several subwords belonging to the same word, it will not be repeated here.

Of course, the above is only an example, and in other examples, the subword-based context representations can also be randomly synthesized, and details will not be repeated here.

Step 104, input the word-based context representation into the decoder to obtain the translation result of the corpus data to be translated.

In this embodiment, the decoder actually judges the language of the corpus data to be translated, and then translates it into an expression in another language.

It should be noted that, similar to the existing encoder and decoder processing the corpus to be translated to obtain the translation model, the encoder, word representation synthesizer and decoder provided in this embodiment can also be compared with the existing encoder and decoder Mutual translation between two languages is like a set of encoder, word representation synthesizer and decoder. When the input is Chinese, the output of the decoder is the corresponding expression in English of the input corpus data to be translated. When the input is English When , the output of the decoder is the corresponding expression in Chinese of the input corpus data to be translated, that is, mutual translation between the two languages can be realized. Therefore, the decoder needs to determine the language of the corpus data to be translated, so as to determine the output expression as another language. This is all determined by the training method of unsupervised machine translation. In order to facilitate those skilled in the art to better understand the above description, the training process will be described below.

Before inputting the word-segmented corpus data to be translated into the encoder to obtain the subword-based context representation, the machine translation method also includes: obtaining two monolingual corpus data sets; wherein, the language types corresponding to the two monolingual corpus data sets Different, then word segmentation is performed on the corpus data in the two monolingual corpus data sets, and two first training sets are obtained; the encoder, word representation synthesizer and decoder are pre-trained according to the first training set; according to the two The monolingual corpus data set, the pre-trained encoder, word representation synthesizer and decoder are back-translated to obtain the trained encoder, word representation synthesizer and decoder.

In some examples, pre-training the encoder, word representation synthesizer and decoder according to the first training set can be achieved in the following way: Randomly add masks to the corpus data of the first training set to obtain the first masks respectively The training set, such as covering 50% of the content with a mask, randomly selects the area covered by the mask; according to the first mask training set, the encoder, word representation synthesizer and decoder are based on masked sequence to sequence (Masked Sequence to Sequence , MASS) joint training.

It should be noted that joint training refers to training the encoder, word representation synthesizer and decoder together, where the output of the encoder is used as the input of the word representation synthesizer, and the output of the word representation synthesizer is used as the input of the decoder. At this point, the loss function of the joint training is as follows:

Among them, L _mass (θ, l) represents the loss value corresponding to the corpus data of the first mask training set corresponding to language l, which is mainly used to describe the masked part of the corpus data in the first mask training set during training. The probability of being correctly decoded by the decoder, θ is the training parameter, x represents the corpus data in the first training set, D _l represents the first training set corresponding to language l, logp(xi ^:j |x ^\i:j ; θ) Indicates the conditional probability that the part covered by the mask in the corpus data is accurately decoded by the decoder, xi ^:j represents the part covered by the mask in the corpus data, and i represents the part covered by the mask in the corpus data. The starting position, j indicates the end position of the part covered by the mask in the corpus data in the corpus data, and x ^\i:j indicates the result obtained by decoding the part covered by the mask in the corpus data by the decoder.

In other examples, a new training set can also be constructed through data augmentation for further joint training, that is, pre-training the encoder, word representation synthesizer and decoder according to the first training set, which can be achieved in the following ways: Randomly select the corpus data in the first training set and segment the unsegmented words in the selected corpus data to obtain new corpus data; add the new corpus data to the first training set to obtain the second training set; The corpus data in the first training set and the second training set are randomly added with masks to obtain the second mask training set and the third mask training set respectively; according to the second mask training set and the third mask training set, the encoder , word representation synthesizer and decoder for joint training of MASS.

It should be noted that the second mask training set is substantially the same as the first mask training set in the previous example, therefore, the loss caused by the second mask training set can also be expressed by the expression shown in the previous example solve. The corpus data of the third mask training set is segmented again on the basis of the first training set and then segmented again, which can be regarded as another first mask training set with stronger word segmentation , therefore, the loss caused by the second mask training set can also be solved in a manner similar to the previous example, that is, the loss caused by the second mask training set can be solved by the following expression:

Among them, L _split (θ,l) represents the loss value corresponding to the corpus data of the third mask training set corresponding to language l, and is mainly used to describe the masked part of the corpus data x in the third mask training set during training When is the probability of being correctly decoded by the decoder, θ is the training parameter, x represents the corpus data in the second training set, D _s represents the second training set corresponding to language l, logp(xi ^:j |x ^\i:j ; θ ) represents the conditional probability that the part covered by the mask in the corpus data is accurately decoded by the decoder, xi ^:j represents the part covered by the mask in the corpus data, i represents the part covered by the mask in the corpus data is in the corpus data j represents the end position of the part covered by the mask in the corpus data, and x ^\i:j represents the result of decoding the part covered by the mask in the corpus data by the decoder.

It is not difficult to see that this example actually increases the data augmentation training set on the basis of the previous example for training.

It is worth mentioning that, by splitting words, more word representation synthesizers are involved in the training of MASS, which further improves the performance of the model.

In other examples, pre-training the encoder, word representation synthesizer and decoder according to the first training set can also be achieved in the following manner: determining a preset amount of corpus data from the corpus data in the first training set as Target corpus data; word segmentation is performed again on the unsegmented words in the target corpus data to obtain the target segmentation corpus data; the target corpus data and the target segmentation corpus data are merged to obtain the third training set; the third training A piece of training data in the set is a corpus data pair composed of a target corpus data and corresponding target segmentation corpus data; a mask is randomly added to the corpus data in the first training set to obtain the fourth mask training set; according to the third training set and the fourth mask training set to jointly train the encoder, word representation synthesizer, and decoder, where the third training set is used for supervised training of the word representation synthesizer, and the fourth mask training set is used for encoding The MASS-based training of the word representation synthesizer and decoder.

It should be noted that in this example, the unsegmented words in the corpus data of the first training set are further segmented, so that the words before segmentation are used as the supervisory signals of several subwords after segmentation, and then the Word representation synthesizers provide additional supervised training. At this point, the loss brought by the third training set is calculated by the following loss function expression:

Among them, L _combiner (θ; l) represents the loss value corresponding to the corpus data of the third training set corresponding to language l, and θ is a training parameter, which is mainly used to describe the accuracy of synthesizing subwords into words by the word-to-word representation synthesizer, D _t represents the third training set corresponding to language l, x represents the corpus data pair, t(x) represents the segmentation target corpus data in the corpus data pair, E _true ( _xi ) represents the corpus data pair in the third training set The output of the target corpus data at the word representation synthesizer, E _fake ( _xi ) represents the output of the segmentation target corpus data in the corpus data pair in the third training set at the word representation synthesizer, DIS(E _true (x _i ),E _fake ( _xi )) represents the negative distance between E _true ( _xi ) and E _fake ( _xi ).

It is not difficult to see that this example actually adds supervised training to the word representation synthesizer on the basis of the first example.

It is worth mentioning that explicitly training the synthesizer by splitting the unsplit word and making the synthesizer's representation close to the representation of the word before splitting will help to get a better representation of the synthesized word.

It should be noted that the above is only an example of pre-training. In other examples, on the basis of the first example above, the training set obtained by data enhancement and the supervised training data set can be added for joint training. . In addition, the above examples are mainly based on the first training set as an example, and the joint effect of the two first training sets needs to be considered during execution. For example, in joint training, it is not the addition of two loss functions, but two loss functions Based on the superposition of two monolingual corpus data sets, take the training set obtained by data enhancement and the supervised training data set for joint training and the monolingual corpus data sets are Chinese and English respectively as an example. The overall loss function should be:

L(θ)＝L _mass (θ,zh)+L _mass (θ,en)+L _combiner (θ;zh)+L _combiner (θ,en)+L _split (θ;zh)+L _split (θ; en)

Among them, L(θ) is the total loss value, L _mass (θ, zh) represents the loss value of the MASS training part of the first training set corresponding to Chinese, and L _mass (θ, en) represents the first training set corresponding to English The loss value of the MASS training part of L _combiner (θ; zh) is the loss value of the first training set corresponding to Chinese and the MASS training part of data enhancement, and L _combiner (θ; en) is the first training set corresponding to English and The loss value of the data-enhanced MASS training part, L _split (θ; zh) represents the loss value of the supervised training part of the first training set corresponding to Chinese, and L _split (θ; en) represents the supervision of the first training set corresponding to English The loss value for the training part.

In addition, the process of back-translation is as follows: the monolingual corpus data set is used as the input of the pre-trained encoder, the word representation synthesizer and the encoder, the output of the encoder constitutes the translation reference data set, and then the translation reference data The set is used as the input of the pre-trained encoder, word representation synthesizer and encoder, and the corresponding monolingual corpus data set is used as a supervisory signal, and a pseudo-parallel corpus data set is constructed by the set to realize supervised training, such as using pre-training A good encoder, word representation synthesizer, and encoder process the Chinese monolingual corpus data set to obtain the corresponding English expression, thereby constructing an English translation reference data set, and then use the English translation reference data set as a pre-trained The encoder, word representation synthesizer, and input to the encoder are trained with a Chinese monolingual corpus dataset as a supervisory signal.

In particular, the loss function expression during back-translation is as follows:

Among them, L _bt (θ,l) represents the back-translation loss function value of language l, θ is the training parameter, D _l represents the monolingual corpus data set of language l, M(x) represents the corresponding corpus data x in the translation reference data set The output of the corpus data at the encoder after the pre-trained encoder, word representation synthesizer and encoder.

For the input of Chinese sentences to the encoder, the model can encode the context word representation in the bilingual space, and let the decoder decode English. Generally, the English translation can be generated by controlling the start symbol or language encoding. This initial translation result has a certain quality, but it is not ideal. Back-translation training can use monolingual data and existing translation models to further improve the translation effect.

It can be understood that after obtaining the loss value, the training parameter θ can be adjusted according to the loss value, so as to continue training according to the adjusted training parameter θ until the loss value meets the preset threshold, the loss value converges, or the number of training times reaches the preset value wait.

In addition, it should be understood that the division of steps in the above methods is only for clarity of description, and may be combined into one step or split into multiple steps during implementation. As long as the same logical relationship is included, all Within the scope of protection of this patent; adding insignificant modifications or introducing insignificant designs to the algorithm or process, but not changing the core design of the algorithm and process are all within the scope of protection of the patent.

On the other hand, the embodiment of the present application also provides a machine translation device, as shown in Figure 2, including:

The acquiring module 201 is configured to acquire corpus data to be translated.

The encoding module 202 is configured to input the word-segmented corpus data to be translated into the encoder to obtain a contextual representation based on subwords.

The synthesis module 203 is configured to input the subword-based context representation into the word representation synthesizer to obtain the word-based context representation.

The decoding module 204 is configured to input the word-based context representation into the decoder to obtain the translation result of the corpus data to be translated.

It is not difficult to find that this embodiment is an apparatus embodiment corresponding to the method embodiment, and this embodiment can be implemented in cooperation with the method embodiment. The relevant technical details mentioned in the method embodiments are still valid in this embodiment, and will not be repeated here in order to reduce repetition. Correspondingly, the related technical details mentioned in this embodiment can also be applied in the method embodiment.

It is worth mentioning that all the modules involved in this embodiment are logical modules. In practical applications, a logical unit can be a physical unit, or a part of a physical unit, or multiple physical units. Combination of units. In addition, in order to highlight the innovative part of the present application, units that are not closely related to solving the technical problem proposed in the present application are not introduced in this embodiment, but this does not mean that there are no other units in this embodiment.

Another aspect of the embodiment of the present application provides an electronic device, as shown in FIG. 3 , including: at least one processor 301; and a memory 302 communicatively connected to the at least one processor 301; An instruction to be executed by at least one processor 301, the instruction is executed by at least one processor 301, so that at least one processor 301 can execute the machine translation method described in any one of the above method embodiments.

Wherein, the memory 302 and the processor 301 are connected by a bus, and the bus may include any number of interconnected buses and bridges, and the bus connects one or more processors 301 and various circuits of the memory 302 together. The bus may also connect together various other circuits such as peripherals, voltage regulators, and power management circuits, all of which are well known in the art and therefore will not be further described herein. The bus interface provides an interface between the bus and the transceivers. A transceiver may be a single element or multiple elements, such as multiple receivers and transmitters, providing means for communicating with various other devices over a transmission medium. The data processed by the processor 301 is transmitted on the wireless medium through the antenna, and the antenna also receives the data and transmits the data to the processor 301 .

The processor 301 is responsible for managing the bus and general processing, and may also provide various functions including timing, peripheral interface, voltage regulation, power management and other control functions. And the memory 302 can be used to store data used by the processor 301 when performing operations.

Another aspect of the embodiment of the present application provides a computer-readable storage medium storing a computer program. When the computer program is executed by the processor, the machine translation method described in any one of the above method embodiments is realized.

That is, those skilled in the art can understand that all or part of the steps in the method of the above-mentioned embodiments can be completed by instructing related hardware through a program, the program is stored in a storage medium, and includes several instructions to make a device ( It may be a single-chip microcomputer, a chip, etc.) or a processor (processor) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .

Those of ordinary skill in the art can understand that the above-mentioned embodiments are embodiments of the present application, and in practical applications, various changes can be made to it in form and detail without departing from the spirit and scope of the present application .

Claims

A method of machine translation comprising:

Obtain the corpus data to be translated;

The corpus data to be translated after the word segmentation is input into the encoder to obtain a context representation based on subwords;

The context representation based on the subword is input into the word representation synthesizer, and the context representation based on the word is obtained;

The word-based context representation is input into a decoder to obtain a translation result of the corpus data to be translated.
The machine translation method according to claim 1, wherein, before the corpus data to be translated after the word segmentation is input into an encoder, and the subword-based context representation is obtained, the method further comprises:

performing word segmentation on the corpus data to be translated and generating a first position label for the segmented words in the corpus data to be translated;

The described context representation based on the subword is input into the word representation synthesizer, and the context representation based on the word is obtained, including:

Inputting the subword-based context representation and the first position label into the word representation synthesizer to synthesize context representations of several subwords from the same word in the subword-based context representation to obtain The word-based contextual representation.
The machine translation method according to claim 1, wherein, before the corpus data to be translated after the word segmentation is input into an encoder, and the subword-based context representation is obtained, the method further comprises:

generating a second position label for each word in the corpus data to be translated and performing word segmentation on the corpus data to be translated;

The described context representation based on the subword is input into the word representation synthesizer, and the context representation based on the word is obtained, including:

Inputting the subword-based context representation and the second position label into the word representation synthesizer to synthesize context representations from several subwords of the same word in the subword-based context representation to obtain The word-based contextual representation.
The machine translation method according to any one of claims 1 to 3, wherein the corpus data to be translated after the word segmentation is input into the encoder, and before the subword-based context representation is obtained, the method further include:

Obtain two monolingual corpus datasets;

Carry out word segmentation to the corpus data in described two monolingual corpus data sets, obtain two first training sets;

pre-training the encoder, the word representation synthesizer, and the decoder based on the first training set;

According to the two monolingual corpus data sets, the pre-trained encoder, the word representation synthesizer and the decoder are back-translated to obtain the trained encoder, the word representation synthesizer and the decoder.
The machine translation method according to claim 4, wherein said pre-training said encoder, said word representation synthesizer and said decoder according to said first training set comprises:

Randomly add a mask to the corpus data of the first training set to obtain the first mask training set respectively;

Perform joint training based on masked sequence-to-sequence MASS for the encoder, the word representation synthesizer, and the decoder according to the first mask training set.
The machine translation method according to claim 4, wherein said pre-training said encoder, said word representation synthesizer and said decoder according to said first training set comprises:

Randomly select the corpus data in the first training set and segment the unsegmented words in the selected corpus data to obtain new corpus data;

adding the new corpus data to the first training set to obtain a second training set;

Randomly add a mask to the corpus data in the first training set and the second training set to obtain the second mask training set and the third mask training set respectively;

Perform joint training of MASS on the encoder, the word representation synthesizer, and the decoder according to the second mask training set and the third mask training set.
The machine translation method according to claim 4, wherein said pre-training said encoder, said word representation synthesizer and said decoder according to said first training set comprises:

determining a preset amount of corpus data from the corpus data in the first training set as target corpus data;

Carry out word segmentation again to the unsegmented word in described target corpus data, obtain target segmentation corpus data;

Merging the target corpus data and the target segmented corpus data to obtain a third training set; wherein, one piece of training data in the third training set is one target corpus data and the corresponding target segmentation A corpus data pair composed of corpus data;

Randomly add a mask to the corpus data in the first training set to obtain a fourth mask training set;

According to the third training set and the fourth mask training set, the encoder, the word representation synthesizer and the decoder are jointly trained, wherein the third training set is used for all The predicate representation synthesizer performs supervised training, and the fourth mask training set is used for MASS-based training of the encoder, the word representation synthesizer and the decoder.
A machine translation device, comprising:

The obtaining module is configured to obtain the corpus data to be translated;

The encoding module is configured to input the corpus data to be translated after word segmentation into an encoder to obtain a context representation based on subwords;

A synthesis module configured to input the subword-based context representation into a word representation synthesizer to obtain a word-based context representation;

The decoding module is configured to input the word-based context representation into a decoder to obtain a translation result of the corpus data to be translated.
An electronic device comprising:

at least one processor; and,

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, the instructions are executed by the at least one processor, so that the at least one processor can perform the operation described in any one of claims 1 to 7 The method of machine translation.
A computer-readable storage medium storing a computer program, which implements the machine translation method according to any one of claims 1 to 7 when the computer program is executed by a processor.