WO2023116709A1 - Machine translation method and apparatus, electronic device and storage medium - Google Patents

Machine translation method and apparatus, electronic device and storage medium Download PDF

Info

Publication number
WO2023116709A1
WO2023116709A1 PCT/CN2022/140417 CN2022140417W WO2023116709A1 WO 2023116709 A1 WO2023116709 A1 WO 2023116709A1 CN 2022140417 W CN2022140417 W CN 2022140417W WO 2023116709 A1 WO2023116709 A1 WO 2023116709A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
representation
corpus data
training set
synthesizer
Prior art date
Application number
PCT/CN2022/140417
Other languages
French (fr)
Chinese (zh)
Inventor
高洪
周志浩
黄书剑
陈家骏
张洋铭
周祥生
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2023116709A1 publication Critical patent/WO2023116709A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the embodiments of the present application relate to the technical field of machine learning, and in particular, to a machine translation method, device, electronic device, and storage medium.
  • machine translation can be divided into supervised machine translation, semi-supervised machine translation and unsupervised machine translation.
  • unsupervised machine translation does not need to construct parallel corpus data, but only needs to collect monolingual corpus data to construct monolingual corpus Data set for training, therefore, has a wider application prospect.
  • the current unsupervised machine translation is mainly based on codec + attention mechanism for pre-training and back-translation.
  • pre-training two different languages are coded to learn the contextual representation shared by each other, and through back-translation to construct pseudo-parallel
  • the corpus is used for translation training to further improve the quality of translation.
  • Embodiments of the present application provide a machine translation method, device, electronic equipment, and storage medium.
  • the embodiment of the present application provides a machine translation method, the method includes the following steps: obtaining the corpus data to be translated; inputting the corpus data to be translated after word segmentation into an encoder to obtain a context representation based on subwords;
  • the subword-based context representation is input into a word representation synthesizer to obtain a word-based context representation;
  • the word-based context representation is input into a decoder to obtain a translation result of the corpus data to be translated.
  • the embodiment of the present application also proposes a machine translation device, including: an acquisition module configured to acquire the corpus data to be translated; an encoding module configured to input the word-segmented corpus data to be translated into an encoder to obtain Based on the context representation of subwords; the synthesis module is configured to input the context representation based on subwords into a word representation synthesizer to obtain a context representation based on words; the decoding module is configured to input the context representation based on words A decoder for obtaining the translation result of the corpus data to be translated.
  • the embodiment of the present application also proposes an electronic device, including: at least one processor; and a memory connected to the at least one processor in communication; wherein, the memory stores information executable by the at least one processor. Instructions, the instructions are executed by the at least one processor, so that the at least one processor can execute the above-mentioned machine translation method.
  • the embodiment of the present application also provides a computer-readable storage medium storing a computer program, and implementing the above-mentioned machine translation method when the computer program is executed by a processor.
  • FIG. 1 is a flowchart of a machine translation method provided in an embodiment of the present application
  • Fig. 2 is a schematic structural diagram of a machine translation device provided in another embodiment of the present application.
  • Fig. 3 is a schematic structural diagram of an electronic device provided in another embodiment of the present application.
  • the input decoder is a subword, and what is learned is a subword-based context representation shared by two different language codes.
  • the subwords between English-French, English-German and other languages have some commonality and the way of constructing words from subwords also has commonality.
  • the sun “sun” in English is similar to the sun “Sonne” in German
  • the English sun “Sonne” in German is similar.
  • Both German and German use "er” and “est” to represent comparative and superlative, etc.; while the differences between Chinese and English subword representations and the differences in word construction from subwords are very large, even through training.
  • the decoder will still translate according to subwords far from the true meaning of the word, and finally achieve Unexpected translation effect.
  • the embodiment of the present application provides a machine translation method, which includes the following steps: obtaining the corpus data to be translated; inputting the corpus data to be translated after word segmentation into an encoder, and obtaining The context representation based on the subword is input into the word representation synthesizer to obtain the context representation based on the word; the context representation based on the word is input into the decoder to obtain the translation result of the corpus data to be translated.
  • the encoder After the encoder outputs the subword-based context representation of the corpus data to be translated, it does not directly input the subword-based context representation into the decoder, but first inputs the subword-based context representation into the decoder.
  • the context representation is input into the word representation synthesizer to synthesize the sub-word-based context representation according to the word granularity to obtain the word-based context representation, and then input the word-based context representation into the decoder, that is, in the original encoding
  • an additional word representation synthesizer is introduced between the encoder and the decoder, and the input of the decoder is changed from the original subword-based context representation to the word-based context representation, that is, the decoder
  • the granularity of decoding and translation changes from subwords to words.
  • the embodiment of the present application provides a machine translation method, which is applied in the process of mutual translation between two different languages, and is applied to electronic devices such as mobile phones and servers.
  • the process at least includes but is not limited to :
  • Step 101 acquiring corpus data to be translated.
  • the corpus data to be translated in this embodiment is text data, but this embodiment does not limit the data volume of the corpus data to be translated, it may be a sentence, a word or a paragraph, etc.
  • the source of the corpus data to be translated can be audio, video, text, etc.
  • it may also need to combine other things such as language conversion, text segmentation, etc.
  • subtitles in language B for a video in language A. It is necessary to perform language-to-text processing on the voice signal in the video to obtain several sentences. Each sentence in the several sentences can be used as the current waiting list To translate the corpus, several sentences can also be used as the corpus data to be translated, and then the translation results are added to the video in the form of subtitles.
  • PDF Portable Document Format
  • OCR Optical Character Recognition
  • Step 102 input the word-segmented corpus data to be translated into the encoder to obtain a contextual representation based on subwords.
  • the encoder can adopt various network structures, such as multi-layer attention network Transformer, recurrent neural network (Recurrent Neural Network, RNN), etc., which will not be repeated here.
  • network structures such as multi-layer attention network Transformer, recurrent neural network (Recurrent Neural Network, RNN), etc., which will not be repeated here.
  • the input of the encoder is the corpus data to be translated after word segmentation. Therefore, before inputting data to the encoder, it is necessary to perform word segmentation on the corpus data to be translated. Segmentation, such as dividing the word "information" into two subwords of "xin” and "information”, wherein the word segmentation can be realized by using a word segmentation model, such as using a byte pair encoding (Byte Pair Encoder, BPE) model Learn to perform word segmentation on corpus data to be translated.
  • BPE byte pair encoding
  • the context representation based on subwords refers to the minimum granularity of context representations as subwords.
  • the subword-based context representation obtained after the encoder processes subwords A and subwords includes context representation C of subword A and subword B.
  • the context of the word B represents D.
  • each word in the corpus data to be translated after word segmentation has been segmented, and some words may not be segmented but retain the original format, and the number of subwords generated after each word is segmented It may also be different. For example, after word segmentation, “technical scheme” may get two subwords “technology” and “scheme”, and after word segmentation, “zhenjiaomei” may get “zhen”, “delicacy” and “delicious”. three subwords.
  • Step 103 input the subword-based context representation into the word representation synthesizer to obtain a word-based context representation.
  • the word-based context representation means that the minimum granularity of the context is a word, such as the word representation synthesizer synthesizes the context representation C of the subword A and the context representation D of the subword B into a context representation E, wherein the context representation E is a word-based contextual representation.
  • the word-based contextual representation will be input into the decoder for decoding and translation, and the synthesis method of the contextual representation of each subword in the word-based contextual representation will affect the effect of decoding and translation, such as for "technical scheme", its corresponding subwords are “technique”, “skill”, “fang” and “case”, if the context expressions of “technique” and “technique” are synthesized into one context representation, “fang” and “case”
  • the word representation synthesizer can determine the subword position contained in the segmented word according to the first position label, so that the context representation of the subword in the position can be synthesized, that is, based on
  • the contextual representation of the subword is input into the word representation synthesizer, and the context representation based on the word is obtained, which can be realized in the following way: the context representation based on the subword and the first position label are input into the word representation synthesizer, and the context representation based on the subword
  • the contextual representations of several subwords from the same word are synthesized to obtain a word-based contextual representation.
  • the machine translation method before inputting the word-segmented corpus data to be translated into the encoder to obtain the subword-based context representation, further includes: generating a second position label for each word in the corpus data to be translated And perform word segmentation on the corpus data to be translated.
  • the subword-based context representation is input into the word representation synthesizer to obtain the word-based context representation, which can be realized in the following manner: input the subword-based context representation and the second position label into the word representation synthesizer to obtain the word representation synthesizer based on In the context representation of subwords, the context representations of several subwords from the same word are synthesized to obtain a word-based context representation.
  • position labels are generated for each word, so that even if the position labels of some of the segmented words are lost, it can still be determined based on the position labels of other words. The position indicated by the lost position label avoids the influence of label loss.
  • generating location tags for segmented words can reduce processing and speed up processing efficiency, and you can choose which tag generation method to use according to actual needs.
  • the generated position labels can actually be input to the encoder together so that the encoder can generate a subword-based context representation containing the position label.
  • the position label mainly indicates the start position and end position of the context representation of several subwords belonging to the same word, it will not be repeated here.
  • Step 104 input the word-based context representation into the decoder to obtain the translation result of the corpus data to be translated.
  • the decoder actually judges the language of the corpus data to be translated, and then translates it into an expression in another language.
  • the encoder, word representation synthesizer and decoder provided in this embodiment can also be compared with the existing encoder and decoder
  • Mutual translation between two languages is like a set of encoder, word representation synthesizer and decoder.
  • the output of the decoder is the corresponding expression in English of the input corpus data to be translated.
  • the output of the decoder is the corresponding expression in Chinese of the input corpus data to be translated, that is, mutual translation between the two languages can be realized. Therefore, the decoder needs to determine the language of the corpus data to be translated, so as to determine the output expression as another language. This is all determined by the training method of unsupervised machine translation. In order to facilitate those skilled in the art to better understand the above description, the training process will be described below.
  • the machine translation method Before inputting the word-segmented corpus data to be translated into the encoder to obtain the subword-based context representation, the machine translation method also includes: obtaining two monolingual corpus data sets; wherein, the language types corresponding to the two monolingual corpus data sets Different, then word segmentation is performed on the corpus data in the two monolingual corpus data sets, and two first training sets are obtained; the encoder, word representation synthesizer and decoder are pre-trained according to the first training set; according to the two The monolingual corpus data set, the pre-trained encoder, word representation synthesizer and decoder are back-translated to obtain the trained encoder, word representation synthesizer and decoder.
  • pre-training the encoder, word representation synthesizer and decoder according to the first training set can be achieved in the following way: Randomly add masks to the corpus data of the first training set to obtain the first masks respectively The training set, such as covering 50% of the content with a mask, randomly selects the area covered by the mask; according to the first mask training set, the encoder, word representation synthesizer and decoder are based on masked sequence to sequence (Masked Sequence to Sequence , MASS) joint training.
  • MASS masked Sequence to Sequence
  • joint training refers to training the encoder, word representation synthesizer and decoder together, where the output of the encoder is used as the input of the word representation synthesizer, and the output of the word representation synthesizer is used as the input of the decoder.
  • loss function of the joint training is as follows:
  • L mass ( ⁇ , l) represents the loss value corresponding to the corpus data of the first mask training set corresponding to language l, which is mainly used to describe the masked part of the corpus data in the first mask training set during training.
  • is the training parameter
  • x represents the corpus data in the first training set
  • D l represents the first training set corresponding to language l
  • x ⁇ i:j ; ⁇ ) Indicates the conditional probability that the part covered by the mask in the corpus data is accurately decoded by the decoder
  • xi :j represents the part covered by the mask in the corpus data
  • i represents the part covered by the mask in the corpus data.
  • the starting position, j indicates the end position of the part covered by the mask in the corpus data in the corpus data
  • x ⁇ i:j indicates the result obtained by decoding the part covered by the mask in the corpus data by the decoder
  • a new training set can also be constructed through data augmentation for further joint training, that is, pre-training the encoder, word representation synthesizer and decoder according to the first training set, which can be achieved in the following ways: Randomly select the corpus data in the first training set and segment the unsegmented words in the selected corpus data to obtain new corpus data; add the new corpus data to the first training set to obtain the second training set; The corpus data in the first training set and the second training set are randomly added with masks to obtain the second mask training set and the third mask training set respectively; according to the second mask training set and the third mask training set, the encoder , word representation synthesizer and decoder for joint training of MASS.
  • the second mask training set is substantially the same as the first mask training set in the previous example, therefore, the loss caused by the second mask training set can also be expressed by the expression shown in the previous example solve.
  • the corpus data of the third mask training set is segmented again on the basis of the first training set and then segmented again, which can be regarded as another first mask training set with stronger word segmentation , therefore, the loss caused by the second mask training set can also be solved in a manner similar to the previous example, that is, the loss caused by the second mask training set can be solved by the following expression:
  • L split ( ⁇ ,l) represents the loss value corresponding to the corpus data of the third mask training set corresponding to language l, and is mainly used to describe the masked part of the corpus data x in the third mask training set during training
  • is the training parameter
  • x represents the corpus data in the second training set
  • D s represents the second training set corresponding to language l
  • x ⁇ i:j ; ⁇ ) represents the conditional probability that the part covered by the mask in the corpus data is accurately decoded by the decoder
  • xi :j represents the part covered by the mask in the corpus data
  • i represents the part covered by the mask in the corpus data is in the corpus data j represents the end position of the part covered by the mask in the corpus data
  • x ⁇ i:j represents the result of decoding the part covered by the mask in the corpus data by the decoder.
  • pre-training the encoder, word representation synthesizer and decoder according to the first training set can also be achieved in the following manner: determining a preset amount of corpus data from the corpus data in the first training set as Target corpus data; word segmentation is performed again on the unsegmented words in the target corpus data to obtain the target segmentation corpus data; the target corpus data and the target segmentation corpus data are merged to obtain the third training set; the third training A piece of training data in the set is a corpus data pair composed of a target corpus data and corresponding target segmentation corpus data; a mask is randomly added to the corpus data in the first training set to obtain the fourth mask training set; according to the third training set and the fourth mask training set to jointly train the encoder, word representation synthesizer, and decoder, where the third training set is used for supervised training of the word representation synthesizer, and the fourth mask training set is used for encoding
  • the unsegmented words in the corpus data of the first training set are further segmented, so that the words before segmentation are used as the supervisory signals of several subwords after segmentation, and then the Word representation synthesizers provide additional supervised training.
  • the loss brought by the third training set is calculated by the following loss function expression:
  • L combiner ( ⁇ ; l) represents the loss value corresponding to the corpus data of the third training set corresponding to language l
  • is a training parameter, which is mainly used to describe the accuracy of synthesizing subwords into words by the word-to-word representation synthesizer
  • D t represents the third training set corresponding to language l
  • x represents the corpus data pair
  • t(x) represents the segmentation target corpus data in the corpus data pair
  • E true ( xi ) represents the corpus data pair in the third training set
  • E fake ( xi ) represents the output of the segmentation target corpus data in the corpus data pair in the third training set at the word representation synthesizer
  • DIS(E true (x i ),E fake ( xi )) represents the negative distance between E true ( xi ) and E fake ( xi ).
  • this example actually adds supervised training to the word representation synthesizer on the basis of the first example.
  • the training set obtained by data enhancement and the supervised training data set can be added for joint training.
  • the above examples are mainly based on the first training set as an example, and the joint effect of the two first training sets needs to be considered during execution.
  • joint training it is not the addition of two loss functions, but two loss functions Based on the superposition of two monolingual corpus data sets, take the training set obtained by data enhancement and the supervised training data set for joint training and the monolingual corpus data sets are Chinese and English respectively as an example.
  • the overall loss function should be:
  • L( ⁇ ) is the total loss value
  • L mass ( ⁇ , zh) represents the loss value of the MASS training part of the first training set corresponding to Chinese
  • L mass ( ⁇ , en) represents the first training set corresponding to English
  • the loss value of the MASS training part of L combiner ( ⁇ ; zh) is the loss value of the first training set corresponding to Chinese and the MASS training part of data enhancement
  • L combiner ( ⁇ ; en) is the first training set corresponding to English and
  • L split ( ⁇ ; zh) represents the loss value of the supervised training part of the first training set corresponding to Chinese
  • L split ( ⁇ ; en) represents the supervision of the first training set corresponding to English
  • the process of back-translation is as follows: the monolingual corpus data set is used as the input of the pre-trained encoder, the word representation synthesizer and the encoder, the output of the encoder constitutes the translation reference data set, and then the translation reference data
  • the set is used as the input of the pre-trained encoder, word representation synthesizer and encoder, and the corresponding monolingual corpus data set is used as a supervisory signal, and a pseudo-parallel corpus data set is constructed by the set to realize supervised training, such as using pre-training A good encoder, word representation synthesizer, and encoder process the Chinese monolingual corpus data set to obtain the corresponding English expression, thereby constructing an English translation reference data set, and then use the English translation reference data set as a pre-trained
  • the encoder, word representation synthesizer, and input to the encoder are trained with a Chinese monolingual corpus dataset as a supervisory signal.
  • loss function expression during back-translation is as follows:
  • L bt ( ⁇ ,l) represents the back-translation loss function value of language l
  • is the training parameter
  • D l represents the monolingual corpus data set of language l
  • M(x) represents the corresponding corpus data x in the translation reference data set
  • the model can encode the context word representation in the bilingual space, and let the decoder decode English.
  • the English translation can be generated by controlling the start symbol or language encoding. This initial translation result has a certain quality, but it is not ideal.
  • Back-translation training can use monolingual data and existing translation models to further improve the translation effect.
  • the training parameter ⁇ can be adjusted according to the loss value, so as to continue training according to the adjusted training parameter ⁇ until the loss value meets the preset threshold, the loss value converges, or the number of training times reaches the preset value wait.
  • the embodiment of the present application also provides a machine translation device, as shown in Figure 2, including:
  • the acquiring module 201 is configured to acquire corpus data to be translated.
  • the encoding module 202 is configured to input the word-segmented corpus data to be translated into the encoder to obtain a contextual representation based on subwords.
  • the synthesis module 203 is configured to input the subword-based context representation into the word representation synthesizer to obtain the word-based context representation.
  • the decoding module 204 is configured to input the word-based context representation into the decoder to obtain the translation result of the corpus data to be translated.
  • this embodiment is an apparatus embodiment corresponding to the method embodiment, and this embodiment can be implemented in cooperation with the method embodiment.
  • the relevant technical details mentioned in the method embodiments are still valid in this embodiment, and will not be repeated here in order to reduce repetition.
  • the related technical details mentioned in this embodiment can also be applied in the method embodiment.
  • modules involved in this embodiment are logical modules.
  • a logical unit can be a physical unit, or a part of a physical unit, or multiple physical units. Combination of units.
  • units that are not closely related to solving the technical problem proposed in the present application are not introduced in this embodiment, but this does not mean that there are no other units in this embodiment.
  • FIG. 3 Another aspect of the embodiment of the present application provides an electronic device, as shown in FIG. 3 , including: at least one processor 301; and a memory 302 communicatively connected to the at least one processor 301; An instruction to be executed by at least one processor 301, the instruction is executed by at least one processor 301, so that at least one processor 301 can execute the machine translation method described in any one of the above method embodiments.
  • the memory 302 and the processor 301 are connected by a bus, and the bus may include any number of interconnected buses and bridges, and the bus connects one or more processors 301 and various circuits of the memory 302 together.
  • the bus may also connect together various other circuits such as peripherals, voltage regulators, and power management circuits, all of which are well known in the art and therefore will not be further described herein.
  • the bus interface provides an interface between the bus and the transceivers.
  • a transceiver may be a single element or multiple elements, such as multiple receivers and transmitters, providing means for communicating with various other devices over a transmission medium.
  • the data processed by the processor 301 is transmitted on the wireless medium through the antenna, and the antenna also receives the data and transmits the data to the processor 301 .
  • the processor 301 is responsible for managing the bus and general processing, and may also provide various functions including timing, peripheral interface, voltage regulation, power management and other control functions. And the memory 302 can be used to store data used by the processor 301 when performing operations.
  • Another aspect of the embodiment of the present application provides a computer-readable storage medium storing a computer program.
  • the computer program is executed by the processor, the machine translation method described in any one of the above method embodiments is realized.
  • a storage medium includes several instructions to make a device ( It may be a single-chip microcomputer, a chip, etc.) or a processor (processor) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .
  • the encoder After the encoder outputs the subword-based context representation of the corpus data to be translated, it does not directly input the subword-based context representation into the decoder, but first inputs the subword-based context representation into the decoder.
  • the context representation is input into the word representation synthesizer to synthesize the sub-word-based context representation according to the word granularity to obtain the word-based context representation, and then input the word-based context representation into the decoder, that is, in the original encoding
  • an additional word representation synthesizer is introduced between the encoder and the decoder, and the input of the decoder is changed from the original subword-based context representation to the word-based context representation, that is, the decoder
  • the granularity of decoding and translation changes from subwords to words.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the present application relates to the technical field of machine learning. Disclosed are a machine translation method and apparatus, an electronic device and a storage medium. The machine translation method comprises the following steps: acquiring corpus data to be translated (101); inputting said corpus data subjected to word segmentation into an encoder to obtain a subword-based contextual representation (102); inputting the subword-based contextual representation into a word representation synthesizer to obtain a word-based contextual representation (103); and inputting the word-based contextual representation into a decoder to obtain a translation result of said corpus data (104).

Description

机器翻译方法、装置、电子设备和存储介质Machine translation method, device, electronic device and storage medium
相关申请的交叉引用Cross References to Related Applications
本申请基于申请号为202111567148.X、申请日为2021年12月20日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。This application is based on a Chinese patent application with application number 202111567148.X and a filing date of December 20, 2021, and claims the priority of this Chinese patent application. The entire content of this Chinese patent application is hereby incorporated by reference into this application.
技术领域technical field
本申请实施例涉及机器学习技术领域,特别涉及一种机器翻译方法、装置、电子设备和存储介质。The embodiments of the present application relate to the technical field of machine learning, and in particular, to a machine translation method, device, electronic device, and storage medium.
背景技术Background technique
机器翻译根据模型的不同训练方式可以分为监督机器翻译、半监督机器翻译和无监督机器翻译,其中,无监督机器翻译由于不需要构建平行语料数据,只需要收集单语语料数据构建单语语料数据集进行训练,因此,具有更加广泛的应用前景。目前的无监督机器翻译主要是基于编解码器+注意力机制进行预训练和回译,其中,通过预训练对将两种不同语言编码得到彼此共享的上下文表示进行学习,通过回译构造伪平行语料进行翻译训练,以进一步提高翻译的质量。According to the different training methods of the model, machine translation can be divided into supervised machine translation, semi-supervised machine translation and unsupervised machine translation. Among them, unsupervised machine translation does not need to construct parallel corpus data, but only needs to collect monolingual corpus data to construct monolingual corpus Data set for training, therefore, has a wider application prospect. The current unsupervised machine translation is mainly based on codec + attention mechanism for pre-training and back-translation. Among them, through pre-training, two different languages are coded to learn the contextual representation shared by each other, and through back-translation to construct pseudo-parallel The corpus is used for translation training to further improve the quality of translation.
然而,只有在英-法、英-德等语言之间基于上述模型进行翻译才比较有效,而对中-英、法-韩等语言,往往达不到预期效果。However, translation based on the above model is more effective only between English-French, English-German and other languages, while for Chinese-English, French-Korean and other languages, the expected results are often not achieved.
发明内容Contents of the invention
本申请实施例提出一种机器翻译方法、装置、电子设备及存储介质。Embodiments of the present application provide a machine translation method, device, electronic equipment, and storage medium.
本申请实施例提供了一种机器翻译方法,所述方法包括以下步骤:获取待翻译语料数据;将词切分后的所述待翻译语料数据输入编码器,得到基于子词的上下文表示;将所述基于子词的上下文表示输入词表示合成器,得到基于词的上下文表示;将所述基于词的上下文表示输入解码器,得到所述待翻译语料数据的翻译结果。The embodiment of the present application provides a machine translation method, the method includes the following steps: obtaining the corpus data to be translated; inputting the corpus data to be translated after word segmentation into an encoder to obtain a context representation based on subwords; The subword-based context representation is input into a word representation synthesizer to obtain a word-based context representation; the word-based context representation is input into a decoder to obtain a translation result of the corpus data to be translated.
本申请实施例还提出了一种机器翻译装置,包括:获取模块,被设置为获取待翻译语料数据;编码模块,被设置为将词切分后的所述待翻译语料数据输入编码器,得到基于子词的上下文表示;合成模块,被设置为将所述基于子词的上下文表示输入词表示合成器,得到基于词的上下文表示;解码模块,被设置为将所述基于词的上下文表示输入解码器,得到所述待翻译语料数据的翻译结果。The embodiment of the present application also proposes a machine translation device, including: an acquisition module configured to acquire the corpus data to be translated; an encoding module configured to input the word-segmented corpus data to be translated into an encoder to obtain Based on the context representation of subwords; the synthesis module is configured to input the context representation based on subwords into a word representation synthesizer to obtain a context representation based on words; the decoding module is configured to input the context representation based on words A decoder for obtaining the translation result of the corpus data to be translated.
本申请实施例还提出了一种电子设备,包括:至少一个处理器;以及,与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如上所述的机器翻译方法。The embodiment of the present application also proposes an electronic device, including: at least one processor; and a memory connected to the at least one processor in communication; wherein, the memory stores information executable by the at least one processor. Instructions, the instructions are executed by the at least one processor, so that the at least one processor can execute the above-mentioned machine translation method.
本申请实施例还提出了一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时实现如上所述的机器翻译方法。The embodiment of the present application also provides a computer-readable storage medium storing a computer program, and implementing the above-mentioned machine translation method when the computer program is executed by a processor.
附图说明Description of drawings
一个或多个实施例通过与之对应的附图中的图片进行示例性说明,这些示例性说明并不构成对实施例的限定。One or more embodiments are exemplified by pictures in the accompanying drawings, and these exemplifications are not intended to limit the embodiments.
图1是本申请一实施例中提供的机器翻译方法的流程图;FIG. 1 is a flowchart of a machine translation method provided in an embodiment of the present application;
图2是本申请另一实施例中提供的机器翻译装置的结构示意图;Fig. 2 is a schematic structural diagram of a machine translation device provided in another embodiment of the present application;
图3是本申请另一实施例中提供的电子设备的结构示意图。Fig. 3 is a schematic structural diagram of an electronic device provided in another embodiment of the present application.
具体实施方式Detailed ways
由背景技术可知,目前的机器翻译方法的翻译效果会受到翻译前后的语种的限制,不是任意两个语种之间的翻译都能够达到预期效果。It can be seen from the background technology that the translation effect of the current machine translation method will be limited by the languages before and after the translation, and not any translation between any two languages can achieve the expected effect.
经分析发现,目前的机器翻译方法的效果受到语种限制的原因在于,在预训练时,输入解码器中的是子词,学习到的是两种不同语言编码彼此共享的基于子词上下文表示。英-法、英-德等语言之间的子词具有一些共通性且由子词构造词的方式也存在共通性,如英语中的太阳“sun”和德语中的太阳“Sonne”比较近似、英语和德语均使用“er”和“est”分别表示比较级和最高级等;而中文和英文的子词表示的差异和由子词构造单词的差异都非常大,即使通过训练能够学习到共享空间的子词表示,但是解码器仍然会由于子词构词方式的多样性以及子词在词和语句中含义的不固定性,导致按照子词进行翻译导致与词的真实意思相差甚远,最终达不到预期的翻译效果。After analysis, it is found that the reason why the effect of the current machine translation method is limited by the language is that during pre-training, the input decoder is a subword, and what is learned is a subword-based context representation shared by two different language codes. The subwords between English-French, English-German and other languages have some commonality and the way of constructing words from subwords also has commonality. For example, the sun "sun" in English is similar to the sun "Sonne" in German, and the English sun "Sonne" in German is similar. Both German and German use "er" and "est" to represent comparative and superlative, etc.; while the differences between Chinese and English subword representations and the differences in word construction from subwords are very large, even through training. However, due to the diversity of subword formation methods and the unfixed meaning of subwords in words and sentences, the decoder will still translate according to subwords far from the true meaning of the word, and finally achieve Unexpected translation effect.
为解决上述问题,本申请实施例提供了一种机器翻译方法,所述方法包括以下步骤:获取待翻译语料数据;将词切分后的所述待翻译语料数据输入编码器,得到基于子词的上下文表示;将所述基于子词的上下文表示输入词表示合成器,得到基于词的上下文表示;将所述基于词的上下文表示输入解码器,得到所述待翻译语料数据的翻译结果。In order to solve the above problems, the embodiment of the present application provides a machine translation method, which includes the following steps: obtaining the corpus data to be translated; inputting the corpus data to be translated after word segmentation into an encoder, and obtaining The context representation based on the subword is input into the word representation synthesizer to obtain the context representation based on the word; the context representation based on the word is input into the decoder to obtain the translation result of the corpus data to be translated.
本申请实施例提出的机器翻译方法,在编码器在输出待翻译语料数据的基于子词的上下文表示后,不是直接将基于子词的上下文表示输入解码器中,而是先将基于子词的上下文表示输入到词表示合成器中,以对基于子词的上下文表示按照词粒度进行合成,得到基于词的上下文表示,然后将基于词的上下文表示输入到解码器中,也就是在原有的编解码器+注意力机制的基础上,额外在编码器和解码器之间引入词表示合成器,将解码器的输入由原来的基于子词的上下文表示变为基于词的上下文表示,即将解码器的解码翻译粒度从子词变化为词,由于词的含义相对于子词更加稳定,受到语言结构如语境、在语句中的嵌入方式等的影响比子词小,使得解码器不需要基于子词进行词重建,避免了由于子词构词的方式不同导致翻译前后的语种中基于子词重建词的含义发生变化,进而导致翻译前后的语句含义发生较大变化的问题,使得翻译前后的语句中各个词的含义更加准确,进而翻译的语句也会更加准确,克服了翻译过程中的语种限制,能够在任意语种之间能够进行有效、准确的翻译。In the machine translation method proposed in the embodiment of this application, after the encoder outputs the subword-based context representation of the corpus data to be translated, it does not directly input the subword-based context representation into the decoder, but first inputs the subword-based context representation into the decoder. The context representation is input into the word representation synthesizer to synthesize the sub-word-based context representation according to the word granularity to obtain the word-based context representation, and then input the word-based context representation into the decoder, that is, in the original encoding On the basis of the decoder + attention mechanism, an additional word representation synthesizer is introduced between the encoder and the decoder, and the input of the decoder is changed from the original subword-based context representation to the word-based context representation, that is, the decoder The granularity of decoding and translation changes from subwords to words. Since the meaning of words is more stable than subwords, it is less affected by language structures such as context and embedding methods in sentences than subwords, so the decoder does not need to be based on subwords. Words are reconstructed, which avoids the problem that the meaning of words reconstructed based on subwords changes in the language before and after translation due to the different ways of word formation of subwords, which in turn leads to a large change in the meaning of sentences before and after translation, making sentences before and after translation The meaning of each word in the text is more accurate, and the translated sentences will be more accurate, which overcomes the language restrictions in the translation process and can effectively and accurately translate between any languages.
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图对本申请的各实施例进行详细的阐述。然而,本领域的普通技术人员可以理解,在本申请各实施例中,为了使读者更好地理解本申请而提出了许多技术细节。但是,即使没有这些技术细节和基于以下各实施例的种种变化和修改,也可以实现本申请所要求保护的技术方案。以下各个实施例的划分是为了描述方便,不应对本申请的实现方式构成任何限定,各个实施例在不矛盾的前提下可以相互结合相互引用。In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the embodiments of the present application will be described in detail below with reference to the accompanying drawings. However, those of ordinary skill in the art can understand that in each embodiment of the application, many technical details are provided for readers to better understand the application. However, even without these technical details and various changes and modifications based on the following embodiments, the technical solutions claimed in this application can also be realized. The division of the following embodiments is for the convenience of description, and should not constitute any limitation to the implementation of the present application, and the embodiments can be combined and referred to each other on the premise of no contradiction.
本申请实施例一方面提供了一种机器翻译方法,应用在两种不同语言之间互译的过程中,应用于手机、服务器等电子设备,如图1所示,其流程至少包括但不限于:On the one hand, the embodiment of the present application provides a machine translation method, which is applied in the process of mutual translation between two different languages, and is applied to electronic devices such as mobile phones and servers. As shown in Figure 1, the process at least includes but is not limited to :
步骤101,获取待翻译语料数据。 Step 101, acquiring corpus data to be translated.
本实施例中的待翻译语料数据为文本数据,但是本实施例不对待翻译语料数据的数据量进行限定,可以是一句话,也可以是一个词或是一段话等。The corpus data to be translated in this embodiment is text data, but this embodiment does not limit the data volume of the corpus data to be translated, it may be a sentence, a word or a paragraph, etc.
需要说明的是,待翻译语料数据的来源可以是音频、视频、文本等,在获取待翻译语料数据的过程中可能还需要结合其他的例如语言转文字、文本分割等。It should be noted that the source of the corpus data to be translated can be audio, video, text, etc. In the process of obtaining the corpus data to be translated, it may also need to combine other things such as language conversion, text segmentation, etc.
在一个例子中,需要为A语言的一段视频提供B语言的字幕,需要先对视频中的语音信号进行语言转文字处理,得到若干语句,可以将若干语句中的每句话依次作为当前的待翻译语料,也可以将若干语句整体作为待翻译语料数据,然后翻译结果以字幕的形式添加到视频中。In an example, it is necessary to provide subtitles in language B for a video in language A. It is necessary to perform language-to-text processing on the voice signal in the video to obtain several sentences. Each sentence in the several sentences can be used as the current waiting list To translate the corpus, several sentences can also be used as the corpus data to be translated, and then the translation results are added to the video in the form of subtitles.
在另一个例子中,需要将一个便携式文档格式(Portable Document Format,PDF)文档翻译为目标语种的可编辑文档,此时,可以对PDF文档进行光学字符识别(Optical Character Recognition,OCR),得到整个文档的文本数据,可以将整个文档的作为待翻译语料数据,也可以对整个文本按照段落等进行划分,将划分后的每部分文本数据依次作为待翻译语料数据,然后将翻译后的文档保存为可编辑文本格式。In another example, a Portable Document Format (PDF) document needs to be translated into an editable document in the target language. At this time, the PDF document can be subjected to Optical Character Recognition (OCR) to obtain the entire For the text data of the document, the entire document can be used as the corpus data to be translated, or the entire text can be divided into paragraphs, etc., and each part of the divided text data can be used as the corpus data to be translated in turn, and then the translated document can be saved as Editable text format.
步骤102,将词切分后的待翻译语料数据输入编码器,得到基于子词的上下文表示。 Step 102, input the word-segmented corpus data to be translated into the encoder to obtain a contextual representation based on subwords.
本实施例中,编码器的可以采用多种网络结构,如多层注意力网络Transformer、循环神经网络(Recurrent Neural Network,RNN)等,此处就不再一一赘述了。In this embodiment, the encoder can adopt various network structures, such as multi-layer attention network Transformer, recurrent neural network (Recurrent Neural Network, RNN), etc., which will not be repeated here.
可以理解的是,编码器器的输入是词切分后的待翻译语料数据,因此,在向编码器输入数据之前,还需要对待翻译语料数据进行词切分,即将待翻译语料数据中词进行切分,如将词“信息”切分为“信”“息”两个子词,其中,词切分可以是利用词切分模型实现,例如利用字节对编码(Byte Pair Encoder,BPE)模型学习对待翻译语料数据进行词切分。It can be understood that the input of the encoder is the corpus data to be translated after word segmentation. Therefore, before inputting data to the encoder, it is necessary to perform word segmentation on the corpus data to be translated. Segmentation, such as dividing the word "information" into two subwords of "xin" and "information", wherein the word segmentation can be realized by using a word segmentation model, such as using a byte pair encoding (Byte Pair Encoder, BPE) model Learn to perform word segmentation on corpus data to be translated.
需要说明的是,基于子词的上下文表示是指上下文表示的最小粒度为子词,如编码器处理子词A和子词B后得到的基于子词的上下文表示包括子词A的上下文表示C和子词B的上下文表示D。It should be noted that the context representation based on subwords refers to the minimum granularity of context representations as subwords. For example, the subword-based context representation obtained after the encoder processes subwords A and subwords includes context representation C of subword A and subword B. The context of the word B represents D.
还需要说明的是,词切分后的待翻译语料数据中的每个词都被切分了,有些词可能未被切分而保留原格式,且每个词切分后产生的子词数量也可能不一样,如词切分后的“技术方案”可能得到“技术”和“方案”两个子词,词切分后的“珍馐美味”可能得到“珍”、“馐”和“美味”三个子词。It should also be noted that each word in the corpus data to be translated after word segmentation has been segmented, and some words may not be segmented but retain the original format, and the number of subwords generated after each word is segmented It may also be different. For example, after word segmentation, "technical scheme" may get two subwords "technology" and "scheme", and after word segmentation, "zhenjiaomei" may get "zhen", "delicacy" and "delicious". three subwords.
步骤103,将基于子词的上下文表示输入词表示合成器,得到基于词的上下文表示。 Step 103, input the subword-based context representation into the word representation synthesizer to obtain a word-based context representation.
本实施例中,基于词的上下文表示是指上下文的最小粒度为词,如词表示合成器将子词A的上下文表示C和子词B的上下文表示D合成为上下文表示E,其中,上下文表示E为基于词的上下文表示。In this embodiment, the word-based context representation means that the minimum granularity of the context is a word, such as the word representation synthesizer synthesizes the context representation C of the subword A and the context representation D of the subword B into a context representation E, wherein the context representation E is a word-based contextual representation.
可以理解的是,基于词的上下文表示将会被输入到解码器中进行解码翻译,而基于词的上下文表示中对各个子词的上下文表示的合成方式会影响解码翻译的效果,如对于“技术方案”,其对应的子词为“技”、“术”、“方”和“案”,若是将“技”和“术”的上下文表示合成为一个上下文表示、将“方”和“案”的上下文表示合成为一个上下文表示,比将“技”、“术”和“方”的上下文表示合成为一个上下文表示,更加准确,因为第二种合成 方式破坏了子词构词的模式,也就是说,词表示合成器在合成处理时如果能够按照词切分前待翻译语料数据中词的分布位置进行合成会更加准确。It can be understood that the word-based contextual representation will be input into the decoder for decoding and translation, and the synthesis method of the contextual representation of each subword in the word-based contextual representation will affect the effect of decoding and translation, such as for "technical scheme", its corresponding subwords are "technique", "skill", "fang" and "case", if the context expressions of "technique" and "technique" are synthesized into one context representation, "fang" and "case" The context representation of "is synthesized into a context representation, which is more accurate than combining the context representations of "Technology", "Skill" and "Fang" into a context representation, because the second synthesis method destroys the pattern of subword formation, That is to say, it will be more accurate if the word representation synthesizer can synthesize according to the distribution position of words in the corpus data to be translated before word segmentation.
因此,在一些例子中,将词切分后的待翻译语料数据输入编码器,得到基于子词的上下文表示之前,机器翻译方法还包括:对待翻译语料数据进行词切分并为待翻译语料数据中被切分的词生成第一位置标签,其中,在有多个词被切分的情况下会生成多个第一位置标签,每个第一位置标签都指示了一个被切分的词的位置信息,例如,对于语料数据x=(x1,x2,x3),其中,x1、x2和x3为3个词,假设词x2被切分为x2'和x2'',则在切分后的待翻译语料数据x'=(x1,x2',x2'',x3)中生成指示X1之后的位置的标签和生成指示X2之前的位置的标签共同作为第一位置标签,以用于标记词x2被切分得到的子词的起始位置和终止位置,特别地,还可以直接在切分后的待翻译语料数据中进行标注,如切分后的待翻译语料数据x'=(x1,E beg,x2',x2'',E end,x3),其中,E beg标志某个被切分的词的起始位置,E end标志某个被切分的词的起始位置。进而在生成了第一位置标签的情况下,词表示合成器可以根据第一位置标签确定被切分的词包含的子词位置,从而可以将该位置内的子词的上下文表示合成,即将基于子词的上下文表示输入词表示合成器,得到基于词的上下文表示,可以通过如下方式实现:将基于子词的上下文表示和第一位置标签输入词表示合成器,以对基于子词的上下文表示中来自同一个词的若干子词的上下文表示进行合成,得到基于词的上下文表示。 Therefore, in some examples, before inputting the word-segmented corpus data to be translated into the encoder to obtain the context representation based on subwords, the machine translation method further includes: performing word segmentation on the to-be-translated corpus data and generating The segmented word in generates the first position label, where multiple first position labels are generated in the case of multiple words being segmented, and each first position label indicates the position of a segmented word Position information, for example, for corpus data x=(x1, x2, x3), where x1, x2 and x3 are 3 words, assuming that word x2 is divided into x2' and x2'', then after segmentation In the corpus data to be translated x'=(x1, x2', x2'', x3), the generated label indicating the position after X1 and the generated label indicating the position before X2 are used together as the first position label to mark the word x2 The start position and end position of the subwords obtained by segmentation can also be directly marked in the segmented corpus data to be translated, such as the segmented corpus data to be translated x'=(x1, E beg , x2', x2'', E end , x3), where E beg marks the starting position of a segmented word, and E end marks the starting position of a segmented word. Furthermore, when the first position label is generated, the word representation synthesizer can determine the subword position contained in the segmented word according to the first position label, so that the context representation of the subword in the position can be synthesized, that is, based on The contextual representation of the subword is input into the word representation synthesizer, and the context representation based on the word is obtained, which can be realized in the following way: the context representation based on the subword and the first position label are input into the word representation synthesizer, and the context representation based on the subword The contextual representations of several subwords from the same word are synthesized to obtain a word-based contextual representation.
在另一些例子中,将词切分后的待翻译语料数据输入编码器,得到基于子词的上下文表示之前,机器翻译方法还包括:为待翻译语料数据中的每一个词生成第二位置标签并对待翻译语料数据进行词切分。相应地,将基于子词的上下文表示输入词表示合成器,得到基于词的上下文表示,可以通过如下方式实现:将基于子词的上下文表示和第二位置标签输入词表示合成器,以对基于子词的上下文表示中来自同一个词的若干子词的上下文表示进行合成,得到基于词的上下文表示。值得一提的是,不对词按照是否被切分进行区分,而是为每个词生成位置标签,这样即使部分被切分的词的位置标签丢失,也仍然能够根据其他词的位置标签确定除丢失的位置标签指示的位置,避免了标签丢失的影响。当然,为被切分的词生成位置标签则能够减少处理加快处理效率,可以根据实际需求选择使用哪种标签生成方法。In other examples, before inputting the word-segmented corpus data to be translated into the encoder to obtain the subword-based context representation, the machine translation method further includes: generating a second position label for each word in the corpus data to be translated And perform word segmentation on the corpus data to be translated. Correspondingly, the subword-based context representation is input into the word representation synthesizer to obtain the word-based context representation, which can be realized in the following manner: input the subword-based context representation and the second position label into the word representation synthesizer to obtain the word representation synthesizer based on In the context representation of subwords, the context representations of several subwords from the same word are synthesized to obtain a word-based context representation. It is worth mentioning that instead of distinguishing words according to whether they are segmented or not, position labels are generated for each word, so that even if the position labels of some of the segmented words are lost, it can still be determined based on the position labels of other words. The position indicated by the lost position label avoids the influence of label loss. Of course, generating location tags for segmented words can reduce processing and speed up processing efficiency, and you can choose which tag generation method to use according to actual needs.
需要说明的是,上述举例仅是针对位置标签生成过程和词表示合成过程进行描述,生成的位置标签实际还可以一起输入编码器,以便编码器生成含有位置标签的基于子词的上下文表示,此时的位置标签主要指示同属于一个词的若干子词的上下文表示的起始位置和终止位置,此处就不再一一赘述了。It should be noted that the above examples are only for the description of the position label generation process and the word representation synthesis process. The generated position labels can actually be input to the encoder together so that the encoder can generate a subword-based context representation containing the position label. When the position label mainly indicates the start position and end position of the context representation of several subwords belonging to the same word, it will not be repeated here.
当然,以上仅为举例说明,在其他例子中,还可以对基于子词的上下文表示随机进行合成处理,此处也不再一一赘述了。Of course, the above is only an example, and in other examples, the subword-based context representations can also be randomly synthesized, and details will not be repeated here.
步骤104,将基于词的上下文表示输入解码器,得到待翻译语料数据的翻译结果。 Step 104, input the word-based context representation into the decoder to obtain the translation result of the corpus data to be translated.
本实施例中,解码器实际会对待翻译语料数据所属语种的判断,然后将其翻译为另一语种的表达。In this embodiment, the decoder actually judges the language of the corpus data to be translated, and then translates it into an expression in another language.
需要说明的是,与现有的编码器和解码器处理待翻译语料书得到翻译模式类似,本实施例提供的编码器、词表示合成器和解码器与现有的编码器和解码器也能够对两种语言进行互译,如同一组编码器、词表示合成器和解码器,在输入为中文时,解码器输出的为输入的待翻译语料数据在英文中的对应表达,在输入为英文时,解码器输出的为输入的待翻译语料数 据在中文中的对应表达,即能够实现两种语言之间的互译。因此,解码器需要判定待翻译语料数据所属的语种,以便将输出的表达确定为另一语种。这都是由无监督机器翻译的训练方式决定的。为了便于本领域技术人员更好地理解上述描述,以下将对训练过程进行说明。It should be noted that, similar to the existing encoder and decoder processing the corpus to be translated to obtain the translation model, the encoder, word representation synthesizer and decoder provided in this embodiment can also be compared with the existing encoder and decoder Mutual translation between two languages is like a set of encoder, word representation synthesizer and decoder. When the input is Chinese, the output of the decoder is the corresponding expression in English of the input corpus data to be translated. When the input is English When , the output of the decoder is the corresponding expression in Chinese of the input corpus data to be translated, that is, mutual translation between the two languages can be realized. Therefore, the decoder needs to determine the language of the corpus data to be translated, so as to determine the output expression as another language. This is all determined by the training method of unsupervised machine translation. In order to facilitate those skilled in the art to better understand the above description, the training process will be described below.
将词切分后的待翻译语料数据输入编码器,得到基于子词的上下文表示之前,机器翻译方法还包括:获取两个单语语料数据集;其中,两个单语语料数据集对应的语种不同,然后对两个单语语料数据集中的语料数据进行词切分,得到两个第一训练集;根据第一训练集对编码器、词表示合成器和解码器进行预训练;根据两个单语语料数据集,对预训练好的编码器、词表示合成器和解码器进行回译处理,得到训练好的编码器、词表示合成器和解码器。Before inputting the word-segmented corpus data to be translated into the encoder to obtain the subword-based context representation, the machine translation method also includes: obtaining two monolingual corpus data sets; wherein, the language types corresponding to the two monolingual corpus data sets Different, then word segmentation is performed on the corpus data in the two monolingual corpus data sets, and two first training sets are obtained; the encoder, word representation synthesizer and decoder are pre-trained according to the first training set; according to the two The monolingual corpus data set, the pre-trained encoder, word representation synthesizer and decoder are back-translated to obtain the trained encoder, word representation synthesizer and decoder.
在一些例子中,根据第一训练集对编码器、词表示合成器和解码器进行预训练,可以通过如下方式实现:对第一训练集的语料数据随机添加掩码,分别得到第一掩码训练集,例如以掩码遮盖50%的内容随机选定掩码遮盖的区域;根据第一掩码训练集对编码器、词表示合成器和解码器进行基于屏蔽序列到序列(Masked Sequence to Sequence,MASS)的联合训练。In some examples, pre-training the encoder, word representation synthesizer and decoder according to the first training set can be achieved in the following way: Randomly add masks to the corpus data of the first training set to obtain the first masks respectively The training set, such as covering 50% of the content with a mask, randomly selects the area covered by the mask; according to the first mask training set, the encoder, word representation synthesizer and decoder are based on masked sequence to sequence (Masked Sequence to Sequence , MASS) joint training.
需要说明的是,联合训练是指对编码器、词表示合成器和解码器一起进行训练,其中,编码器的输出作为词表示合成器的输入,词表示合成器的输出作为解码器的输入。此时,联合训练的损失函数如下:It should be noted that joint training refers to training the encoder, word representation synthesizer and decoder together, where the output of the encoder is used as the input of the word representation synthesizer, and the output of the word representation synthesizer is used as the input of the decoder. At this point, the loss function of the joint training is as follows:
Figure PCTCN2022140417-appb-000001
Figure PCTCN2022140417-appb-000001
其中,L mass(θ,l)表示语言l对应的第一掩码训练集的语料数据对应的损失值,主要用于描述第一掩码训练集中的语料数据的被掩码遮盖部分在训练时被解码器正确解码的概率,θ为训练参数,x表示第一训练集中的语料数据,D l表示语言l对应的第一训练集,logp(x i:j|x \i:j;θ)表示语料数据中被掩码遮盖的部分被解码器准确解码的条件概率,x i:j表示语料数据中被掩码遮盖的部分,i表示语料数据中被掩码遮盖的部分在语料数据中的起始位置,j表示语料数据中被掩码遮盖的部分在语料数据中的终止位置,x \i:j表示语料数据中被掩码遮盖的部分被解码器解码得到的结果。 Among them, L mass (θ, l) represents the loss value corresponding to the corpus data of the first mask training set corresponding to language l, which is mainly used to describe the masked part of the corpus data in the first mask training set during training. The probability of being correctly decoded by the decoder, θ is the training parameter, x represents the corpus data in the first training set, D l represents the first training set corresponding to language l, logp(xi :j |x \i:j ; θ) Indicates the conditional probability that the part covered by the mask in the corpus data is accurately decoded by the decoder, xi :j represents the part covered by the mask in the corpus data, and i represents the part covered by the mask in the corpus data. The starting position, j indicates the end position of the part covered by the mask in the corpus data in the corpus data, and x \i:j indicates the result obtained by decoding the part covered by the mask in the corpus data by the decoder.
在另一些例子中,还可以通过数据增强构造新的训练集,以便进一步进行联合训练,即根据第一训练集对编码器、词表示合成器和解码器进行预训练,可以通过如下方式实现:随机选择第一训练集中的语料数据并对选中的语料数据中未被切分的词进行切分,得到新的语料数据;将新的语料数据加入第一训练集,得到第二训练集;对第一训练集和第二训练集中的语料数据随机添加掩码,分别得到第二掩码训练集和第三掩码训练集;根据第二掩码训练集和第三掩码训练集对编码器、词表示合成器和解码器进行MASS的联合训练。In other examples, a new training set can also be constructed through data augmentation for further joint training, that is, pre-training the encoder, word representation synthesizer and decoder according to the first training set, which can be achieved in the following ways: Randomly select the corpus data in the first training set and segment the unsegmented words in the selected corpus data to obtain new corpus data; add the new corpus data to the first training set to obtain the second training set; The corpus data in the first training set and the second training set are randomly added with masks to obtain the second mask training set and the third mask training set respectively; according to the second mask training set and the third mask training set, the encoder , word representation synthesizer and decoder for joint training of MASS.
需要说明的是,第二掩码训练集和上一例子中的第一掩码训练集实质相同,因此,第二掩码训练集带来的损失也可以通过如上一例子中示出的表达式求解。而第三掩码训练集的语料数据是在第一训练集的基础上再次进行词切分然后再次进行切分,可以将其视为另一个词切分程度更强的第一掩码训练集,因此,第二掩码训练集带来的损失也可以通过类似上一例子中的方式求解,即第二掩码训练集带来的损失可以通过如下表达式求解:It should be noted that the second mask training set is substantially the same as the first mask training set in the previous example, therefore, the loss caused by the second mask training set can also be expressed by the expression shown in the previous example solve. The corpus data of the third mask training set is segmented again on the basis of the first training set and then segmented again, which can be regarded as another first mask training set with stronger word segmentation , therefore, the loss caused by the second mask training set can also be solved in a manner similar to the previous example, that is, the loss caused by the second mask training set can be solved by the following expression:
Figure PCTCN2022140417-appb-000002
Figure PCTCN2022140417-appb-000002
其中,L split(θ,l)表示语言l对应的第三掩码训练集的语料数据对应的损失值,主要用于描述第三掩码训练集中的语料数据x的被掩码遮盖部分在训练时被解码器正确解码的概率,θ为训练参数,x表示第二训练集中的语料数据,D s表示语言l对应的第二训练集,logp(x i:j|x \i:j;θ)表示语料数据中被掩码遮盖的部分被解码器准确解码的条件概率,x i:j表示 语料数据中被掩码遮盖的部分,i表示语料数据中被掩码遮盖的部分在语料数据中的起始位置,j表示语料数据中被掩码遮盖的部分在语料数据中的终止位置,x \i:j表示语料数据中被掩码遮盖的部分被解码器解码得到的结果。 Among them, L split (θ,l) represents the loss value corresponding to the corpus data of the third mask training set corresponding to language l, and is mainly used to describe the masked part of the corpus data x in the third mask training set during training When is the probability of being correctly decoded by the decoder, θ is the training parameter, x represents the corpus data in the second training set, D s represents the second training set corresponding to language l, logp(xi :j |x \i:j ; θ ) represents the conditional probability that the part covered by the mask in the corpus data is accurately decoded by the decoder, xi :j represents the part covered by the mask in the corpus data, i represents the part covered by the mask in the corpus data is in the corpus data j represents the end position of the part covered by the mask in the corpus data, and x \i:j represents the result of decoding the part covered by the mask in the corpus data by the decoder.
不难看出,本例实际是在前一例子的基础上增加了数据增强训练集进行训练。It is not difficult to see that this example actually increases the data augmentation training set on the basis of the previous example for training.
值得一提的是,通过拆分词的方法,更多地让词表示合成器参与MASS的训练,进一步提高了模型的性能。It is worth mentioning that, by splitting words, more word representation synthesizers are involved in the training of MASS, which further improves the performance of the model.
在另一些例子中,根据第一训练集对编码器、词表示合成器和解码器进行预训练,还可以通过如下方式实现:从第一训练集中的语料数据中确定预设数量的语料数据作为目标语料数据;对目标语料数据中未被切分的词再次进行词切分,得到目标切分语料数据;对目标语料数据和目标切分语料数据进行合并,得到第三训练集;第三训练集中的一条训练数据为一个目标语料数据和对应的目标切分语料数据组成的语料数据对;对第一训练集中的语料数据随机添加掩码,得到第四掩码训练集;根据第三训练集和第四掩码训练集,对编码器、词表示合成器和解码器进行联合训练,其中,第三训练集用于对词表示合成器进行监督训练,第四掩码训练集用于对编码器、词表示合成器和解码器进行基于MASS的训练。In other examples, pre-training the encoder, word representation synthesizer and decoder according to the first training set can also be achieved in the following manner: determining a preset amount of corpus data from the corpus data in the first training set as Target corpus data; word segmentation is performed again on the unsegmented words in the target corpus data to obtain the target segmentation corpus data; the target corpus data and the target segmentation corpus data are merged to obtain the third training set; the third training A piece of training data in the set is a corpus data pair composed of a target corpus data and corresponding target segmentation corpus data; a mask is randomly added to the corpus data in the first training set to obtain the fourth mask training set; according to the third training set and the fourth mask training set to jointly train the encoder, word representation synthesizer, and decoder, where the third training set is used for supervised training of the word representation synthesizer, and the fourth mask training set is used for encoding The MASS-based training of the word representation synthesizer and decoder.
需要说明的是,本例实际是通过对第一训练集的语料数据中未被切分的词进一步切分,从而让切分前的词作为切分后的若干子词的监督信号,进而对词表示合成器提供额外的监督训练。此时,由第三训练集带来的损失通过如下损失函数表达式计算:It should be noted that in this example, the unsegmented words in the corpus data of the first training set are further segmented, so that the words before segmentation are used as the supervisory signals of several subwords after segmentation, and then the Word representation synthesizers provide additional supervised training. At this point, the loss brought by the third training set is calculated by the following loss function expression:
Figure PCTCN2022140417-appb-000003
Figure PCTCN2022140417-appb-000003
其中,L combiner(θ;l)表示语言l对应的第三训练集的语料数据对应的损失值,θ为训练参数,主要用于描述对词表示合成器将子词合成为词的准确性,D t表示语言l对应的第三训练集,x表示语料数据对,t(x)表示语料数据对中的切分目标语料数据,E true(x i)表示第三训练集中的语料数据对中的目标语料数据在词表示合成器处的输出,E fake(x i)表示第三训练集中的语料数据对中的切分目标语料数据在词表示合成器处的输出,DIS(E true(x i),E fake(x i))表示E true(x i)和E fake(x i)之间的负距离。 Among them, L combiner (θ; l) represents the loss value corresponding to the corpus data of the third training set corresponding to language l, and θ is a training parameter, which is mainly used to describe the accuracy of synthesizing subwords into words by the word-to-word representation synthesizer, D t represents the third training set corresponding to language l, x represents the corpus data pair, t(x) represents the segmentation target corpus data in the corpus data pair, E true ( xi ) represents the corpus data pair in the third training set The output of the target corpus data at the word representation synthesizer, E fake ( xi ) represents the output of the segmentation target corpus data in the corpus data pair in the third training set at the word representation synthesizer, DIS(E true (x i ),E fake ( xi )) represents the negative distance between E true ( xi ) and E fake ( xi ).
不难看出,本例实际是在第一个例子的基础上增加了对词表示合成器的监督训练。It is not difficult to see that this example actually adds supervised training to the word representation synthesizer on the basis of the first example.
值得一提的是,通过拆分未拆分单词,让合成器的表示接近拆分前单词的表示,来显示地训练合成器,有助于得到更好地合成词表示。It is worth mentioning that explicitly training the synthesizer by splitting the unsplit word and making the synthesizer's representation close to the representation of the word before splitting will help to get a better representation of the synthesized word.
需要说明的是,以上仅为对预训练进行举例说明,在其他例子中,实际还可以在上述第一个例子的基础上,同时增加数据增强得到的训练集和监督训练的数据集进行联合训练。并且,上述举例说明主要是以一个第一训练集为例,在执行时需要考虑两个第一训练集的共同作用,如在联合训练时不是两个损失函数相加,而是两个损失函数基于两个单语语料数据集进行叠加,以同时增加数据增强得到的训练集和监督训练的数据集进行联合训练且单语语料数据集分别为中文和英语的单语语料数据集为例,其总的损失函数应该为:It should be noted that the above is only an example of pre-training. In other examples, on the basis of the first example above, the training set obtained by data enhancement and the supervised training data set can be added for joint training. . In addition, the above examples are mainly based on the first training set as an example, and the joint effect of the two first training sets needs to be considered during execution. For example, in joint training, it is not the addition of two loss functions, but two loss functions Based on the superposition of two monolingual corpus data sets, take the training set obtained by data enhancement and the supervised training data set for joint training and the monolingual corpus data sets are Chinese and English respectively as an example. The overall loss function should be:
L(θ)=L mass(θ,zh)+L mass(θ,en)+L combiner(θ;zh)+L combiner(θ,en)+L split(θ;zh)+L split(θ;en) L(θ)=L mass (θ,zh)+L mass (θ,en)+L combiner (θ;zh)+L combiner (θ,en)+L split (θ;zh)+L split (θ; en)
其中,L(θ)为总的损失值,L mass(θ,zh)表示中文对应的第一训练集的MASS训练部分的损失值,L mass(θ,en)表示英文对应的第一训练集的MASS训练部分的损失值,L combiner(θ;zh)为中文对应的第一训练集和数据增强的MASS训练部分的损失值,L combiner(θ;en)为英文对应的第一训练集和数据增强的MASS训练部分的损失值,L split(θ;zh)表示中文对应的第一训练集的监督训练部分的损失值,L split(θ;en)表示英文对应的第一训练集的监督训练部分的损失值。 Among them, L(θ) is the total loss value, L mass (θ, zh) represents the loss value of the MASS training part of the first training set corresponding to Chinese, and L mass (θ, en) represents the first training set corresponding to English The loss value of the MASS training part of L combiner (θ; zh) is the loss value of the first training set corresponding to Chinese and the MASS training part of data enhancement, and L combiner (θ; en) is the first training set corresponding to English and The loss value of the data-enhanced MASS training part, L split (θ; zh) represents the loss value of the supervised training part of the first training set corresponding to Chinese, and L split (θ; en) represents the supervision of the first training set corresponding to English The loss value for the training part.
此外,回译的过程如下所示:将单语语料数据集作为预训练好的编码器、词表示合成器以及编码器的输入,编码器处的输出组成翻译参考数据集,然后将翻译参考数据集作为预训练好的编码器、词表示合成器以及编码器的输入,将对应的单语语料数据集作为监督信号,集构建了一个伪平行语料数据集,实现了监督训练,如利用预训练好的编码器、词表示合成器以及编码器对中文单语语料数据集进行处理,可以得到对应的英文表达,从而构建出英文翻译参考数据集,然后将英文翻译参考数据集作为预训练好的编码器、词表示合成器以及编码器的输入,将中文单语语料数据集作为监督信号进行训练。In addition, the process of back-translation is as follows: the monolingual corpus data set is used as the input of the pre-trained encoder, the word representation synthesizer and the encoder, the output of the encoder constitutes the translation reference data set, and then the translation reference data The set is used as the input of the pre-trained encoder, word representation synthesizer and encoder, and the corresponding monolingual corpus data set is used as a supervisory signal, and a pseudo-parallel corpus data set is constructed by the set to realize supervised training, such as using pre-training A good encoder, word representation synthesizer, and encoder process the Chinese monolingual corpus data set to obtain the corresponding English expression, thereby constructing an English translation reference data set, and then use the English translation reference data set as a pre-trained The encoder, word representation synthesizer, and input to the encoder are trained with a Chinese monolingual corpus dataset as a supervisory signal.
特别地,回译时的损失函数表达式如下:In particular, the loss function expression during back-translation is as follows:
Figure PCTCN2022140417-appb-000004
Figure PCTCN2022140417-appb-000004
其中,L bt(θ,l)表示语言l的回译损失函数值,θ为训练参数,D l表示语言l的单语语料数据集,M(x)表示语料数据x在翻译参考数据集中对应的语料数据经过预训练好的编码器、词表示合成器以及编码器后在编码器处的输出。 Among them, L bt (θ,l) represents the back-translation loss function value of language l, θ is the training parameter, D l represents the monolingual corpus data set of language l, M(x) represents the corresponding corpus data x in the translation reference data set The output of the corpus data at the encoder after the pre-trained encoder, word representation synthesizer and encoder.
对编码器输入中文的句子,模型可以编码出双语空间的上下文词表示,让解码器解码英文,一般通过开始符号或语言编码进行控制,就可以生成英文的译文。这种初始的翻译结果有一定的质量,但还不够理想。回译训练可以利用单语数据和已有的翻译模型,进一步提高翻译效果。For the input of Chinese sentences to the encoder, the model can encode the context word representation in the bilingual space, and let the decoder decode English. Generally, the English translation can be generated by controlling the start symbol or language encoding. This initial translation result has a certain quality, but it is not ideal. Back-translation training can use monolingual data and existing translation models to further improve the translation effect.
可以理解的是,在得到损失值后,能够根据损失值调整训练参数θ,以便根据调整后的训练参数θ继续进行训练,直到损失值满足预设阈值、损失值收敛或者训练次数达到预设值等。It can be understood that after obtaining the loss value, the training parameter θ can be adjusted according to the loss value, so as to continue training according to the adjusted training parameter θ until the loss value meets the preset threshold, the loss value converges, or the number of training times reaches the preset value wait.
此外,应当理解的是,上面各种方法的步骤划分,只是为了描述清楚,实现时可以合并为一个步骤或者对某些步骤进行拆分,分解为多个步骤,只要包括相同的逻辑关系,都在本专利的保护范围内;对算法中或者流程中添加无关紧要的修改或者引入无关紧要的设计,但不改变其算法和流程的核心设计都在该专利的保护范围内。In addition, it should be understood that the division of steps in the above methods is only for clarity of description, and may be combined into one step or split into multiple steps during implementation. As long as the same logical relationship is included, all Within the scope of protection of this patent; adding insignificant modifications or introducing insignificant designs to the algorithm or process, but not changing the core design of the algorithm and process are all within the scope of protection of the patent.
本申请实施例另一方面还提供了一种机器翻译装置,如图2所示,包括:On the other hand, the embodiment of the present application also provides a machine translation device, as shown in Figure 2, including:
获取模块201,被设置为获取待翻译语料数据。The acquiring module 201 is configured to acquire corpus data to be translated.
编码模块202,被设置为将词切分后的待翻译语料数据输入编码器,得到基于子词的上下文表示。The encoding module 202 is configured to input the word-segmented corpus data to be translated into the encoder to obtain a contextual representation based on subwords.
合成模块203,被设置为将基于子词的上下文表示输入词表示合成器,得到基于词的上下文表示.The synthesis module 203 is configured to input the subword-based context representation into the word representation synthesizer to obtain the word-based context representation.
解码模块204,被设置为将基于词的上下文表示输入解码器,得到待翻译语料数据的翻译结果。The decoding module 204 is configured to input the word-based context representation into the decoder to obtain the translation result of the corpus data to be translated.
不难发现,本实施例为与方法实施例相对应的装置实施例,本实施例可与方法实施例互相配合实施。方法实施例中提到的相关技术细节在本实施例中依然有效,为了减少重复,这里不再赘述。相应地,本实施例中提到的相关技术细节也可应用在方法实施例中。It is not difficult to find that this embodiment is an apparatus embodiment corresponding to the method embodiment, and this embodiment can be implemented in cooperation with the method embodiment. The relevant technical details mentioned in the method embodiments are still valid in this embodiment, and will not be repeated here in order to reduce repetition. Correspondingly, the related technical details mentioned in this embodiment can also be applied in the method embodiment.
值得一提的是,本实施例中所涉及到的各模块均为逻辑模块,在实际应用中,一个逻辑单元可以是一个物理单元,也可以是一个物理单元的一部分,还可以以多个物理单元的组合实现。此外,为了突出本申请的创新部分,本实施例中并没有将与解决本申请所提出的技术问题关系不太密切的单元引入,但这并不表明本实施例中不存在其它的单元。It is worth mentioning that all the modules involved in this embodiment are logical modules. In practical applications, a logical unit can be a physical unit, or a part of a physical unit, or multiple physical units. Combination of units. In addition, in order to highlight the innovative part of the present application, units that are not closely related to solving the technical problem proposed in the present application are not introduced in this embodiment, but this does not mean that there are no other units in this embodiment.
本申请实施例另一方面还提供了一种电子设备,如图3所示,包括:至少一个处理器301;以及,与至少一个处理器301通信连接的存储器302;其中,存储器302存储有可被至少一 个处理器301执行的指令,指令被至少一个处理器301执行,以使至少一个处理器301能够执行上述任一方法实施例所描述的机器翻译方法。Another aspect of the embodiment of the present application provides an electronic device, as shown in FIG. 3 , including: at least one processor 301; and a memory 302 communicatively connected to the at least one processor 301; An instruction to be executed by at least one processor 301, the instruction is executed by at least one processor 301, so that at least one processor 301 can execute the machine translation method described in any one of the above method embodiments.
其中,存储器302和处理器301采用总线方式连接,总线可以包括任意数量的互联的总线和桥,总线将一个或多个处理器301和存储器302的各种电路连接在一起。总线还可以将诸如外围设备、稳压器和功率管理电路等之类的各种其他电路连接在一起,这些都是本领域所公知的,因此,本文不再对其进行进一步描述。总线接口在总线和收发机之间提供接口。收发机可以是一个元件,也可以是多个元件,比如多个接收器和发送器,提供用于在传输介质上与各种其他装置通信的单元。经处理器301处理的数据通过天线在无线介质上进行传输,天线还接收数据并将数据传输给处理器301。Wherein, the memory 302 and the processor 301 are connected by a bus, and the bus may include any number of interconnected buses and bridges, and the bus connects one or more processors 301 and various circuits of the memory 302 together. The bus may also connect together various other circuits such as peripherals, voltage regulators, and power management circuits, all of which are well known in the art and therefore will not be further described herein. The bus interface provides an interface between the bus and the transceivers. A transceiver may be a single element or multiple elements, such as multiple receivers and transmitters, providing means for communicating with various other devices over a transmission medium. The data processed by the processor 301 is transmitted on the wireless medium through the antenna, and the antenna also receives the data and transmits the data to the processor 301 .
处理器301负责管理总线和通常的处理,还可以提供各种功能,包括定时,外围接口,电压调节、电源管理以及其他控制功能。而存储器302可以被用于存储处理器301在执行操作时所使用的数据。The processor 301 is responsible for managing the bus and general processing, and may also provide various functions including timing, peripheral interface, voltage regulation, power management and other control functions. And the memory 302 can be used to store data used by the processor 301 when performing operations.
本申请实施例另一方面还提供了一种计算机可读存储介质,存储有计算机程序。计算机程序被处理器执行时实现上述任一方法实施例所描述的机器翻译方法。Another aspect of the embodiment of the present application provides a computer-readable storage medium storing a computer program. When the computer program is executed by the processor, the machine translation method described in any one of the above method embodiments is realized.
即,本领域技术人员可以理解,实现上述实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。That is, those skilled in the art can understand that all or part of the steps in the method of the above-mentioned embodiments can be completed by instructing related hardware through a program, the program is stored in a storage medium, and includes several instructions to make a device ( It may be a single-chip microcomputer, a chip, etc.) or a processor (processor) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .
本申请实施例提出的机器翻译方法,在编码器在输出待翻译语料数据的基于子词的上下文表示后,不是直接将基于子词的上下文表示输入解码器中,而是先将基于子词的上下文表示输入到词表示合成器中,以对基于子词的上下文表示按照词粒度进行合成,得到基于词的上下文表示,然后将基于词的上下文表示输入到解码器中,也就是在原有的编解码器+注意力机制的基础上,额外在编码器和解码器之间引入词表示合成器,将解码器的输入由原来的基于子词的上下文表示变为基于词的上下文表示,即将解码器的解码翻译粒度从子词变化为词,由于词的含义相对于子词更加稳定,受到语言结构如语境、在语句中的嵌入方式等的影响比子词小,使得解码器不需要基于子词进行词重建,避免了由于子词构词的方式不同导致翻译前后的语种中基于子词重建词的含义发生变化,进而导致翻译前后的语句含义发生较大变化的问题,使得翻译前后的语句中各个词的含义更加准确,进而翻译的语句也会更加准确,克服了翻译过程中的语种限制,能够在任意语种之间能够进行有效、准确的翻译。In the machine translation method proposed in the embodiment of this application, after the encoder outputs the subword-based context representation of the corpus data to be translated, it does not directly input the subword-based context representation into the decoder, but first inputs the subword-based context representation into the decoder. The context representation is input into the word representation synthesizer to synthesize the sub-word-based context representation according to the word granularity to obtain the word-based context representation, and then input the word-based context representation into the decoder, that is, in the original encoding On the basis of the decoder + attention mechanism, an additional word representation synthesizer is introduced between the encoder and the decoder, and the input of the decoder is changed from the original subword-based context representation to the word-based context representation, that is, the decoder The granularity of decoding and translation changes from subwords to words. Since the meaning of words is more stable than subwords, it is less affected by language structures such as context and embedding methods in sentences than subwords, so the decoder does not need to be based on subwords. Words are reconstructed, which avoids the problem that the meaning of words reconstructed based on subwords changes in the language before and after translation due to the different ways of word formation of subwords, which in turn leads to a large change in the meaning of sentences before and after translation, making sentences before and after translation The meaning of each word in the text is more accurate, and the translated sentences will be more accurate, which overcomes the language restrictions in the translation process and can effectively and accurately translate between any languages.
本领域的普通技术人员可以理解,上述各实施例是实现本申请的实施例,而在实际应用中,可以在形式上和细节上对其作各种改变,而不偏离本申请的精神和范围。Those of ordinary skill in the art can understand that the above-mentioned embodiments are embodiments of the present application, and in practical applications, various changes can be made to it in form and detail without departing from the spirit and scope of the present application .

Claims (10)

  1. 一种机器翻译方法,包括:A method of machine translation comprising:
    获取待翻译语料数据;Obtain the corpus data to be translated;
    将词切分后的所述待翻译语料数据输入编码器,得到基于子词的上下文表示;The corpus data to be translated after the word segmentation is input into the encoder to obtain a context representation based on subwords;
    将所述基于子词的上下文表示输入词表示合成器,得到基于词的上下文表示;The context representation based on the subword is input into the word representation synthesizer, and the context representation based on the word is obtained;
    将所述基于词的上下文表示输入解码器,得到所述待翻译语料数据的翻译结果。The word-based context representation is input into a decoder to obtain a translation result of the corpus data to be translated.
  2. 根据权利要求1所述的机器翻译方法,其中,所述将词切分后的所述待翻译语料数据输入编码器,得到基于子词的上下文表示之前,所述方法还包括:The machine translation method according to claim 1, wherein, before the corpus data to be translated after the word segmentation is input into an encoder, and the subword-based context representation is obtained, the method further comprises:
    对所述待翻译语料数据进行词切分并为所述待翻译语料数据中被切分的词生成第一位置标签;performing word segmentation on the corpus data to be translated and generating a first position label for the segmented words in the corpus data to be translated;
    所述将所述基于子词的上下文表示输入词表示合成器,得到基于词的上下文表示,包括:The described context representation based on the subword is input into the word representation synthesizer, and the context representation based on the word is obtained, including:
    将所述基于子词的上下文表示和所述第一位置标签输入所述词表示合成器,以对所述基于子词的上下文表示中来自同一个词的若干子词的上下文表示进行合成,得到所述基于词的上下文表示。Inputting the subword-based context representation and the first position label into the word representation synthesizer to synthesize context representations of several subwords from the same word in the subword-based context representation to obtain The word-based contextual representation.
  3. 根据权利要求1所述的机器翻译方法,其中,所述将词切分后的所述待翻译语料数据输入编码器,得到基于子词的上下文表示之前,所述方法还包括:The machine translation method according to claim 1, wherein, before the corpus data to be translated after the word segmentation is input into an encoder, and the subword-based context representation is obtained, the method further comprises:
    为所述待翻译语料数据中的每一个词生成第二位置标签并对所述待翻译语料数据进行词切分;generating a second position label for each word in the corpus data to be translated and performing word segmentation on the corpus data to be translated;
    所述将所述基于子词的上下文表示输入词表示合成器,得到基于词的上下文表示,包括:The described context representation based on the subword is input into the word representation synthesizer, and the context representation based on the word is obtained, including:
    将所述基于子词的上下文表示和所述第二位置标签输入所述词表示合成器,以对所述基于子词的上下文表示中来自同一个词的若干子词的上下文表示进行合成,得到所述基于词的上下文表示。Inputting the subword-based context representation and the second position label into the word representation synthesizer to synthesize context representations from several subwords of the same word in the subword-based context representation to obtain The word-based contextual representation.
  4. 根据权利要求1至3中任一项所述的机器翻译方法,其中,所述将词切分后的所述待翻译语料数据输入编码器,得到基于子词的上下文表示之前,所述方法还包括:The machine translation method according to any one of claims 1 to 3, wherein the corpus data to be translated after the word segmentation is input into the encoder, and before the subword-based context representation is obtained, the method further include:
    获取两个单语语料数据集;Obtain two monolingual corpus datasets;
    对所述两个单语语料数据集中的语料数据进行词切分,得到两个第一训练集;Carry out word segmentation to the corpus data in described two monolingual corpus data sets, obtain two first training sets;
    根据所述第一训练集对所述编码器、所述词表示合成器和所述解码器进行预训练;pre-training the encoder, the word representation synthesizer, and the decoder based on the first training set;
    根据所述两个单语语料数据集,对预训练好的所述编码器、所述词表示合成器和所述解码器进行回译处理,得到训练好的所述编码器、所述词表示合成器和所述解码器。According to the two monolingual corpus data sets, the pre-trained encoder, the word representation synthesizer and the decoder are back-translated to obtain the trained encoder, the word representation synthesizer and the decoder.
  5. 根据权利要求4所述的机器翻译方法,其中,所述根据所述第一训练集对所述编码器、所述词表示合成器和所述解码器进行预训练,包括:The machine translation method according to claim 4, wherein said pre-training said encoder, said word representation synthesizer and said decoder according to said first training set comprises:
    对所述第一训练集的语料数据随机添加掩码,分别得到第一掩码训练集;Randomly add a mask to the corpus data of the first training set to obtain the first mask training set respectively;
    根据所述第一掩码训练集对所述编码器、所述词表示合成器和所述解码器进行基于屏蔽序列到序列MASS的联合训练。Perform joint training based on masked sequence-to-sequence MASS for the encoder, the word representation synthesizer, and the decoder according to the first mask training set.
  6. 根据权利要求4所述的机器翻译方法,其中,所述根据所述第一训练集对所述编码器、所述词表示合成器和所述解码器进行预训练,包括:The machine translation method according to claim 4, wherein said pre-training said encoder, said word representation synthesizer and said decoder according to said first training set comprises:
    随机选择所述第一训练集中的语料数据并对选中的语料数据中未被切分的词进行切分,得到新的语料数据;Randomly select the corpus data in the first training set and segment the unsegmented words in the selected corpus data to obtain new corpus data;
    将所述新的语料数据加入所述第一训练集,得到第二训练集;adding the new corpus data to the first training set to obtain a second training set;
    对所述第一训练集和所述第二训练集中的语料数据随机添加掩码,分别得到第二掩码训练集和第三掩码训练集;Randomly add a mask to the corpus data in the first training set and the second training set to obtain the second mask training set and the third mask training set respectively;
    根据所述第二掩码训练集和所述第三掩码训练集对所述编码器、所述词表示合成器和所述解码器进行MASS的联合训练。Perform joint training of MASS on the encoder, the word representation synthesizer, and the decoder according to the second mask training set and the third mask training set.
  7. 根据权利要求4所述的机器翻译方法,其中,所述根据所述第一训练集对所述编码器、所述词表示合成器和所述解码器进行预训练,包括:The machine translation method according to claim 4, wherein said pre-training said encoder, said word representation synthesizer and said decoder according to said first training set comprises:
    从所述第一训练集中的语料数据中确定预设数量的语料数据作为目标语料数据;determining a preset amount of corpus data from the corpus data in the first training set as target corpus data;
    对所述目标语料数据中未被切分的词再次进行词切分,得到目标切分语料数据;Carry out word segmentation again to the unsegmented word in described target corpus data, obtain target segmentation corpus data;
    对所述目标语料数据和所述目标切分语料数据进行合并,得到第三训练集;其中,所述第三训练集中的一条训练数据为一个所述目标语料数据和对应的所述目标切分语料数据组成的语料数据对;Merging the target corpus data and the target segmented corpus data to obtain a third training set; wherein, one piece of training data in the third training set is one target corpus data and the corresponding target segmentation A corpus data pair composed of corpus data;
    对所述第一训练集中的语料数据随机添加掩码,得到第四掩码训练集;Randomly add a mask to the corpus data in the first training set to obtain a fourth mask training set;
    根据所述第三训练集和所述第四掩码训练集,对所述编码器、所述词表示合成器和所述解码器进行联合训练,其中,所述第三训练集用于对所述词表示合成器进行监督训练,所述第四掩码训练集用于对所述编码器、所述词表示合成器和所述解码器进行基于MASS的训练。According to the third training set and the fourth mask training set, the encoder, the word representation synthesizer and the decoder are jointly trained, wherein the third training set is used for all The predicate representation synthesizer performs supervised training, and the fourth mask training set is used for MASS-based training of the encoder, the word representation synthesizer and the decoder.
  8. 一种机器翻译装置,包括:A machine translation device, comprising:
    获取模块,被设置为获取待翻译语料数据;The obtaining module is configured to obtain the corpus data to be translated;
    编码模块,被设置为将词切分后的所述待翻译语料数据输入编码器,得到基于子词的上下文表示;The encoding module is configured to input the corpus data to be translated after word segmentation into an encoder to obtain a context representation based on subwords;
    合成模块,被设置为将所述基于子词的上下文表示输入词表示合成器,得到基于词的上下文表示;A synthesis module configured to input the subword-based context representation into a word representation synthesizer to obtain a word-based context representation;
    解码模块,被设置为将所述基于词的上下文表示输入解码器,得到所述待翻译语料数据的翻译结果。The decoding module is configured to input the word-based context representation into a decoder to obtain a translation result of the corpus data to be translated.
  9. 一种电子设备,包括:An electronic device comprising:
    至少一个处理器;以及,at least one processor; and,
    与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如权利要求1至7中任意一项所述机器翻译方法。The memory stores instructions executable by the at least one processor, the instructions are executed by the at least one processor, so that the at least one processor can perform the operation described in any one of claims 1 to 7 The method of machine translation.
  10. 一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至7中任一项所述的机器翻译方法。A computer-readable storage medium storing a computer program, which implements the machine translation method according to any one of claims 1 to 7 when the computer program is executed by a processor.
PCT/CN2022/140417 2021-12-20 2022-12-20 Machine translation method and apparatus, electronic device and storage medium WO2023116709A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111567148.X 2021-12-20
CN202111567148.XA CN116384414A (en) 2021-12-20 2021-12-20 Machine translation method, device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2023116709A1 true WO2023116709A1 (en) 2023-06-29

Family

ID=86901251

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/140417 WO2023116709A1 (en) 2021-12-20 2022-12-20 Machine translation method and apparatus, electronic device and storage medium

Country Status (2)

Country Link
CN (1) CN116384414A (en)
WO (1) WO2023116709A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH02206869A (en) * 1989-02-06 1990-08-16 Nippon Telegr & Teleph Corp <Ntt> Japanese noun compound word translation system in automatic japanese language translation system
CN107977364A (en) * 2017-12-30 2018-05-01 科大讯飞股份有限公司 Tie up language word segmentation method and device
CN110334360A (en) * 2019-07-08 2019-10-15 腾讯科技(深圳)有限公司 Machine translation method and device, electronic equipment and storage medium
CN113297841A (en) * 2021-05-24 2021-08-24 哈尔滨工业大学 Neural machine translation method based on pre-training double-word vectors

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH02206869A (en) * 1989-02-06 1990-08-16 Nippon Telegr & Teleph Corp <Ntt> Japanese noun compound word translation system in automatic japanese language translation system
CN107977364A (en) * 2017-12-30 2018-05-01 科大讯飞股份有限公司 Tie up language word segmentation method and device
CN110334360A (en) * 2019-07-08 2019-10-15 腾讯科技(深圳)有限公司 Machine translation method and device, electronic equipment and storage medium
CN113297841A (en) * 2021-05-24 2021-08-24 哈尔滨工业大学 Neural machine translation method based on pre-training double-word vectors

Also Published As

Publication number Publication date
CN116384414A (en) 2023-07-04

Similar Documents

Publication Publication Date Title
Tan et al. Neural machine translation: A review of methods, resources, and tools
US20240169166A1 (en) Translation method, target information determining method, related apparatus, and storage medium
CN112487182B (en) Training method of text processing model, text processing method and device
JP5128629B2 (en) Part-of-speech tagging system, part-of-speech tagging model training apparatus and method
CN110069790B (en) Machine translation system and method for contrasting original text through translated text retranslation
CN109522403A (en) A kind of summary texts generation method based on fusion coding
Li et al. Generating long and informative reviews with aspect-aware coarse-to-fine decoding
CN116324972A (en) System and method for a multilingual speech recognition framework
CN110211562B (en) Voice synthesis method, electronic equipment and readable storage medium
WO2023092960A1 (en) Labeling method and apparatus for named entity recognition in legal document
US11615247B1 (en) Labeling method and apparatus for named entity recognition of legal instrument
CN116522142A (en) Method for training feature extraction model, feature extraction method and device
Mo Design and Implementation of an Interactive English Translation System Based on the Information‐Assisted Processing Function of the Internet of Things
Wang et al. Data augmentation for internet of things dialog system
CN113947091A (en) Method, apparatus, device and medium for language translation
WO2023116709A1 (en) Machine translation method and apparatus, electronic device and storage medium
US20230153550A1 (en) Machine Translation Method and Apparatus, Device and Storage Medium
CN115906854A (en) Multi-level confrontation-based cross-language named entity recognition model training method
CN107423293A (en) The method and apparatus of data translation
CN110866404B (en) Word vector generation method and device based on LSTM neural network
Laitonjam et al. A hybrid machine transliteration model based on multi-source encoder–decoder framework: English to manipuri
Wan et al. Lexicon-constrained copying network for Chinese abstractive summarization
Chang et al. A mixed semantic features model for chinese ner with characters and words
CN117273027B (en) Automatic machine translation post-verification method based on translation error correction
Zhan et al. A Chinese-Malay Neural Machine Translation Model Based on CA-Transformer and Transfer Learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22910021

Country of ref document: EP

Kind code of ref document: A1