WO2022116841A1 - Text translation method, apparatus and device, and storage medium - Google Patents

Text translation method, apparatus and device, and storage medium Download PDF

Info

Publication number
WO2022116841A1
WO2022116841A1 PCT/CN2021/131360 CN2021131360W WO2022116841A1 WO 2022116841 A1 WO2022116841 A1 WO 2022116841A1 CN 2021131360 W CN2021131360 W CN 2021131360W WO 2022116841 A1 WO2022116841 A1 WO 2022116841A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
training
sequence
translation
original
Prior art date
Application number
PCT/CN2021/131360
Other languages
French (fr)
Chinese (zh)
Inventor
赵程绮
王涛
王明轩
李磊
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2022116841A1 publication Critical patent/WO2022116841A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation

Definitions

  • the present disclosure relates to the technical field of machine translation, for example, to a text translation method, apparatus, device, and storage medium.
  • NMT Machine Translation
  • NLP Natural Language Processing
  • MT has evolved from a Rule-based methods to statistical-based methods, to neural network-based neural machine translation (NMT).
  • NMT also adopts a sequence-to-sequence structure (Sequence to sequence, seq2seq), which consists of an encoder (encoder) and a decoder (decoder).
  • the encoder encodes the sentence at the source into a vector representation, and then the decoder generates the corresponding translation word by word according to the vector representation.
  • the present disclosure provides a text translation method, device, device and storage medium, so as to realize the translation of dialogue text and improve the accuracy of dialogue text translation.
  • the present disclosure provides a text translation method, including:
  • At least two sentences of the text to be translated are segmented by setting segmentation labels
  • the word sequence and the token sequence are input into the decoder of the machine translation model to obtain a target translation result.
  • the present disclosure also provides a text translation device, comprising:
  • a sentence segmentation module which is set to use a set segmentation label to segment at least two sentences of the text to be translated
  • a sequence acquisition module configured to input the segmented text to be translated into an encoder of a machine translation model, to obtain a word sequence and a marker sequence of the text to be translated;
  • the target translation result obtaining module is configured to input the word sequence and the label sequence into the decoder of the machine translation model to obtain the target translation result.
  • the present disclosure also provides an electronic device, the electronic device comprising:
  • storage means configured to store one or more instructions
  • the one or more instructions when executed by the one or more processing devices, cause the one or more processing devices to implement the above-described text translation method.
  • the present disclosure also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by the processing device, implements the above-mentioned text translation method.
  • FIG. 1 is a flowchart of a text translation method provided by an embodiment of the present disclosure
  • FIG. 2a is an example of a training machine translation model provided by an embodiment of the present disclosure
  • Figure 2b is an effect diagram of a text translation provided by an embodiment of the present disclosure
  • FIG. 3 is a schematic structural diagram of a text translation apparatus provided by an embodiment of the present disclosure.
  • FIG. 4 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.
  • method embodiments of the present disclosure may be performed in different orders and/or in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this regard.
  • the term “including” and variations thereof are open-ended inclusions, ie, "including but not limited to”.
  • the term “based on” is “based at least in part on.”
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one additional embodiment”; the term “some embodiments” means “at least some embodiments”. Relevant definitions of other terms will be given in the description below.
  • Translation systems used in dialogue are often at the single-sentence level. Problems such as personal pronoun omission, punctuation omission, and typos often occur in dialogue. It is difficult for a single-sentence-level translation system to solve these problems, resulting in low accuracy of translation results.
  • Table 1 is an example table of the translation of dialogue fragments by a translation model.
  • the second example is punctuation omission, which is common in everyday chat scenarios, where spaces are used to represent intervals. However, this can have a huge impact on the translation system.
  • the omitted "?” in the example results in a loss of semantics in the translation result.
  • the third example is a typo, "le” was mistakenly typed into “happy”, resulting in "happy” appearing in the translation result, with completely wrong semantics.
  • FIG. 1 is a flowchart of a text translation method provided by an embodiment of the present disclosure.
  • This embodiment can be applied to the case of translating dialogue text, and the method can be executed by a text translation device, which can be It is composed of hardware and/or software, and can generally be integrated in a device with text translation function, which can be an electronic device such as a server or a server cluster.
  • the method includes the following steps:
  • Step 110 at least two sentences of the text to be translated are segmented by using a set segmentation label.
  • the text to be translated may be a text formed by a dialogue between two or more people, and contains at least two sentences, that is, at least two sentences.
  • the preset segmentation tag can be a preset tag used to segment sentences in the text, for example: ⁇ sep>.
  • the text to be translated may be offline text or online text.
  • offline text can be understood as non-real-time text that has been generated, such as subtitles in film and television dramas
  • online text can be understood as dialogue text generated in real time.
  • the sentence generated in real time is translated in real time during the actual translation process, and when the dialogue ends, the translation of the dialogue is completed.
  • historical dialogue information can be referred to when translating the currently generated dialogue.
  • the method of dividing at least two sentences of the text to be translated by setting segmentation labels may be: obtaining the current sentence and a set number of forward sentences to form the text to be translated; A set number of forward sentences are split using a set split label.
  • the set number can be set by the developer, for example: set to 5.
  • the forward sentence can be understood as the above information of the current sentence.
  • the text to be translated is composed of a current sentence and a set number of forward sentences. The advantage of this is that when translating the current sentence, you can refer to the above information, thereby improving the accuracy of the translation.
  • Step 120 Input the segmented text to be translated into the encoder of the machine translation model to obtain a word sequence and a token sequence of the text to be translated.
  • the role of the encoder is to compile the text to be translated into a vector, a sequence.
  • the word sequence can be understood as a sequence consisting of the values corresponding to the words contained in the text to be translated;
  • the tag sequence can be understood as a sequence formed by adding tags to the words contained in the text to be translated, and the tag sequence is used to indicate that the words in the text to be translated are omitted. Pronouns or omitted punctuation or typos or normal words.
  • the role of the token sequence is to assist the decoder in translating the word sequence.
  • Step 130 Input the word sequence and the token sequence into the decoder of the machine translation model to obtain the target translation result.
  • the decoder is used to parse or translate the vector generated by the encoder to obtain the target translation result.
  • the text to be translated is online text
  • the following two methods can be used for translation, one is to translate the current sentence with reference to the above information, and the other is to translate the current sentence with reference to the historical translation result.
  • the process of using the first method of translation can be as follows: after the text to be translated composed of the current sentence and a set number of forward sentences is divided by the set segmentation label, input into the machine translation model to obtain the target translation result, and then translate from the target The translation result corresponding to the current sentence is cut out from the result. That is, the above information is re-translated to prevent the wrong results of historical translation from being propagated downward.
  • the process of using the second method of translation may be: after dividing the text to be translated consisting of the current sentence and a set number of forward sentences with a set segmentation label, obtain the historical translation results corresponding to the set number of forward sentences, and The historical translation results and the text to be translated after segmentation are input into the machine translation model to obtain the target translation result, and then the translation result corresponding to the current sentence is cut out from the target translation result.
  • the machine translation model since the historical translation result corresponding to the forward sentence is input, the machine translation model does not need to translate the forward sentence during the translation process, but only translates the current sentence with reference to the historical translation result, thereby saving computation. quantity, and maintains the coherence of translation.
  • the machine translation model consists of an encoder-decoder (encoder-decoder).
  • the training process of the machine translation model can be as follows: obtaining the original text and the original translation result of the original text; segmenting at least two sentences of the original text using a set segmentation label; preprocessing the segmented original text according to the set rules, Obtain the training text; add tags to the training text according to the set rules to obtain the original tag sequence; train the machine translation model based on the training text, the original translation result and the original tag sequence.
  • the setting rules may include at least one of the following: discarding pronouns, discarding punctuation marks, and replacing words with typos.
  • the training text is tagged according to the set rules, and the original tag sequence can be obtained in the following way: if the original text is preprocessed to discard the pronoun, the tag added to the word after the pronoun in the training text is the first set value ; If the original text is preprocessed to discard punctuation, the mark added to the punctuation position in the training text is the second set value; if the original text is preprocessed to replace words with typos, the The mark added to the misspelled word in the training text is the third set value; the mark added to the words that have not been preprocessed in the training text is the fourth set value.
  • the first setting value may be 2, the second setting value may be 3, the third setting value may be 1, and the fourth setting value may be 0.
  • Table 2 is an example table for generating training samples provided by an embodiment of the present disclosure.
  • x (1) and x (2) are two consecutive Chinese sentences, and y (1) and y (2) are their corresponding translations, respectively.
  • y (1) and y (2) are their corresponding translations, respectively.
  • the process of training the machine translation model based on the training text, the original translation result and the original label sequence may be as follows: input the training text into the decoder of the machine translation model to obtain the training label sequence and the training word sequence; The sequence and the training tag sequence are input into the decoder of the machine translation model to obtain the training translation result; the first loss function is calculated according to the training tag sequence and the original tag sequence; the second loss function is calculated according to the training translation result and the original translation result; based on the first loss function and the second loss function train the encoder, and train the decoder according to the second loss function.
  • the encoder has the function of compiling text into sequences of words and tokens.
  • the decoder has the function of decoding the word sequence and the token sequence and outputting the translation result.
  • FIG. 2a is an example of training a machine translation model provided by an embodiment of the present disclosure.
  • x' d is the training text.
  • the training label sequence L SL is obtained.
  • the output of the decoder is the training translation result L MT , where l' x is the original label sequence, and y d is Original translation result.
  • the first loss function is obtained according to L SL and l' x
  • the second loss function is obtained according to L MT and y d .
  • the parameters in the encoder are trained according to the first loss function and the second loss function, and the parameters in the decoder are trained according to the second loss function, until the machine translation model reaches the translation accuracy.
  • the translation model in this embodiment is compared with the translation results of other translation models.
  • Table 3 is a comparison table of translation results.
  • BASE is the original translation model
  • DIALREPAIR and DIALROBUST are translation models improved from the original translation model
  • DIALMTL is the translation model of this embodiment.
  • the overall translation effect of the translation model in the embodiment of the present disclosure the translation accuracy rate of lost main sentences, the translation accuracy rate of lost punctuation sentences, and the translation accuracy rate of sentences containing typos all reach the best relative to other translation models. .
  • the embodiment of the present disclosure also verifies the overall translation effect (BLEU) and the translation accuracy (Accuracy) of the lost subject sentence when the method in this embodiment is used to translate offline text and online text.
  • context_length refers to the maximum number of sentences the model can use for the online text (online) each time; in the case of oneline-cut, for each new sentence obtained, it is combined with the previous dialogue label. , translate it, and then only intercept the last sentence; online-fd refers to the use of historical translation information for translation; offline refers to the translation scene of offline text.
  • At least two sentences of the text to be translated are firstly segmented using a set segmentation label, and then the segmented text to be translated is input into the encoder of the machine translation model to obtain word sequences and tags of the text to be translated sequence, and finally input the word sequence and tag sequence into the decoder of the machine translation model to obtain the target translation result.
  • the encoder of the machine translation model compiles the text to be translated into a word sequence and a marker sequence, and the decoder simultaneously decodes the word sequence and the marker sequence to obtain the final translation result, which can realize the dialogue text to improve the accuracy of translation of dialogue texts.
  • FIG. 3 is a schematic structural diagram of a text translation apparatus provided by an embodiment of the present disclosure. As shown in Figure 3, the device includes:
  • the sentence segmentation module 210 is configured to divide at least two sentences of the text to be translated by using a set segmentation label; the sequence acquisition module 220 is configured to input the segmented text to be translated into the encoder of the machine translation model, and obtain the text to be translated. The word sequence and the marker sequence; the target translation result acquisition module 230 is configured to input the word sequence and the marker sequence into the decoder of the machine translation model to obtain the target translation result.
  • the text to be translated includes online text; the sentence segmentation module 210 is set to:
  • a translation result interception module which is set to:
  • a module for obtaining historical translation results set to:
  • sequence obtaining module 220 is further set to:
  • a machine translation model training module set to:
  • the machine translation model training module is also set to:
  • the first loss function; the second loss function is calculated according to the training translation result and the original translation result; the encoder is trained according to the first loss function and the second loss function, and the decoder is trained according to the second loss function.
  • the setting rule includes at least one of the following: discarding pronouns, discarding punctuation marks, and replacing words with typos.
  • the machine translation model training module is also set to:
  • the mark added to the words after the pronoun in the training text is the first set value; if the original text is preprocessed to discard punctuation, the training text will be marked The mark added at the point position is the second set value; if the original text is preprocessed to replace words with typos, the mark added to the typo in the training text is the third set value; The preprocessed word-added token is the fourth set value.
  • the foregoing apparatus can execute the methods provided by all the foregoing embodiments of the present disclosure, and has functional modules and effects corresponding to executing the foregoing methods.
  • the foregoing apparatus can execute the methods provided by all the foregoing embodiments of the present disclosure, and has functional modules and effects corresponding to executing the foregoing methods.
  • FIG. 4 it shows a schematic structural diagram of an electronic device 300 suitable for implementing an embodiment of the present disclosure.
  • the electronic devices in the embodiments of the present disclosure may include, but are not limited to, such as mobile phones, notebook computers, digital broadcast receivers, personal digital assistants (Personal Digital Assistants, PDAs), tablet computers (PADs), portable multimedia players (Portable Media Players) , PMP), mobile terminals such as in-vehicle terminals (such as in-vehicle navigation terminals), and fixed terminals such as digital (Television, TV), desktop computers, etc., or various forms of servers, such as independent servers or server clusters.
  • PDAs Personal Digital Assistants
  • PMP portable multimedia players
  • PMP portable multimedia players
  • the electronic device 300 may include a processing device (eg, a central processing unit, a graphics processor, etc.) 301, which may be stored in accordance with a program stored in a read-only storage device (Read-Only Memory, ROM) 302 or from a storage Device 308 loads a program into Random Access Memory (RAM) 303 to perform various appropriate actions and processes.
  • a processing device eg, a central processing unit, a graphics processor, etc.
  • RAM Random Access Memory
  • various programs and data required for the operation of the electronic device 300 are also stored.
  • the processing device 301, the ROM 302, and the RAM 303 are connected to each other through a bus 304.
  • An Input/Output (I/O) interface 305 is also connected to the bus 304 .
  • the following devices can be connected to the I/O interface 305: input devices 306 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a Liquid Crystal Display (LCD) Output device 307 , speaker, vibrator, etc.; storage device 308 including, eg, magnetic tape, hard disk, etc.; and communication device 309 .
  • Communication means 309 may allow electronic device 300 to communicate wirelessly or by wire with other devices to exchange data.
  • FIG. 4 shows the electronic device 300 having various means, it is not required to implement or have all of the illustrated means. More or fewer devices may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing a recommended method of a word.
  • the computer program may be downloaded and installed from the network via the communication device 309, or from the storage device 308, or from the ROM 302.
  • the processing device 301 When the computer program is executed by the processing device 301, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
  • the computer-readable medium described above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two.
  • the computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above.
  • Examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, RAM, ROM, Erasable Programmable Read-Only Memory (EPROM) or flash memory), optical fiber, portable compact disk read-only memory (Compact Disc Read-Only Memory, CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • the program code embodied on the computer-readable medium may be transmitted by any suitable medium, including but not limited to: electric wire, optical fiber cable, radio frequency (RF), etc., or any suitable combination of the above.
  • clients and servers can communicate using any currently known or future developed network protocol, such as HyperText Transfer Protocol (HTTP), and can communicate with digital data in any form or medium.
  • Communication eg, a communication network
  • Examples of communication networks include Local Area Networks (LANs), Wide Area Networks (WANs), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), as well as any currently Known or future developed networks.
  • LANs Local Area Networks
  • WANs Wide Area Networks
  • the Internet eg, the Internet
  • peer-to-peer networks eg, ad hoc peer-to-peer networks
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic equipment, the electronic equipment: at least two sentences of the text to be translated are divided by setting segmentation tags; The latter text to be translated is input into the encoder of the machine translation model to obtain the word sequence and tag sequence of the text to be translated; the word sequence and the tag sequence are input into the decoder of the machine translation model to obtain the target translation result.
  • Computer program code for performing operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and This includes conventional procedural programming languages - such as the "C" language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any kind of network, including a LAN or WAN, or may be connected to an external computer (eg, using an Internet service provider to connect through the Internet).
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments of the present disclosure may be implemented in a software manner, and may also be implemented in a hardware manner. Among them, the name of the unit does not constitute a limitation of the unit itself in one case.
  • exemplary types of hardware logic components include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (Application Specific Standard Products) Standard Parts, ASSP), system on chip (System on Chip, SOC), complex programmable logic device (Complex Programmable Logic Device, CPLD) and so on.
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSP Application Specific Standard Products
  • SOC System on Chip
  • complex programmable logic device Complex Programmable Logic Device, CPLD
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing. Examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, RAM, ROM, EPROM or flash memory, optical fibers, CD-ROMs, optical storage devices, magnetic storage devices, or Any suitable combination of the above.
  • the embodiments of the present disclosure disclose a text translation method, including:
  • At least two sentences of the text to be translated are segmented by setting segmentation labels; the segmented text to be translated is input into the encoder of the machine translation model to obtain the word sequence and tag sequence of the text to be translated; The sequence and the marker sequence are input into the decoder of the machine translation model to obtain the target translation result.
  • the text to be translated includes online text; the at least two sentences of the text to be translated are segmented by setting segmentation labels, including:
  • the method further includes:
  • the translation result corresponding to the current sentence is cut out from the target translation result.
  • the method further includes:
  • the training process of the machine translation model is as follows:
  • Obtain the original text and the original translation result of the original text use the set segmentation label to segment at least two sentences of the original text; perform preprocessing on the segmented original text according to the set rules to obtain training text; adding tags to the training text according to the set rule to obtain an original tag sequence; training the machine translation model based on the training text, the original translation result and the original tag sequence.
  • the machine translation model is trained based on the training text, the original translation result and the original tag sequence, including:
  • the setting rules include at least one of the following: discarding pronouns, discarding punctuation marks, and replacing words with typos; adding marks to the training text according to the setting rules to obtain an original mark sequence, including:
  • the mark added to the words after the pronouns in the training text is the first set value; if the original text is processed by discarding punctuation preprocessing, the mark added to the punctuation position in the training text is the second set value; if the original text is preprocessed by replacing words with typos, then the The mark added to the misspelled word is the third set value; the mark added to the words that have not been preprocessed in the training text is the fourth set value.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

A text translation method. The text translation method comprises: S110, using a set segmentation label to segment at least two sentences of text to be translated; S120, inputting said text, after segmentation, into an encoder of a machine translation model to obtain a word sequence and a mark sequence of said text; and S130, inputting the word sequence and the mark sequence into a decoder of the machine translation model to obtain a target translation result.

Description

文本翻译方法、装置、设备及存储介质Text translation method, device, device and storage medium
本申请要求在2020年12月04日提交中国专利局、申请号为202011408602.2的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。This application claims the priority of the Chinese Patent Application No. 202011408602.2 filed with the China Patent Office on December 04, 2020, the entire contents of which are incorporated herein by reference.
技术领域technical field
本公开涉及机器翻译技术领域,例如涉及一种文本翻译方法、装置、设备及存储介质。The present disclosure relates to the technical field of machine translation, for example, to a text translation method, apparatus, device, and storage medium.
背景技术Background technique
机器翻译((Machine Translation,MT)是自然语言处理(Natural Language Processing,NLP)领域中的一个重要领域,旨在使用机器将一种语言翻译为另一种语言。MT经过多年的发展,从基于规则的方法到基于统计的方法,再到基于神经网络的神经机器翻译(Neural Machine Translation,NMT)。一般来说,和许多其他主流的NLP任务一样,NMT也采用一个序列到序列的结构(Sequence to sequence,seq2seq),由编码器(encoder)和解码器(decoder)组成。编码器将源端的句子编码为向量表示,然后解码器根据向量表示逐词生成对应的翻译。Machine Translation ((Machine Translation, MT) is an important field in the field of Natural Language Processing (NLP), which aims to use machines to translate one language into another language. After years of development, MT has evolved from a Rule-based methods to statistical-based methods, to neural network-based neural machine translation (NMT). Generally speaking, like many other mainstream NLP tasks, NMT also adopts a sequence-to-sequence structure (Sequence to sequence, seq2seq), which consists of an encoder (encoder) and a decoder (decoder). The encoder encodes the sentence at the source into a vector representation, and then the decoder generates the corresponding translation word by word according to the vector representation.
发明内容SUMMARY OF THE INVENTION
本公开提供一种文本翻译方法、装置、设备及存储介质,以实现对话文本的翻译,提高对话文本翻译的准确性。The present disclosure provides a text translation method, device, device and storage medium, so as to realize the translation of dialogue text and improve the accuracy of dialogue text translation.
本公开提供了一种文本翻译方法,包括:The present disclosure provides a text translation method, including:
对待翻译文本的至少两个句子采用设定分割标签进行分割;At least two sentences of the text to be translated are segmented by setting segmentation labels;
将分割后的所述待翻译文本输入机器翻译模型的编码器,获得所述待翻译文本的词序列及标记序列;Inputting the segmented text to be translated into the encoder of the machine translation model to obtain the word sequence and tag sequence of the text to be translated;
将所述词序列及所述标记序列输入所述机器翻译模型的解码器,获得目标翻译结果。The word sequence and the token sequence are input into the decoder of the machine translation model to obtain a target translation result.
本公开还提供了一种文本翻译装置,包括:The present disclosure also provides a text translation device, comprising:
句子分割模块,设置为对待翻译文本的至少两个句子采用设定分割标签进行分割;A sentence segmentation module, which is set to use a set segmentation label to segment at least two sentences of the text to be translated;
序列获取模块,设置为将分割后的所述待翻译文本输入机器翻译模型的编 码器,获得所述待翻译文本的词序列及标记序列;A sequence acquisition module, configured to input the segmented text to be translated into an encoder of a machine translation model, to obtain a word sequence and a marker sequence of the text to be translated;
目标翻译结果获取模块,设置为将所述词序列及所述标记序列输入所述机器翻译模型的解码器,获得目标翻译结果。The target translation result obtaining module is configured to input the word sequence and the label sequence into the decoder of the machine translation model to obtain the target translation result.
本公开还提供了一种电子设备,所述电子设备包括:The present disclosure also provides an electronic device, the electronic device comprising:
一个或多个处理装置;one or more processing devices;
存储装置,设置为存储一个或多个指令;storage means configured to store one or more instructions;
当所述一个或多个指令被所述一个或多个处理装置执行,使得所述一个或多个处理装置实现上述的文本翻译方法。The one or more instructions, when executed by the one or more processing devices, cause the one or more processing devices to implement the above-described text translation method.
本公开还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理装置执行时实现上述的文本翻译方法。The present disclosure also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by the processing device, implements the above-mentioned text translation method.
附图说明Description of drawings
图1是本公开实施例提供的一种文本翻译方法的流程图;1 is a flowchart of a text translation method provided by an embodiment of the present disclosure;
图2a是本公开实施例提供的一种训练机器翻译模型的示例性;FIG. 2a is an example of a training machine translation model provided by an embodiment of the present disclosure;
图2b是本公开实施例提供的一种对文本翻译的效果图;Figure 2b is an effect diagram of a text translation provided by an embodiment of the present disclosure;
图3是本公开实施例提供的一种文本翻译装置的结构示意图;3 is a schematic structural diagram of a text translation apparatus provided by an embodiment of the present disclosure;
图4是本公开实施例提供的一种电子设备的结构示意图。FIG. 4 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.
具体实施方式Detailed ways
下面将参照附图描述本公开的实施例。虽然附图中显示了本公开的一些实施例,然而,本公开可以通过多种形式来实现,而且不应该被解释为限于这里阐述的实施例,提供这些实施例是为了更加透彻和完整地理解本公开。本公开的附图及实施例仅用于示例性作用。Embodiments of the present disclosure will be described below with reference to the accompanying drawings. Although some embodiments of the present disclosure are shown in the drawings, the present disclosure may, however, be embodied in various forms and should not be construed as limited to the embodiments set forth herein, which are provided for a more thorough and complete understanding this disclosure. The figures and examples of the present disclosure are for illustrative purposes only.
本公开的方法实施方式中记载的多个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。The multiple steps described in the method embodiments of the present disclosure may be performed in different orders and/or in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this regard.
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。As used herein, the term "including" and variations thereof are open-ended inclusions, ie, "including but not limited to". The term "based on" is "based at least in part on." The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions of other terms will be given in the description below.
本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进 行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。Concepts such as "first" and "second" mentioned in the present disclosure are only used to distinguish different devices, modules or units, and are not used to limit the order or interdependence of functions performed by these devices, modules or units relation.
本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有指出,否则应该理解为“一个或多个”。Modifications of "a" and "a plurality" mentioned in the present disclosure are illustrative rather than limiting, and those skilled in the art should understand that unless the context indicates otherwise, they should be construed as "one or more".
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。The names of messages or information exchanged between multiple devices in the embodiments of the present disclosure are only for illustrative purposes, and are not intended to limit the scope of these messages or information.
应用于对话中的翻译系统往往都是单句级别的,在对话中经常出现人称代词省略、标点省略、错别字等问题,单句级别的翻译系统很难解决这些问题,使得翻译结果准确度低。Translation systems used in dialogue are often at the single-sentence level. Problems such as personal pronoun omission, punctuation omission, and typos often occur in dialogue. It is difficult for a single-sentence-level translation system to solve these problems, resulting in low accuracy of translation results.
表1 是一种翻译模型对对话片段进行翻译的示例表。Table 1 is an example table of the translation of dialogue fragments by a translation model.
表1Table 1
中文(ZH)Chinese (ZH) Nancy怎么了?[她] drop是不是哭了啊。 What happened to Nancy? Did [she] drop cry.
MTMT What happened to Nancy?Did you cry?What happened to Nancy? Did you cry?
参考(Reference,REF)Reference (Reference, REF) What happened to Nancy?Did she cry?What happened to Nancy? Did she cry?
ZHEN Nancy怎么了[?] drop是不是哭了啊。 What happened to Nancy[? ] drop is not crying ah.
MTMT Did Nancy cry?Did Nancy cry?
REFREF What happened to Nancy?Did she cry?What happened to Nancy? Did she cry?
ZHEN Nancy怎么[乐] typoHow does Nancy [music] typo ?
MTMT How happy is Nancy?How happy is Nancy?
REFREF What happened to Nancy?What happened to Nancy?
可以看到,第一个例子,询问“Nancy怎么了?”,然后第二句话再次询问“是不是哭了啊”,此时省略了“她”。然后翻译系统就会脑补翻译为“你/you”,得到错误的翻译。在对话中为了简洁紧凑,经常会出现这种省略,尤其是中文,日语,韩语,越南语等。It can be seen that the first example asks "what's wrong with Nancy?", and then the second sentence asks again "are you crying?", omitting "she" at this time. Then the translation system will make up the translation as "you/you", and get the wrong translation. In order to be concise and compact in dialogue, this omission often occurs, especially in Chinese, Japanese, Korean, Vietnamese, etc.
第二个例子是标点省略,这在日常的聊天场景中很常见,用空格代表间隔。然而这会对翻译系统带来巨大的影响。例子中省略的“?”导致了翻译结果语义丢 失。The second example is punctuation omission, which is common in everyday chat scenarios, where spaces are used to represent intervals. However, this can have a huge impact on the translation system. The omitted "?" in the example results in a loss of semantics in the translation result.
第三个例子是错别字,“了”被错误打成了“乐”,导致翻译结果里出现了“happy”,语义完全错误。The third example is a typo, "le" was mistakenly typed into "happy", resulting in "happy" appearing in the translation result, with completely wrong semantics.
为了解决上述问题,图1为本公开实施例提供的一种文本翻译方法的流程图,本实施例可适用于对对话文本进行翻译的情况,该方法可以由文本翻译装置来执行,该装置可由硬件和/或软件组成,并一般可集成在具有文本翻译功能的设备中,该设备可以是服务器或服务器集群等电子设备。如图1所示,该方法包括如下步骤:In order to solve the above problem, FIG. 1 is a flowchart of a text translation method provided by an embodiment of the present disclosure. This embodiment can be applied to the case of translating dialogue text, and the method can be executed by a text translation device, which can be It is composed of hardware and/or software, and can generally be integrated in a device with text translation function, which can be an electronic device such as a server or a server cluster. As shown in Figure 1, the method includes the following steps:
步骤110,对待翻译文本的至少两个句子采用设定分割标签进行分割。Step 110, at least two sentences of the text to be translated are segmented by using a set segmentation label.
待翻译文本可以是由两人或者多人对话形成的文本,包含至少两个句子,即至少两句话。设定分割标签可以是预先设置的标签,用于分割文本中的句子,例如:<sep>。本实施例中,待翻译文本可以是离线文本或者在线文本。其中,离线文本可以理解为已经生成好的非实时的文本,例如影视剧中的字幕;在线文本可以理解为实时生成的对话文本。The text to be translated may be a text formed by a dialogue between two or more people, and contains at least two sentences, that is, at least two sentences. The preset segmentation tag can be a preset tag used to segment sentences in the text, for example: <sep>. In this embodiment, the text to be translated may be offline text or online text. Among them, offline text can be understood as non-real-time text that has been generated, such as subtitles in film and television dramas; online text can be understood as dialogue text generated in real time.
在获得待翻译文本后,在相邻两个句子间添加设定分割标签<sep>,将待翻译文本中的句子分割开。After obtaining the text to be translated, add a set segmentation tag <sep> between two adjacent sentences to separate the sentences in the text to be translated.
本实施例中,当待翻译文本是在线文本时,实际翻译过程中是对实时产生的句子实时地进行翻译,当对话结束后,也就完成对该段对话的翻译。为了提高翻译的准确性,在对当前产生的对话进行翻译时,可以参考历史对话信息。In this embodiment, when the text to be translated is online text, the sentence generated in real time is translated in real time during the actual translation process, and when the dialogue ends, the translation of the dialogue is completed. In order to improve the accuracy of translation, historical dialogue information can be referred to when translating the currently generated dialogue.
当待翻译文本是在线文本时,对待翻译文本的至少两个句子采用设定分割标签进行分割的方式可以是:获取当前句子以及设定数量的前向句子,组成待翻译文本;对当前句子和设定数量的前向句子采用设定分割标签进行分割。When the text to be translated is online text, the method of dividing at least two sentences of the text to be translated by setting segmentation labels may be: obtaining the current sentence and a set number of forward sentences to form the text to be translated; A set number of forward sentences are split using a set split label.
设定数量可以由开发人员设置,例如:设为5。前向句子可以理解为当前句子的上文信息。本实施例中,待翻译文本是由当前句子和设定数量的前向句子构成的。这样做的好处是,在翻译当前句子时,可以参考上文信息,从而提高翻译的准确性。The set number can be set by the developer, for example: set to 5. The forward sentence can be understood as the above information of the current sentence. In this embodiment, the text to be translated is composed of a current sentence and a set number of forward sentences. The advantage of this is that when translating the current sentence, you can refer to the above information, thereby improving the accuracy of the translation.
步骤120,将分割后的待翻译文本输入机器翻译模型的编码器,获得待翻译文本的词序列及标记序列。Step 120: Input the segmented text to be translated into the encoder of the machine translation model to obtain a word sequence and a token sequence of the text to be translated.
编码器的作用是将待翻译文本编译为向量,即序列。词序列可以理解为由待翻译文本包含的词语对应的数值组成的序列;标记序列可以理解为对待翻译文本包含的词添加标记后形成的序列,标记序列用于表征待翻译文本中的词是省略代词或者省略标点或者错别字或者正常词。标记序列的作用是辅助解码器对词序列进行翻译。The role of the encoder is to compile the text to be translated into a vector, a sequence. The word sequence can be understood as a sequence consisting of the values corresponding to the words contained in the text to be translated; the tag sequence can be understood as a sequence formed by adding tags to the words contained in the text to be translated, and the tag sequence is used to indicate that the words in the text to be translated are omitted. Pronouns or omitted punctuation or typos or normal words. The role of the token sequence is to assist the decoder in translating the word sequence.
步骤130,将词序列及标记序列输入机器翻译模型的解码器,获得目标翻译结果。Step 130: Input the word sequence and the token sequence into the decoder of the machine translation model to obtain the target translation result.
解码器用于对编码器生成的向量进行解析或者翻译,获得目标翻译结果。The decoder is used to parse or translate the vector generated by the encoder to obtain the target translation result.
本实施例中,当待翻译文本为在线文本时,可以采用如下两种方式翻译,一种是参考上文信息对当前句子翻译,一种是参考历史翻译结果对当前句子翻译。In this embodiment, when the text to be translated is online text, the following two methods can be used for translation, one is to translate the current sentence with reference to the above information, and the other is to translate the current sentence with reference to the historical translation result.
采用第一种方式翻译的过程可以是:将由当前句子和设定数量的前向句子组成的待翻译文本采用设定分割标签分割后,输入机器翻译模型中,获得目标翻译结果,然后从目标翻译结果中截取出当前句子对应的翻译结果。即对上文信息重新翻译了一遍,防止历史翻译的错误结果向下传播。The process of using the first method of translation can be as follows: after the text to be translated composed of the current sentence and a set number of forward sentences is divided by the set segmentation label, input into the machine translation model to obtain the target translation result, and then translate from the target The translation result corresponding to the current sentence is cut out from the result. That is, the above information is re-translated to prevent the wrong results of historical translation from being propagated downward.
采用第二种方式翻译的过程可以是:将由当前句子和设定数量的前向句子组成的待翻译文本采用设定分割标签分割后,获取设定数量的前向句子对应的历史翻译结果,将历史翻译结果和分割后待翻译文本都输入机器翻译模型中,获得目标翻译结果,然后从目标翻译结果中截取出当前句子对应的翻译结果。本实施例中,由于输入了前向句子对应的历史翻译结果,因此机器翻译模型在翻译过程中,无需再对前向句子进行翻译,只是参考历史翻译结果对当前句子进行翻译,从而节约了计算量,且保持了翻译的连贯性。The process of using the second method of translation may be: after dividing the text to be translated consisting of the current sentence and a set number of forward sentences with a set segmentation label, obtain the historical translation results corresponding to the set number of forward sentences, and The historical translation results and the text to be translated after segmentation are input into the machine translation model to obtain the target translation result, and then the translation result corresponding to the current sentence is cut out from the target translation result. In this embodiment, since the historical translation result corresponding to the forward sentence is input, the machine translation model does not need to translate the forward sentence during the translation process, but only translates the current sentence with reference to the historical translation result, thereby saving computation. quantity, and maintains the coherence of translation.
本实施例中,机器翻译模型由编码器-解码器(encoder-decoder)构成。机器翻译模型的训练过程可以是:获取原始文本及原始文本的原始翻译结果;对原始文本的至少两个句子采用设定分割标签进行分割;对分割后的原始文本按照设定规则进行预处理,获得训练文本;根据设定规则对训练文本添加标记,获得原始标记序列;基于训练文本、原始翻译结果及原始标记序列对机器翻译模型进行训练。In this embodiment, the machine translation model consists of an encoder-decoder (encoder-decoder). The training process of the machine translation model can be as follows: obtaining the original text and the original translation result of the original text; segmenting at least two sentences of the original text using a set segmentation label; preprocessing the segmented original text according to the set rules, Obtain the training text; add tags to the training text according to the set rules to obtain the original tag sequence; train the machine translation model based on the training text, the original translation result and the original tag sequence.
在对机器翻译模型训练时,由于对话语料稀缺,我们引入大量篇章以及句子级别的语料,用来模拟上下文的使用。为了增加预料的多样性,需要对原始文本采用设定规则进行预处理。其中,设定规则可以包括如下至少一项:丢弃代词、丢弃标点符号、将字词替换为错别字。When training the machine translation model, due to the scarcity of dialogue data, we introduce a large number of text and sentence-level data to simulate the use of context. In order to increase the expected diversity, the original text needs to be preprocessed with set rules. The setting rules may include at least one of the following: discarding pronouns, discarding punctuation marks, and replacing words with typos.
根据设定规则对训练文本添加标记,获得原始标记序列的方式可以是:若对原始文本进行的是丢弃代词的预处理,则对训练文本中代词后的词添加的标记为第一设定值;若对原始文本进行的是丢弃标点的预处理,则对训练文本中标点位置添加的标记为第二设定值;若对原始文本进行的是将字词替换为错别字的预处理,则对训练文本中错别字添加的标记为第三设定值;对训练文本中未进行预处理的词添加的标记为第四设定值。The training text is tagged according to the set rules, and the original tag sequence can be obtained in the following way: if the original text is preprocessed to discard the pronoun, the tag added to the word after the pronoun in the training text is the first set value ; If the original text is preprocessed to discard punctuation, the mark added to the punctuation position in the training text is the second set value; if the original text is preprocessed to replace words with typos, the The mark added to the misspelled word in the training text is the third set value; the mark added to the words that have not been preprocessed in the training text is the fourth set value.
第一设定值可以为2,第二设定值可以为3,第三设定值可以为1,第四设定值为0。示例性的,表2为本公开实施例提供的一种生成训练样本的示例表。The first setting value may be 2, the second setting value may be 3, the third setting value may be 1, and the fourth setting value may be 0. Exemplarily, Table 2 is an example table for generating training samples provided by an embodiment of the present disclosure.
表2Table 2
Figure PCTCN2021131360-appb-000001
Figure PCTCN2021131360-appb-000001
如表2所示,x (1)和x (2)是连续的两句中文,y (1)和y (2)分别是它们对应的翻译。为了使用上下文信息,使用设定分割标签<sep>连接两个句子(对话场景下就是连接上下的多个句子),得到了x d,对应的就是y (1)和y (2)组成的y d。然后根据设定规则随机从这句话中丢弃掉一些代词,标点,替换掉一些词为错别字,得到新的句子x’ d,对于新句子的每一个位置,根据设定规则打上标记,对于丢弃的主语的后面一个词标记为2,标点标记为3,错别字标记为1,未进行处理的词标记为0,这样就得到了与x’ d序列等长的标记序列l’ xAs shown in Table 2, x (1) and x (2) are two consecutive Chinese sentences, and y (1) and y (2) are their corresponding translations, respectively. In order to use the context information, use the set segmentation tag <sep> to connect two sentences (in the dialogue scene, it is to connect multiple sentences above and below), and get x d , which corresponds to y composed of y (1) and y (2) d . Then randomly discard some pronouns and punctuation from this sentence according to the set rules, replace some words as typos, and get a new sentence x' d , for each position of the new sentence, mark it according to the set rules, for discarded The last word of the subject of , is marked as 2, punctuation is marked as 3, typos are marked as 1, and unprocessed words are marked as 0, so that the sequence of tokens l' x of the same length as the sequence of x'd is obtained.
本实施例中,基于训练文本、原始翻译结果及原始标记序列对机器翻译模型进行训练的过程可以是:将训练文本输入机器翻译模型的解码器,获得训练标记序列和训练词序列;将训练词序列和训练标记序列输入机器翻译模型的解码器,获得训练翻译结果;根据训练标记序列和原始标记序列计算第一损失函数;根据训练翻译结果和原始翻译结果计算第二损失函数;基于第一损失函数和第二损失函数训练编码器,根据第二损失函数训练解码器。In this embodiment, the process of training the machine translation model based on the training text, the original translation result and the original label sequence may be as follows: input the training text into the decoder of the machine translation model to obtain the training label sequence and the training word sequence; The sequence and the training tag sequence are input into the decoder of the machine translation model to obtain the training translation result; the first loss function is calculated according to the training tag sequence and the original tag sequence; the second loss function is calculated according to the training translation result and the original translation result; based on the first loss function and the second loss function train the encoder, and train the decoder according to the second loss function.
编码器具有将文本编译为词序列和标记序列的功能。解码器具有对词序列和标记序列进行解码输出翻译结果的功能。示例性的,图2a为本公开实施例提供的一种训练机器翻译模型的示例性。如图2a所示,x’ d为训练文本,将x’ d输入编码器后得到训练标记序列L SL,解码器输出的是训练翻译结果L MT,l’ x为原始标记序列,y d为原始翻译结果。根据L SL和l’ x计算获得第一损失函数,根据L MT和y d计算获得第二损失函数。根据第一损失函数和第二损失函数训练编码器 中的参数,根据第二损失函数训练解码器中的参数,直到机器翻译模型达到翻译精度。 The encoder has the function of compiling text into sequences of words and tokens. The decoder has the function of decoding the word sequence and the token sequence and outputting the translation result. Exemplarily, FIG. 2a is an example of training a machine translation model provided by an embodiment of the present disclosure. As shown in Figure 2a, x' d is the training text. After inputting x' d into the encoder, the training label sequence L SL is obtained. The output of the decoder is the training translation result L MT , where l' x is the original label sequence, and y d is Original translation result. The first loss function is obtained according to L SL and l' x , and the second loss function is obtained according to L MT and y d . The parameters in the encoder are trained according to the first loss function and the second loss function, and the parameters in the decoder are trained according to the second loss function, until the machine translation model reaches the translation accuracy.
示例性的,为了验证本公开实施例的翻译方法的翻译效果,将本实施例中的翻译模型与其他翻译模型的翻译结果进行了比较。表3为翻译结果的对比表。其中,BASE是原始的翻译模型,DIALREPAIR和DIALROBUST是对原始的翻译模型改进后的翻译模型,DIALMTL为本实施例的翻译模型。Exemplarily, in order to verify the translation effect of the translation method in the embodiment of the present disclosure, the translation model in this embodiment is compared with the translation results of other translation models. Table 3 is a comparison table of translation results. Among them, BASE is the original translation model, DIALREPAIR and DIALROBUST are translation models improved from the original translation model, and DIALMTL is the translation model of this embodiment.
表3table 3
Figure PCTCN2021131360-appb-000002
Figure PCTCN2021131360-appb-000002
从表3可以看出,本公开实施例中的翻译模型的整体翻译效果、丢失主语句子翻译准确率、丢失标点句子翻译准确率以及包含错别字句子翻译准确率相对于其他翻译模型均达到了最优。As can be seen from Table 3, the overall translation effect of the translation model in the embodiment of the present disclosure, the translation accuracy rate of lost main sentences, the translation accuracy rate of lost punctuation sentences, and the translation accuracy rate of sentences containing typos all reach the best relative to other translation models. .
同时,本公开实施例也验证了采用本实施例中的方法对离线文本和在线文本翻译时的整体翻译效果(BLEU)及丢失主语句子翻译准确率(Accuracy)。如图2b所示,context_length指模型对于在线文本(online)每次最多能使用多少句上下文;oneline-cut情况下,对于得到的每一句新的句子,将它和之前的对话用标签拼在一起,翻译出来,然后只截取最后一句;online-fd指利用历史翻译信息进行翻译;offline指对离线文本的翻译场景。从图2b中可以看出,offline由于使用了最多的上下文,所以在主语补全这种很需要上下文的任务上可以达到最好的效果,整体的BLEU质量也较高。而online模式中,online-fd是利用 历史翻译信息继续翻译,保持了前后翻译的一致性,BLEU结果最好,但是由于可能存在错误传播,主语的准确率略微低于online-cut方法。At the same time, the embodiment of the present disclosure also verifies the overall translation effect (BLEU) and the translation accuracy (Accuracy) of the lost subject sentence when the method in this embodiment is used to translate offline text and online text. As shown in Figure 2b, context_length refers to the maximum number of sentences the model can use for the online text (online) each time; in the case of oneline-cut, for each new sentence obtained, it is combined with the previous dialogue label. , translate it, and then only intercept the last sentence; online-fd refers to the use of historical translation information for translation; offline refers to the translation scene of offline text. As can be seen from Figure 2b, because offline uses the most context, it can achieve the best effect on the task of subject completion, which requires a lot of context, and the overall BLEU quality is also high. In the online mode, online-fd uses historical translation information to continue translation, maintaining the consistency of translation before and after, and the BLEU result is the best, but due to possible error propagation, the accuracy of the subject is slightly lower than the online-cut method.
本公开实施例的技术方案,首先对待翻译文本的至少两个句子采用设定分割标签进行分割,然后将分割后的待翻译文本输入机器翻译模型的编码器,获得待翻译文本的词序列及标记序列,最后将词序列及标记序列输入机器翻译模型的解码器,获得目标翻译结果。本公开实施例提供的文本翻译方法,机器翻译模型的编码器将待翻译文本编译为词序列和标记序列,解码器同时对词序列和标记序列进行解码,获得最终的翻译结果,可以实现对话文本的翻译,提高对话文本翻译的准确性。In the technical solution of the embodiment of the present disclosure, at least two sentences of the text to be translated are firstly segmented using a set segmentation label, and then the segmented text to be translated is input into the encoder of the machine translation model to obtain word sequences and tags of the text to be translated sequence, and finally input the word sequence and tag sequence into the decoder of the machine translation model to obtain the target translation result. In the text translation method provided by the embodiment of the present disclosure, the encoder of the machine translation model compiles the text to be translated into a word sequence and a marker sequence, and the decoder simultaneously decodes the word sequence and the marker sequence to obtain the final translation result, which can realize the dialogue text to improve the accuracy of translation of dialogue texts.
图3是本公开实施例提供的一种文本翻译装置的结构示意图。如图3所示,该装置包括:FIG. 3 is a schematic structural diagram of a text translation apparatus provided by an embodiment of the present disclosure. As shown in Figure 3, the device includes:
句子分割模块210,设置为对待翻译文本的至少两个句子采用设定分割标签进行分割;序列获取模块220,设置为将分割后的待翻译文本输入机器翻译模型的编码器,获得待翻译文本的词序列及标记序列;目标翻译结果获取模块230,设置为将词序列及标记序列输入机器翻译模型的解码器,获得目标翻译结果。The sentence segmentation module 210 is configured to divide at least two sentences of the text to be translated by using a set segmentation label; the sequence acquisition module 220 is configured to input the segmented text to be translated into the encoder of the machine translation model, and obtain the text to be translated. The word sequence and the marker sequence; the target translation result acquisition module 230 is configured to input the word sequence and the marker sequence into the decoder of the machine translation model to obtain the target translation result.
可选的,待翻译文本包括在线文本;句子分割模块210,设置为:Optionally, the text to be translated includes online text; the sentence segmentation module 210 is set to:
获取当前句子以及设定数量的前向句子,组成待翻译文本;对当前句子和设定数量的前向句子采用设定分割标签进行分割。Obtain the current sentence and the set number of forward sentences to form the text to be translated; use the set segmentation label to segment the current sentence and the set number of forward sentences.
可选的,在获取目标翻译结果之后,还包括:翻译结果截取模块,设置为:Optionally, after obtaining the target translation result, it also includes: a translation result interception module, which is set to:
从目标翻译结果中截取出当前句子对应的翻译结果。Cut out the translation result corresponding to the current sentence from the target translation result.
可选的,还包括:历史翻译结果获取模块,设置为:Optionally, it also includes: a module for obtaining historical translation results, set to:
获取设定数量的前向句子对应的历史翻译结果。Get the historical translation results corresponding to a set number of forward sentences.
可选的,序列获取模块220,还设置为:Optionally, the sequence obtaining module 220 is further set to:
将历史翻译结果和分割后的待翻译文本输入机器翻译模型的编码器,获得待翻译文本的词序列及标记序列。Input the historical translation results and the segmented text to be translated into the encoder of the machine translation model to obtain the word sequence and tag sequence of the text to be translated.
可选的,还包括:机器翻译模型训练模块,设置为:Optionally, it also includes: a machine translation model training module, set to:
获取原始文本及原始文本的原始翻译结果;对原始文本的至少两个句子采用设定分割标签进行分割;对分割后的原始文本按照设定规则进行预处理,获得训练文本;根据设定规则对训练文本添加标记,获得原始标记序列;基于训练文本、原始翻译结果及原始标记序列对机器翻译模型进行训练。Obtain the original text and the original translation result of the original text; use the set segmentation labels to segment at least two sentences of the original text; preprocess the segmented original text according to the set rules to obtain the training text; Add tags to the training text to obtain the original tag sequence; train the machine translation model based on the training text, the original translation result and the original tag sequence.
可选的,机器翻译模型训练模块,还设置为:Optionally, the machine translation model training module is also set to:
将训练文本输入机器翻译模型的解码器,获得训练标记序列和训练词序列;将训练词序列和训练标记序列输入机器翻译模型的解码器,获得训练翻译结果;根据训练标记序列和原始标记序列计算第一损失函数;根据训练翻译结果和原始翻译结果计算第二损失函数;根据第一损失函数和第二损失函数训练编码器,根据第二损失函数训练解码器。Input the training text into the decoder of the machine translation model to obtain the training tag sequence and the training word sequence; input the training word sequence and the training tag sequence into the decoder of the machine translation model to obtain the training translation result; calculate according to the training tag sequence and the original tag sequence The first loss function; the second loss function is calculated according to the training translation result and the original translation result; the encoder is trained according to the first loss function and the second loss function, and the decoder is trained according to the second loss function.
可选的,设定规则包括如下至少一项:丢弃代词、丢弃标点符号、将字词替换为错别字。Optionally, the setting rule includes at least one of the following: discarding pronouns, discarding punctuation marks, and replacing words with typos.
可选的,机器翻译模型训练模块,还设置为:Optionally, the machine translation model training module is also set to:
若对原始文本进行的是丢弃代词的预处理,则对训练文本中代词后的词添加的标记为第一设定值;若对原始文本进行的是丢弃标点的预处理,则对训练文本中标点位置添加的标记为第二设定值;若对原始文本进行的是将字词替换为错别字的预处理,则对训练文本中错别字添加的标记为第三设定值;对训练文本中未进行预处理的词添加的标记为第四设定值。If the original text is preprocessed to discard pronouns, the mark added to the words after the pronoun in the training text is the first set value; if the original text is preprocessed to discard punctuation, the training text will be marked The mark added at the point position is the second set value; if the original text is preprocessed to replace words with typos, the mark added to the typo in the training text is the third set value; The preprocessed word-added token is the fourth set value.
上述装置可执行本公开前述所有实施例所提供的方法,具备执行上述方法相应的功能模块和效果。未在本实施例中详尽描述的技术细节,可参见本公开前述所有实施例所提供的方法。The foregoing apparatus can execute the methods provided by all the foregoing embodiments of the present disclosure, and has functional modules and effects corresponding to executing the foregoing methods. For technical details not described in detail in this embodiment, reference may be made to the methods provided by all the foregoing embodiments of the present disclosure.
下面参考图4,其示出了适于用来实现本公开实施例的电子设备300的结构示意图。本公开实施例中的电子设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、个人数字助理(Personal Digital Assistant,PDA)、平板电脑(PAD)、便携式多媒体播放器(Portable Media Player,PMP)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字(Television,TV)、台式计算机等等的固定终端,或者多种形式的服务器,如独立服务器或者服务器集群。图4示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。Referring next to FIG. 4 , it shows a schematic structural diagram of an electronic device 300 suitable for implementing an embodiment of the present disclosure. The electronic devices in the embodiments of the present disclosure may include, but are not limited to, such as mobile phones, notebook computers, digital broadcast receivers, personal digital assistants (Personal Digital Assistants, PDAs), tablet computers (PADs), portable multimedia players (Portable Media Players) , PMP), mobile terminals such as in-vehicle terminals (such as in-vehicle navigation terminals), and fixed terminals such as digital (Television, TV), desktop computers, etc., or various forms of servers, such as independent servers or server clusters. The electronic device shown in FIG. 4 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.
如图4所示,电子设备300可以包括处理装置(例如中央处理器、图形处理器等)301,其可以根据存储在只读存储装置(Read-Only Memory,ROM)302中的程序或者从存储装置308加载到随机访问存储装置(Random Access Memory,RAM)303中的程序而执行多种适当的动作和处理。在RAM 303中,还存储有电子设备300操作所需的多种程序和数据。处理装置301、ROM 302以及RAM 303通过总线304彼此相连。输入/输出(Input/Output,I/O)接口305也连接至总线304。As shown in FIG. 4 , the electronic device 300 may include a processing device (eg, a central processing unit, a graphics processor, etc.) 301, which may be stored in accordance with a program stored in a read-only storage device (Read-Only Memory, ROM) 302 or from a storage Device 308 loads a program into Random Access Memory (RAM) 303 to perform various appropriate actions and processes. In the RAM 303, various programs and data required for the operation of the electronic device 300 are also stored. The processing device 301, the ROM 302, and the RAM 303 are connected to each other through a bus 304. An Input/Output (I/O) interface 305 is also connected to the bus 304 .
通常,以下装置可以连接至I/O接口305:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置306;包括例如液晶显 示器(Liquid Crystal Display,LCD)、扬声器、振动器等的输出装置307;包括例如磁带、硬盘等的存储装置308;以及通信装置309。通信装置309可以允许电子设备300与其他设备进行无线或有线通信以交换数据。虽然图4示出了具有多种装置的电子设备300,但是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。Typically, the following devices can be connected to the I/O interface 305: input devices 306 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a Liquid Crystal Display (LCD) Output device 307 , speaker, vibrator, etc.; storage device 308 including, eg, magnetic tape, hard disk, etc.; and communication device 309 . Communication means 309 may allow electronic device 300 to communicate wirelessly or by wire with other devices to exchange data. Although FIG. 4 shows the electronic device 300 having various means, it is not required to implement or have all of the illustrated means. More or fewer devices may alternatively be implemented or provided.
根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行词语的推荐方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置309从网络上被下载和安装,或者从存储装置308被安装,或者从ROM 302被安装。在该计算机程序被处理装置301执行时,执行本公开实施例的方法中限定的上述功能。According to embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing a recommended method of a word. In such an embodiment, the computer program may be downloaded and installed from the network via the communication device 309, or from the storage device 308, or from the ROM 302. When the computer program is executed by the processing device 301, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、RAM、ROM、可擦式可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、射频(Radio Frequency,RF)等等,或者上述的任意合适的组合。The computer-readable medium described above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two. The computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. Examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, RAM, ROM, Erasable Programmable Read-Only Memory (EPROM) or flash memory), optical fiber, portable compact disk read-only memory (Compact Disc Read-Only Memory, CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In this disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present disclosure, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device . The program code embodied on the computer-readable medium may be transmitted by any suitable medium, including but not limited to: electric wire, optical fiber cable, radio frequency (RF), etc., or any suitable combination of the above.
在一些实施方式中,客户端、服务器可以利用诸如超文本传输协议(HyperText Transfer Protocol,HTTP)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(Local Area Network,LAN),广域网(Wide Area Network,WAN),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。In some embodiments, clients and servers can communicate using any currently known or future developed network protocol, such as HyperText Transfer Protocol (HTTP), and can communicate with digital data in any form or medium. Communication (eg, a communication network) interconnects. Examples of communication networks include Local Area Networks (LANs), Wide Area Networks (WANs), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), as well as any currently Known or future developed networks.
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。The above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:对待翻译文本的至少两个句子采用设定分割标签进行分割;将分割后的所述待翻译文本输入机器翻译模型的编码器,获得所述待翻译文本的词序列及标记序列;将所述词序列及所述标记序列输入所述机器翻译模型的解码器,获得目标翻译结果。The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic equipment, the electronic equipment: at least two sentences of the text to be translated are divided by setting segmentation tags; The latter text to be translated is input into the encoder of the machine translation model to obtain the word sequence and tag sequence of the text to be translated; the word sequence and the tag sequence are input into the decoder of the machine translation model to obtain the target translation result.
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括LAN或WAN—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for performing operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and This includes conventional procedural programming languages - such as the "C" language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server. Where a remote computer is involved, the remote computer may be connected to the user's computer through any kind of network, including a LAN or WAN, or may be connected to an external computer (eg, using an Internet service provider to connect through the Internet).
附图中的流程图和框图,图示了按照本公开多种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元的名称在一种情况下并不构成对该单元本身的限定。The units involved in the embodiments of the present disclosure may be implemented in a software manner, and may also be implemented in a hardware manner. Among them, the name of the unit does not constitute a limitation of the unit itself in one case.
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(Field Programmable Gate Array,FPGA)、专用集成电路(Application Specific Integrated Circuit,ASIC)、专用标准产品(Application Specific Standard Parts,ASSP)、片上系统(System on Chip,SOC)、复杂可编程逻辑设备(Complex  Programmable Logic Device,CPLD)等等。The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (Application Specific Standard Products) Standard Parts, ASSP), system on chip (System on Chip, SOC), complex programmable logic device (Complex Programmable Logic Device, CPLD) and so on.
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、RAM、ROM、EPROM或快闪存储器、光纤、CD-ROM、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing. Examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, RAM, ROM, EPROM or flash memory, optical fibers, CD-ROMs, optical storage devices, magnetic storage devices, or Any suitable combination of the above.
根据本公开实施例的一个或多个实施例,本公开实施例公开一种文本翻译方法,包括:According to one or more of the embodiments of the present disclosure, the embodiments of the present disclosure disclose a text translation method, including:
对待翻译文本的至少两个句子采用设定分割标签进行分割;将分割后的所述待翻译文本输入机器翻译模型的编码器,获得所述待翻译文本的词序列及标记序列;将所述词序列及所述标记序列输入所述机器翻译模型的解码器,获得目标翻译结果。At least two sentences of the text to be translated are segmented by setting segmentation labels; the segmented text to be translated is input into the encoder of the machine translation model to obtain the word sequence and tag sequence of the text to be translated; The sequence and the marker sequence are input into the decoder of the machine translation model to obtain the target translation result.
所述待翻译文本包括在线文本;所述对待翻译文本的至少两个句子采用设定分割标签进行分割,包括:The text to be translated includes online text; the at least two sentences of the text to be translated are segmented by setting segmentation labels, including:
获取当前句子以及设定数量的前向句子,组成所述待翻译文本;对所述当前句子和所述设定数量的前向句子采用所述设定分割标签进行分割。Obtain the current sentence and a set number of forward sentences to form the text to be translated; use the set segmentation label to segment the current sentence and the set number of forward sentences.
在所述获取目标翻译结果之后,还包括:After the obtaining the target translation result, the method further includes:
从所述目标翻译结果中截取出所述当前句子对应的翻译结果。The translation result corresponding to the current sentence is cut out from the target translation result.
在所述对所述当前句子和所述设定数量的前向句子采用所述设定分割标签进行分割之后,还包括:After the current sentence and the set number of forward sentences are segmented using the set segmentation label, the method further includes:
获取所述设定数量的前向句子对应的历史翻译结果;Obtain the historical translation results corresponding to the set number of forward sentences;
所述将分割后的所述待翻译文本输入机器翻译模型的编码器,获得所述待翻译文本的词序列及标记序列,包括:Inputting the segmented text to be translated into the encoder of the machine translation model to obtain the word sequence and tag sequence of the text to be translated, including:
将所述历史翻译结果和分割后的所述待翻译文本输入所述机器翻译模型的编码器,获得所述待翻译文本的词序列及标记序列。Input the historical translation result and the segmented text to be translated into the encoder of the machine translation model to obtain a word sequence and a token sequence of the text to be translated.
所述机器翻译模型的训练过程为:The training process of the machine translation model is as follows:
获取原始文本及所述原始文本的原始翻译结果;对所述原始文本的至少两个句子采用所述设定分割标签进行分割;对分割后的所述原始文本按照设定规则进行预处理,获得训练文本;根据所述设定规则对所述训练文本添加标记, 获得原始标记序列;基于所述训练文本、所述原始翻译结果及所述原始标记序列对所述机器翻译模型进行训练。Obtain the original text and the original translation result of the original text; use the set segmentation label to segment at least two sentences of the original text; perform preprocessing on the segmented original text according to the set rules to obtain training text; adding tags to the training text according to the set rule to obtain an original tag sequence; training the machine translation model based on the training text, the original translation result and the original tag sequence.
基于所述训练文本、所述原始翻译结果及所述原始标记序列对所述机器翻译模型进行训练,包括:The machine translation model is trained based on the training text, the original translation result and the original tag sequence, including:
将所述训练文本输入所述机器翻译模型的解码器,获得训练标记序列和训练词序列;将所述训练词序列和所述训练标记序列输入所述机器翻译模型的解码器,获得训练翻译结果;根据所述训练标记序列和所述原始标记序列计算第一损失函数;根据所述训练翻译结果和所述原始翻译结果计算第二损失函数;根据所述第一损失函数和所述第二损失函数训练所述编码器,根据所述第二损失函数训练所述解码器。Input the training text into the decoder of the machine translation model to obtain a training label sequence and a training word sequence; input the training word sequence and the training label sequence into the decoder of the machine translation model to obtain a training translation result ; Calculate the first loss function according to the training label sequence and the original label sequence; Calculate the second loss function according to the training translation result and the original translation result; Calculate the second loss function according to the first loss function and the second loss function trains the encoder and trains the decoder according to the second loss function.
设定规则包括如下至少一项:丢弃代词、丢弃标点符号、将字词替换为错别字;根据所述设定规则对所述训练文本添加标记,获得原始标记序列,包括:The setting rules include at least one of the following: discarding pronouns, discarding punctuation marks, and replacing words with typos; adding marks to the training text according to the setting rules to obtain an original mark sequence, including:
若对所述原始文本进行的是丢弃代词的预处理,则对所述训练文本中所述代词后的词添加的标记为第一设定值;若对所述原始文本进行的是丢弃标点的预处理,则对所述训练文本中所述标点位置添加的标记为第二设定值;若对所述原始文本进行的是将字词替换为错别字的预处理,则对所述训练文本中所述错别字添加的标记为第三设定值;对所述训练文本中未进行预处理的词添加的标记为第四设定值。If the preprocessing of discarding pronouns is performed on the original text, the mark added to the words after the pronouns in the training text is the first set value; if the original text is processed by discarding punctuation preprocessing, the mark added to the punctuation position in the training text is the second set value; if the original text is preprocessed by replacing words with typos, then the The mark added to the misspelled word is the third set value; the mark added to the words that have not been preprocessed in the training text is the fourth set value.

Claims (10)

  1. 一种文本翻译方法,包括:A text translation method comprising:
    对待翻译文本的至少两个句子采用设定分割标签进行分割;At least two sentences of the text to be translated are segmented by setting segmentation labels;
    将分割后的所述待翻译文本输入机器翻译模型的编码器,获得所述待翻译文本的词序列及标记序列;Inputting the segmented text to be translated into the encoder of the machine translation model to obtain the word sequence and tag sequence of the text to be translated;
    将所述词序列及所述标记序列输入所述机器翻译模型的解码器,获得目标翻译结果。The word sequence and the token sequence are input into the decoder of the machine translation model to obtain a target translation result.
  2. 根据权利要求1所述的方法,其中,所述待翻译文本包括在线文本;所述对待翻译文本的至少两个句子采用设定分割标签进行分割,包括:The method according to claim 1, wherein the to-be-translated text comprises online text; the at least two sentences of the to-be-translated text are segmented by setting segmentation labels, comprising:
    获取当前句子以及设定数量的前向句子,组成所述待翻译文本;Obtain the current sentence and a set number of forward sentences to form the text to be translated;
    对所述当前句子和所述设定数量的前向句子采用所述设定分割标签进行分割。The current sentence and the set number of forward sentences are segmented using the set segmentation label.
  3. 根据权利要求2所述的方法,在所述获取目标翻译结果之后,还包括:The method according to claim 2, after the obtaining the target translation result, further comprising:
    从所述目标翻译结果中截取出所述当前句子对应的翻译结果。The translation result corresponding to the current sentence is cut out from the target translation result.
  4. 根据权利要求2所述的方法,在所述对所述当前句子和所述设定数量的前向句子采用所述设定分割标签进行分割之后,还包括:The method according to claim 2, after the current sentence and the set number of forward sentences are segmented using the set segmentation label, further comprising:
    获取所述设定数量的前向句子对应的历史翻译结果;Obtain the historical translation results corresponding to the set number of forward sentences;
    所述将分割后的所述待翻译文本输入机器翻译模型的编码器,获得所述待翻译文本的词序列及标记序列,包括:Inputting the segmented text to be translated into the encoder of the machine translation model to obtain the word sequence and tag sequence of the text to be translated, including:
    将所述历史翻译结果和分割后的所述待翻译文本输入所述机器翻译模型的编码器,获得所述待翻译文本的词序列及标记序列。Input the historical translation result and the segmented text to be translated into the encoder of the machine translation model to obtain a word sequence and a token sequence of the text to be translated.
  5. 根据权利要求1所述的方法,其中,所述机器翻译模型的训练过程为:The method according to claim 1, wherein the training process of the machine translation model is:
    获取原始文本及所述原始文本的原始翻译结果;obtaining the original text and the original translation result of the original text;
    对所述原始文本的至少两个句子采用所述设定分割标签进行分割;Segmenting at least two sentences of the original text using the set segmentation label;
    对分割后的所述原始文本按照设定规则进行预处理,获得训练文本;Preprocessing the segmented original text according to the set rules to obtain training text;
    根据所述设定规则对所述训练文本添加标记,获得原始标记序列;Add a mark to the training text according to the set rule to obtain an original mark sequence;
    基于所述训练文本、所述原始翻译结果及所述原始标记序列对所述机器翻译模型进行训练。The machine translation model is trained based on the training text, the original translation result and the original token sequence.
  6. 根据权利要求5所述的方法,其中,基于所述训练文本、所述原始翻译结果及所述原始标记序列对所述机器翻译模型进行训练,包括:The method according to claim 5, wherein training the machine translation model based on the training text, the original translation result and the original tag sequence comprises:
    将所述训练文本输入所述机器翻译模型的解码器,获得训练标记序列和训练词序列;Inputting the training text into the decoder of the machine translation model to obtain a training label sequence and a training word sequence;
    将所述训练词序列和所述训练标记序列输入所述机器翻译模型的解码器,获得训练翻译结果;Inputting the training word sequence and the training label sequence into the decoder of the machine translation model to obtain a training translation result;
    根据所述训练标记序列和所述原始标记序列计算第一损失函数;Calculate a first loss function according to the training label sequence and the original label sequence;
    根据所述训练翻译结果和所述原始翻译结果计算第二损失函数;Calculate a second loss function according to the training translation result and the original translation result;
    根据所述第一损失函数和所述第二损失函数训练所述编码器,根据所述第二损失函数训练所述解码器。The encoder is trained according to the first loss function and the second loss function, and the decoder is trained according to the second loss function.
  7. 根据权利要求5或6所述的方法,其中,所述设定规则包括如下至少一项:丢弃代词、丢弃标点符号、将字词替换为错别字;The method according to claim 5 or 6, wherein the setting rules include at least one of the following: discarding pronouns, discarding punctuation marks, and replacing words with typos;
    所述根据所述设定规则对所述训练文本添加标记,获得原始标记序列,包括:The adding a mark to the training text according to the set rule to obtain the original mark sequence, including:
    在对所述原始文本进行的是丢弃代词的预处理的情况下,对所述训练文本中所述代词后的词添加的标记为第一设定值;In the case that the preprocessing of discarding pronouns is performed on the original text, the mark added to the words after the pronouns in the training text is a first set value;
    在对所述原始文本进行的是丢弃标点的预处理的情况下,对所述训练文本中所述标点位置添加的标记为第二设定值;In the case that the preprocessing of discarding punctuation is performed on the original text, the mark added to the punctuation position in the training text is a second set value;
    在对所述原始文本进行的是将字词替换为错别字的预处理的情况下,对所述训练文本中所述错别字添加的标记为第三设定值;In the case where the original text is preprocessed by replacing words with typos, the mark added to the typos in the training text is a third set value;
    对所述训练文本中未进行预处理的词添加的标记为第四设定值。The tags added to the words that have not been preprocessed in the training text are the fourth set value.
  8. 一种文本翻译装置,包括:A text translation device, comprising:
    句子分割模块,设置为对待翻译文本的至少两个句子采用设定分割标签进行分割;A sentence segmentation module, which is set to use a set segmentation label to segment at least two sentences of the text to be translated;
    序列获取模块,设置为将分割后的所述待翻译文本输入机器翻译模型的编码器,获得所述待翻译文本的词序列及标记序列;A sequence acquisition module, configured to input the segmented text to be translated into an encoder of a machine translation model to obtain a word sequence and a tag sequence of the text to be translated;
    目标翻译结果获取模块,设置为将所述词序列及所述标记序列输入所述机器翻译模型的解码器,获得目标翻译结果。The target translation result obtaining module is configured to input the word sequence and the label sequence into the decoder of the machine translation model to obtain the target translation result.
  9. 一种电子设备,包括:An electronic device comprising:
    至少一个处理装置;at least one processing device;
    存储装置,设置为存储至少一个指令;a storage device configured to store at least one instruction;
    当所述至少一个指令被所述至少一个处理装置执行,使得所述至少一个处 理装置实现如权利要求1-7中任一项所述的文本翻译方法。The at least one instruction, when executed by the at least one processing device, causes the at least one processing device to implement the text translation method of any one of claims 1-7.
  10. 一种计算机可读存储介质,存储有计算机程序,其中,所述程序被处理装置执行时实现如权利要求1-7中任一项所述的文本翻译方法。A computer-readable storage medium storing a computer program, wherein when the program is executed by a processing device, the text translation method according to any one of claims 1-7 is implemented.
PCT/CN2021/131360 2020-12-04 2021-11-18 Text translation method, apparatus and device, and storage medium WO2022116841A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011408602.2A CN112417902A (en) 2020-12-04 2020-12-04 Text translation method, device, equipment and storage medium
CN202011408602.2 2020-12-04

Publications (1)

Publication Number Publication Date
WO2022116841A1 true WO2022116841A1 (en) 2022-06-09

Family

ID=74830332

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/131360 WO2022116841A1 (en) 2020-12-04 2021-11-18 Text translation method, apparatus and device, and storage medium

Country Status (2)

Country Link
CN (1) CN112417902A (en)
WO (1) WO2022116841A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116595999A (en) * 2023-07-17 2023-08-15 深圳须弥云图空间科技有限公司 Machine translation model training method and device

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112417902A (en) * 2020-12-04 2021-02-26 北京有竹居网络技术有限公司 Text translation method, device, equipment and storage medium
CN113038184B (en) * 2021-03-01 2023-05-05 北京百度网讯科技有限公司 Data processing method, device, equipment and storage medium
CN112966506A (en) * 2021-03-23 2021-06-15 北京有竹居网络技术有限公司 Text processing method, device, equipment and storage medium
CN113139391B (en) * 2021-04-26 2023-06-06 北京有竹居网络技术有限公司 Translation model training method, device, equipment and storage medium
CN113221576B (en) * 2021-06-01 2023-01-13 复旦大学 Named entity identification method based on sequence-to-sequence architecture
CN116070643B (en) * 2023-04-03 2023-08-15 武昌理工学院 Fixed style translation method and system from ancient text to English

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108549644A (en) * 2018-04-12 2018-09-18 苏州大学 Omission pronominal translation method towards neural machine translation
CN109948166A (en) * 2019-03-25 2019-06-28 腾讯科技(深圳)有限公司 Text interpretation method, device, storage medium and computer equipment
WO2020048195A1 (en) * 2018-09-05 2020-03-12 腾讯科技(深圳)有限公司 Text translation method and apparatus, storage medium and computer device
CN112417902A (en) * 2020-12-04 2021-02-26 北京有竹居网络技术有限公司 Text translation method, device, equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110020440B (en) * 2018-01-09 2023-05-23 深圳市腾讯计算机系统有限公司 Machine translation method, device, server and storage medium
US10963652B2 (en) * 2018-12-11 2021-03-30 Salesforce.Com, Inc. Structured text translation
CN110750959B (en) * 2019-10-28 2022-05-10 腾讯科技(深圳)有限公司 Text information processing method, model training method and related device
CN111160050A (en) * 2019-12-20 2020-05-15 沈阳雅译网络技术有限公司 Chapter-level neural machine translation method based on context memory network
CN111382577B (en) * 2020-03-11 2023-05-02 北京字节跳动网络技术有限公司 Document translation method, device, electronic equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108549644A (en) * 2018-04-12 2018-09-18 苏州大学 Omission pronominal translation method towards neural machine translation
WO2020048195A1 (en) * 2018-09-05 2020-03-12 腾讯科技(深圳)有限公司 Text translation method and apparatus, storage medium and computer device
CN109948166A (en) * 2019-03-25 2019-06-28 腾讯科技(深圳)有限公司 Text interpretation method, device, storage medium and computer equipment
CN112417902A (en) * 2020-12-04 2021-02-26 北京有竹居网络技术有限公司 Text translation method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LONGYUE WANG; ZHAOPENG TU; SHUMING SHI; TONG ZHANG; YVETTE GRAHAM; QUN LIU: "Translating Pro-Drop Languages with Reconstruction Models", ARXIV.ORG, 10 January 2018 (2018-01-10), pages 1 - 10, XP080851838 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116595999A (en) * 2023-07-17 2023-08-15 深圳须弥云图空间科技有限公司 Machine translation model training method and device
CN116595999B (en) * 2023-07-17 2024-04-16 深圳须弥云图空间科技有限公司 Machine translation model training method and device

Also Published As

Publication number Publication date
CN112417902A (en) 2021-02-26

Similar Documents

Publication Publication Date Title
WO2022116841A1 (en) Text translation method, apparatus and device, and storage medium
CN110969012B (en) Text error correction method and device, storage medium and electronic equipment
CN108877782B (en) Speech recognition method and device
WO2020186904A1 (en) Text adaptive display method and apparatus, electronic device, server, and storage medium
CN106997370A (en) Text classification and conversion based on author
WO2022143105A1 (en) Method and apparatus for generating text generation model, text generation method and apparatus, and device
US10558762B2 (en) System and method for adaptive quality estimation for machine translation post-editing
US11783808B2 (en) Audio content recognition method and apparatus, and device and computer-readable medium
CN111382261B (en) Abstract generation method and device, electronic equipment and storage medium
WO2022228041A1 (en) Translation model training method, apparatus, and device, and storage medium
WO2022228221A1 (en) Information translation method, apparatus and device, and storage medium
CN111488742B (en) Method and device for translation
WO2022116821A1 (en) Translation method and apparatus employing multi-language machine translation model, device, and medium
CN113778419B (en) Method and device for generating multimedia data, readable medium and electronic equipment
WO2023103897A1 (en) Image processing method, apparatus and device, and storage medium
WO2024099342A1 (en) Translation method and apparatus, readable medium, and electronic device
CN111104796B (en) Method and device for translation
WO2022116819A1 (en) Model training method and apparatus, machine translation method and apparatus, and device and storage medium
WO2020052060A1 (en) Method and apparatus for generating correction statement
JP2023078411A (en) Information processing method, model training method, apparatus, appliance, medium and program product
CN111126078B (en) Translation method and device
CN113221514A (en) Text processing method and device, electronic equipment and storage medium
CN112820280A (en) Generation method and device of regular language model
CN111027332A (en) Method and device for generating translation model
CN110084710A (en) Determine the method and device of message subject

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21899875

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21899875

Country of ref document: EP

Kind code of ref document: A1