WO2022088570A1 - Method and apparatus for post-editing of translation, electronic device, and storage medium - Google Patents

Method and apparatus for post-editing of translation, electronic device, and storage medium Download PDF

Info

Publication number
WO2022088570A1
WO2022088570A1 PCT/CN2021/078814 CN2021078814W WO2022088570A1 WO 2022088570 A1 WO2022088570 A1 WO 2022088570A1 CN 2021078814 W CN2021078814 W CN 2021078814W WO 2022088570 A1 WO2022088570 A1 WO 2022088570A1
Authority
WO
WIPO (PCT)
Prior art keywords
translation
text
sample
post
editing
Prior art date
Application number
PCT/CN2021/078814
Other languages
French (fr)
Chinese (zh)
Inventor
张睦
Original Assignee
语联网(武汉)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 语联网(武汉)信息技术有限公司 filed Critical 语联网(武汉)信息技术有限公司
Publication of WO2022088570A1 publication Critical patent/WO2022088570A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

A method and apparatus for post-editing of translation. The method comprises: determining machine translated text to be edited (110); and inputting into a post-editing model the machine translated text and original text corresponding thereto to obtain post-edited translation text output by the post-editing model (120), wherein the post-editing model is obtained by fine-tuning a pre-trained post-editing model on the basis of sample fine-tuning original text, sample fine-tuning post-edited translation text, and sample machine translated text of the sample fine-tuning original text, and the pre-trained post-editing model is trained on the basis of the sample pre-training original text, sample pre-training post-edited translation text, and simulated translation text of the sample pre-training original text. The method and apparatus improve the efficiency and effect of post-editing model training and improve post-editing accuracy by means of pre-training and fine-tuning as well as translation data synthesis by error simulation.

Description

译文后编译方法、装置、电子设备和存储介质Post-translation compilation method, apparatus, electronic device and storage medium
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求于2020年10月29日提交的申请号为2020111868691,发明名称为“译文后编译方法、装置、电子”的中国专利申请的优先权,其通过引用方式全部并入本文。This application claims the priority of the Chinese patent application with the application number 2020111868691 filed on October 29, 2020, and the invention title is "post-translation compilation method, device, and electronics", which is fully incorporated herein by reference.
技术领域technical field
本公开涉及自然语言处理技术领域,尤其涉及一种译文后编辑方法、装置、电子设备和存储介质。The present disclosure relates to the technical field of natural language processing, and in particular, to a post-translation editing method, apparatus, electronic device and storage medium.
背景技术Background technique
后编辑是指给定待翻译的原文,调取其对应的机器翻译结果,然后译员在此基础上进行修改和润色,进而提升翻译的质量。其中,机器翻译结果可以为译员提供一个翻译结果作为参考,以免译员从头开始进行翻译,从而减少译员的工作负担。Post-editing means that given the original text to be translated, the corresponding machine translation results are retrieved, and then the translators modify and polish on this basis, thereby improving the quality of the translation. Among them, the machine translation result can provide the translator with a translation result as a reference, so as to avoid the translator from starting the translation from scratch, thereby reducing the translator's workload.
在实际工作中,当机器翻译结果与期望的翻译结果差距较大时,后编辑模式会导致译员需要做很多修改编辑,反而进一步增加了译员的工作负担。例如,当机器翻译模型处理一些资源有限且面向某些专业领域的待翻译原文时,效果较差,得出的机器翻译结果也与正确翻译结果相差较远。或者机器翻译模型对于实体词,如人名、地名或机构名等,或是对数词进行了错误翻译时,得出的机器翻译结果准确性也欠佳。又或者机器翻译模型无法合理的处理长句子的翻译时,同样会造成机器翻译结果准确性不足,需要大量的后编辑工作。因此,自动后编辑模型在当前的辅助翻译中扮演着越来越重要的角色。该后编辑模型可以基于输入的待翻译原文和机器翻译的译文,对机器翻译的译文自动进行后编辑,实现译文错误的更正,并输出后编辑的译文,通过更进一步减少输出的译文与译员所期望的译文之间的差距,进一步减少了译员的工作量。In actual work, when the machine translation result is far from the expected translation result, the post-editing mode will cause the translator to do a lot of revision and editing, which will further increase the translator's workload. For example, when the machine translation model processes some original texts to be translated with limited resources and oriented to certain professional fields, the effect is poor, and the machine translation results obtained are far from the correct translation results. Or when the machine translation model mistranslates entity words, such as person names, place names, or institution names, or mistranslates numerals, the accuracy of the machine translation results obtained is also poor. Or when the machine translation model cannot reasonably handle the translation of long sentences, the accuracy of the machine translation results will also be insufficient, requiring a lot of post-editing work. Therefore, automatic post-editing models play an increasingly important role in current assisted translation. The post-editing model can automatically post-edit the machine-translated translation based on the input original text to be translated and the machine-translated translation, so as to correct translation errors, and output the post-edited translation. The gap between the expected translations further reduces the translator's workload.
然而,现有的后编辑模型训练方法,需要数量众多的三元平行语料,即原文、机器翻译译文,和后编辑译文组成的三元组。而这些三元组训练 数据较难获取且需要大量的人工标注成本,导致后编辑模型的训练效果欠佳、训练效率不高,进而造成译文后编辑的准确性欠佳。However, the existing post-editing model training methods require a large number of triplet parallel corpora, that is, the triplet consisting of the original text, the machine translation translation, and the post-editing translation. However, these triplet training data are difficult to obtain and require a lot of manual labeling costs, resulting in poor training effect and low training efficiency of the post-editing model, which in turn results in poor post-editing accuracy of translations.
发明内容SUMMARY OF THE INVENTION
(一)要解决的技术问题(1) Technical problems to be solved
本公开实施例提供一种译文后编辑方法、装置、电子设备和存储介质,用以解决现有技术中后编辑模型训练效果欠佳、训练效率不高,译文后编辑的准确性欠佳的缺陷。Embodiments of the present disclosure provide a post-translation editing method, device, electronic device, and storage medium, which are used to solve the defects of the prior art that the post-editing model has poor training effect, low training efficiency, and poor post-translation editing accuracy. .
(二)发明内容(2) Content of the invention
本公开实施例提供一种译文后编辑方法,包括:An embodiment of the present disclosure provides a post-translation editing method, including:
确定待编辑的机器翻译译文文本;Determine the machine translation translation text to be edited;
将所述机器翻译译文文本及其对应的原文文本输入至后编辑模型,得到所述后编辑模型输出的后编辑译文文本;Inputting the machine-translated translation text and its corresponding original text into the post-editing model to obtain the post-editing translation text output by the post-editing model;
其中,所述后编辑模型是基于样本微调原文文本及其样本微调后编辑译文文本,以及所述样本微调原文文本的样本机器翻译译文文本,对预训练后编辑模型进行微调后得到的;The post-editing model is obtained by fine-tuning the pre-training post-editing model based on the sample fine-tuning original text and the sample fine-tuning edited translation text, and the sample machine-translated translation text of the sample fine-tuning original text;
所述预训练后编辑模型是基于样本预训练原文文本及其样本预训练后编辑译文文本,以及所述样本预训练原文文本的模拟译文文本训练得到的。The pre-training post-editing model is obtained by training based on the sample pre-training original text, the sample pre-training post-editing translation text, and the simulated translation text of the sample pre-training original text.
根据本公开一个实施例的译文后编辑方法,所述样本机器翻译译文文本对应长句翻译错误、实体名翻译错误以及领域翻译错误中的至少一种错误类型。According to the post-translation editing method of an embodiment of the present disclosure, the sample machine-translated translation text corresponds to at least one error type among long sentence translation errors, entity name translation errors, and domain translation errors.
根据本公开一个实施例的译文后编辑方法,所述样本机器翻译译文文本是基于以下至少一种方式确定的:According to the post-translation editing method according to an embodiment of the present disclosure, the sample machine-translated translation text is determined based on at least one of the following methods:
应用第一机器翻译模型对所述样本微调原文文本进行翻译,得到长句翻译错误类型的样本机器翻译译文文本;所述第一机器翻译模型是基于第一样本翻译原文文本及其第一样本翻译译文文本训练得到的,所述样本微调原文文本为长句,所述第一样本翻译原文文本为短句;Translate the sample fine-tuned original text by applying a first machine translation model to obtain a sample machine translation translation text of the type of long sentence translation error; the first machine translation model is based on the first sample translation of the original text and its first translation The original text of the sample fine-tuning is a long sentence, and the original text of the first sample translation is a short sentence;
对所述样本微调后编辑译文文本中的实体名进行随机修改,得到实体 名翻译错误类型的样本机器翻译译文文本;Randomly modifying the entity name in the edited translation text after the sample fine-tuning, to obtain a sample machine translation translation text with an entity name translation error type;
应用第二机器翻译模型对所述样本微调原文文本进行翻译,得到领域翻译错误类型的样本机器翻译译文文本;所述第二机器翻译模型是基于与所述样本微调原文文本领域不同的第二样本翻译原文文本及其第二样本翻译译文文本训练得到的。Translate the sample fine-tuned original text by applying a second machine translation model to obtain a sample machine-translated target text of the wrong type of domain translation; the second machine translation model is based on a second sample that differs from the sample fine-tuned original text in domain It is obtained by training the translated original text and its second sample translated translation text.
根据本公开一个实施例的译文后编辑方法,所述预训练后编辑模型包括预训练的原文语言编码器和预训练的译文语言编码器,以及解码器。According to the post-translation editing method according to an embodiment of the present disclosure, the pre-training post-editing model includes a pre-trained source language encoder, a pre-trained target language encoder, and a decoder.
根据本公开一个实施例的译文后编辑方法,所述预训练的原文语言编码器和预训练的译文语言编码器是基于对应语言的样本单语文本以及对所述样本单语文本进行常规错误模拟得到的样本错误文本训练得到的。According to the post-translation editing method according to an embodiment of the present disclosure, the pre-trained source language encoder and the pre-trained target language encoder are based on sample monolingual texts in corresponding languages and perform conventional error simulation on the sample monolingual texts The resulting sample error text is obtained by training.
根据本公开一个实施例的译文后编辑方法,所述模拟译文文本是基于如下步骤确定的:According to the post-translation editing method of an embodiment of the present disclosure, the simulated translation text is determined based on the following steps:
对所述样本预训练原文文本或所述样本预训练后编辑译文文本进行常规错误模拟,得到所述模拟译文文本。Perform conventional error simulation on the sample pre-trained original text or the sample pre-trained edited translation text to obtain the simulated translation text.
根据本公开一个实施例的译文后编辑方法,所述进行常规错误模拟具体包括:According to the post-translation editing method according to an embodiment of the present disclosure, the performing conventional error simulation specifically includes:
随机选取对应文本中的若干文本片段,并对所述文本片段进行删除、重排、替换或转移操作。Randomly select several text segments in the corresponding text, and perform deletion, rearrangement, replacement or transfer operations on the text segments.
本公开实施例还提供一种译文后编辑装置,包括:An embodiment of the present disclosure also provides a post-translation editing device, including:
译文确定单元,用于确定待编辑的机器翻译译文文本;A translation determination unit, used to determine the machine translation translation text to be edited;
后编辑单元,用于将所述机器翻译译文文本及其对应的原文文本输入至后编辑模型,得到所述后编辑模型输出的后编辑译文文本;a post-editing unit, configured to input the machine-translated translation text and its corresponding original text into the post-editing model to obtain the post-editing translation text output by the post-editing model;
其中,所述后编辑模型是基于样本微调原文文本及其样本微调后编辑译文文本,以及所述样本微调原文文本的样本机器翻译译文文本,对预训练后编辑模型进行微调后得到的;The post-editing model is obtained by fine-tuning the pre-training post-editing model based on the sample fine-tuning original text and the sample fine-tuning edited translation text, and the sample machine-translated translation text of the sample fine-tuning original text;
所述预训练后编辑模型是基于样本预训练原文文本及其样本预训练后编辑译文文本,以及所述样本预训练原文文本的模拟译文文本训练得到的。The pre-training post-editing model is obtained by training based on the sample pre-training original text, the sample pre-training post-editing translation text, and the simulated translation text of the sample pre-training original text.
本公开实施例还提供一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如上述任一种所述译文后编辑方法的步骤。Embodiments of the present disclosure further provide an electronic device, including a memory, a processor, and a computer program stored in the memory and running on the processor, where the processor implements the translation according to any one of the above when executing the program Post-editing method steps.
本公开实施例还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如上述任一种所述译文后编辑方法的步骤。Embodiments of the present disclosure further provide a non-transitory computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements the steps of any of the above-mentioned post-translation editing methods.
(三)有益效果(3) Beneficial effects
本公开实施例提供的译文后编辑方法、装置、电子设备和存储介质,通过基于样本预训练原文文本及其样本预训练后编辑译文文本,以及样本预训练原文文本的模拟译文文本训练得到预训练后编辑模型,并基于样本微调原文文本及其样本微调后编辑译文文本,以及样本微调原文文本的样本机器翻译译文文本,对预训练后编辑模型进行微调后得到后编辑模型,通过预训练加微调的方式,以及错误模拟以合成译文数据的方式,提高了后编辑模型的训练效率和训练效果,提高了后编辑的准确性。The post-translation editing method, device, electronic device, and storage medium provided by the embodiments of the present disclosure obtain pre-training by pre-training the original text based on the sample and editing the translation text after the sample pre-training, and training the simulated translation text of the sample pre-training original text. Post-editing model, and fine-tuning the original text based on the sample and its sample-fine-tuning post-editing translation text, as well as the sample machine-translating translation text of the sample fine-tuning the original text, fine-tuning the pre-training post-editing model to obtain a post-editing model, which is pre-trained and fine-tuning method, and the way of synthesizing translation data by error simulation, improves the training efficiency and training effect of the post-editing model, and improves the accuracy of post-editing.
附图说明Description of drawings
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following will briefly introduce the accompanying drawings used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort.
图1为本公开实施例提供的译文后编辑方法的流程示意图;1 is a schematic flowchart of a post-translation editing method provided by an embodiment of the present disclosure;
图2为本公开实施例提供的译文后编辑模型训练方法的流程示意图;2 is a schematic flowchart of a post-translation editing model training method provided by an embodiment of the present disclosure;
图3为本公开实施例提供的译文后编辑装置的结构示意图;3 is a schematic structural diagram of a post-translation editing apparatus provided by an embodiment of the present disclosure;
图4为本公开实施例提供的电子设备的结构示意图。FIG. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
具体实施方式Detailed ways
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在没有作出创造性劳动前提 下所获得的所有其他实施例,都属于本公开保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present disclosure clearer, the technical solutions in the embodiments of the present disclosure will be described clearly and completely below with reference to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments These are some, but not all, embodiments of the present disclosure. Based on the embodiments in the present disclosure, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present disclosure.
后编辑是指给定待翻译的原文,调取其对应的机器翻译结果,然后译员在此基础上进行修改和润色,进而提升翻译的质量。其中,机器翻译结果可以为译员提供一个翻译结果作为参考,以免译员从头开始进行翻译,从而减少译员的工作负担。然而,当机器翻译结果与期望的翻译结果差距较大时,后编辑模式会导致译员需要做很多修改编辑,反而进一步增加了译员的工作负担。例如,当机器翻译模型处理一些资源有限且面向某些专业领域的待翻译原文时,对于实体词,如人名、地名或机构名等,或是对数词进行了错误翻译时,或机器翻译模型无法合理的处理长句子的翻译时,均会导致翻译效果较差,得出的机器翻译结果与正确翻译结果相差较远,需要大量的后编辑工作。因此,自动后编辑模型在当前的辅助翻译中扮演着越来越重要的角色。Post-editing means that given the original text to be translated, the corresponding machine translation results are retrieved, and then the translators modify and polish on this basis, thereby improving the quality of the translation. Among them, the machine translation result can provide the translator with a translation result as a reference, so as to avoid the translator from starting the translation from scratch, thereby reducing the translator's workload. However, when the machine translation result is far from the expected translation result, the post-editing mode will cause the translator to do a lot of revision editing, which will further increase the translator's workload. For example, when the machine translation model processes some original texts to be translated with limited resources and is oriented to certain professional fields, for entity words, such as names of people, places or institutions, or when the numerical words are translated incorrectly, or the machine translation model When the translation of long sentences cannot be handled reasonably, the translation effect will be poor, and the obtained machine translation results are far from the correct translation results, requiring a lot of post-editing work. Therefore, automatic post-editing models play an increasingly important role in current assisted translation.
然而,现有的后编辑模型训练方法,需要数量众多的三元平行语料,即原文、机器翻译译文,和后编辑译文组成的三元组。而这些三元组训练数据较难获取且需要大量的人工标注成本,导致后编辑模型的训练效果欠佳、训练效率不高,进而造成译文后编辑的准确性欠佳。However, the existing post-editing model training methods require a large number of triplet parallel corpora, that is, the triplet consisting of the original text, the machine translation translation, and the post-editing translation. However, these triplet training data are difficult to obtain and require a lot of manual labeling costs, resulting in poor training effect and low training efficiency of the post-editing model, resulting in poor post-editing accuracy of translations.
对此,本公开实施例提供了一种译文后编辑方法。图1为本公开实施例提供的译文后编辑方法的流程示意图,如图1所示,该方法包括:In this regard, an embodiment of the present disclosure provides a post-translation editing method. FIG. 1 is a schematic flowchart of a post-translation editing method provided by an embodiment of the present disclosure. As shown in FIG. 1 , the method includes:
步骤110,确定待编辑的机器翻译译文文本; Step 110, determining the machine translation translation text to be edited;
步骤120,将机器翻译译文文本及其对应的原文文本输入至后编辑模型,得到后编辑模型输出的后编辑译文文本; Step 120, input the machine translation translation text and its corresponding original text into the post-editing model, and obtain the post-editing translation text output by the post-editing model;
其中,后编辑模型是基于样本微调原文文本及其样本微调后编辑译文文本,以及样本微调原文文本的样本机器翻译译文文本,对预训练后编辑模型进行微调后得到的;Among them, the post-editing model is obtained by fine-tuning the pre-training post-editing model based on the sample fine-tuning original text and its sample fine-tuning post-editing translation text, and the sample machine-translating translation text of the sample fine-tuning original text;
预训练后编辑模型是基于样本预训练原文文本及其样本预训练后编辑译文文本,以及样本预训练原文文本的模拟译文文本训练得到的。The pre-trained post-editing model is trained based on the sample pre-trained original text and its sample pre-trained post-edited translation text, as well as the simulated translation text of the sample pre-trained original text.
具体地,获取原文文本对应的机器翻译译文文本,以供后编辑模型对其进行自动后编辑。其中,机器翻译译文文本可以是将原文文本输入到机 器翻译模型中进行翻译后得到的。Specifically, the machine translation translation text corresponding to the original text is obtained, so that the post-editing model can automatically post-edit it. Among them, the machine translation target text can be obtained by inputting the original text into the machine translation model for translation.
然后,将机器翻译译文文本及其对应的原文文本输入到后编辑模型中,后编辑模型会基于原文文本的语义信息以及机器翻译译文文本的语义信息,对机器翻译译文文本进行错误纠正,从而得到纠正后的后编辑译文文本。此处,后编辑译文文本所使用的语言与机器翻译译文文本所使用的语言相同。Then, the machine-translated target text and its corresponding original text are input into the post-editing model, and the post-editing model will correct the error of the machine-translated target text based on the semantic information of the original text and the semantic information of the machine-translated target text, so as to obtain Corrected post-edited translation text. Here, the post-edited target text is in the same language as the machine translated target text.
其中,后编辑模型是基于样本微调原文文本及其样本微调后编辑译文文本,以及样本微调原文文本的样本机器翻译译文文本,对预训练后编辑模型进行微调后得到的;而预训练后编辑模型是基于样本预训练原文文本及其样本预训练后编辑译文文本,以及样本预训练原文文本的模拟译文文本训练得到的。Among them, the post-editing model is obtained by fine-tuning the pre-training post-editing model based on the sample fine-tuning original text and its sample fine-tuning post-editing translation text, as well as the sample machine-translating translation text of the sample fine-tuning original text; and the pre-training post-editing model It is obtained based on the sample pre-trained original text and its sample pre-trained edited translation text, as well as the simulated translation text of the sample pre-trained original text.
此处,在训练后编辑模型时,采用的是预训练加微调的方式。图2为本公开实施例提供的译文后编辑模型训练方法的流程示意图,如图2所示,译文后编辑模型的训练方法包括:Here, when editing the model after training, the method of pre-training and fine-tuning is used. FIG. 2 is a schematic flowchart of a post-translation editing model training method provided by an embodiment of the present disclosure. As shown in FIG. 2 , the training method for the post-translation editing model includes:
步骤210,基于样本预训练原文文本及其样本预训练后编辑译文文本,以及样本预训练原文文本的模拟译文文本对初始模型进行训练,得到预训练后编辑模型; Step 210, training the initial model based on the sample pre-trained original text and the sample pre-trained edited translation text, and the sample pre-trained original text simulated translation text, to obtain a pre-trained edited model;
步骤220,基于样本微调原文文本及其样本微调后编辑译文文本,以及样本微调原文文本的样本机器翻译译文文本,对预训练后编辑模型进行微调,得到后编辑模型。 Step 220 , fine-tune the pre-trained post-editing model based on the sample fine-tuned original text and its sample fine-tuned edited translation text, and the sample machine-translated translated text of the sample fine-tuned original text, to obtain a post-edited model.
首先,利用大量的样本预训练原文文本及其样本预训练后编辑译文文本,以及模拟译文文本对初始模型进行预训练,得到预训练后编辑模型。其中,样本预训练原文文本及其样本预训练后编辑译文文本可以通过从网络上下载公共的双语平行语料数据得到,例如联合国政府公文以及国际机器翻译大赛(Conference on Machine Translation,WMT)给出的中英平行语料。然后,可以基于双语平行语料进行错误模拟,得到样本预训练原文文本的模拟译文文本,以模拟机器翻译的译文。由于在预训练时,只需获取双语平行语料,并以错误模拟的方式合成与机器翻译译文类似的模拟译 文文本,大大减少了训练数据的获取难度,也省去了人工标注后编辑译文的成本,有助于提高整个训练过程的效率,并降低训练难度。First, use a large number of samples to pre-train the original text and its sample pre-trained edited translation text, and simulate the translated text to pre-train the initial model to obtain a pre-trained post-editing model. Among them, the original text of the sample pre-training and the edited translation text after the sample pre-training can be obtained by downloading public bilingual parallel corpus data from the Internet, such as the official documents of the United Nations and the International Machine Translation Competition (Conference on Machine Translation, WMT). Chinese and English parallel corpus. Then, an error simulation can be performed based on the bilingual parallel corpus to obtain a simulated translation text of the sample pre-trained original text to simulate the translation of the machine translation. Since in pre-training, only bilingual parallel corpus is needed, and simulated translation text similar to machine translation translation is synthesized in a wrong simulation way, which greatly reduces the difficulty of acquiring training data and saves the cost of editing translation after manual annotation. , which helps to improve the efficiency of the entire training process and reduce the difficulty of training.
另外,预训练得到的预训练后编辑模型在训练过程中,根据样本预训练原文文本及其样本预训练后编辑译文文本,以及样本预训练原文文本的模拟译文文本,可以学习到译文中可能出现的文本错误,例如字词重复、倒序、漏词等,并学习到如何根据原文文本对译文文本中的文本错误进行纠正,以得到正确的后编辑译文文本。In addition, during the training process of the pre-trained post-editing model obtained by pre-training, the pre-trained original text and the pre-trained post-edited translated text of the sample, as well as the simulated translated text of the pre-trained original text of the sample, can learn that there may appear in the translation. and learn how to correct the text errors in the translated text according to the original text, so as to obtain the correct post-edited translation text.
为了进一步提高后编辑的准确性,以更好地完成后编辑任务,可以基于样本微调原文文本及其样本微调后编辑译文文本,以及样本微调原文文本的样本机器翻译译文文本,对预训练后编辑模型进行微调,得到后编辑模型。其中,样本微调原文文本与其样本微调后编辑译文文本也可以通过获取双语平行语料得到。此处,为了提高微调的准确性,可以获取翻译生产环境中所产生的双语平行语料。其中每一条双语平行语料包括待翻译的原文文本,以及人工翻译审校后所产生的高质量的译文文本。根据该生产环境中产生的双语平行语料,即可得到样本微调原文文本以及高质量的样本微调后编辑译文文本。而样本机器翻译译文文本中包括后编辑场景下,实际机器翻译过程中由于机器翻译模型的局限性导致的翻译错误。基于样本微调原文文本及其样本微调后编辑译文文本,以及样本机器翻译译文文本进行微调,可以使后编辑模型在常规的文本错误之外,还学习到机器翻译领域可能出现的翻译错误,从而提高后编辑模型在后编辑场景下的错误定位和纠正能力,进一步提高后编辑的准确性。此外,由于微调时所需的数据量相较于预训练阶段较少,因此可以减少<原文,机器翻译译文,后编辑译文>三元组的获取难度,进一步降低了模型训练难度,提高了模型训练效率。In order to further improve the accuracy of post-editing to better complete the post-editing task, the original text and its sample fine-tuned edited translation text can be fine-tuned based on the sample, as well as the sample machine-translated translated text of the sample fine-tuned original text. Models are fine-tuned and post-edited models are obtained. Among them, the sample fine-tuned original text and its sample fine-tuned edited translation text can also be obtained by acquiring bilingual parallel corpora. Here, in order to improve the accuracy of fine-tuning, bilingual parallel corpora generated in the translation production environment can be obtained. Each bilingual parallel corpus includes the original text to be translated, as well as the high-quality translation text produced by human translation and review. According to the bilingual parallel corpus generated in the production environment, the sample fine-tuned original text and the high-quality sample fine-tuned edited translation text can be obtained. The sample machine translation translation text includes translation errors caused by the limitations of the machine translation model in the actual machine translation process in the post-editing scenario. Based on sample fine-tuning of the original text and its sample fine-tuning of the post-edited translated text, as well as the fine-tuning of the sample machine-translated target text, the post-editing model can learn translation errors that may occur in the field of machine translation in addition to conventional text errors, thereby improving The post-editing model's ability to locate and correct errors in post-editing scenarios further improves the accuracy of post-editing. In addition, since the amount of data required for fine-tuning is less than that in the pre-training stage, it can reduce the difficulty of acquiring the triplet of <original text, machine translation translation, post-editing translation>, further reducing the difficulty of model training and improving the model. training efficiency.
本公开实施例提供的方法,通过基于样本预训练原文文本及其样本预训练后编辑译文文本,以及样本预训练原文文本的模拟译文文本训练得到预训练后编辑模型,并基于样本微调原文文本及其样本微调后编辑译文文本,以及样本微调原文文本的样本机器翻译译文文本,对预训练后编辑模 型进行微调后得到后编辑模型,通过预训练加微调的方式,以及错误模拟以合成译文数据的方式,提高了后编辑模型的训练效率和训练效果,提高了后编辑的准确性。In the method provided by the embodiments of the present disclosure, a pre-trained editing model is obtained by pre-training the original text based on the sample and the edited translation text after the sample pre-training, and training the simulated translation text of the sample pre-training original text, and fine-tuning the original text and the sample based on the sample. The edited translation text after sample fine-tuning, and the sample machine-translated translation text of the sample fine-tuned original text, the post-editing model is obtained after fine-tuning the pre-training post-editing model. In this way, the training efficiency and training effect of the post-editing model are improved, and the accuracy of the post-editing is improved.
基于上述实施例,样本机器翻译译文文本对应长句翻译错误、实体名翻译错误以及领域翻译错误中的至少一种错误类型。Based on the above embodiment, the sample machine translation target text corresponds to at least one error type among long sentence translation errors, entity name translation errors, and domain translation errors.
具体地,为了使后编辑模型在微调过程中学习到后编辑场景下,实际机器翻译过程中由于机器翻译模型的局限性导致的翻译错误,可以获取包含上述翻译错误的样本机器翻译译文文本。通常情况下,可能存在的翻译错误包括长句翻译错误、实体名翻译错误以及领域翻译错误等。其中,长句翻译错误是机器翻译模型无法合理的处理长句子时出现的错误;实体名翻译错误是机器翻译模型对于实体词,如人名、地名或机构名等,或是对数词进行翻译时出现的错误;领域翻译错误是当机器翻译模型处理一些资源有限且面向某些专业领域的待翻译原文时,由于机器翻译模型适用的领域与待翻译原文领域存在差别所导致的错误。因此,获取的样本机器翻译译文文本可以对应长句翻译错误、实体名翻译错误以及领域翻译错误中的至少一种错误类型。Specifically, in order for the post-editing model to learn the post-editing scenario during the fine-tuning process, the translation errors caused by the limitations of the machine translation model in the actual machine translation process can be obtained. Sample machine translation translation texts containing the above translation errors can be obtained. Usually, possible translation errors include long sentence translation errors, entity name translation errors, and domain translation errors. Among them, long sentence translation errors are errors that occur when the machine translation model cannot reasonably process long sentences; entity name translation errors are when the machine translation model translates entity words, such as person names, place names or institution names, or when translating numerals. Errors that occur; domain translation errors are errors caused by the difference between the domain where the machine translation model is applicable and the domain to be translated when the machine translation model processes some original texts to be translated with limited resources and oriented to certain professional fields. Therefore, the obtained sample machine translation translation text may correspond to at least one error type among long sentence translation errors, entity name translation errors, and domain translation errors.
基于上述任一实施例,样本机器翻译译文文本是基于以下至少一种方式确定的:Based on any of the above embodiments, the sample machine translation translation text is determined based on at least one of the following methods:
应用第一机器翻译模型对样本微调原文文本进行翻译,得到长句翻译错误类型的样本机器翻译译文文本;第一机器翻译模型是基于第一样本翻译原文文本及其第一样本翻译译文文本训练得到的,样本微调原文文本为长句,第一样本翻译原文文本为短句;The first machine translation model is used to translate the sample fine-tuned original text, and the sample machine translation translation text of the type of long sentence translation error is obtained; the first machine translation model is based on the first sample translation original text and its first sample translation translation text After training, the original text of the sample fine-tuning is a long sentence, and the first sample translated original text is a short sentence;
对样本微调后编辑译文文本中的实体名进行随机修改,得到实体名翻译错误类型的样本机器翻译译文文本;Randomly modify the entity name in the edited translation text after sample fine-tuning, and obtain the sample machine translation translation text with the wrong type of entity name translation;
应用第二机器翻译模型对样本微调原文文本进行翻译,得到领域翻译错误类型的样本机器翻译译文文本;第二机器翻译模型是基于与样本微调原文文本领域不同的第二样本翻译原文文本及其第二样本翻译译文文本训练得到的。The second machine translation model is used to translate the original text of the sample fine-tuning, and the sample machine-translated target text with the wrong type of domain translation is obtained; The two-sample translation translation text is obtained by training.
具体地,针对于长句翻译错误,可以基于第一样本翻译原文文本及其第一样本翻译译文文本训练得到第一机器翻译模型,第一机器翻译模型可以基于单Transformer模型构建得到。其中,第一样本翻译原文文本及其第一样本翻译译文文本可以是通过网络下载的双语平行语料。此处,第一样本翻译原文文本为短句,例如只包含1句话。由于第一机器翻译模型是基于短句训练得到的,因此该模型只擅长对短句进行翻译,若将长句输入到该模型中进行翻译,则得到的译文容易出现长句翻译错误。故选取长句,例如包含2个以上句子,作为样本微调原文文本,并将其输入到第一机器翻译模型中,得到长句翻译错误类型的样本机器翻译译文文本。Specifically, for a long sentence translation error, a first machine translation model can be obtained by training based on the first sample translated original text and its first sample translated target text, and the first machine translation model can be constructed based on a single Transformer model. Wherein, the first sample translated original text and the first sample translated translated text may be bilingual parallel corpora downloaded through the network. Here, the first sample translated original text is a short sentence, for example, only contains one sentence. Since the first machine translation model is trained based on short sentences, this model is only good at translating short sentences. If long sentences are input into the model for translation, the resulting translation is prone to translation errors of long sentences. Therefore, a long sentence, for example, containing more than two sentences, is selected as a sample to fine-tune the original text, and input it into the first machine translation model to obtain a sample machine translation translation text of the wrong type of long sentence translation.
针对于实体名翻译错误,可以利用Spacy等实体识别工具对样本微调后编辑译文文本进行实体识别,例如对双语平行语料,如翻译生产环境中所产生的双语平行语料中的英文文本进行实体识别。筛选出样本微调后编辑译文文本中包含人名、地名、机构名,以及数字等实体的后编辑译文文本片段,并对其进行随机修改,例如删除或替换,得到实体名翻译错误类型的样本机器翻译译文文本。For entity name translation errors, entity recognition tools such as Spacy can be used to perform entity recognition on the edited translated text after fine-tuning the sample. Filter out the post-edited translation text fragments that contain persons, place names, institution names, numbers and other entities in the post-edited translation text after sample fine-tuning, and make random modifications to them, such as deletion or replacement, to obtain a sample machine translation of the wrong type of entity name translation translated text.
针对于领域翻译错误,可以基于第二样本翻译原文文本及其第二样本翻译译文文本训练得到第二机器翻译模型,第二机器翻译模型可以基于单Transformer模型构建得到。其中,第二样本翻译原文文本及其第二样本翻译译文文本的所属领域,与样本微调原文文本的所属领域不同。例如,可以通过网络下载高质量但领域较偏较窄的联合国政府公文的双语平行语料,作为第二样本翻译原文文本及其第二样本翻译译文文本。由此训练得到的第二机器翻译模型只擅长对第二样本翻译原文文本及其第二样本翻译译文文本的所属领域文本进行翻译,因此若将不同领域的原文文本输入到该模型中进行翻译,则得到的译文容易出现领域翻译错误,故可将第二机器翻译模型对样本微调原文文本进行翻译得到的译文文本作为领域翻译错误类型的样本机器翻译译文文本。For domain translation errors, a second machine translation model can be obtained by training based on the second sample translation source text and its second sample translation target text, and the second machine translation model can be constructed based on a single Transformer model. The fields to which the second sample translated original text and the second sample translated target text belong are different from those of the sample fine-tuned original text. For example, a bilingual parallel corpus of high-quality but narrow-scoped United Nations government documents can be downloaded from the Internet as a second sample translated original text and a second sample translated translation text. The second machine translation model obtained by this training is only good at translating the original text of the second sample translation and the domain texts of the second sample translation and target text. Therefore, if the original text from different fields is input into the model for translation, Then the obtained translation is prone to domain translation errors, so the translation text obtained by translating the sample fine-tuned original text by the second machine translation model can be used as the sample machine translation translation text of the type of domain translation error.
本公开实施例提供的方法,通过不同的数据合成方式,可以高效生成对应三种不同翻译错误类型的样本机器翻译译文文本,省去了微调过程中 的数据标注过程,可以进一步提高后编辑模型的训练效率。The methods provided by the embodiments of the present disclosure can efficiently generate sample machine translation translation texts corresponding to three different types of translation errors through different data synthesis methods, save the data labeling process in the fine-tuning process, and further improve the performance of the post-editing model. training efficiency.
基于上述任一实施例,预训练后编辑模型包括预训练的原文语言编码器和预训练的译文语言编码器,以及解码器。Based on any of the above embodiments, the pre-trained post-editing model includes a pre-trained source language encoder, a pre-trained target language encoder, and a decoder.
具体地,预训练后编辑模型可以包括两个编码器,即原文语言编码器和译文语言编码器,以分别用于对原文文本和机器翻译译文文本进行编码,以及一个解码器,以用于基于原文文本的编码和机器翻译译文文本的编码进行解码,实现机器翻译译文文本的错误纠正,得到后编辑译文文本。其中,原文语言编码器、译文语言编码器以及解码器均可以基于单Transformer模型构建得到。此处,两个编码器可以通过预训练获得,以提高预训练后编辑模型的预训练效率,从而进一步提高后编辑模型整体的训练效率。Specifically, the pre-trained post-editing model may include two encoders, a source language encoder and a target language encoder, for encoding the source text and the machine-translated target text, respectively, and a decoder for The encoding of the original text and the encoding of the machine translation translation text are decoded to realize the error correction of the machine translation translation text, and the post-editing translation text is obtained. Among them, the source language encoder, the target language encoder and the decoder can all be constructed based on the single Transformer model. Here, the two encoders can be obtained through pre-training to improve the pre-training efficiency of the pre-trained post-editing model, thereby further improving the overall training efficiency of the post-editing model.
本公开实施例提供的方法,通过预训练的原文语言编码器和预训练的译文语言编码器,以及解码器共同构建预训练后编辑模型,进一步提高了后编辑模型整体的训练效率。In the method provided by the embodiments of the present disclosure, a pre-trained post-editing model is jointly constructed by a pre-trained source language encoder, a pre-trained target language encoder, and a decoder, which further improves the overall training efficiency of the post-editing model.
基于上述任一实施例,预训练的原文语言编码器和预训练的译文语言编码器是基于对应语言的样本单语文本以及对样本单语文本进行常规错误模拟得到的样本错误文本训练得到的。Based on any of the above embodiments, the pre-trained source language encoder and the pre-trained target language encoder are obtained by training based on sample monolingual texts in corresponding languages and sample error texts obtained by performing conventional error simulation on the sample monolingual texts.
具体地,为了使原文语言编码器和译文语言编码器可以学会从错误文本中提取正确的语义信息,从而编码得到包含正确语义信息的原文编码和译文编码,以提高编码的表达能力,可以基于对应语言的样本单语文本及其对应的样本错误文本,以及对应语言的词向量模型训练原文语言编码器和译文语言编码器。例如,若原文为汉语,译文为英语,则可以基于汉语的样本单语文本及其对应的样本错误文本,以及汉语词向量模型对原文语言编码器进行预训练,基于英语的样本单语文本及其对应的样本错误文本,以及英语词向量模型对译文语言编码器进行预训练。其中,样本单语文本可以通过收集大量的单语语料获取得到,例如可以从网络上下载公共的汉语单语语料,如中文维基百科以及新闻语料,以及公共的英语语料,如英文维基百科以及新闻语料。而为了降低训练数据的获取难度,可以从单语 语料库中随机挑选部分单语语料,例如20%的单语语料,对挑选的单语语料,即样本单语文本,进行常规错误模拟,得到包含常规文本错误的样本错误文本。Specifically, in order to enable the source language encoder and the target language encoder to learn to extract correct semantic information from the erroneous text, so as to encode the original source code and the translation code containing correct semantic information, so as to improve the expressive ability of the encoding, the corresponding The sample monolingual text of the language and its corresponding sample error text, and the word vector model of the corresponding language train the source language encoder and the target language encoder. For example, if the original text is Chinese and the translation is English, the original language encoder can be pre-trained based on the Chinese sample monolingual text and its corresponding sample error text, as well as the Chinese word vector model, based on the English sample monolingual text and Its corresponding sample error text, and the English word vector model pre-train the translation language encoder. Among them, the sample monolingual text can be obtained by collecting a large number of monolingual corpora, for example, public Chinese monolingual corpus, such as Chinese Wikipedia and news corpus, and public English corpus, such as English Wikipedia and news corpus, can be downloaded from the Internet. corpus. In order to reduce the difficulty of obtaining training data, part of the monolingual corpus can be randomly selected from the monolingual corpus, such as 20% of the monolingual corpus. Sample error text for regular text errors.
本公开实施例提供的方法,对应语言的样本单语文本以及对所述样本单语文本进行常规错误模拟得到的样本错误文本预训练得到原文语言编码器和译文语言编码器,能够编码得到包含正确语义信息的原文编码和译文编码,提高了编码的表达能力。According to the method provided by the embodiment of the present disclosure, the sample monolingual text of the corresponding language and the sample error text obtained by performing conventional error simulation on the sample monolingual text are pre-trained to obtain the original language encoder and the target language encoder, which can be encoded to obtain correct The original coding and translation coding of semantic information improves the expressive ability of coding.
基于上述任一实施例,模拟译文文本是基于如下步骤确定的:Based on any of the above embodiments, the simulated translation text is determined based on the following steps:
对样本预训练原文文本或样本预训练后编辑译文文本进行常规错误模拟,得到模拟译文文本。The conventional error simulation is performed on the original text of the sample pre-training or the edited translation text after the sample pre-training, and the simulated translation text is obtained.
具体地,可以从双语平行语料库中随机挑选部分双语平行语料,例如10%的双语平行语料,将每一条语料中的样本预训练原文文本进行常规错误模拟,得到包含由常规文本错误的模拟译文文本,并将该双语平行语料中的样本预训练后编辑译文文本、该模拟译文文本以及该样本预训练原文文本作为预训练后编辑模型的一条训练数据。还可以从双语平行语料库中随机挑选部分双语平行语料,例如10%的双语平行语料,将其中的样本预训练后编辑译文文本进行常规错误模拟,得到包含由常规文本错误的模拟译文文本,并将双语平行语料中的样本预训练原文文本、该模拟译文文本以及该样本预训练后编辑译文文本作为预训练后编辑模型的一条训练数据。Specifically, part of the bilingual parallel corpus, such as 10% of the bilingual parallel corpus, can be randomly selected from the bilingual parallel corpus, and the sample pre-trained original text in each corpus is subjected to conventional error simulation to obtain a simulated translation text containing errors caused by the conventional text. , and take the sample pre-trained edited translation text, the simulated translation text and the sample pre-trained original text in the bilingual parallel corpus as a piece of training data of the pre-trained edited model. It is also possible to randomly select some bilingual parallel corpora from the bilingual parallel corpus, such as 10% of the bilingual parallel corpus, and perform conventional error simulation on the sample pre-trained and edited translation texts to obtain simulated translation texts that contain errors caused by the conventional texts, and then use The sample pre-trained original text, the simulated translation text and the sample pre-trained edited translation text in the bilingual parallel corpus are used as a piece of training data for the pre-trained post-editing model.
基于上述任一实施例,进行常规错误模拟具体包括:Based on any of the above embodiments, performing conventional error simulation specifically includes:
随机选取对应文本中的若干文本片段,并对文本片段进行删除、重排、替换或转移操作。Randomly select several text fragments in the corresponding text, and delete, rearrange, replace or transfer the text fragments.
具体地,常规的文本错误包括漏词、倒序、错词、重复等,因此在进行常规错误模拟时,可以随机选取对应文本中的若干个文本片段,对每个文本片段进行删除、重排、替换或转移操作。其中,删除是指将该文本片段整体删除,重排是指颠倒该文本片段中各字词的顺序,替换是指利用原文本中其他位置的文本片段替换该文本片段,转移是指将原文本中其他位 置的文本片段与该文本片段交换位置。例如,可以采用下表中的方式进行常规错误模拟:Specifically, conventional text errors include missing words, reverse order, wrong words, repetition, etc. Therefore, when simulating conventional errors, several text fragments in the corresponding text can be randomly selected, and each text fragment can be deleted, rearranged, Replace or transfer operations. Among them, deletion refers to deleting the text fragment as a whole, rearranging refers to reversing the order of each word in the text fragment, replacing refers to replacing the text fragment with a text fragment in another position in the original text, and transferring refers to replacing the original text The text fragments elsewhere in the text swap places with this text fragment. For example, general error simulation can be done in the following way:
原文本original text <zh>今天天气真好。<en>The weather is great today.
删除delete <zh>今天天DEL DEL好。<en>Today is a good day DEL DEL.
重排rearrange <zh>今天天真气好。<en>Today is very innocuous.
替换replace <zh>今天天今天好。<en>Good day today.
转移transfer <zh>今气真天天好。<en>It's a good day today.
基于上述任一实施例,本公开又一实施例提供了一种后编辑模型构建方法。该方法包括:Based on any of the above embodiments, another embodiment of the present disclosure provides a post-editing model construction method. The method includes:
首先,收集模型训练需要的语料数据,包括:First, collect the corpus data required for model training, including:
积累翻译生产环境中所产生的双语平行语料,记为双语平行语料库C。其中,每一条语料包括一条待翻译的原文文本和人工翻译审校后所产生的高质量的译文文本。Accumulate the bilingual parallel corpus generated in the translation production environment and record it as bilingual parallel corpus C. Among them, each corpus includes an original text to be translated and a high-quality translation text produced by manual translation and review.
从网络上下载公共的双语平行语料,例如联合国以及WMT双语平行语料,记为双语平行语料库T。Download public bilingual parallel corpora from the Internet, such as the United Nations and WMT bilingual parallel corpus, and record it as bilingual parallel corpus T.
从网络上下载公共的原文语言单语语料,例如中文维基百科以及新闻语料,记为单语语料库Z。Download public monolingual corpus in original language from the Internet, such as Chinese Wikipedia and news corpus, and record it as monolingual corpus Z.
从网络上下载公共的译文语言单语语料,例如英文维基百科以及新闻语料,记为单语语料库E。Download public monolingual corpus of translation language from the Internet, such as English Wikipedia and news corpus, and record it as monolingual corpus E.
对所有语料进行分词处理。其中,对英文语料,可以利用spacy工具进行分词;对于中文语料,可以利用文法规则以字为单位进行分词,即将单独的汉字、连续的数字或英文字母和标点符号等单独作为词例。然后,在每条语料的开始加上语言标识符,如下表所示:Perform word segmentation on all corpora. Among them, for the English corpus, the spacy tool can be used for word segmentation; for the Chinese corpus, the word segmentation can be performed in units of words by using grammar rules, that is, individual Chinese characters, continuous numbers or English letters and punctuation marks are used as word examples. Then, add a language identifier to the beginning of each corpus, as shown in the following table:
Figure PCTCN2021078814-appb-000001
Figure PCTCN2021078814-appb-000001
Figure PCTCN2021078814-appb-000002
Figure PCTCN2021078814-appb-000002
基于已分词的语料数据,利用Skip-Gram算法分别对原文语言和译文语言进行词向量训练。其中,词向量的维度可以设置为300,上下文窗口可以设置为5。Based on the segmented corpus data, the Skip-Gram algorithm is used to train the word vectors in the source language and the target language respectively. Among them, the dimension of the word vector can be set to 300, and the context window can be set to 5.
从Z中随机抽取20%的语料,进行常规错误模拟,合成包含可能被破坏的语料和原语料的平行语料,结合原文语言的词向量模型,预训练一个标准的Transformer模型的原文语言编码器。Randomly extract 20% of the corpus from Z, perform conventional error simulation, synthesize parallel corpus containing the possibly corrupted corpus and the original corpus, combine the word vector model of the original language, and pre-train a standard Transformer model of the original language encoder.
从E中随机抽取20%的语料,进行常规错误模拟,合成包含可能被破坏的语料和原语料的平行语料,结合译文语言的词向量模型,预训练一个标准的Transformer模型的译文语言编码器。Randomly extract 20% of the corpus from E, perform conventional error simulation, synthesize parallel corpus containing the possibly corrupted corpus and the original corpus, and combine the word vector model of the target language to pre-train a standard Transformer model of the target language encoder.
从T中随机抽取10%的语料,对其中的原文语料进行常规错误模拟,即产生一个三元语料(可能被破坏的原文语料,原译文语料,初始原文语料)。同样再从T中随机抽取10%的语料,对其中的译文语料进行常规错误模拟,即产生一个三元语料(初始原文语料,可能被破坏的译文语料,原译文语料)。利用合成的三元平行语料进行一个双Transformer编码器到单Transformer解码器的预训练,得到预训练后编辑模型。其中双Transformer编码器为原文语言编码器和译文语言编码器。Randomly select 10% of the corpus from T, and perform conventional error simulation on the original corpus, that is, generate a ternary corpus (the original corpus that may be damaged, the original translation corpus, the original original corpus). Similarly, 10% of the corpus is randomly selected from T, and the conventional error simulation is performed on the translation corpus, that is, a ternary corpus (initial source corpus, possibly corrupted translation corpus, original translation corpus) is generated. Using the synthesized trigram parallel corpus to pre-train a dual Transformer encoder to a single Transformer decoder, a pre-trained post-editing model is obtained. The dual Transformer encoders are the source language encoder and the target language encoder.
随后,进行微调任务的训练数据获取,包括:Subsequently, the training data acquisition for the fine-tuning task is performed, including:
a)利用中文断句规则法对C中的原文语料进行断句,筛选出原文语料句子个数大于等于2的双语平行语料,形成一个子集C1。同样的,对T中的原文语料进行断句,筛选出原文语料句子个数为1的双语平行语料,形成另一个子集T1。利用语料库T1,构建一个基于Transformer模型的机器翻译引擎。然后将C1的原文语料输入至进该模型中进行解码产生机器翻译译文,产生三元组(C1原文,机器翻译译文,C1译文)。a) Use the Chinese sentence segmentation rule method to segment the original corpus in C, screen out the bilingual parallel corpus with the number of sentences in the original corpus greater than or equal to 2, and form a subset C1. Similarly, segment the original corpus in T, and filter out the bilingual parallel corpus with 1 sentence in the original corpus, forming another subset T1. Using the corpus T1, build a machine translation engine based on the Transformer model. Then, the original corpus of C1 is input into the model for decoding to generate a machine translation translation, and a triplet (C1 original text, machine translation translation, C1 translation) is generated.
b)利用Spacy工具对C中的译文语料进行实体识别,筛选出包含人名、地名、机构名,以及数字等实体的双语平行语料C2。随机修改C2译文语料中的实体名词,例如删除或替换,产生三元组(C2原文,实体名词被破坏的译文,C2译文)。b) Use the Spacy tool to perform entity recognition on the translation corpus in C, and screen out the bilingual parallel corpus C2 containing entities such as person names, place names, institution names, and numbers. Randomly modify the entity nouns in the C2 translation corpus, such as deletion or substitution, to generate triples (C2 original text, entity noun-destroyed translation, C2 translation).
c)从T中筛选出联合国双语平行语料,构建一个基于Transformer模型的机器翻译引擎。从C中抽取一个子集C3,将C3的原文语料输入至该模型进行解码产生机器翻译译文,产生三元组(C3原文,机器翻译译文,C3译文)。c) Screen out the United Nations bilingual parallel corpus from T, and build a machine translation engine based on the Transformer model. A subset C3 is extracted from C, and the original corpus of C3 is input into the model for decoding to generate a machine translation translation, and a triplet (C3 original text, machine translation translation text, C3 translation text) is generated.
将a)、b)和c)中产生的三元组集合起来形成总的微调任务训练数据,对预训练后编辑模型进行微调,得到最终的后编辑模型。The triples generated in a), b), and c) are aggregated to form the total fine-tuning task training data, and the pre-trained post-editing model is fine-tuned to obtain the final post-editing model.
下面对本公开实施例提供的译文后编辑装置进行描述,下文描述的译文后编辑装置与上文描述的译文后编辑方法可相互对应参照。The post-translation editing apparatus provided by the embodiments of the present disclosure will be described below. The post-translation editing apparatus described below and the post-translation editing method described above can be referred to each other correspondingly.
基于上述任一实施例,图3为本公开实施例提供的译文后编辑装置的结构示意图,如图3所示,该装置包括:译文确定单元310和后编辑单元320。Based on any of the above embodiments, FIG. 3 is a schematic structural diagram of an apparatus for post-editing translation provided by an embodiment of the present disclosure. As shown in FIG. 3 , the apparatus includes a translation determining unit 310 and a post-editing unit 320 .
其中,译文确定单元310用于确定待编辑的机器翻译译文文本;Wherein, the translation determination unit 310 is used to determine the machine translation translation text to be edited;
后编辑单元320用于将机器翻译译文文本及其对应的原文文本输入至后编辑模型,得到后编辑模型输出的后编辑译文文本;The post-editing unit 320 is configured to input the machine-translated translation text and its corresponding original text into the post-editing model, and obtain the post-editing translation text output by the post-editing model;
其中,后编辑模型是基于样本微调原文文本及其样本微调后编辑译文文本,以及样本微调原文文本的样本机器翻译译文文本,对预训练后编辑模型进行微调后得到的;Among them, the post-editing model is obtained by fine-tuning the pre-training post-editing model based on the sample fine-tuning original text and its sample fine-tuning post-editing translation text, and the sample machine-translating translation text of the sample fine-tuning original text;
预训练后编辑模型是基于样本预训练原文文本及其样本预训练后编辑译文文本,以及样本预训练原文文本的模拟译文文本训练得到的。The pre-trained post-editing model is trained based on the sample pre-trained original text, its sample pre-trained post-edited translation text, and the simulated translation text of the sample pre-trained original text.
本公开实施例提供的装置,通过基于样本预训练原文文本及其样本预训练后编辑译文文本,以及样本预训练原文文本的模拟译文文本训练得到预训练后编辑模型,并基于样本微调原文文本及其样本微调后编辑译文文本,以及样本微调原文文本的样本机器翻译译文文本,对预训练后编辑模型进行微调后得到后编辑模型,通过预训练加微调的方式,以及错误模拟以合成译文数据的方式,提高了后编辑模型的训练效率和训练效果,提高了后编辑的准确性。The apparatus provided by the embodiment of the present disclosure obtains a pre-trained editing model by pre-training the original text based on the sample and the edited translation text after the sample pre-training, and training the simulated translation text of the sample pre-trained original text, and fine-tunes the original text and the original text based on the sample. The edited translation text after sample fine-tuning, and the sample machine-translated translation text of the sample fine-tuned original text, the post-editing model is obtained after fine-tuning the pre-training post-editing model, and the pre-training and fine-tuning methods are used to synthesize the translation data. In this way, the training efficiency and training effect of the post-editing model are improved, and the accuracy of the post-editing is improved.
基于上述任一实施例,样本机器翻译译文文本对应长句翻译错误、实体名翻译错误以及领域翻译错误中的至少一种错误类型。Based on any of the foregoing embodiments, the sample machine translation translation text corresponds to at least one error type among long sentence translation errors, entity name translation errors, and domain translation errors.
基于上述任一实施例,样本机器翻译译文文本是基于以下至少一种方 式确定的:Based on any of the above embodiments, the sample machine translation translation text is determined based on at least one of the following ways:
应用第一机器翻译模型对样本微调原文文本进行翻译,得到长句翻译错误类型的样本机器翻译译文文本;第一机器翻译模型是基于第一样本翻译原文文本及其第一样本翻译译文文本训练得到的,样本微调原文文本为长句,第一样本翻译原文文本为短句;The first machine translation model is used to translate the sample fine-tuned original text, and the sample machine translation translation text of the type of long sentence translation error is obtained; the first machine translation model is based on the first sample translation original text and its first sample translation translation text After training, the original text of the sample fine-tuning is a long sentence, and the first sample translated original text is a short sentence;
对样本微调后编辑译文文本中的实体名进行随机修改,得到实体名翻译错误类型的样本机器翻译译文文本;Randomly modify the entity name in the edited translation text after sample fine-tuning, and obtain the sample machine translation translation text with the wrong type of entity name translation;
应用第二机器翻译模型对样本微调原文文本进行翻译,得到领域翻译错误类型的样本机器翻译译文文本;第二机器翻译模型是基于与样本微调原文文本领域不同的第二样本翻译原文文本及其第二样本翻译译文文本训练得到的。The second machine translation model is used to translate the original text of the sample fine-tuning, and the sample machine-translated target text with the wrong type of domain translation is obtained; The two-sample translation translation text is obtained by training.
本公开实施例提供的装置,通过不同的数据合成方式,可以高效生成对应三种不同翻译错误类型的样本机器翻译译文文本,省去了微调过程中的数据标注过程,可以进一步提高后编辑模型的训练效率。The device provided by the embodiment of the present disclosure can efficiently generate sample machine translation translation texts corresponding to three different types of translation errors through different data synthesis methods, omits the data labeling process in the fine-tuning process, and can further improve the performance of the post-editing model. training efficiency.
基于上述任一实施例,预训练后编辑模型包括预训练的原文语言编码器和预训练的译文语言编码器,以及解码器。Based on any of the above embodiments, the pre-trained post-editing model includes a pre-trained source language encoder, a pre-trained target language encoder, and a decoder.
本公开实施例提供的装置,通过预训练的原文语言编码器和预训练的译文语言编码器,以及解码器共同构建预训练后编辑模型,进一步提高了后编辑模型整体的训练效率。In the apparatus provided by the embodiments of the present disclosure, a pre-trained post-editing model is jointly constructed by a pre-trained source language encoder, a pre-trained target language encoder, and a decoder, which further improves the overall training efficiency of the post-editing model.
基于上述任一实施例,预训练的原文语言编码器和预训练的译文语言编码器是基于对应语言的样本单语文本以及对样本单语文本进行常规错误模拟得到的样本错误文本训练得到的。Based on any of the above embodiments, the pre-trained source language encoder and the pre-trained target language encoder are obtained by training based on sample monolingual texts of the corresponding language and sample error texts obtained by performing conventional error simulation on the sample monolingual texts.
本公开实施例提供的装置,对应语言的样本单语文本以及对所述样本单语文本进行常规错误模拟得到的样本错误文本预训练得到原文语言编码器和译文语言编码器,能够编码得到包含正确语义信息的原文编码和译文编码,提高了编码的表达能力。The device provided by the embodiment of the present disclosure can pre-train the sample monolingual text of the corresponding language and the sample error text obtained by performing the conventional error simulation on the sample monolingual text to obtain the original language encoder and the target language encoder, and can encode and obtain the correct The original coding and translation coding of semantic information improves the expressive ability of coding.
基于上述任一实施例,模拟译文文本是基于如下步骤确定的:Based on any of the above embodiments, the simulated translation text is determined based on the following steps:
对样本预训练原文文本或样本预训练后编辑译文文本进行常规错误 模拟,得到模拟译文文本。The conventional error simulation is performed on the sample pre-trained original text or the sample pre-trained edited translation text, and the simulated translation text is obtained.
基于上述任一实施例,该装置还包括常规错误模拟单元,用于:Based on any of the above embodiments, the apparatus further includes a conventional error simulation unit for:
随机选取对应文本中的若干文本片段,并对文本片段进行删除、重排、替换或转移操作。Randomly select several text fragments in the corresponding text, and delete, rearrange, replace or transfer the text fragments.
图4示例了一种电子设备的实体结构示意图,如图4所示,该电子设备可以包括:处理器(processor)410、通信接口(Communications Interface)420、存储器(memory)430和通信总线440,其中,处理器410,通信接口420,存储器430通过通信总线440完成相互间的通信。处理器410可以调用存储器430中的逻辑指令,以执行译文后编辑方法,该方法包括:确定待编辑的机器翻译译文文本;将所述机器翻译译文文本及其对应的原文文本输入至后编辑模型,得到所述后编辑模型输出的后编辑译文文本;其中,所述后编辑模型是基于样本微调原文文本及其样本微调后编辑译文文本,以及所述样本微调原文文本的样本机器翻译译文文本,对预训练后编辑模型进行微调后得到的;所述预训练后编辑模型是基于样本预训练原文文本及其样本预训练后编辑译文文本,以及所述样本预训练原文文本的模拟译文文本训练得到的。FIG. 4 illustrates a schematic diagram of the physical structure of an electronic device. As shown in FIG. 4 , the electronic device may include: a processor (processor) 410, a communication interface (Communications Interface) 420, a memory (memory) 430 and a communication bus 440, The processor 410 , the communication interface 420 , and the memory 430 communicate with each other through the communication bus 440 . The processor 410 may invoke the logic instructions in the memory 430 to execute a post-translation editing method, the method comprising: determining the machine-translated translation text to be edited; inputting the machine-translated translation text and its corresponding original text into the post-editing model to obtain the post-editing translation text output by the post-editing model; wherein, the post-editing model is based on the sample fine-tuning original text and its sample fine-tuning post-editing translation text, and the sample machine-translated translation text of the sample fine-tuning original text, It is obtained by fine-tuning the pre-training post-editing model; the pre-training post-editing model is obtained based on the sample pre-training original text and the sample pre-training post-editing translation text, as well as the simulated translation text training of the sample pre-training original text. of.
此外,上述的存储器430中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the above-mentioned logic instructions in the memory 430 can be implemented in the form of software functional units and can be stored in a computer-readable storage medium when sold or used as an independent product. Based on such understanding, the technical solutions of the present disclosure can be embodied in the form of software products in essence, or the parts that contribute to the prior art or the parts of the technical solutions. The computer software products are stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of the present disclosure. The aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes .
另一方面,本公开实施例还提供一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,计算机能够执行 上述各方法实施例所提供的译文后编辑方法,该方法包括:确定待编辑的机器翻译译文文本;将所述机器翻译译文文本及其对应的原文文本输入至后编辑模型,得到所述后编辑模型输出的后编辑译文文本;其中,所述后编辑模型是基于样本微调原文文本及其样本微调后编辑译文文本,以及所述样本微调原文文本的样本机器翻译译文文本,对预训练后编辑模型进行微调后得到的;所述预训练后编辑模型是基于样本预训练原文文本及其样本预训练后编辑译文文本,以及所述样本预训练原文文本的模拟译文文本训练得到的。On the other hand, an embodiment of the present disclosure also provides a computer program product, the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, when the program instructions When executed by a computer, the computer can execute the post-translation editing method provided by the above method embodiments. The method includes: determining the machine-translated translation text to be edited; inputting the machine-translated translation text and its corresponding original text into the post-processing an editing model to obtain the post-editing translation text output by the post-editing model; wherein the post-editing model is based on the sample fine-tuning original text and its sample fine-tuning post-editing translation text, and the sample machine-translation translation of the sample fine-tuning original text Text, obtained after fine-tuning the pre-training editing model; the pre-training editing model is based on the sample pre-trained original text and the sample pre-trained edited translation text, and the sample pre-trained original text. obtained by training.
又一方面,本公开实施例还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现以执行上述各实施例提供的译文后编辑方法,该方法包括:确定待编辑的机器翻译译文文本;将所述机器翻译译文文本及其对应的原文文本输入至后编辑模型,得到所述后编辑模型输出的后编辑译文文本;其中,所述后编辑模型是基于样本微调原文文本及其样本微调后编辑译文文本,以及所述样本微调原文文本的样本机器翻译译文文本,对预训练后编辑模型进行微调后得到的;所述预训练后编辑模型是基于样本预训练原文文本及其样本预训练后编辑译文文本,以及所述样本预训练原文文本的模拟译文文本训练得到的。In yet another aspect, an embodiment of the present disclosure further provides a non-transitory computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, is implemented to execute the post-translation editing method provided by the foregoing embodiments, The method includes: determining the machine translation translation text to be edited; inputting the machine translation translation text and its corresponding original text into a post-editing model to obtain a post-editing translation text output by the post-editing model; wherein, the post-editing translation text is The editing model is obtained by fine-tuning the pre-trained editing model based on the sample fine-tuning original text and its sample fine-tuning edited translation text, and the sample machine-translated translated text of the sample fine-tuning original text; the pre-training and post-training editing model It is obtained based on the sample pre-training original text and the edited translation text after sample pre-training, and the simulated translation text of the sample pre-training original text.
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。The device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用 以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on this understanding, the above-mentioned technical solutions can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic A disc, an optical disc, etc., includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in various embodiments or some parts of the embodiments.
最后应说明的是:以上实施例仅用以说明本公开的技术方案,而非对其限制;尽管参照前述实施例对本公开进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本公开各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present disclosure, but not to limit them; although the present disclosure has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still be Modifications are made to the technical solutions described in the foregoing embodiments, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions depart from the spirit and scope of the technical solutions of the embodiments of the present disclosure.

Claims (10)

  1. 一种译文后编辑方法,其特征在于,包括:A post-translation editing method, comprising:
    确定待编辑的机器翻译译文文本;Determine the machine translation translation text to be edited;
    将所述机器翻译译文文本及其对应的原文文本输入至后编辑模型,得到所述后编辑模型输出的后编辑译文文本;Inputting the machine-translated translation text and its corresponding original text into the post-editing model to obtain the post-editing translation text output by the post-editing model;
    其中,所述后编辑模型是基于样本微调原文文本及其样本微调后编辑译文文本,以及所述样本微调原文文本的样本机器翻译译文文本,对预训练后编辑模型进行微调后得到的;The post-editing model is obtained by fine-tuning the pre-training post-editing model based on the sample fine-tuning original text and the sample fine-tuning edited translation text, and the sample machine-translated translation text of the sample fine-tuning original text;
    所述预训练后编辑模型是基于样本预训练原文文本及其样本预训练后编辑译文文本,以及所述样本预训练原文文本的模拟译文文本训练得到的。The pre-training post-editing model is obtained by training based on the sample pre-training original text, the sample pre-training post-editing translation text, and the simulated translation text of the sample pre-training original text.
  2. 根据权利要求1所述的译文后编辑方法,其特征在于,所述样本机器翻译译文文本对应长句翻译错误、实体名翻译错误以及领域翻译错误中的至少一种错误类型。The post-translation editing method according to claim 1, wherein the sample machine-translated translation text corresponds to at least one error type among long sentence translation errors, entity name translation errors, and domain translation errors.
  3. 根据权利要求2所述的译文后编辑方法,其特征在于,所述样本机器翻译译文文本是基于以下至少一种方式确定的:The post-translation editing method according to claim 2, wherein the sample machine translation translation text is determined based on at least one of the following methods:
    应用第一机器翻译模型对所述样本微调原文文本进行翻译,得到长句翻译错误类型的样本机器翻译译文文本;所述第一机器翻译模型是基于第一样本翻译原文文本及其第一样本翻译译文文本训练得到的,所述样本微调原文文本为长句,所述第一样本翻译原文文本为短句;Translate the sample fine-tuned original text by applying a first machine translation model to obtain a sample machine translation translation text of the type of long sentence translation error; the first machine translation model is based on the first sample translation of the original text and its first translation The original text of the sample fine-tuning is a long sentence, and the original text of the first sample translation is a short sentence;
    对所述样本微调后编辑译文文本中的实体名进行随机修改,得到实体名翻译错误类型的样本机器翻译译文文本;Randomly modifying the entity name in the edited translation text after the sample fine-tuning, to obtain a sample machine translation translation text with an entity name translation error type;
    应用第二机器翻译模型对所述样本微调原文文本进行翻译,得到领域翻译错误类型的样本机器翻译译文文本;所述第二机器翻译模型是基于与所述样本微调原文文本领域不同的第二样本翻译原文文本及其第二样本翻译译文文本训练得到的。Translate the sample fine-tuned original text by applying a second machine translation model to obtain a sample machine-translated target text of the wrong type of domain translation; the second machine translation model is based on a second sample that differs from the sample fine-tuned original text in domain It is obtained by training the translated original text and its second sample translated translation text.
  4. 根据权利要求1所述的译文后编辑方法,其特征在于,所述预训练后编辑模型包括预训练的原文语言编码器和预训练的译文语言编码器,以及解码器。The post-translation editing method according to claim 1, wherein the pre-training post-editing model comprises a pre-trained source language encoder, a pre-trained target language encoder, and a decoder.
  5. 根据权利要求4所述的译文后编辑方法,其特征在于,所述预训 练的原文语言编码器和预训练的译文语言编码器是基于对应语言的样本单语文本以及对所述样本单语文本进行常规错误模拟得到的样本错误文本训练得到的。The post-translation editing method according to claim 4, wherein the pre-trained source language encoder and the pre-trained target language encoder are based on sample monolingual texts of corresponding languages and a Trained on sample error texts obtained from regular error simulations.
  6. 根据权利要求1所述的译文后编辑方法,其特征在于,所述模拟译文文本是基于如下步骤确定的:The post-translation editing method according to claim 1, wherein the simulated translation text is determined based on the following steps:
    对所述样本预训练原文文本或所述样本预训练后编辑译文文本进行常规错误模拟,得到所述模拟译文文本。Perform conventional error simulation on the sample pre-trained original text or the sample pre-trained edited translation text to obtain the simulated translation text.
  7. 根据权利要求5或6所述的译文后编辑方法,其特征在于,所述进行常规错误模拟具体包括:The post-translation editing method according to claim 5 or 6, wherein the performing conventional error simulation specifically comprises:
    随机选取对应文本中的若干文本片段,并对所述文本片段进行删除、重排、替换或转移操作。Randomly select several text segments in the corresponding text, and perform deletion, rearrangement, replacement or transfer operations on the text segments.
  8. 一种译文后编辑装置,其特征在于,包括:A post-translation editing device, comprising:
    译文确定单元,用于确定待编辑的机器翻译译文文本;A translation determination unit, used to determine the machine translation translation text to be edited;
    后编辑单元,用于将所述机器翻译译文文本及其对应的原文文本输入至后编辑模型,得到所述后编辑模型输出的后编辑译文文本;a post-editing unit, configured to input the machine-translated translation text and its corresponding original text into the post-editing model to obtain the post-editing translation text output by the post-editing model;
    其中,所述后编辑模型是基于样本微调原文文本及其样本微调后编辑译文文本,以及所述样本微调原文文本的样本机器翻译译文文本,对预训练后编辑模型进行微调后得到的;The post-editing model is obtained by fine-tuning the pre-training post-editing model based on the sample fine-tuning original text and the sample fine-tuning edited translation text, and the sample machine-translated translation text of the sample fine-tuning original text;
    所述预训练后编辑模型是基于样本预训练原文文本及其样本预训练后编辑译文文本,以及所述样本预训练原文文本的模拟译文文本训练得到的。The pre-training post-editing model is obtained by training based on the sample pre-training original text, the sample pre-training post-editing translation text, and the simulated translation text of the sample pre-training original text.
  9. 一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现如权利要求1至7任一项所述译文后编辑方法的步骤。An electronic device, comprising a memory, a processor and a computer program stored on the memory and running on the processor, characterized in that, when the processor executes the program, the program as claimed in any one of claims 1 to 7 is implemented. Describe the steps of the post-translation editing method.
  10. 一种非暂态计算机可读存储介质,其上存储有计算机程序,其特征在于,该计算机程序被处理器执行时实现如权利要求1至7任一项所述译文后编辑方法的步骤。A non-transitory computer-readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the steps of the post-translation editing method according to any one of claims 1 to 7 are implemented.
PCT/CN2021/078814 2020-10-29 2021-03-03 Method and apparatus for post-editing of translation, electronic device, and storage medium WO2022088570A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011186869.1A CN112287696B (en) 2020-10-29 2020-10-29 Post-translation editing method and device, electronic equipment and storage medium
CN202011186869.1 2020-10-29

Publications (1)

Publication Number Publication Date
WO2022088570A1 true WO2022088570A1 (en) 2022-05-05

Family

ID=74352729

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/078814 WO2022088570A1 (en) 2020-10-29 2021-03-03 Method and apparatus for post-editing of translation, electronic device, and storage medium

Country Status (2)

Country Link
CN (1) CN112287696B (en)
WO (1) WO2022088570A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116956946A (en) * 2023-07-14 2023-10-27 上海一者信息科技有限公司 Machine translation text fine granularity error type identification and positioning method

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112287696B (en) * 2020-10-29 2024-02-23 语联网(武汉)信息技术有限公司 Post-translation editing method and device, electronic equipment and storage medium
CN112836528B (en) * 2021-02-07 2023-10-03 语联网(武汉)信息技术有限公司 Machine post-translation editing method and system
CN114091483B (en) * 2021-10-27 2023-02-28 北京百度网讯科技有限公司 Translation processing method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740218A (en) * 2015-12-31 2016-07-06 成都数联铭品科技有限公司 Post-editing processing method for mechanical translation
CN109670191A (en) * 2019-01-24 2019-04-23 语联网(武汉)信息技术有限公司 Calibration optimization method, device and the electronic equipment of machine translation
US20200073949A1 (en) * 2018-02-24 2020-03-05 International Business Machines Corporation System and method for adaptive quality estimation for machine translation post-editing
CN111382580A (en) * 2020-01-21 2020-07-07 沈阳雅译网络技术有限公司 Encoder-decoder framework pre-training method for neural machine translation
CN111597778A (en) * 2020-04-15 2020-08-28 哈尔滨工业大学 Method and system for automatically optimizing machine translation based on self-supervision
CN112287696A (en) * 2020-10-29 2021-01-29 语联网(武汉)信息技术有限公司 Post-translation editing method and device, electronic equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6471074B2 (en) * 2015-09-30 2019-02-13 株式会社東芝 Machine translation apparatus, method and program
CN111144137B (en) * 2019-12-17 2023-09-05 语联网(武汉)信息技术有限公司 Method and device for generating corpus of machine post-translation editing model

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740218A (en) * 2015-12-31 2016-07-06 成都数联铭品科技有限公司 Post-editing processing method for mechanical translation
US20200073949A1 (en) * 2018-02-24 2020-03-05 International Business Machines Corporation System and method for adaptive quality estimation for machine translation post-editing
CN109670191A (en) * 2019-01-24 2019-04-23 语联网(武汉)信息技术有限公司 Calibration optimization method, device and the electronic equipment of machine translation
CN111382580A (en) * 2020-01-21 2020-07-07 沈阳雅译网络技术有限公司 Encoder-decoder framework pre-training method for neural machine translation
CN111597778A (en) * 2020-04-15 2020-08-28 哈尔滨工业大学 Method and system for automatically optimizing machine translation based on self-supervision
CN112287696A (en) * 2020-10-29 2021-01-29 语联网(武汉)信息技术有限公司 Post-translation editing method and device, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116956946A (en) * 2023-07-14 2023-10-27 上海一者信息科技有限公司 Machine translation text fine granularity error type identification and positioning method

Also Published As

Publication number Publication date
CN112287696B (en) 2024-02-23
CN112287696A (en) 2021-01-29

Similar Documents

Publication Publication Date Title
WO2022088570A1 (en) Method and apparatus for post-editing of translation, electronic device, and storage medium
WO2018010455A1 (en) Neural network-based translation method and apparatus
CN110852117B (en) Effective data enhancement method for improving translation effect of neural machine
WO2022095345A1 (en) Multi-modal model training method, apparatus, device, and storage medium
WO2022148104A1 (en) Machine translation method and system based on pre-training model
CN101458681A (en) Voice translation method and voice translation apparatus
CN103631772A (en) Machine translation method and device
US20120296633A1 (en) Syntax-based augmentation of statistical machine translation phrase tables
JP7413630B2 (en) Summary generation model training method, apparatus, device and storage medium
CN111539229A (en) Neural machine translation model training method, neural machine translation method and device
WO2022179149A1 (en) Machine translation method and apparatus based on translation memory
CN111144140B (en) Zhongtai bilingual corpus generation method and device based on zero-order learning
CN112541365A (en) Machine translation method and device based on term replacement
CN112446221B (en) Translation evaluation method, device, system and computer storage medium
CN114201975B (en) Translation model training method, translation method and translation device
CN110705317B (en) Translation method and related device
CN115730585A (en) Text error correction and model training method and device, storage medium and equipment
CN113204978B (en) Machine translation enhancement training method and system
CN109657244B (en) English long sentence automatic segmentation method and system
CN109871550B (en) Method for improving digital translation quality based on post-processing technology
CN111178060A (en) Korean word segmentation reduction method based on language model
WO2022166267A1 (en) Machine translation post-editing method and system
CN114519358A (en) Translation quality evaluation method and device, electronic equipment and storage medium
CN114185573A (en) Implementation and online updating system and method for human-computer interaction machine translation system
CN117034968B (en) Neural machine translation method, device, electronic equipment and medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21884314

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21884314

Country of ref document: EP

Kind code of ref document: A1