WO2022088570A1

WO2022088570A1 - Method and apparatus for post-editing of translation, electronic device, and storage medium

Info

Publication number: WO2022088570A1
Application number: PCT/CN2021/078814
Authority: WO
Inventors: 张睦
Original assignee: 语联网（武汉）信息技术有限公司
Priority date: 2020-10-29
Filing date: 2021-03-03
Publication date: 2022-05-05
Also published as: CN112287696B; CN112287696A

Abstract

A method and apparatus for post-editing of translation. The method comprises: determining machine translated text to be edited (110); and inputting into a post-editing model the machine translated text and original text corresponding thereto to obtain post-edited translation text output by the post-editing model (120), wherein the post-editing model is obtained by fine-tuning a pre-trained post-editing model on the basis of sample fine-tuning original text, sample fine-tuning post-edited translation text, and sample machine translated text of the sample fine-tuning original text, and the pre-trained post-editing model is trained on the basis of the sample pre-training original text, sample pre-training post-edited translation text, and simulated translation text of the sample pre-training original text. The method and apparatus improve the efficiency and effect of post-editing model training and improve post-editing accuracy by means of pre-training and fine-tuning as well as translation data synthesis by error simulation.

Description

Post-translation compilation method, apparatus, electronic device and storage medium

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the priority of the Chinese patent application with the application number 2020111868691 filed on October 29, 2020, and the invention title is "post-translation compilation method, device, and electronics", which is fully incorporated herein by reference.

technical field

The present disclosure relates to the technical field of natural language processing, and in particular, to a post-translation editing method, apparatus, electronic device and storage medium.

Background technique

Post-editing means that given the original text to be translated, the corresponding machine translation results are retrieved, and then the translators modify and polish on this basis, thereby improving the quality of the translation. Among them, the machine translation result can provide the translator with a translation result as a reference, so as to avoid the translator from starting the translation from scratch, thereby reducing the translator's workload.

In actual work, when the machine translation result is far from the expected translation result, the post-editing mode will cause the translator to do a lot of revision and editing, which will further increase the translator's workload. For example, when the machine translation model processes some original texts to be translated with limited resources and oriented to certain professional fields, the effect is poor, and the machine translation results obtained are far from the correct translation results. Or when the machine translation model mistranslates entity words, such as person names, place names, or institution names, or mistranslates numerals, the accuracy of the machine translation results obtained is also poor. Or when the machine translation model cannot reasonably handle the translation of long sentences, the accuracy of the machine translation results will also be insufficient, requiring a lot of post-editing work. Therefore, automatic post-editing models play an increasingly important role in current assisted translation. The post-editing model can automatically post-edit the machine-translated translation based on the input original text to be translated and the machine-translated translation, so as to correct translation errors, and output the post-edited translation. The gap between the expected translations further reduces the translator's workload.

However, the existing post-editing model training methods require a large number of triplet parallel corpora, that is, the triplet consisting of the original text, the machine translation translation, and the post-editing translation. However, these triplet training data are difficult to obtain and require a lot of manual labeling costs, resulting in poor training effect and low training efficiency of the post-editing model, which in turn results in poor post-editing accuracy of translations.

SUMMARY OF THE INVENTION

(1) Technical problems to be solved

Embodiments of the present disclosure provide a post-translation editing method, device, electronic device, and storage medium, which are used to solve the defects of the prior art that the post-editing model has poor training effect, low training efficiency, and poor post-translation editing accuracy. .

(2) Content of the invention

An embodiment of the present disclosure provides a post-translation editing method, including:

Determine the machine translation translation text to be edited;

Inputting the machine-translated translation text and its corresponding original text into the post-editing model to obtain the post-editing translation text output by the post-editing model;

The post-editing model is obtained by fine-tuning the pre-training post-editing model based on the sample fine-tuning original text and the sample fine-tuning edited translation text, and the sample machine-translated translation text of the sample fine-tuning original text;

The pre-training post-editing model is obtained by training based on the sample pre-training original text, the sample pre-training post-editing translation text, and the simulated translation text of the sample pre-training original text.

According to the post-translation editing method of an embodiment of the present disclosure, the sample machine-translated translation text corresponds to at least one error type among long sentence translation errors, entity name translation errors, and domain translation errors.

According to the post-translation editing method according to an embodiment of the present disclosure, the sample machine-translated translation text is determined based on at least one of the following methods:

Translate the sample fine-tuned original text by applying a first machine translation model to obtain a sample machine translation translation text of the type of long sentence translation error; the first machine translation model is based on the first sample translation of the original text and its first translation The original text of the sample fine-tuning is a long sentence, and the original text of the first sample translation is a short sentence;

Randomly modifying the entity name in the edited translation text after the sample fine-tuning, to obtain a sample machine translation translation text with an entity name translation error type;

Translate the sample fine-tuned original text by applying a second machine translation model to obtain a sample machine-translated target text of the wrong type of domain translation; the second machine translation model is based on a second sample that differs from the sample fine-tuned original text in domain It is obtained by training the translated original text and its second sample translated translation text.

According to the post-translation editing method according to an embodiment of the present disclosure, the pre-training post-editing model includes a pre-trained source language encoder, a pre-trained target language encoder, and a decoder.

According to the post-translation editing method according to an embodiment of the present disclosure, the pre-trained source language encoder and the pre-trained target language encoder are based on sample monolingual texts in corresponding languages and perform conventional error simulation on the sample monolingual texts The resulting sample error text is obtained by training.

According to the post-translation editing method of an embodiment of the present disclosure, the simulated translation text is determined based on the following steps:

Perform conventional error simulation on the sample pre-trained original text or the sample pre-trained edited translation text to obtain the simulated translation text.

According to the post-translation editing method according to an embodiment of the present disclosure, the performing conventional error simulation specifically includes:

Randomly select several text segments in the corresponding text, and perform deletion, rearrangement, replacement or transfer operations on the text segments.

An embodiment of the present disclosure also provides a post-translation editing device, including:

A translation determination unit, used to determine the machine translation translation text to be edited;

a post-editing unit, configured to input the machine-translated translation text and its corresponding original text into the post-editing model to obtain the post-editing translation text output by the post-editing model;

Embodiments of the present disclosure further provide an electronic device, including a memory, a processor, and a computer program stored in the memory and running on the processor, where the processor implements the translation according to any one of the above when executing the program Post-editing method steps.

Embodiments of the present disclosure further provide a non-transitory computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements the steps of any of the above-mentioned post-translation editing methods.

(3) Beneficial effects

The post-translation editing method, device, electronic device, and storage medium provided by the embodiments of the present disclosure obtain pre-training by pre-training the original text based on the sample and editing the translation text after the sample pre-training, and training the simulated translation text of the sample pre-training original text. Post-editing model, and fine-tuning the original text based on the sample and its sample-fine-tuning post-editing translation text, as well as the sample machine-translating translation text of the sample fine-tuning the original text, fine-tuning the pre-training post-editing model to obtain a post-editing model, which is pre-trained and fine-tuning method, and the way of synthesizing translation data by error simulation, improves the training efficiency and training effect of the post-editing model, and improves the accuracy of post-editing.

Description of drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following will briefly introduce the accompanying drawings used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort.

1 is a schematic flowchart of a post-translation editing method provided by an embodiment of the present disclosure;

2 is a schematic flowchart of a post-translation editing model training method provided by an embodiment of the present disclosure;

3 is a schematic structural diagram of a post-translation editing apparatus provided by an embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed ways

In order to make the purposes, technical solutions and advantages of the embodiments of the present disclosure clearer, the technical solutions in the embodiments of the present disclosure will be described clearly and completely below with reference to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments These are some, but not all, embodiments of the present disclosure. Based on the embodiments in the present disclosure, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present disclosure.

Post-editing means that given the original text to be translated, the corresponding machine translation results are retrieved, and then the translators modify and polish on this basis, thereby improving the quality of the translation. Among them, the machine translation result can provide the translator with a translation result as a reference, so as to avoid the translator from starting the translation from scratch, thereby reducing the translator's workload. However, when the machine translation result is far from the expected translation result, the post-editing mode will cause the translator to do a lot of revision editing, which will further increase the translator's workload. For example, when the machine translation model processes some original texts to be translated with limited resources and is oriented to certain professional fields, for entity words, such as names of people, places or institutions, or when the numerical words are translated incorrectly, or the machine translation model When the translation of long sentences cannot be handled reasonably, the translation effect will be poor, and the obtained machine translation results are far from the correct translation results, requiring a lot of post-editing work. Therefore, automatic post-editing models play an increasingly important role in current assisted translation.

However, the existing post-editing model training methods require a large number of triplet parallel corpora, that is, the triplet consisting of the original text, the machine translation translation, and the post-editing translation. However, these triplet training data are difficult to obtain and require a lot of manual labeling costs, resulting in poor training effect and low training efficiency of the post-editing model, resulting in poor post-editing accuracy of translations.

In this regard, an embodiment of the present disclosure provides a post-translation editing method. FIG. 1 is a schematic flowchart of a post-translation editing method provided by an embodiment of the present disclosure. As shown in FIG. 1 , the method includes:

Step 110, determining the machine translation translation text to be edited;

Step 120, input the machine translation translation text and its corresponding original text into the post-editing model, and obtain the post-editing translation text output by the post-editing model;

Among them, the post-editing model is obtained by fine-tuning the pre-training post-editing model based on the sample fine-tuning original text and its sample fine-tuning post-editing translation text, and the sample machine-translating translation text of the sample fine-tuning original text;

The pre-trained post-editing model is trained based on the sample pre-trained original text and its sample pre-trained post-edited translation text, as well as the simulated translation text of the sample pre-trained original text.

Specifically, the machine translation translation text corresponding to the original text is obtained, so that the post-editing model can automatically post-edit it. Among them, the machine translation target text can be obtained by inputting the original text into the machine translation model for translation.

Then, the machine-translated target text and its corresponding original text are input into the post-editing model, and the post-editing model will correct the error of the machine-translated target text based on the semantic information of the original text and the semantic information of the machine-translated target text, so as to obtain Corrected post-edited translation text. Here, the post-edited target text is in the same language as the machine translated target text.

Among them, the post-editing model is obtained by fine-tuning the pre-training post-editing model based on the sample fine-tuning original text and its sample fine-tuning post-editing translation text, as well as the sample machine-translating translation text of the sample fine-tuning original text; and the pre-training post-editing model It is obtained based on the sample pre-trained original text and its sample pre-trained edited translation text, as well as the simulated translation text of the sample pre-trained original text.

Here, when editing the model after training, the method of pre-training and fine-tuning is used. FIG. 2 is a schematic flowchart of a post-translation editing model training method provided by an embodiment of the present disclosure. As shown in FIG. 2 , the training method for the post-translation editing model includes:

Step 210, training the initial model based on the sample pre-trained original text and the sample pre-trained edited translation text, and the sample pre-trained original text simulated translation text, to obtain a pre-trained edited model;

Step 220 , fine-tune the pre-trained post-editing model based on the sample fine-tuned original text and its sample fine-tuned edited translation text, and the sample machine-translated translated text of the sample fine-tuned original text, to obtain a post-edited model.

First, use a large number of samples to pre-train the original text and its sample pre-trained edited translation text, and simulate the translated text to pre-train the initial model to obtain a pre-trained post-editing model. Among them, the original text of the sample pre-training and the edited translation text after the sample pre-training can be obtained by downloading public bilingual parallel corpus data from the Internet, such as the official documents of the United Nations and the International Machine Translation Competition (Conference on Machine Translation, WMT). Chinese and English parallel corpus. Then, an error simulation can be performed based on the bilingual parallel corpus to obtain a simulated translation text of the sample pre-trained original text to simulate the translation of the machine translation. Since in pre-training, only bilingual parallel corpus is needed, and simulated translation text similar to machine translation translation is synthesized in a wrong simulation way, which greatly reduces the difficulty of acquiring training data and saves the cost of editing translation after manual annotation. , which helps to improve the efficiency of the entire training process and reduce the difficulty of training.

In addition, during the training process of the pre-trained post-editing model obtained by pre-training, the pre-trained original text and the pre-trained post-edited translated text of the sample, as well as the simulated translated text of the pre-trained original text of the sample, can learn that there may appear in the translation. and learn how to correct the text errors in the translated text according to the original text, so as to obtain the correct post-edited translation text.

In order to further improve the accuracy of post-editing to better complete the post-editing task, the original text and its sample fine-tuned edited translation text can be fine-tuned based on the sample, as well as the sample machine-translated translated text of the sample fine-tuned original text. Models are fine-tuned and post-edited models are obtained. Among them, the sample fine-tuned original text and its sample fine-tuned edited translation text can also be obtained by acquiring bilingual parallel corpora. Here, in order to improve the accuracy of fine-tuning, bilingual parallel corpora generated in the translation production environment can be obtained. Each bilingual parallel corpus includes the original text to be translated, as well as the high-quality translation text produced by human translation and review. According to the bilingual parallel corpus generated in the production environment, the sample fine-tuned original text and the high-quality sample fine-tuned edited translation text can be obtained. The sample machine translation translation text includes translation errors caused by the limitations of the machine translation model in the actual machine translation process in the post-editing scenario. Based on sample fine-tuning of the original text and its sample fine-tuning of the post-edited translated text, as well as the fine-tuning of the sample machine-translated target text, the post-editing model can learn translation errors that may occur in the field of machine translation in addition to conventional text errors, thereby improving The post-editing model's ability to locate and correct errors in post-editing scenarios further improves the accuracy of post-editing. In addition, since the amount of data required for fine-tuning is less than that in the pre-training stage, it can reduce the difficulty of acquiring the triplet of <original text, machine translation translation, post-editing translation>, further reducing the difficulty of model training and improving the model. training efficiency.

In the method provided by the embodiments of the present disclosure, a pre-trained editing model is obtained by pre-training the original text based on the sample and the edited translation text after the sample pre-training, and training the simulated translation text of the sample pre-training original text, and fine-tuning the original text and the sample based on the sample. The edited translation text after sample fine-tuning, and the sample machine-translated translation text of the sample fine-tuned original text, the post-editing model is obtained after fine-tuning the pre-training post-editing model. In this way, the training efficiency and training effect of the post-editing model are improved, and the accuracy of the post-editing is improved.

Based on the above embodiment, the sample machine translation target text corresponds to at least one error type among long sentence translation errors, entity name translation errors, and domain translation errors.

Specifically, in order for the post-editing model to learn the post-editing scenario during the fine-tuning process, the translation errors caused by the limitations of the machine translation model in the actual machine translation process can be obtained. Sample machine translation translation texts containing the above translation errors can be obtained. Usually, possible translation errors include long sentence translation errors, entity name translation errors, and domain translation errors. Among them, long sentence translation errors are errors that occur when the machine translation model cannot reasonably process long sentences; entity name translation errors are when the machine translation model translates entity words, such as person names, place names or institution names, or when translating numerals. Errors that occur; domain translation errors are errors caused by the difference between the domain where the machine translation model is applicable and the domain to be translated when the machine translation model processes some original texts to be translated with limited resources and oriented to certain professional fields. Therefore, the obtained sample machine translation translation text may correspond to at least one error type among long sentence translation errors, entity name translation errors, and domain translation errors.

Based on any of the above embodiments, the sample machine translation translation text is determined based on at least one of the following methods:

The first machine translation model is used to translate the sample fine-tuned original text, and the sample machine translation translation text of the type of long sentence translation error is obtained; the first machine translation model is based on the first sample translation original text and its first sample translation translation text After training, the original text of the sample fine-tuning is a long sentence, and the first sample translated original text is a short sentence;

Randomly modify the entity name in the edited translation text after sample fine-tuning, and obtain the sample machine translation translation text with the wrong type of entity name translation;

The second machine translation model is used to translate the original text of the sample fine-tuning, and the sample machine-translated target text with the wrong type of domain translation is obtained; The two-sample translation translation text is obtained by training.

Specifically, for a long sentence translation error, a first machine translation model can be obtained by training based on the first sample translated original text and its first sample translated target text, and the first machine translation model can be constructed based on a single Transformer model. Wherein, the first sample translated original text and the first sample translated translated text may be bilingual parallel corpora downloaded through the network. Here, the first sample translated original text is a short sentence, for example, only contains one sentence. Since the first machine translation model is trained based on short sentences, this model is only good at translating short sentences. If long sentences are input into the model for translation, the resulting translation is prone to translation errors of long sentences. Therefore, a long sentence, for example, containing more than two sentences, is selected as a sample to fine-tune the original text, and input it into the first machine translation model to obtain a sample machine translation translation text of the wrong type of long sentence translation.

For entity name translation errors, entity recognition tools such as Spacy can be used to perform entity recognition on the edited translated text after fine-tuning the sample. Filter out the post-edited translation text fragments that contain persons, place names, institution names, numbers and other entities in the post-edited translation text after sample fine-tuning, and make random modifications to them, such as deletion or replacement, to obtain a sample machine translation of the wrong type of entity name translation translated text.

For domain translation errors, a second machine translation model can be obtained by training based on the second sample translation source text and its second sample translation target text, and the second machine translation model can be constructed based on a single Transformer model. The fields to which the second sample translated original text and the second sample translated target text belong are different from those of the sample fine-tuned original text. For example, a bilingual parallel corpus of high-quality but narrow-scoped United Nations government documents can be downloaded from the Internet as a second sample translated original text and a second sample translated translation text. The second machine translation model obtained by this training is only good at translating the original text of the second sample translation and the domain texts of the second sample translation and target text. Therefore, if the original text from different fields is input into the model for translation, Then the obtained translation is prone to domain translation errors, so the translation text obtained by translating the sample fine-tuned original text by the second machine translation model can be used as the sample machine translation translation text of the type of domain translation error.

The methods provided by the embodiments of the present disclosure can efficiently generate sample machine translation translation texts corresponding to three different types of translation errors through different data synthesis methods, save the data labeling process in the fine-tuning process, and further improve the performance of the post-editing model. training efficiency.

Based on any of the above embodiments, the pre-trained post-editing model includes a pre-trained source language encoder, a pre-trained target language encoder, and a decoder.

Specifically, the pre-trained post-editing model may include two encoders, a source language encoder and a target language encoder, for encoding the source text and the machine-translated target text, respectively, and a decoder for The encoding of the original text and the encoding of the machine translation translation text are decoded to realize the error correction of the machine translation translation text, and the post-editing translation text is obtained. Among them, the source language encoder, the target language encoder and the decoder can all be constructed based on the single Transformer model. Here, the two encoders can be obtained through pre-training to improve the pre-training efficiency of the pre-trained post-editing model, thereby further improving the overall training efficiency of the post-editing model.

In the method provided by the embodiments of the present disclosure, a pre-trained post-editing model is jointly constructed by a pre-trained source language encoder, a pre-trained target language encoder, and a decoder, which further improves the overall training efficiency of the post-editing model.

Based on any of the above embodiments, the pre-trained source language encoder and the pre-trained target language encoder are obtained by training based on sample monolingual texts in corresponding languages and sample error texts obtained by performing conventional error simulation on the sample monolingual texts.

Specifically, in order to enable the source language encoder and the target language encoder to learn to extract correct semantic information from the erroneous text, so as to encode the original source code and the translation code containing correct semantic information, so as to improve the expressive ability of the encoding, the corresponding The sample monolingual text of the language and its corresponding sample error text, and the word vector model of the corresponding language train the source language encoder and the target language encoder. For example, if the original text is Chinese and the translation is English, the original language encoder can be pre-trained based on the Chinese sample monolingual text and its corresponding sample error text, as well as the Chinese word vector model, based on the English sample monolingual text and Its corresponding sample error text, and the English word vector model pre-train the translation language encoder. Among them, the sample monolingual text can be obtained by collecting a large number of monolingual corpora, for example, public Chinese monolingual corpus, such as Chinese Wikipedia and news corpus, and public English corpus, such as English Wikipedia and news corpus, can be downloaded from the Internet. corpus. In order to reduce the difficulty of obtaining training data, part of the monolingual corpus can be randomly selected from the monolingual corpus, such as 20% of the monolingual corpus. Sample error text for regular text errors.

According to the method provided by the embodiment of the present disclosure, the sample monolingual text of the corresponding language and the sample error text obtained by performing conventional error simulation on the sample monolingual text are pre-trained to obtain the original language encoder and the target language encoder, which can be encoded to obtain correct The original coding and translation coding of semantic information improves the expressive ability of coding.

Based on any of the above embodiments, the simulated translation text is determined based on the following steps:

The conventional error simulation is performed on the original text of the sample pre-training or the edited translation text after the sample pre-training, and the simulated translation text is obtained.

Specifically, part of the bilingual parallel corpus, such as 10% of the bilingual parallel corpus, can be randomly selected from the bilingual parallel corpus, and the sample pre-trained original text in each corpus is subjected to conventional error simulation to obtain a simulated translation text containing errors caused by the conventional text. , and take the sample pre-trained edited translation text, the simulated translation text and the sample pre-trained original text in the bilingual parallel corpus as a piece of training data of the pre-trained edited model. It is also possible to randomly select some bilingual parallel corpora from the bilingual parallel corpus, such as 10% of the bilingual parallel corpus, and perform conventional error simulation on the sample pre-trained and edited translation texts to obtain simulated translation texts that contain errors caused by the conventional texts, and then use The sample pre-trained original text, the simulated translation text and the sample pre-trained edited translation text in the bilingual parallel corpus are used as a piece of training data for the pre-trained post-editing model.

Based on any of the above embodiments, performing conventional error simulation specifically includes:

Randomly select several text fragments in the corresponding text, and delete, rearrange, replace or transfer the text fragments.

Specifically, conventional text errors include missing words, reverse order, wrong words, repetition, etc. Therefore, when simulating conventional errors, several text fragments in the corresponding text can be randomly selected, and each text fragment can be deleted, rearranged, Replace or transfer operations. Among them, deletion refers to deleting the text fragment as a whole, rearranging refers to reversing the order of each word in the text fragment, replacing refers to replacing the text fragment with a text fragment in another position in the original text, and transferring refers to replacing the original text The text fragments elsewhere in the text swap places with this text fragment. For example, general error simulation can be done in the following way:

原文本original text	<zh>今天天气真好。<en>The weather is great today.
删除delete	<zh>今天天DEL DEL好。<en>Today is a good day DEL DEL.
重排rearrange	<zh>今天天真气好。<en>Today is very innocuous.
替换replace	<zh>今天天今天好。<en>Good day today.
转移transfer	<zh>今气真天天好。<en>It's a good day today.

Based on any of the above embodiments, another embodiment of the present disclosure provides a post-editing model construction method. The method includes:

First, collect the corpus data required for model training, including:

Accumulate the bilingual parallel corpus generated in the translation production environment and record it as bilingual parallel corpus C. Among them, each corpus includes an original text to be translated and a high-quality translation text produced by manual translation and review.

Download public bilingual parallel corpora from the Internet, such as the United Nations and WMT bilingual parallel corpus, and record it as bilingual parallel corpus T.

Download public monolingual corpus in original language from the Internet, such as Chinese Wikipedia and news corpus, and record it as monolingual corpus Z.

Download public monolingual corpus of translation language from the Internet, such as English Wikipedia and news corpus, and record it as monolingual corpus E.

Perform word segmentation on all corpora. Among them, for the English corpus, the spacy tool can be used for word segmentation; for the Chinese corpus, the word segmentation can be performed in units of words by using grammar rules, that is, individual Chinese characters, continuous numbers or English letters and punctuation marks are used as word examples. Then, add a language identifier to the beginning of each corpus, as shown in the following table:

Based on the segmented corpus data, the Skip-Gram algorithm is used to train the word vectors in the source language and the target language respectively. Among them, the dimension of the word vector can be set to 300, and the context window can be set to 5.

Randomly extract 20% of the corpus from Z, perform conventional error simulation, synthesize parallel corpus containing the possibly corrupted corpus and the original corpus, combine the word vector model of the original language, and pre-train a standard Transformer model of the original language encoder.

Randomly extract 20% of the corpus from E, perform conventional error simulation, synthesize parallel corpus containing the possibly corrupted corpus and the original corpus, and combine the word vector model of the target language to pre-train a standard Transformer model of the target language encoder.

Randomly select 10% of the corpus from T, and perform conventional error simulation on the original corpus, that is, generate a ternary corpus (the original corpus that may be damaged, the original translation corpus, the original original corpus). Similarly, 10% of the corpus is randomly selected from T, and the conventional error simulation is performed on the translation corpus, that is, a ternary corpus (initial source corpus, possibly corrupted translation corpus, original translation corpus) is generated. Using the synthesized trigram parallel corpus to pre-train a dual Transformer encoder to a single Transformer decoder, a pre-trained post-editing model is obtained. The dual Transformer encoders are the source language encoder and the target language encoder.

Subsequently, the training data acquisition for the fine-tuning task is performed, including:

a) Use the Chinese sentence segmentation rule method to segment the original corpus in C, screen out the bilingual parallel corpus with the number of sentences in the original corpus greater than or equal to 2, and form a subset C1. Similarly, segment the original corpus in T, and filter out the bilingual parallel corpus with 1 sentence in the original corpus, forming another subset T1. Using the corpus T1, build a machine translation engine based on the Transformer model. Then, the original corpus of C1 is input into the model for decoding to generate a machine translation translation, and a triplet (C1 original text, machine translation translation, C1 translation) is generated.

b) Use the Spacy tool to perform entity recognition on the translation corpus in C, and screen out the bilingual parallel corpus C2 containing entities such as person names, place names, institution names, and numbers. Randomly modify the entity nouns in the C2 translation corpus, such as deletion or substitution, to generate triples (C2 original text, entity noun-destroyed translation, C2 translation).

c) Screen out the United Nations bilingual parallel corpus from T, and build a machine translation engine based on the Transformer model. A subset C3 is extracted from C, and the original corpus of C3 is input into the model for decoding to generate a machine translation translation, and a triplet (C3 original text, machine translation translation text, C3 translation text) is generated.

The triples generated in a), b), and c) are aggregated to form the total fine-tuning task training data, and the pre-trained post-editing model is fine-tuned to obtain the final post-editing model.

The post-translation editing apparatus provided by the embodiments of the present disclosure will be described below. The post-translation editing apparatus described below and the post-translation editing method described above can be referred to each other correspondingly.

Based on any of the above embodiments, FIG. 3 is a schematic structural diagram of an apparatus for post-editing translation provided by an embodiment of the present disclosure. As shown in FIG. 3 , the apparatus includes a translation determining unit 310 and a post-editing unit 320 .

Wherein, the translation determination unit 310 is used to determine the machine translation translation text to be edited;

The post-editing unit 320 is configured to input the machine-translated translation text and its corresponding original text into the post-editing model, and obtain the post-editing translation text output by the post-editing model;

The pre-trained post-editing model is trained based on the sample pre-trained original text, its sample pre-trained post-edited translation text, and the simulated translation text of the sample pre-trained original text.

The apparatus provided by the embodiment of the present disclosure obtains a pre-trained editing model by pre-training the original text based on the sample and the edited translation text after the sample pre-training, and training the simulated translation text of the sample pre-trained original text, and fine-tunes the original text and the original text based on the sample. The edited translation text after sample fine-tuning, and the sample machine-translated translation text of the sample fine-tuned original text, the post-editing model is obtained after fine-tuning the pre-training post-editing model, and the pre-training and fine-tuning methods are used to synthesize the translation data. In this way, the training efficiency and training effect of the post-editing model are improved, and the accuracy of the post-editing is improved.

Based on any of the foregoing embodiments, the sample machine translation translation text corresponds to at least one error type among long sentence translation errors, entity name translation errors, and domain translation errors.

Based on any of the above embodiments, the sample machine translation translation text is determined based on at least one of the following ways:

The device provided by the embodiment of the present disclosure can efficiently generate sample machine translation translation texts corresponding to three different types of translation errors through different data synthesis methods, omits the data labeling process in the fine-tuning process, and can further improve the performance of the post-editing model. training efficiency.

In the apparatus provided by the embodiments of the present disclosure, a pre-trained post-editing model is jointly constructed by a pre-trained source language encoder, a pre-trained target language encoder, and a decoder, which further improves the overall training efficiency of the post-editing model.

Based on any of the above embodiments, the pre-trained source language encoder and the pre-trained target language encoder are obtained by training based on sample monolingual texts of the corresponding language and sample error texts obtained by performing conventional error simulation on the sample monolingual texts.

The device provided by the embodiment of the present disclosure can pre-train the sample monolingual text of the corresponding language and the sample error text obtained by performing the conventional error simulation on the sample monolingual text to obtain the original language encoder and the target language encoder, and can encode and obtain the correct The original coding and translation coding of semantic information improves the expressive ability of coding.

The conventional error simulation is performed on the sample pre-trained original text or the sample pre-trained edited translation text, and the simulated translation text is obtained.

Based on any of the above embodiments, the apparatus further includes a conventional error simulation unit for:

FIG. 4 illustrates a schematic diagram of the physical structure of an electronic device. As shown in FIG. 4 , the electronic device may include: a processor (processor) 410, a communication interface (Communications Interface) 420, a memory (memory) 430 and a communication bus 440, The processor 410 , the communication interface 420 , and the memory 430 communicate with each other through the communication bus 440 . The processor 410 may invoke the logic instructions in the memory 430 to execute a post-translation editing method, the method comprising: determining the machine-translated translation text to be edited; inputting the machine-translated translation text and its corresponding original text into the post-editing model to obtain the post-editing translation text output by the post-editing model; wherein, the post-editing model is based on the sample fine-tuning original text and its sample fine-tuning post-editing translation text, and the sample machine-translated translation text of the sample fine-tuning original text, It is obtained by fine-tuning the pre-training post-editing model; the pre-training post-editing model is obtained based on the sample pre-training original text and the sample pre-training post-editing translation text, as well as the simulated translation text training of the sample pre-training original text. of.

In addition, the above-mentioned logic instructions in the memory 430 can be implemented in the form of software functional units and can be stored in a computer-readable storage medium when sold or used as an independent product. Based on such understanding, the technical solutions of the present disclosure can be embodied in the form of software products in essence, or the parts that contribute to the prior art or the parts of the technical solutions. The computer software products are stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of the present disclosure. The aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes .

On the other hand, an embodiment of the present disclosure also provides a computer program product, the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, when the program instructions When executed by a computer, the computer can execute the post-translation editing method provided by the above method embodiments. The method includes: determining the machine-translated translation text to be edited; inputting the machine-translated translation text and its corresponding original text into the post-processing an editing model to obtain the post-editing translation text output by the post-editing model; wherein the post-editing model is based on the sample fine-tuning original text and its sample fine-tuning post-editing translation text, and the sample machine-translation translation of the sample fine-tuning original text Text, obtained after fine-tuning the pre-training editing model; the pre-training editing model is based on the sample pre-trained original text and the sample pre-trained edited translation text, and the sample pre-trained original text. obtained by training.

In yet another aspect, an embodiment of the present disclosure further provides a non-transitory computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, is implemented to execute the post-translation editing method provided by the foregoing embodiments, The method includes: determining the machine translation translation text to be edited; inputting the machine translation translation text and its corresponding original text into a post-editing model to obtain a post-editing translation text output by the post-editing model; wherein, the post-editing translation text is The editing model is obtained by fine-tuning the pre-trained editing model based on the sample fine-tuning original text and its sample fine-tuning edited translation text, and the sample machine-translated translated text of the sample fine-tuning original text; the pre-training and post-training editing model It is obtained based on the sample pre-training original text and the edited translation text after sample pre-training, and the simulated translation text of the sample pre-training original text.

The device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.

From the description of the above embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on this understanding, the above-mentioned technical solutions can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic A disc, an optical disc, etc., includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in various embodiments or some parts of the embodiments.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present disclosure, but not to limit them; although the present disclosure has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still be Modifications are made to the technical solutions described in the foregoing embodiments, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions depart from the spirit and scope of the technical solutions of the embodiments of the present disclosure.

Claims

A post-translation editing method, comprising:

Determine the machine translation translation text to be edited;

Inputting the machine-translated translation text and its corresponding original text into the post-editing model to obtain the post-editing translation text output by the post-editing model;

The post-editing model is obtained by fine-tuning the pre-training post-editing model based on the sample fine-tuning original text and the sample fine-tuning edited translation text, and the sample machine-translated translation text of the sample fine-tuning original text;

The pre-training post-editing model is obtained by training based on the sample pre-training original text, the sample pre-training post-editing translation text, and the simulated translation text of the sample pre-training original text.
The post-translation editing method according to claim 1, wherein the sample machine-translated translation text corresponds to at least one error type among long sentence translation errors, entity name translation errors, and domain translation errors.
The post-translation editing method according to claim 2, wherein the sample machine translation translation text is determined based on at least one of the following methods:

Translate the sample fine-tuned original text by applying a first machine translation model to obtain a sample machine translation translation text of the type of long sentence translation error; the first machine translation model is based on the first sample translation of the original text and its first translation The original text of the sample fine-tuning is a long sentence, and the original text of the first sample translation is a short sentence;

Randomly modifying the entity name in the edited translation text after the sample fine-tuning, to obtain a sample machine translation translation text with an entity name translation error type;

Translate the sample fine-tuned original text by applying a second machine translation model to obtain a sample machine-translated target text of the wrong type of domain translation; the second machine translation model is based on a second sample that differs from the sample fine-tuned original text in domain It is obtained by training the translated original text and its second sample translated translation text.
The post-translation editing method according to claim 1, wherein the pre-training post-editing model comprises a pre-trained source language encoder, a pre-trained target language encoder, and a decoder.
The post-translation editing method according to claim 4, wherein the pre-trained source language encoder and the pre-trained target language encoder are based on sample monolingual texts of corresponding languages and a Trained on sample error texts obtained from regular error simulations.
The post-translation editing method according to claim 1, wherein the simulated translation text is determined based on the following steps:

Perform conventional error simulation on the sample pre-trained original text or the sample pre-trained edited translation text to obtain the simulated translation text.
The post-translation editing method according to claim 5 or 6, wherein the performing conventional error simulation specifically comprises:

Randomly select several text segments in the corresponding text, and perform deletion, rearrangement, replacement or transfer operations on the text segments.
A post-translation editing device, comprising:

A translation determination unit, used to determine the machine translation translation text to be edited;

a post-editing unit, configured to input the machine-translated translation text and its corresponding original text into the post-editing model to obtain the post-editing translation text output by the post-editing model;

The post-editing model is obtained by fine-tuning the pre-training post-editing model based on the sample fine-tuning original text and the sample fine-tuning edited translation text, and the sample machine-translated translation text of the sample fine-tuning original text;

The pre-training post-editing model is obtained by training based on the sample pre-training original text, the sample pre-training post-editing translation text, and the simulated translation text of the sample pre-training original text.
An electronic device, comprising a memory, a processor and a computer program stored on the memory and running on the processor, characterized in that, when the processor executes the program, the program as claimed in any one of claims 1 to 7 is implemented. Describe the steps of the post-translation editing method.
A non-transitory computer-readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the steps of the post-translation editing method according to any one of claims 1 to 7 are implemented.