CN115270825A

CN115270825A - Method for automatically completing translated text fragments during post-editing of on-machine translated text

Info

Publication number: CN115270825A
Application number: CN202210942707.9A
Authority: CN
Inventors: 毛红保
Original assignee: Iol Wuhan Information Technology Co ltd
Current assignee: Iol Wuhan Information Technology Co ltd
Priority date: 2022-08-08
Filing date: 2022-08-08
Publication date: 2022-11-01

Abstract

The invention discloses a method for automatically completing a translated text segment during post-editing of a machine translated text, which comprises the steps of synthesizing training data and training a translated text completion segment prediction model. The beneficial effects of the invention are: performing machine translation on an original sentence to obtain a machine translated sentence, manually examining the machine translated sentence, finding a translation segment needing to be interfered, and deleting a corresponding translation segment to obtain a machine translated sentence to be complemented; combining the original sentence and the machine translation to be complemented and inputting the combination into a translation complementing segment prediction model, automatically obtaining the content of the segment to be complemented, backfilling the content into the translation sentence to obtain the complemented translation, and finishing the post-editing operation of the corresponding sentence; the contents to be completed are automatically generated through the translated text completion segment prediction model, the manual revision and input process of a translator is replaced, and the efficiency of post-editing can be obviously improved.

Description

Method for automatically completing translated text fragments during post-editing of on-machine translated text

Technical Field

The invention relates to a method for automatically supplementing translated text segments, in particular to a method for automatically supplementing translated text segments during editing after translating a machine, and belongs to the technical field of translation editing.

Background

With the continuous improvement of the quality of machine translation, post-editing based on machine translation text has become a common practice for translators in translation practice. During post-editing, the translator needs to check the quality of the translated text of the translator one by one and revise the translated text if necessary.

The quality of the currently used neural machine translation engine is greatly improved, and usually, in most cases, a translator only needs to revise partial segments of a translated sentence during post-editing, and the revising process is as follows: the translator deletes the segment to be revised first and then manually keys in new content to complete the translated sentence, and the method completely depends on manual completion, so that the efficiency is low, and the potential of AI auxiliary translation is not fully exploited.

Disclosure of Invention

The present invention is directed to solving at least one of the above problems and to providing a method for automatically completing a translation segment during post-compilation of a machine-translated sentence.

The invention achieves the above purpose through the following technical scheme: a method for automatically supplementing the segments of a translated sentence during the post-editing of the translated sentence includes the following steps

Step one, synthesizing training data, wherein the prediction of a translated text completion segment is a generative model based on NLP, the model needs to be obtained by learning and training on large-scale data, and the training data is generated by two methods of sampling parallel linguistic data and data based on a translation process, and the training data generated by the two methods can be used independently or combined;

training a translation completion segment prediction model, wherein a transformer-based model can be adopted to train according to a generative translation task; and fine training can be carried out on the basis of an encoder-decoder pre-training model.

As a still further scheme of the invention: in the first step, sampling the parallel corpus specifically includes:

assuming a set of parallel corpora (src, tgt), wherein src represents an original sentence and tgt represents a translated sentence; simulating the process of manual post-editing, randomly intercepting a continuous segment on the tgt sentence (keeping the integrity of words and expressions during interception) and recording the segment as tgt _fragment And replacing the segment in the tgt sentence with<mask>Marking, and recording the replaced tgt sentence as tgt _mask Thus, a new corpus is formed: (src, tgt) _mask ,tgt _fragment ) Where src and tgt _mask For the input text of the model training,tgt _fragment output text for model training; because tgt _fragment The method is carried out by random interception, and different interception positions can be transformed to generate a plurality of groups of different target corpora on the same group of original corpora (src, tgt).

As a still further scheme of the invention: in the first step, the generating based on the translation process data specifically includes:

the translation process data is derived from the recording and collection of real translation project data, including: original sentence src, machine translation mt and manual post-editing translation pe; because the post-editing translation pe is modified on the basis of the machine translation mt, the difference part between mt and pe is found by comparison, and the difference part in mt is replaced by<mask>Is recorded as mt _mask And the prediction object with the difference part as the model is marked as pe _fragment Thereby obtaining a corpus: src and mt _mask For input, pe _fragment Is the output.

As a still further scheme of the invention: in the second step, training based on a transformer model specifically comprises:

the training process is similar to the training of a neural machine translation model, and an input sentence is composed of src and tgt _mask Spliced to form a single output of tgt _fragment Or input by src and mt _mask Formed by splicing, the output is pe _fragment 。

As a still further scheme of the invention: in the second step, the transformer comprises an encoder and a decoder.

As a still further scheme of the invention: in the standard architecture of the transform, each encoder is divided into six layers, and each layer is divided into a feed-forward network and a multi-head attention layer; the decoder is also divided into six layers, each layer is composed of two attention layers and a feedforward network, one is a self-attention layer, and the other is an attention layer composed of the final output of the encoder.

As a still further scheme of the invention: in the second step, the fine training based on the encoder-decoder pre-training model specifically comprises:

the method is characterized in that a DeltaLM pre-training model is adopted to finely tune training corpora of synthetic training data, the DeltaLM is a pre-training model based on an encoder-decoder framework, strong text representation and multilingual translation capability are achieved through parameter sharing and pre-training of large-scale monolingual and bilingual corpora, based on the pre-training model, fine tuning is conducted on tasks of a translation completion segment prediction model, and good translation segment prediction effect can be achieved under the condition that small-scale training corpora are used.

The invention has the beneficial effects that: performing machine translation on the original sentence to obtain a machine translated sentence, manually examining the machine translated sentence and finding a translated sentence fragment needing intervention, and deleting the corresponding translated sentence fragment to obtain a machine translated sentence to be complemented; combining the original sentence and the machine translation to be complemented and inputting the combination into a translation complementing segment prediction model, automatically obtaining the content of the segment to be complemented, backfilling the content into the translation sentence to obtain the complemented translation, and finishing the post-editing operation of the corresponding sentence; the content to be completed is automatically generated through the translated text completion segment prediction model, the manual revision and input process of a translator is replaced, and the efficiency of post-editing can be obviously improved.

Drawings

FIG. 1 is a schematic overall flow diagram of the present invention;

FIG. 2 is a schematic diagram of a training process of the transformer model of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

As shown in FIG. 1, a method for automatically completing a translation segment during post-editing of a machine translated sentence comprises the following steps

Step one, synthesizing training data, wherein the prediction of a translated text completion segment is a generation type model based on NLP, the model needs to be obtained by learning and training on large-scale data, two methods of training data are carried out by sampling parallel corpora and generating data based on a translation process, and the training data generated by the two methods can be used independently or combined;

In the embodiment of the present invention, in the first step, the sampling of the parallel corpus specifically includes:

assuming a set of parallel corpora (src, tgt), wherein src represents an original sentence and tgt represents a translated sentence; simulating the process of manual post-editing, randomly intercepting a continuous segment on the tgt sentence (keeping the integrity of words and expressions during interception) and recording the segment as tgt _fragment And replacing the segment in the tgt sentence by the segment<mask>Marking, the substituted tgt sentence as tgt _mask Thus, a new corpus is formed: (src, tgt) _mask ,tgt _fragment ) Wherein src and tgt _mask Input text for model training, tgt _fragment Output text for model training; because tgt _fragment The method is carried out by random interception, and different interception positions can be transformed to generate a plurality of groups of different target corpora on the same group of original corpora (src, tgt).

Original parallel corpora:

src: load application considering actual size of column wall

tgt：Actual dimensions of stud walls shall be considered for load application

Target corpus one:

src: load application takes into account the actual size of the column wall

tgt _mask ：Actual dimensions of<mask>shall be considered for

load application

tgt _fragment ：stud walls

Target corpus two:

src: load application considering actual size of column wall

tgt：Actual dimensions of stud walls shall be considered for<mask>

tgt _fragment ：load application

In the embodiment of the present invention, in the first step, the generating based on the translation process data specifically includes:

the translation process data is derived from recording and collecting real translation project data, and comprises the following steps: the method comprises the steps of original sentence src, machine translation text mt and manual post-editing translation text pe. Because the manual post-editing translation pe is modified on the basis of the machine translation mt, the difference part between mt and pe can be found out through comparison, and the difference part in mt is replaced by the difference part in mt<mask>Is recorded as mt _mask And the prediction object with the difference part as the model is marked as pe _fragment Thereby obtaining a corpus: src and mt _mask For input, pe _fragment Is the output.

Raw translation process data:

src: load application considering actual size of column wall

mt：The actual size of column and wall shall be considered for load application

pe：Actual dimensions of stud walls shall be considered for load application

Target corpus:

src: load application takes into account the actual size of the column wall

mt _mask ：<mask>shall be considered for load application

pe _fragment ：Actual dimensions of stud walls

Example two

Referring to FIGS. 1-2, a method for automatically completing a translation segment during post-compilation of a machine translated sentence includes the following steps

Step one, synthesizing training data, wherein the prediction of a translated text completion segment is based on a generation model of NLP (non line of sight), the prediction needs to be obtained by learning and training on large-scale data, two methods of training data are carried out by sampling parallel corpora and generating data based on a translation process, and the training data generated by the two methods can be used independently or combined;

training a translated text completion segment prediction model, wherein training can be carried out according to a generative translation task based on a transformer model; and fine training can be carried out on the basis of an encoder-decoder pre-training model.

In the embodiment of the present invention, in the second step, training based on a transform model specifically includes:

In the embodiment of the present invention, in the second step, the transform includes an encoder and a decoder.

In the embodiment of the invention, in the standard architecture of the transform, each encoder is divided into six layers, and each layer is divided into a feed-forward network and a multi-head attention layer; the decoder is also divided into six layers, each layer is composed of two attention layers and a feedforward network, one is a self-attention layer, and the other is an attention layer composed of the final output of the encoder.

In the embodiment of the present invention, in the second step, the performing fine training based on the encoder-decoder pre-training model specifically includes:

The working principle is as follows: a two-stage fine tuning scheme is used, and when the first stage fine tuning is carried out, training corpora synthesized by a sampling parallel corpus mode are used for fine tuning; and after the first-stage fine tuning is finished, the second-stage fine tuning is carried out by using the training corpus synthesized in a mode of generating data based on the translation process, and a better model prediction effect can be realized through the two-stage fine tuning.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims

1. A method for automatically completing translated text segments during post-editing of an on-machine translated text is characterized by comprising the following steps: comprises the following steps

Step one, synthesizing training data, wherein the prediction of a translated text completion segment is based on a generation model of NLP (non line segment), the training data is obtained through data learning and training, and the training data of two methods are generated through sampling parallel corpora and data based on a translation process;

training a translation completion segment prediction model, and training according to a generative translation task by adopting a transformer-based model or performing fine training on the basis of an encoder-decoder-based pre-training model.

2. The method of claim 1 for automatically completing a translation snippet during post-translational editing of a translation: in the first step, sampling the parallel corpus specifically includes:

1) Assuming a set of parallel corpora (src, tgt), wherein src represents an original sentence and tgt represents a translated sentence;

2) Simulating the process of manual post-editing, randomly intercepting a continuous segment on the tgt sentence, and keeping the integrity of words and expressions during interception, which is recorded as tgt _fragment And replacing the segment in the tgt sentence with<mask>Marking, and recording the replaced tgt sentence as tgt _mask Forming a new corpus: (src, tgt) _mask ,tgt _fragment ) Wherein src and tgt _mask Input text for model training, tgt _fragment Output text for model training;

3) Due to tgt _fragment And randomly intercepting, and transforming different intercepting positions to generate a plurality of groups of different target corpora on the same group of original corpora (src, tgt).

3. The method of claim 1 for automatically completing a translation snippet during post-translational editing of a translation: in the first step, the generating based on the translation process data specifically includes:

1) The translation process data is derived from recording and collecting real translation project data, and comprises the following steps: the method comprises the following steps of (1) obtaining an original sentence src, a machine translation mt and a manual post-editing translation pe;

2) Because the manual post-editing translation pe is modified on the basis of the machine translation mt, the difference part between mt and pe is found out by comparison, and the difference part in mt is replaced by the difference part in mt<mask>Is recorded as mt _mask And the prediction object with the difference part as the model is marked as pe _fragment Thereby obtaining a corpus: src and mt _mask For input, pe _fragment Is the output.

4. The method of claim 1 for automatically completing a translation snippet during post-translational editing of a translation: in the second step, training based on a transformer model specifically comprises:

the input sentence is composed of src and tgt _mask Spliced to form a single output of tgt _fragment Or input by src and mt _mask Formed by splicing, the output is pe _fragment 。

5. The method of claim 1 for automatically completing a translation snippet during post-translational editing of a translation: in the second step, the transformer comprises an encoder and a decoder.

6. The method of automatically completing a translation snippet during post-translational editing of claim 5, wherein: in the standard architecture of the transform, each encoder is divided into six layers, and each layer is divided into a feedforward network and a multi-head attention layer; the decoder is also divided into six layers, each layer is composed of two attention layers and a feedforward network, one is a self-attention layer, and the other is an attention layer composed of the final output of the encoder.

7. A method of automatically completing a translation snippet during post-translational editing of a translation as recited in claim 1: in the second step, the fine training based on the encoder-decoder pre-training model specifically comprises:

the method is characterized in that a DeltaLM pre-training model is adopted to finely tune on a training corpus of synthetic training data, the DeltaLM is a pre-training model based on an encoder-decoder framework, and the DeltaLM pre-training model has text representation and multilingual translation capabilities through parameter sharing and pre-training of large-scale monolingual and bilingual corpora.