CN115270825A - Method for automatically completing translated text fragments during post-editing of on-machine translated text - Google Patents

Method for automatically completing translated text fragments during post-editing of on-machine translated text Download PDF

Info

Publication number
CN115270825A
CN115270825A CN202210942707.9A CN202210942707A CN115270825A CN 115270825 A CN115270825 A CN 115270825A CN 202210942707 A CN202210942707 A CN 202210942707A CN 115270825 A CN115270825 A CN 115270825A
Authority
CN
China
Prior art keywords
translation
training
tgt
editing
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210942707.9A
Other languages
Chinese (zh)
Inventor
毛红保
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Iol Wuhan Information Technology Co ltd
Original Assignee
Iol Wuhan Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Iol Wuhan Information Technology Co ltd filed Critical Iol Wuhan Information Technology Co ltd
Priority to CN202210942707.9A priority Critical patent/CN115270825A/en
Publication of CN115270825A publication Critical patent/CN115270825A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method for automatically completing a translated text segment during post-editing of a machine translated text, which comprises the steps of synthesizing training data and training a translated text completion segment prediction model. The beneficial effects of the invention are: performing machine translation on an original sentence to obtain a machine translated sentence, manually examining the machine translated sentence, finding a translation segment needing to be interfered, and deleting a corresponding translation segment to obtain a machine translated sentence to be complemented; combining the original sentence and the machine translation to be complemented and inputting the combination into a translation complementing segment prediction model, automatically obtaining the content of the segment to be complemented, backfilling the content into the translation sentence to obtain the complemented translation, and finishing the post-editing operation of the corresponding sentence; the contents to be completed are automatically generated through the translated text completion segment prediction model, the manual revision and input process of a translator is replaced, and the efficiency of post-editing can be obviously improved.

Description

Method for automatically completing translated text fragments during post-editing of on-machine translated text
Technical Field
The invention relates to a method for automatically supplementing translated text segments, in particular to a method for automatically supplementing translated text segments during editing after translating a machine, and belongs to the technical field of translation editing.
Background
With the continuous improvement of the quality of machine translation, post-editing based on machine translation text has become a common practice for translators in translation practice. During post-editing, the translator needs to check the quality of the translated text of the translator one by one and revise the translated text if necessary.
The quality of the currently used neural machine translation engine is greatly improved, and usually, in most cases, a translator only needs to revise partial segments of a translated sentence during post-editing, and the revising process is as follows: the translator deletes the segment to be revised first and then manually keys in new content to complete the translated sentence, and the method completely depends on manual completion, so that the efficiency is low, and the potential of AI auxiliary translation is not fully exploited.
Disclosure of Invention
The present invention is directed to solving at least one of the above problems and to providing a method for automatically completing a translation segment during post-compilation of a machine-translated sentence.
The invention achieves the above purpose through the following technical scheme: a method for automatically supplementing the segments of a translated sentence during the post-editing of the translated sentence includes the following steps
Step one, synthesizing training data, wherein the prediction of a translated text completion segment is a generative model based on NLP, the model needs to be obtained by learning and training on large-scale data, and the training data is generated by two methods of sampling parallel linguistic data and data based on a translation process, and the training data generated by the two methods can be used independently or combined;
training a translation completion segment prediction model, wherein a transformer-based model can be adopted to train according to a generative translation task; and fine training can be carried out on the basis of an encoder-decoder pre-training model.
As a still further scheme of the invention: in the first step, sampling the parallel corpus specifically includes:
assuming a set of parallel corpora (src, tgt), wherein src represents an original sentence and tgt represents a translated sentence; simulating the process of manual post-editing, randomly intercepting a continuous segment on the tgt sentence (keeping the integrity of words and expressions during interception) and recording the segment as tgt fragment And replacing the segment in the tgt sentence with<mask>Marking, and recording the replaced tgt sentence as tgt mask Thus, a new corpus is formed: (src, tgt) mask ,tgt fragment ) Where src and tgt mask For the input text of the model training,tgt fragment output text for model training; because tgt fragment The method is carried out by random interception, and different interception positions can be transformed to generate a plurality of groups of different target corpora on the same group of original corpora (src, tgt).
As a still further scheme of the invention: in the first step, the generating based on the translation process data specifically includes:
the translation process data is derived from the recording and collection of real translation project data, including: original sentence src, machine translation mt and manual post-editing translation pe; because the post-editing translation pe is modified on the basis of the machine translation mt, the difference part between mt and pe is found by comparison, and the difference part in mt is replaced by<mask>Is recorded as mt mask And the prediction object with the difference part as the model is marked as pe fragment Thereby obtaining a corpus: src and mt mask For input, pe fragment Is the output.
As a still further scheme of the invention: in the second step, training based on a transformer model specifically comprises:
the training process is similar to the training of a neural machine translation model, and an input sentence is composed of src and tgt mask Spliced to form a single output of tgt fragment Or input by src and mt mask Formed by splicing, the output is pe fragment
As a still further scheme of the invention: in the second step, the transformer comprises an encoder and a decoder.
As a still further scheme of the invention: in the standard architecture of the transform, each encoder is divided into six layers, and each layer is divided into a feed-forward network and a multi-head attention layer; the decoder is also divided into six layers, each layer is composed of two attention layers and a feedforward network, one is a self-attention layer, and the other is an attention layer composed of the final output of the encoder.
As a still further scheme of the invention: in the second step, the fine training based on the encoder-decoder pre-training model specifically comprises:
the method is characterized in that a DeltaLM pre-training model is adopted to finely tune training corpora of synthetic training data, the DeltaLM is a pre-training model based on an encoder-decoder framework, strong text representation and multilingual translation capability are achieved through parameter sharing and pre-training of large-scale monolingual and bilingual corpora, based on the pre-training model, fine tuning is conducted on tasks of a translation completion segment prediction model, and good translation segment prediction effect can be achieved under the condition that small-scale training corpora are used.
The invention has the beneficial effects that: performing machine translation on the original sentence to obtain a machine translated sentence, manually examining the machine translated sentence and finding a translated sentence fragment needing intervention, and deleting the corresponding translated sentence fragment to obtain a machine translated sentence to be complemented; combining the original sentence and the machine translation to be complemented and inputting the combination into a translation complementing segment prediction model, automatically obtaining the content of the segment to be complemented, backfilling the content into the translation sentence to obtain the complemented translation, and finishing the post-editing operation of the corresponding sentence; the content to be completed is automatically generated through the translated text completion segment prediction model, the manual revision and input process of a translator is replaced, and the efficiency of post-editing can be obviously improved.
Drawings
FIG. 1 is a schematic overall flow diagram of the present invention;
FIG. 2 is a schematic diagram of a training process of the transformer model of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example one
As shown in FIG. 1, a method for automatically completing a translation segment during post-editing of a machine translated sentence comprises the following steps
Step one, synthesizing training data, wherein the prediction of a translated text completion segment is a generation type model based on NLP, the model needs to be obtained by learning and training on large-scale data, two methods of training data are carried out by sampling parallel corpora and generating data based on a translation process, and the training data generated by the two methods can be used independently or combined;
training a translation completion segment prediction model, wherein a transformer-based model can be adopted to train according to a generative translation task; and fine training can be carried out on the basis of an encoder-decoder pre-training model.
In the embodiment of the present invention, in the first step, the sampling of the parallel corpus specifically includes:
assuming a set of parallel corpora (src, tgt), wherein src represents an original sentence and tgt represents a translated sentence; simulating the process of manual post-editing, randomly intercepting a continuous segment on the tgt sentence (keeping the integrity of words and expressions during interception) and recording the segment as tgt fragment And replacing the segment in the tgt sentence by the segment<mask>Marking, the substituted tgt sentence as tgt mask Thus, a new corpus is formed: (src, tgt) mask ,tgt fragment ) Wherein src and tgt mask Input text for model training, tgt fragment Output text for model training; because tgt fragment The method is carried out by random interception, and different interception positions can be transformed to generate a plurality of groups of different target corpora on the same group of original corpora (src, tgt).
Original parallel corpora:
src: load application considering actual size of column wall
tgt:Actual dimensions of stud walls shall be considered for load application
Target corpus one:
src: load application takes into account the actual size of the column wall
tgt mask :Actual dimensions of<mask>shall be considered for
load application
tgt fragment :stud walls
Target corpus two:
src: load application considering actual size of column wall
tgt:Actual dimensions of stud walls shall be considered for<mask>
tgt fragment :load application
In the embodiment of the present invention, in the first step, the generating based on the translation process data specifically includes:
the translation process data is derived from recording and collecting real translation project data, and comprises the following steps: the method comprises the steps of original sentence src, machine translation text mt and manual post-editing translation text pe. Because the manual post-editing translation pe is modified on the basis of the machine translation mt, the difference part between mt and pe can be found out through comparison, and the difference part in mt is replaced by the difference part in mt<mask>Is recorded as mt mask And the prediction object with the difference part as the model is marked as pe fragment Thereby obtaining a corpus: src and mt mask For input, pe fragment Is the output.
Raw translation process data:
src: load application considering actual size of column wall
mt:The actual size of column and wall shall be considered for load application
pe:Actual dimensions of stud walls shall be considered for load application
Target corpus:
src: load application takes into account the actual size of the column wall
mt mask :<mask>shall be considered for load application
pe fragment :Actual dimensions of stud walls
Example two
Referring to FIGS. 1-2, a method for automatically completing a translation segment during post-compilation of a machine translated sentence includes the following steps
Step one, synthesizing training data, wherein the prediction of a translated text completion segment is based on a generation model of NLP (non line of sight), the prediction needs to be obtained by learning and training on large-scale data, two methods of training data are carried out by sampling parallel corpora and generating data based on a translation process, and the training data generated by the two methods can be used independently or combined;
training a translated text completion segment prediction model, wherein training can be carried out according to a generative translation task based on a transformer model; and fine training can be carried out on the basis of an encoder-decoder pre-training model.
In the embodiment of the present invention, in the second step, training based on a transform model specifically includes:
the training process is similar to the training of a neural machine translation model, and an input sentence is composed of src and tgt mask Spliced to form a single output of tgt fragment Or input by src and mt mask Formed by splicing, the output is pe fragment
In the embodiment of the present invention, in the second step, the transform includes an encoder and a decoder.
In the embodiment of the invention, in the standard architecture of the transform, each encoder is divided into six layers, and each layer is divided into a feed-forward network and a multi-head attention layer; the decoder is also divided into six layers, each layer is composed of two attention layers and a feedforward network, one is a self-attention layer, and the other is an attention layer composed of the final output of the encoder.
In the embodiment of the present invention, in the second step, the performing fine training based on the encoder-decoder pre-training model specifically includes:
the method is characterized in that a DeltaLM pre-training model is adopted to finely tune training corpora of synthetic training data, the DeltaLM is a pre-training model based on an encoder-decoder framework, strong text representation and multilingual translation capability are achieved through parameter sharing and pre-training of large-scale monolingual and bilingual corpora, based on the pre-training model, fine tuning is conducted on tasks of a translation completion segment prediction model, and good translation segment prediction effect can be achieved under the condition that small-scale training corpora are used.
The working principle is as follows: a two-stage fine tuning scheme is used, and when the first stage fine tuning is carried out, training corpora synthesized by a sampling parallel corpus mode are used for fine tuning; and after the first-stage fine tuning is finished, the second-stage fine tuning is carried out by using the training corpus synthesized in a mode of generating data based on the translation process, and a better model prediction effect can be realized through the two-stage fine tuning.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.
Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims (7)

1. A method for automatically completing translated text segments during post-editing of an on-machine translated text is characterized by comprising the following steps: comprises the following steps
Step one, synthesizing training data, wherein the prediction of a translated text completion segment is based on a generation model of NLP (non line segment), the training data is obtained through data learning and training, and the training data of two methods are generated through sampling parallel corpora and data based on a translation process;
training a translation completion segment prediction model, and training according to a generative translation task by adopting a transformer-based model or performing fine training on the basis of an encoder-decoder-based pre-training model.
2. The method of claim 1 for automatically completing a translation snippet during post-translational editing of a translation: in the first step, sampling the parallel corpus specifically includes:
1) Assuming a set of parallel corpora (src, tgt), wherein src represents an original sentence and tgt represents a translated sentence;
2) Simulating the process of manual post-editing, randomly intercepting a continuous segment on the tgt sentence, and keeping the integrity of words and expressions during interception, which is recorded as tgt fragment And replacing the segment in the tgt sentence with<mask>Marking, and recording the replaced tgt sentence as tgt mask Forming a new corpus: (src, tgt) mask ,tgt fragment ) Wherein src and tgt mask Input text for model training, tgt fragment Output text for model training;
3) Due to tgt fragment And randomly intercepting, and transforming different intercepting positions to generate a plurality of groups of different target corpora on the same group of original corpora (src, tgt).
3. The method of claim 1 for automatically completing a translation snippet during post-translational editing of a translation: in the first step, the generating based on the translation process data specifically includes:
1) The translation process data is derived from recording and collecting real translation project data, and comprises the following steps: the method comprises the following steps of (1) obtaining an original sentence src, a machine translation mt and a manual post-editing translation pe;
2) Because the manual post-editing translation pe is modified on the basis of the machine translation mt, the difference part between mt and pe is found out by comparison, and the difference part in mt is replaced by the difference part in mt<mask>Is recorded as mt mask And the prediction object with the difference part as the model is marked as pe fragment Thereby obtaining a corpus: src and mt mask For input, pe fragment Is the output.
4. The method of claim 1 for automatically completing a translation snippet during post-translational editing of a translation: in the second step, training based on a transformer model specifically comprises:
the input sentence is composed of src and tgt mask Spliced to form a single output of tgt fragment Or input by src and mt mask Formed by splicing, the output is pe fragment
5. The method of claim 1 for automatically completing a translation snippet during post-translational editing of a translation: in the second step, the transformer comprises an encoder and a decoder.
6. The method of automatically completing a translation snippet during post-translational editing of claim 5, wherein: in the standard architecture of the transform, each encoder is divided into six layers, and each layer is divided into a feedforward network and a multi-head attention layer; the decoder is also divided into six layers, each layer is composed of two attention layers and a feedforward network, one is a self-attention layer, and the other is an attention layer composed of the final output of the encoder.
7. A method of automatically completing a translation snippet during post-translational editing of a translation as recited in claim 1: in the second step, the fine training based on the encoder-decoder pre-training model specifically comprises:
the method is characterized in that a DeltaLM pre-training model is adopted to finely tune on a training corpus of synthetic training data, the DeltaLM is a pre-training model based on an encoder-decoder framework, and the DeltaLM pre-training model has text representation and multilingual translation capabilities through parameter sharing and pre-training of large-scale monolingual and bilingual corpora.
CN202210942707.9A 2022-08-08 2022-08-08 Method for automatically completing translated text fragments during post-editing of on-machine translated text Pending CN115270825A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210942707.9A CN115270825A (en) 2022-08-08 2022-08-08 Method for automatically completing translated text fragments during post-editing of on-machine translated text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210942707.9A CN115270825A (en) 2022-08-08 2022-08-08 Method for automatically completing translated text fragments during post-editing of on-machine translated text

Publications (1)

Publication Number Publication Date
CN115270825A true CN115270825A (en) 2022-11-01

Family

ID=83749940

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210942707.9A Pending CN115270825A (en) 2022-08-08 2022-08-08 Method for automatically completing translated text fragments during post-editing of on-machine translated text

Country Status (1)

Country Link
CN (1) CN115270825A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230070302A1 (en) * 2021-09-07 2023-03-09 Lilt, Inc. Partial execution of translation in browser

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230070302A1 (en) * 2021-09-07 2023-03-09 Lilt, Inc. Partial execution of translation in browser
US11900073B2 (en) * 2021-09-07 2024-02-13 Lilt, Inc. Partial execution of translation in browser

Similar Documents

Publication Publication Date Title
CN111597778B (en) Automatic optimizing method and system for machine translation based on self-supervision
CN105005642B (en) A kind of threedimensional model batch format conversion and light weight method
CN110365933A (en) A kind of online generating means of video conference meeting summary and method based on AI
CN108363704A (en) A kind of neural network machine translation corpus expansion method based on statistics phrase table
JPH06110701A (en) Device and method for converting computer program language
CN106529028A (en) Technological procedure automatic generating method
Kenny Human and machine translation
CN115270825A (en) Method for automatically completing translated text fragments during post-editing of on-machine translated text
Moors et al. Human language technology audit 2018: Analysing the development trends in resource availability in all South African languages
Shen et al. Data player: Automatic generation of data videos with narration-animation interplay
CN112765948B (en) Document generation editing method
CN113657125B (en) Mongolian non-autoregressive machine translation method based on knowledge graph
Morgan et al. Facilitating the spread of new sign language technologies across Europe
JP2002063033A (en) Knowledge control system provided with ontology
CN101330389A (en) Method and system for composing group decision plan based on question disintegration
Nallusamy et al. A software redocumentation process using ontology based approach in software maintenance
CN114116779A (en) Deep learning-based power grid regulation and control field information retrieval method, system and medium
Svoboda Computing and Translation: An Overview for Technical Communicators
Schneider Notes on Social Production: A Brief Commentary
Bu Research on Computer Aided English Translation in the Wave of Globalization
Wang et al. Intelligent English Automatic Translation System Based on Multi-Feature Fusion
Larraz Towards Automated Fact-Checking in Africa: The Experience With Artificial Intelligence at Africa Check
Shiina et al. Comment Generation System for Program Procedure Learning
KR20230156595A (en) Learning method and apparatus of artficial intelligence translation model for multilingual sentences
Zang et al. Multimodal Enhanced Target Representation for Machine Translation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination