CN107885729B

CN107885729B - An Interactive Machine Translation Method Based on Bilingual Fragments

Info

Publication number: CN107885729B
Application number: CN201710877018.3A
Authority: CN
Inventors: 叶娜
Original assignee: Shenyang Aerospace University
Current assignee: Shenyang Aerospace University
Priority date: 2017-09-25
Filing date: 2017-09-25
Publication date: 2021-05-11
Anticipated expiration: 2037-09-25
Also published as: CN107885729A

Abstract

The invention relates to an interactive machine translation method based on bilingual fragments. The steps are: establishing a mathematical model: for each source language fragment, providing a plurality of translation options to the translator, wherein the optimal translation is obtained through the mathematical model; designing the translator interface: Including the interactive area and the editing area, the interactive area gives the source sentence and translation options after the phrase segmentation, the editing area gives the machine translation when the translator completes the confirmation and clicks the "translate"button; decoding: bilingual in the interactive area when the translator completes After the fragment is confirmed, the translator's choice of translation options for each fragment f _i and the current segmentation result of the source sentence are captured, and a phrase-based statistical machine translation decoder is implemented through a multi-stack decoding algorithm. The present invention improves the interaction protocol, allows the translator to confirm the bilingual segment, provides more clues to the translator, and gives the decoder more direct guidance, reduces human labor in the process of human-computer interaction, and promotes the improvement of interactive machine translation efficiency and translation quality. promote.

Description

Interactive machine translation method based on bilingual fragments

Technical Field

The invention relates to a natural language translation technology, in particular to an interactive machine translation method based on bilingual fragments.

Background

Statistical machine translation and neural machine translation techniques have resulted in a significant improvement in the performance of machine translation systems. However, in many tasks with higher quality requirements, the output quality of machine translation is still insufficient and must be modified by a human translator during post-editing before it can be used.

To enhance human-computer collaboration, Foster proposes interactive machine translation techniques. In an interactive machine translation system, a modification-prediction process is repeated. First, an interactive machine translation system provides an initial translation. The translator then confirms the longest correct prefix in it and modifies the next word. Next, the system predicts a new suffix that is expected to be better than previously. This process is repeated until a correct translation is obtained.

Recently, this left-to-right protocol (i.e., the interaction process described in the above paragraph) has been extended to make human-computer interaction more flexible. In an extended protocol, an interpreter may identify the fragments that should be retained in the translation. However, this protocol still suffers from three problems: first, the location of the confirmed segment is unknown, so the search process can only be optimized in the form of a soft constraint; second, the translator's confirmation is limited to translations provided by the system and no clues about other translation options are available; third, identifying the correct segment from the incorrect translation often requires a great deal of cognitive effort, especially when the translation is of low quality.

Disclosure of Invention

In view of the above problems in the prior art, the present invention is to provide a bilingual segment-based interactive machine translation method that can provide more clues to the translator and give the decoder more direct guidance.

In order to solve the technical problems, the invention adopts the technical scheme that:

the invention relates to an interactive machine translation method based on bilingual fragments, which comprises the following steps:

1) establishing a mathematical model: for each source language snippet, providing a plurality of translation options to the translator, wherein an optimal translation is obtained through a mathematical model;

2) designing an interpreter interface: the translation system comprises an interactive area and an editing area, wherein the interactive area provides source sentences after the phrases are segmented and a plurality of translation options provided for a translator in the step 1), and the editing area provides machine translation when the translator finishes confirming and clicks a translation button;

3) and (3) decoding: capturing each segment f of the translator after the translator has completed the validation of the bilingual segment in the interaction zone_iThe translation option is selected, and the current segmentation result of the source sentence is used for realizing a phrase-based statistical machine translation decoder through a multi-stack decoding algorithm.

The mathematical model is implemented by the following formula:

wherein e_iIs confirmed by the translator_iCorrect translation of (a), (b), (c) and (d)_iThe source language segment is the ith source language segment, t is a candidate translation, N is the number of bilingual segments, i is the sequence number of the bilingual segments, P is the translation probability of the candidate translation, and s is a source sentence.

The translator interface also has three auxiliary functions, namely fragment splitting-merging, translation option reordering and suffix prediction, wherein the fragment splitting-merging is that two bidirectional arrows are arranged above each fragment, one bidirectional outward-pointing arrow is a splitting arrow, and the fragment is split into two shorter fragments; another type of bi-directional inwardly pointing arrow is a merge arrow that merges the current segment and the next segment of the current segment into a longer segment.

If no shorter or longer segments are present in the phrase table, then both double-headed arrows do not appear; otherwise if there is a shorter or longer segment in the phrase table, the arrow appears when the mouse is placed over the segment.

The translation options are reordered as: the translator selects either the default mode or the reordering mode before starting the translation; when a new segment is generated, the translation options are changed, and the options of the segment are arranged and displayed according to the sequence in the phrase table under the default condition; when the reordering mode is selected, the top N translation options in the phrase table are reordered to generate a new option list.

The reordering is:

setting a new option list T for each source language phrase p, firstly adding the option with the highest score in the original phrase list into T, then traversing the rest N-1 options, finding the option with the highest diversity with the option in T, adding into T, repeatedly traversing the rest options, finding the option with the highest diversity with the option in T, and adding into T until the N options are reordered; translation option t_aAnd t_bThe diversity between is calculated by the following formula:

wherein c (t)_a,t_b) Is translation option t_aAnd translation option t_bThe number of repeated words in between, and a and b are the number of translation options.

The suffix prediction is: in the editing area, the translator clicks a 'forecast' button to obtain a forecast suffix from the system; when a button is clicked, the current position of the cursor is recorded, and the character in front of the cursor is used as a prefix; the confirmed bilingual segment and the prefix are used as constraint conditions to find an optimal suffix; when a new suffix is generated, the current suffix is replaced.

If the decoder does not find any matching candidate translations, the suffix is not altered.

The decoding includes the following processes:

construct a set as a constraint for decoding:

C＝{S，<p₁，f₁，e₁>，<p₂，f₂，e₂>，...，<p_N，f_N，e_N>} (3)

wherein p is_iIs a fragment f_iA location in a source sentence; f. of_iRepresenting segments, S being the current segmentation of the source sentence, e_iIs confirmed by the translator_iCorrect translation of (2); n is the number of bilingual fragments.

Taking S as the only segmentation result of the source sentence in the decoding process;

translation options for each source language phrase or segment are set by<p_i,f_i,e_i>Subject to a restriction of only e_iWill be retained and participate in the subsequent decoding process.

The translator must click on an option to make this option a confirmed bilingual segment with its source language segment, and if any translation option for a segment is not clicked, then this segment and its options cannot be used as decoding constraints.

The invention has the following beneficial effects and advantages:

1. the invention improves the interactive protocol, allows the translator to confirm the bilingual segment, provides more clues for the translator, gives a decoder more direct guidance, reduces human labor in the human-computer interaction process, promotes the improvement of the interactive machine translation efficiency and the translation quality, and is easier to confirm the bilingual segment than to identify the correct segment from the wrong translation.

2. The invention also designs an interface facing the real translator, allows the translator to split and combine the split phrases, and provides a reordering method for increasing the diversity of translation options, which is helpful for improving the interactive translation efficiency in the real scene. The experimental results of the real translator show that the new protocol improves the efficiency and the quality of the interactive machine translation on the three Chinese-English translation tasks.

Drawings

FIG. 1 is a diagram of an example bilingual-segment-based interactive machine translation protocol according to the present invention;

FIG. 2 is a diagram of an interpreter interface of a bilingual-segment-based interactive machine translation system of the present invention.

Detailed Description

The invention is further elucidated with reference to the accompanying drawings.

Aiming at the problems in the interactive machine translation, the interactive protocol is improved, the translator is allowed to confirm the bilingual segment, more clues are provided for the translator, the decoder is given more direct guidance, the human labor in the human-computer interaction process is reduced, and the interactive machine translation efficiency and the translation quality are improved.

3) solution (II)Code: capturing each segment f of the translator after the translator has completed the validation of the bilingual segment in the interaction zone_iThe translation option is selected, and the current segmentation result of the source sentence is used for realizing a phrase-based statistical machine translation decoder through a multi-stack decoding algorithm.

In step 1), the source language segments are aligned with their target language corresponding segments. For each source language segment, multiple translation options are provided. The interpreter can confirm the shape of<f_i,e_i>Bilingual fragments of (c). The optimal translation is obtained by the following formula:

wherein e_iIs confirmed by the translator_iCorrect translation of (a), (b), (c) and (d)_iRepresenting each segment, t is a candidate translation, N is the number of bilingual segments, i is the sequence number of the bilingual segment, P is the translation probability of the candidate translation, and s is a source sentence.

In equation (1), the search space is the translation hypothesis that coincides with these bilingual segments.

As shown in fig. 1, an example of a new protocol is given. The interpreter has confirmed three bilingual segments (i.e., boxed portions of the graph), and the decoder has given a better translation; the translator then enters a prefix "a" and decodes IT again, resulting in the correct translation IT-2.

Step 2), the present invention employs an interpreter interface as shown in FIG. 2. The interface consists of two areas, one is an interactive area, wherein a source sentence after phrase segmentation and translation options are given, and a segment and the options are left-aligned. When the mouse is placed on a source language segment, a menu with K-best translation options is displayed, and the translator can click to confirm the most preferred option; the other is an edit section that gives the machine translation when the translator completes the confirmation and clicks the "translate" button. Where the translator may make modifications at will until the translation is accepted. The interactive process and the editing process may be alternated.

One prominent feature of phrase-based statistical machine translation is the extraction of translations of longer phrases. The long phrases are used as basic translation units, so that the problem of word disambiguation can be effectively relieved, and a good effect is achieved. Therefore, longer segments and their translations are preferentially displayed in the interface, and the source sentences are initially segmented using the phrasal table using the forward maximum matching algorithm. The translation options displayed are the top K options in the phrase table.

The interpreter interface also provides three ancillary functions: segment split-merge, translation option reorder, and suffix prediction.

a. Fragment splitting-merging

The fragment splitting-merging is that two bidirectional arrows are arranged above each fragment, one bidirectional outward-pointing arrow is a splitting arrow, and the fragment is split into two shorter fragments; another type of bi-directional inwardly pointing arrow is a merge arrow that merges the current segment and the next segment of the current segment into a longer segment.

If no shorter or longer segments are present in the phrase table, the arrow does not appear. Otherwise, when the mouse is placed over the segment, the arrow will appear. Once a new segment is generated, its translation options are changed.

b. Translation option reordering

By default, the options for the snippet are arranged and displayed in the order in the phrase table. However, the highest scoring options are sometimes very similar. The invention thus provides an alternative mode, increasing the variety of options. The translator may select either the default mode or the reorder mode before beginning the translation.

In this mode, the top N translation options in the phrase table are reordered to produce a new list of options. For each source language phrase p, a new option list T (initially empty) is set. First, the option with the highest score in the original phrase table is added to T. Then, the rest N-1 options are traversed, the option with the highest diversity with the option in the T is found and added into the T. This process is repeated until all N options are reordered. Translation option t_aAnd t_bThe diversity between is calculated by the following formula:

wherein c (t)_a,t_b) Is t_aAnd t_bThe number of repeated words (after the word shape is restored), and a and b are the serial numbers of the translation options.

c. Suffix prediction

For the auxiliary function of postfix prediction, a constraint is added in the decoder, namely, the translation hypothesis must match the given prefix t_p。

In the edit section, the translator may click on the "predict" button to obtain the predicted suffix from the system. When the button is clicked, the current position of the cursor is recorded, and the character in front of the cursor is used as a prefix. Both the confirmed bilingual fragment and the prefix are used as constraints to find the optimal suffix. Once a new suffix is generated, it will replace the current suffix. If the decoder does not find any compatible assumptions, the suffix is not altered.

In step 3), the decoding process is as follows:

after the translator completes the validation of bilingual snippets in the interaction zone, the system captures the translator's recognition of each snippet f_iAnd the current segmentation result S of the source sentence. Construct a set as a constraint for decoding:

wherein p is_iIs a fragment f_iA location in a source sentence; f. of_iRepresenting individual segments, S being the translator for each segment f_iSelection of translation options and current segmentation result of source sentence, e_iIs confirmed by the translator_iCorrect translation of (2);

Record p_iTo avoid ambiguity caused by multiple occurrences of a segment. The translator must click on the option to make this option a confirmed bilingual snippet with its source language snippet. If any translation option for a segment has not been clicked on, then the segment and its options cannot be used as decoding constraints.

Table 1 shows a comparative example of a real interactive machine translation.

TABLE 1 Interactive machine translation protocol COMPARATIVE EXAMPLE

In this embodiment, the prefix-based protocol undergoes 6 decodings, including 2 temporal changes ("study" and "contider"), 1 leaky word appends ("functions"), and 1 word order adjustment ("of"). In contrast, the protocol of the present invention decodes only twice after confirming the bilingual segment, and the correct translation options for the content word are all displayed in the list. The translator may click on them directly for confirmation.

(1) Data setting

The present invention tests three different chinese-english translation tasks with a real translator. "Law" is the legal text of the LDC2000T47 corpus. The "meeting record" is the meeting record text of the LDC2000T50 corpus. "News" is the news text of the LDC2000T46 corpus. Table 2 gives the main information for these corpora (S, T and V indicate the number of sentences, the number of words, and the size of the vocabulary, respectively.K and M represent thousands and ten thousand, respectively).

TABLE 2 Main information of test corpus

The Chinese part of the data is preprocessed by an ICTCCLAS word segmentation tool, and the English part is marked and lowercase. A word alignment model is trained by GIZA + +, a 5-gram language model is trained by IRSTLM, a phrase-based statistical machine translation model is constructed by Moses, wherein the phrase-based statistical machine translation model comprises 14 default features, and feature weights are adjusted by MERT.

Three interactive machine translation systems were evaluated in the experiment. Baseline is a prefix-based system, BiSeg is a system without an option reordering function, and BiSeg + D is a system with an option reordering function. In the interpreter interface, the number of translation options displayed is set to 10 and the number of reordered translation options is set to 20.

(2) Evaluation index

In the field of interactive machine translation, because the experimental cost of a real translator is high, an automatic evaluation index is mainly adopted to evaluate a prototype system. In these metrics, translator behavior is simulated, rather than actual translator behavior during interaction. However, direct evaluation of interactive machine translation systems still requires experimentation by real translators. The invention evaluates the performance of the interactive machine translation system by a real translator from the aspects of efficiency and quality. Three indices were used to evaluate translation efficiency: translation time, keyboard stroke and mouse behavior rate (KSMR), and number of decodes.

And evaluating the translation quality by using a BLEU value, and evaluating the translation quality of a translator by using an English part in the original bilingual corpus as a reference translation. The final translation received by the translator is correct, although not identical to the reference translation.

(3) Participants and processes

9 investigators (6 women) volunteered to participate in the experiment as non-professional translators. They all use Chinese as mother languageThe man of (1) is proficient in English. This example randomly groups participants into 3 groups (G)₁～G₃) 3 people per group. The test set of each corpus is randomly divided into 3 parts (C)₁～C₃) There are 25 sentences per part. The evaluation was performed in a balanced manner as shown in table 3.

TABLE 3 translation task alignment

(4) Results and analysis

Table 4 shows the average time for three translator groups on the test corpus. The numbers in parentheses are the relative differences between the inventive system and the baseline system.

TABLE 4 translation time for different interactive machine translation systems

It can be seen that the translation time of the inventive system is significantly lower than the baseline system. This indicates a significant reduction in human labor. The variety of translation options may further reduce human labor.

Table 5 gives the KSMR values over three corpora.

TABLE 5 KSMR values for different interactive machine translation systems

It can be seen that the KSMR values of the inventive system are significantly higher than the baseline system. However, these mouse actions do not take much thought and action time, so they have little impact on translation efficiency.

Table 6 gives the number of evaluated decodes over three corpora.

TABLE 6 decoding times for different interactive machine translation systems

Table 6 shows that the number of decodings in the new protocol is significantly reduced.

The translation quality (BLEU value) over the three corpora is given in table 7.

TABLE 7 translation quality for different interactive machine translation systems

The results show that the translation quality of the system of the invention is better than that of the baseline system.

Claims

1. An interactive machine translation method based on bilingual fragments is characterized by comprising the following steps:

3) and (3) decoding: capturing each segment f of the translator after the translator has completed the validation of the bilingual segment in the interaction zone_iThe translation options are selected, and the current segmentation result of the source sentences is obtained through a multi-stack decoding algorithm to realize a phrase-based statistical machine translation decoder;

the decoding includes the following processes:

construct a set as a constraint for decoding:

wherein p is_iIs a fragment f_iBits in Source clausePlacing; f. of_iFor the ith source language snippet, S is the current segmentation result of the source sentence, e_iIs confirmed by the translator_iCorrect translation of (2); n is the number of bilingual fragments;

translation options for each source language phrase or segment are set by<p_i,f_i,e_i>Subject to a restriction of only e_iWill be retained and participate in the subsequent decoding process;

the mathematical model is implemented by the following formula:

wherein t is a candidate translation, i is a bilingual fragment sequence number, and s is a source sentence.

2. The bilingual segment-based interactive machine translation method of claim 1, wherein: the translator interface also has three auxiliary functions, namely fragment splitting-merging, translation option reordering and suffix prediction, wherein the fragment splitting-merging is that two bidirectional arrows are arranged above each fragment, one bidirectional outward-pointing arrow is a splitting arrow, and the fragment is split into two shorter fragments; another type of bi-directional inwardly pointing arrow is a merge arrow that merges the current segment and the next segment of the current segment into a longer segment.

3. The bilingual segment-based interactive machine translation method of claim 2, wherein: if no shorter or longer segments are present in the phrase table, then both double-headed arrows do not appear; otherwise if there is a shorter or longer segment in the phrase table, the arrow appears when the mouse is placed over the segment.

4. The bilingual segment-based interactive machine translation method of claim 2, wherein the reordering of translation options is: the translator selects either the default mode or the reordering mode before starting the translation; when a new segment is generated, the translation options are changed, and the options of the segment are arranged and displayed according to the sequence in the phrase table under the default condition; when the reordering mode is selected, the top N translation options in the phrase table are reordered to generate a new option list.

5. The bilingual segment-based interactive machine translation method of claim 4, wherein the reordering comprises:

setting a new option list T for each source language phrase p, adding the option with the highest score in the original phrase list into T, traversing the rest N-1 options, finding the option with the highest diversity with the option in T, and adding the option into T;

repeatedly traversing the rest options, finding the option with the highest diversity with the options in the T, and adding the option into the T until the N options are reordered; translation option t_aAnd t_bThe diversity between is calculated by the following formula:

6. The bilingual segment-based interactive machine translation method of claim 4, wherein the suffix prediction is: in the editing area, the translator clicks a 'forecast' button to obtain a forecast suffix from the system; when a button is clicked, the current position of the cursor is recorded, and the character in front of the cursor is used as a prefix; the confirmed bilingual segment and the prefix are used as constraint conditions to find an optimal suffix; when a new suffix is generated, the current suffix is replaced.

7. The bilingual segment-based interactive machine translation method of claim 6, wherein: if the decoder does not find any matching candidate translations, the suffix is not altered.

8. The bilingual segment-based interactive machine translation method of claim 1, wherein: the translator must click on an option to make this option a confirmed bilingual segment with its source language segment, and if any translation option for a segment is not clicked, then this segment and its options cannot be used as decoding constraints.