Disclosure of Invention
In view of the above problems in the prior art, the present invention is to provide a bilingual segment-based interactive machine translation method that can provide more clues to the translator and give the decoder more direct guidance.
In order to solve the technical problems, the invention adopts the technical scheme that:
the invention relates to an interactive machine translation method based on bilingual fragments, which comprises the following steps:
1) establishing a mathematical model: for each source language snippet, providing a plurality of translation options to the translator, wherein an optimal translation is obtained through a mathematical model;
2) designing an interpreter interface: the translation system comprises an interactive area and an editing area, wherein the interactive area provides source sentences after the phrases are segmented and a plurality of translation options provided for a translator in the step 1), and the editing area provides machine translation when the translator finishes confirming and clicks a translation button;
3) and (3) decoding: capturing each segment f of the translator after the translator has completed the validation of the bilingual segment in the interaction zoneiThe translation option is selected, and the current segmentation result of the source sentence is used for realizing a phrase-based statistical machine translation decoder through a multi-stack decoding algorithm.
The mathematical model is implemented by the following formula:
wherein eiIs confirmed by the translatoriCorrect translation of (a), (b), (c) and (d)iThe source language segment is the ith source language segment, t is a candidate translation, N is the number of bilingual segments, i is the sequence number of the bilingual segments, P is the translation probability of the candidate translation, and s is a source sentence.
The translator interface also has three auxiliary functions, namely fragment splitting-merging, translation option reordering and suffix prediction, wherein the fragment splitting-merging is that two bidirectional arrows are arranged above each fragment, one bidirectional outward-pointing arrow is a splitting arrow, and the fragment is split into two shorter fragments; another type of bi-directional inwardly pointing arrow is a merge arrow that merges the current segment and the next segment of the current segment into a longer segment.
If no shorter or longer segments are present in the phrase table, then both double-headed arrows do not appear; otherwise if there is a shorter or longer segment in the phrase table, the arrow appears when the mouse is placed over the segment.
The translation options are reordered as: the translator selects either the default mode or the reordering mode before starting the translation; when a new segment is generated, the translation options are changed, and the options of the segment are arranged and displayed according to the sequence in the phrase table under the default condition; when the reordering mode is selected, the top N translation options in the phrase table are reordered to generate a new option list.
The reordering is:
setting a new option list T for each source language phrase p, firstly adding the option with the highest score in the original phrase list into T, then traversing the rest N-1 options, finding the option with the highest diversity with the option in T, adding into T, repeatedly traversing the rest options, finding the option with the highest diversity with the option in T, and adding into T until the N options are reordered; translation option taAnd tbThe diversity between is calculated by the following formula:
wherein c (t)a,tb) Is translation option taAnd translation option tbThe number of repeated words in between, and a and b are the number of translation options.
The suffix prediction is: in the editing area, the translator clicks a 'forecast' button to obtain a forecast suffix from the system; when a button is clicked, the current position of the cursor is recorded, and the character in front of the cursor is used as a prefix; the confirmed bilingual segment and the prefix are used as constraint conditions to find an optimal suffix; when a new suffix is generated, the current suffix is replaced.
If the decoder does not find any matching candidate translations, the suffix is not altered.
The decoding includes the following processes:
construct a set as a constraint for decoding:
C={S,<p1,f1,e1>,<p2,f2,e2>,...,<pN,fN,eN>} (3)
wherein p isiIs a fragment fiA location in a source sentence; f. ofiRepresenting segments, S being the current segmentation of the source sentence, eiIs confirmed by the translatoriCorrect translation of (2); n is the number of bilingual fragments.
Taking S as the only segmentation result of the source sentence in the decoding process;
translation options for each source language phrase or segment are set by<pi,fi,ei>Subject to a restriction of only eiWill be retained and participate in the subsequent decoding process.
The translator must click on an option to make this option a confirmed bilingual segment with its source language segment, and if any translation option for a segment is not clicked, then this segment and its options cannot be used as decoding constraints.
The invention has the following beneficial effects and advantages:
1. the invention improves the interactive protocol, allows the translator to confirm the bilingual segment, provides more clues for the translator, gives a decoder more direct guidance, reduces human labor in the human-computer interaction process, promotes the improvement of the interactive machine translation efficiency and the translation quality, and is easier to confirm the bilingual segment than to identify the correct segment from the wrong translation.
2. The invention also designs an interface facing the real translator, allows the translator to split and combine the split phrases, and provides a reordering method for increasing the diversity of translation options, which is helpful for improving the interactive translation efficiency in the real scene. The experimental results of the real translator show that the new protocol improves the efficiency and the quality of the interactive machine translation on the three Chinese-English translation tasks.
Detailed Description
The invention is further elucidated with reference to the accompanying drawings.
Aiming at the problems in the interactive machine translation, the interactive protocol is improved, the translator is allowed to confirm the bilingual segment, more clues are provided for the translator, the decoder is given more direct guidance, the human labor in the human-computer interaction process is reduced, and the interactive machine translation efficiency and the translation quality are improved.
The invention relates to an interactive machine translation method based on bilingual fragments, which comprises the following steps:
1) establishing a mathematical model: for each source language snippet, providing a plurality of translation options to the translator, wherein an optimal translation is obtained through a mathematical model;
2) designing an interpreter interface: the translation system comprises an interactive area and an editing area, wherein the interactive area provides source sentences after the phrases are segmented and a plurality of translation options provided for a translator in the step 1), and the editing area provides machine translation when the translator finishes confirming and clicks a translation button;
3) solution (II)Code: capturing each segment f of the translator after the translator has completed the validation of the bilingual segment in the interaction zoneiThe translation option is selected, and the current segmentation result of the source sentence is used for realizing a phrase-based statistical machine translation decoder through a multi-stack decoding algorithm.
In step 1), the source language segments are aligned with their target language corresponding segments. For each source language segment, multiple translation options are provided. The interpreter can confirm the shape of<fi,ei>Bilingual fragments of (c). The optimal translation is obtained by the following formula:
wherein eiIs confirmed by the translatoriCorrect translation of (a), (b), (c) and (d)iRepresenting each segment, t is a candidate translation, N is the number of bilingual segments, i is the sequence number of the bilingual segment, P is the translation probability of the candidate translation, and s is a source sentence.
In equation (1), the search space is the translation hypothesis that coincides with these bilingual segments.
As shown in fig. 1, an example of a new protocol is given. The interpreter has confirmed three bilingual segments (i.e., boxed portions of the graph), and the decoder has given a better translation; the translator then enters a prefix "a" and decodes IT again, resulting in the correct translation IT-2.
Step 2), the present invention employs an interpreter interface as shown in FIG. 2. The interface consists of two areas, one is an interactive area, wherein a source sentence after phrase segmentation and translation options are given, and a segment and the options are left-aligned. When the mouse is placed on a source language segment, a menu with K-best translation options is displayed, and the translator can click to confirm the most preferred option; the other is an edit section that gives the machine translation when the translator completes the confirmation and clicks the "translate" button. Where the translator may make modifications at will until the translation is accepted. The interactive process and the editing process may be alternated.
One prominent feature of phrase-based statistical machine translation is the extraction of translations of longer phrases. The long phrases are used as basic translation units, so that the problem of word disambiguation can be effectively relieved, and a good effect is achieved. Therefore, longer segments and their translations are preferentially displayed in the interface, and the source sentences are initially segmented using the phrasal table using the forward maximum matching algorithm. The translation options displayed are the top K options in the phrase table.
The interpreter interface also provides three ancillary functions: segment split-merge, translation option reorder, and suffix prediction.
a. Fragment splitting-merging
The fragment splitting-merging is that two bidirectional arrows are arranged above each fragment, one bidirectional outward-pointing arrow is a splitting arrow, and the fragment is split into two shorter fragments; another type of bi-directional inwardly pointing arrow is a merge arrow that merges the current segment and the next segment of the current segment into a longer segment.
If no shorter or longer segments are present in the phrase table, the arrow does not appear. Otherwise, when the mouse is placed over the segment, the arrow will appear. Once a new segment is generated, its translation options are changed.
b. Translation option reordering
By default, the options for the snippet are arranged and displayed in the order in the phrase table. However, the highest scoring options are sometimes very similar. The invention thus provides an alternative mode, increasing the variety of options. The translator may select either the default mode or the reorder mode before beginning the translation.
In this mode, the top N translation options in the phrase table are reordered to produce a new list of options. For each source language phrase p, a new option list T (initially empty) is set. First, the option with the highest score in the original phrase table is added to T. Then, the rest N-1 options are traversed, the option with the highest diversity with the option in the T is found and added into the T. This process is repeated until all N options are reordered. Translation option taAnd tbThe diversity between is calculated by the following formula:
wherein c (t)a,tb) Is taAnd tbThe number of repeated words (after the word shape is restored), and a and b are the serial numbers of the translation options.
c. Suffix prediction
For the auxiliary function of postfix prediction, a constraint is added in the decoder, namely, the translation hypothesis must match the given prefix tp。
In the edit section, the translator may click on the "predict" button to obtain the predicted suffix from the system. When the button is clicked, the current position of the cursor is recorded, and the character in front of the cursor is used as a prefix. Both the confirmed bilingual fragment and the prefix are used as constraints to find the optimal suffix. Once a new suffix is generated, it will replace the current suffix. If the decoder does not find any compatible assumptions, the suffix is not altered.
In step 3), the decoding process is as follows:
after the translator completes the validation of bilingual snippets in the interaction zone, the system captures the translator's recognition of each snippet fiAnd the current segmentation result S of the source sentence. Construct a set as a constraint for decoding:
C={S,<p1,f1,e1>,<p2,f2,e2>,...,<pN,fN,eN>} (3)
wherein p isiIs a fragment fiA location in a source sentence; f. ofiRepresenting individual segments, S being the translator for each segment fiSelection of translation options and current segmentation result of source sentence, eiIs confirmed by the translatoriCorrect translation of (2);
taking S as the only segmentation result of the source sentence in the decoding process;
translation options for each source language phrase or segment are set by<pi,fi,ei>Subject to a restriction of only eiWill be retained and participate in the subsequent decoding process.
Record piTo avoid ambiguity caused by multiple occurrences of a segment. The translator must click on the option to make this option a confirmed bilingual snippet with its source language snippet. If any translation option for a segment has not been clicked on, then the segment and its options cannot be used as decoding constraints.
Table 1 shows a comparative example of a real interactive machine translation.
TABLE 1 Interactive machine translation protocol COMPARATIVE EXAMPLE
In this embodiment, the prefix-based protocol undergoes 6 decodings, including 2 temporal changes ("study" and "contider"), 1 leaky word appends ("functions"), and 1 word order adjustment ("of"). In contrast, the protocol of the present invention decodes only twice after confirming the bilingual segment, and the correct translation options for the content word are all displayed in the list. The translator may click on them directly for confirmation.
(1) Data setting
The present invention tests three different chinese-english translation tasks with a real translator. "Law" is the legal text of the LDC2000T47 corpus. The "meeting record" is the meeting record text of the LDC2000T50 corpus. "News" is the news text of the LDC2000T46 corpus. Table 2 gives the main information for these corpora (S, T and V indicate the number of sentences, the number of words, and the size of the vocabulary, respectively.K and M represent thousands and ten thousand, respectively).
TABLE 2 Main information of test corpus
The Chinese part of the data is preprocessed by an ICTCCLAS word segmentation tool, and the English part is marked and lowercase. A word alignment model is trained by GIZA + +, a 5-gram language model is trained by IRSTLM, a phrase-based statistical machine translation model is constructed by Moses, wherein the phrase-based statistical machine translation model comprises 14 default features, and feature weights are adjusted by MERT.
Three interactive machine translation systems were evaluated in the experiment. Baseline is a prefix-based system, BiSeg is a system without an option reordering function, and BiSeg + D is a system with an option reordering function. In the interpreter interface, the number of translation options displayed is set to 10 and the number of reordered translation options is set to 20.
(2) Evaluation index
In the field of interactive machine translation, because the experimental cost of a real translator is high, an automatic evaluation index is mainly adopted to evaluate a prototype system. In these metrics, translator behavior is simulated, rather than actual translator behavior during interaction. However, direct evaluation of interactive machine translation systems still requires experimentation by real translators. The invention evaluates the performance of the interactive machine translation system by a real translator from the aspects of efficiency and quality. Three indices were used to evaluate translation efficiency: translation time, keyboard stroke and mouse behavior rate (KSMR), and number of decodes.
And evaluating the translation quality by using a BLEU value, and evaluating the translation quality of a translator by using an English part in the original bilingual corpus as a reference translation. The final translation received by the translator is correct, although not identical to the reference translation.
(3) Participants and processes
9 investigators (6 women) volunteered to participate in the experiment as non-professional translators. They all use Chinese as mother languageThe man of (1) is proficient in English. This example randomly groups participants into 3 groups (G)1~G3) 3 people per group. The test set of each corpus is randomly divided into 3 parts (C)1~C3) There are 25 sentences per part. The evaluation was performed in a balanced manner as shown in table 3.
TABLE 3 translation task alignment
(4) Results and analysis
Table 4 shows the average time for three translator groups on the test corpus. The numbers in parentheses are the relative differences between the inventive system and the baseline system.
TABLE 4 translation time for different interactive machine translation systems
It can be seen that the translation time of the inventive system is significantly lower than the baseline system. This indicates a significant reduction in human labor. The variety of translation options may further reduce human labor.
Table 5 gives the KSMR values over three corpora.
TABLE 5 KSMR values for different interactive machine translation systems
It can be seen that the KSMR values of the inventive system are significantly higher than the baseline system. However, these mouse actions do not take much thought and action time, so they have little impact on translation efficiency.
Table 6 gives the number of evaluated decodes over three corpora.
TABLE 6 decoding times for different interactive machine translation systems
Table 6 shows that the number of decodings in the new protocol is significantly reduced.
The translation quality (BLEU value) over the three corpora is given in table 7.
TABLE 7 translation quality for different interactive machine translation systems
The results show that the translation quality of the system of the invention is better than that of the baseline system.